Download as pdf or txt
Download as pdf or txt
You are on page 1of 254

Discrete Choice Analysis:

Predicting Demand and Market Shares


MIT, June 11-15, 2012


Case Studies Workbook


iii

Credits

The principal authors of this edition of the case studies workbook are Gianluca Antonini,
Carmine Gioia, Emma Frejinger, and Micha l Th mans, with contributions by Maya
Abou Zeid, Ricardo Alvarez-Daziano, Ramachandran Balakrishna, Charisma Choudhury,
Matteo Sorci and Yang Wen. There have been many other Teaching Assistants over the
years who have provided significant inputs to the materials on which this workbook is
based.

The development of the case studies in this workbook was initiated and supervised by
Moshe Ben-Akiva for use in the MIT graduate course on Demand Modeling and in the
one-week continuing education course on Discrete Choice Analysis, Michel Bierlaire,
Denis Bolduc and Joan Walker participated in the development of the case studies and
contributed with many commets and suggestions.



Contents
I Introduction and Biogeme 15
1 Introduction 17
2 Biogeme 21
2.1 Install Biogeme . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Invoke Biogeme under Windows . . . . . . . . . . . . . . . . . 23
2.3 Install Emacs . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Input Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Output Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Step-by-Step Example . . . . . . . . . . . . . . . . . . . . . . 28
2.7 BioSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
II Case Studies 37
3 Introduction to Model Building 39
3.1 Practical Information . . . . . . . . . . . . . . . . . . . . . . . 40
4 Binary Logit 43
4.1 Challenge Question . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Choice-Lab-Fashion Marketing Case . . . . . . . . . . . . . . . 46
4.3 Netherlands Mode Choice Case . . . . . . . . . . . . . . . . . 52
1
2 CONTENTS
4.4 Airline Itinerary Case . . . . . . . . . . . . . . . . . . . . . . . 57
5 Logit 63
5.1 Challenge Question . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Swissmetro Case . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 Choice of Residential Telephone Services Case . . . . . . . . . 73
5.4 Airline Itinerary Case . . . . . . . . . . . . . . . . . . . . . . . 78
6 Specication Testing 85
6.1 Swissmetro Case . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 Choice of Residential Telephone Services Case . . . . . . . . . 102
6.3 Airline Itinerary Case . . . . . . . . . . . . . . . . . . . . . . . 115
7 Forecasting 133
7.1 Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2 Swissmetro Case . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3 Choice of Residential Telephone Services Case . . . . . . . . . 138
7.4 Airline Itinerary Case . . . . . . . . . . . . . . . . . . . . . . . 141
8 Multivariate (Generalized) Extreme Value Models 145
8.1 Challenge Question . . . . . . . . . . . . . . . . . . . . . . . . 146
8.2 Swissmetro Case . . . . . . . . . . . . . . . . . . . . . . . . . 150
8.3 Choice of Residential Telephone Services Case . . . . . . . . . 158
9 Mixtures of Logit and GEV Models 169
9.1 Challenge Question . . . . . . . . . . . . . . . . . . . . . . . . 170
9.2 Swissmetro Case . . . . . . . . . . . . . . . . . . . . . . . . . 173
10 Simultaneous RP/SP Estimation 189
10.1 Model Specication with RP Data . . . . . . . . . . . . . . . . 190
2
CONTENTS 3
10.2 Model Specication with SP Data . . . . . . . . . . . . . . . . 190
10.3 Model Specication with Combined RP-SP Data . . . . . . . . 191
A Datasets 197
A.1 Choice-Lab-Fashion Marketing Case . . . . . . . . . . . . . . . 197
A.2 Netherlands Mode Choice Case . . . . . . . . . . . . . . . . . 203
A.3 Swissmetro Case . . . . . . . . . . . . . . . . . . . . . . . . . 209
A.4 Choice of Residential Telephone Services Case . . . . . . . . . 215
A.5 Airline Itinerary Case . . . . . . . . . . . . . . . . . . . . . . . 220
A.6 Facial Expressions Recognition Case . . . . . . . . . . . . . . 227
A.7 Italy Mode Choice Case . . . . . . . . . . . . . . . . . . . . . 236
3
4 CONTENTS
4
List of Tables
1.1 Datasets and applications . . . . . . . . . . . . . . . . . . . . 20
1.2 Datasets and applications . . . . . . . . . . . . . . . . . . . . 20
4.1 BL Challenge: Netherlands results . . . . . . . . . . . . . . . . 45
4.2 BL: Choice lab marketing case estimation results . . . . . . . 48
4.3 BL: Choice lab marketing case estimation results . . . . . . . 51
4.4 BL: Netherlands mode choice case estimation results . . . . . 52
4.5 BL: Netherlands mode choice case estimation results . . . . . 54
4.6 BL: Netherlands mode choice case estimation results . . . . . 56
4.7 BL: Airline itinerary case estimation results . . . . . . . . . . 58
4.8 BL: Airline itinerary case estimation results . . . . . . . . . . 60
4.9 BL: Airline itinerary case estimation results . . . . . . . . . . 61
5.1 Logit model Challenge: Italy mode choice, Logit model esti-
mation results . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Logit model: Swissmetro estimation results . . . . . . . . . . . 68
5.3 Logit model: Swissmetro estimation results . . . . . . . . . . . 69
5.4 Logit model: Swissmetro estimation results . . . . . . . . . . . 72
5.5 Logit model: Telephone services case estimation results . . . . 76
5.6 Logit model: Telephone services case estimation results . . . . 76
5.7 Logit model: Telephone services case estimation results . . . . 77
5
6 LIST OF TABLES
5.8 Logit model: Airline itinerary case estimation results . . . . . 79
5.9 Logit model: Airline itinerary case estimation results . . . . . 81
5.10 Logit model: Airline itinerary case estimation results . . . . . 83
6.1 Specication Testing: Swissmetro market segmentation test . . 87
6.2 Specication Testing: Swissmetro IIA test . . . . . . . . . . . 89
6.3 Specication Testing: Swissmetro models for Cox test . . . . . 92
6.4 Specication Testing: Swissmetro M
1
estimation results . . . . 92
6.5 Specication Testing: Swissmetro M
2
estimation results . . . . 93
6.6 Specication Testing: Swissmetro M
C
estimation results . . . 94
6.7 Specication Testing: Swissmetro piecewise linear model . . . 97
6.8 Specication Testing: Swissmetro power series model . . . . . 99
6.9 Specication Testing: Swissmetro Box-Cox transformed model 101
6.10 Specication Testing: Telephone market segmentation test . . 103
6.11 Specication Testing: Telephone IIA test . . . . . . . . . . . . 103
6.12 Specication Testing: Telephone non-nested test . . . . . . . . 107
6.13 Specication Testing: Telephone piecewise linear model . . . . 110
6.14 Specication Testing: Telephone power series model . . . . . . 112
6.15 Specication Testing: Telephone Box-Cox transformed model . 114
6.16 Specication Testing: Swissmetro market segmentation test . . 116
6.17 Specication Testing: Airline Itinerary IIA test . . . . . . . . 118
6.18 Specication Testing: Airline itinerary models for Cox test . . 122
6.19 Specication Testing: Airline itinerary M
1
estimation results . 123
6.20 Specication Testing: Airline itinerary M
2
estimation results . 124
6.21 Specication Testing: Airline itinerary M
C
estimation results . 125
6.22 Specication Testing: Airline itinerary piecewise linear model . 127
6.23 Specication Testing: Airline itinerary power series model . . 128
6.24 Specication Testing: Airline itinerary Box-Cox transformed
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6
LIST OF TABLES 7
7.1 Forecasting: Swissmetro fuel cost policy . . . . . . . . . . . . 137
7.2 Forecasting: Telephone new cost policy . . . . . . . . . . . . . 140
7.3 Forecasting: Airline itinerary fuel cost policy . . . . . . . . . . 143
8.1 MEV Challenge: Swissmetro NL estimation results . . . . . . 149
8.2 MEV: Swissmetro NL estimation results . . . . . . . . . . . . 152
8.3 MEV: Swissmetro CNL estimation results . . . . . . . . . . . 155
8.4 MEV: Swissmetro CNL estimation unknown . . . . . . . . . 156
8.5 MEV: Telephone NL estimation results . . . . . . . . . . . . . 160
8.6 MEV: Telephone NL estimation results . . . . . . . . . . . . . 162
8.7 MEV: Telephone CNL estimation results . . . . . . . . . . . . 164
8.8 MEV: Telephone CNL estimation with unknown . . . . . . 167
9.1 Mixtures Challenge: Airline itinerary case . . . . . . . . . . . 172
9.2 Mixtures: Swissmetro alternative specic variance specication 174
9.3 Mixtures: Swissmetro error component specication . . . . . . 177
9.4 Mixtures: Swissmetro error component specication . . . . . . 179
9.5 Mixtures: Swissmetro random coecient specication . . . . . 181
9.6 Mixtures: Swissmetro mixture of nested Logit estimation . . . 184
9.7 Mixtures: Swissmetro panel data specication . . . . . . . . . 187
10.1 RP-SP: BL with RP data estimation results . . . . . . . . . . 193
10.2 RP-SP: BL with SP data estimation results . . . . . . . . . . 193
10.3 RP-SP: BL with RP-SP data estimation results . . . . . . . . 194
A.1 Choice-Lab Marketing Case: Description of variables . . . . . 201
A.2 Choice-Lab Marketing Case: Descriptive statistics . . . . . . . 202
A.3 Netherlands Mode Choice Case: Description of variables . . . 205
A.4 Netherlands Mode Choice Case: Description of variables . . . 206
A.5 Netherlands Mode Choice Case: Description of variables . . . 207
7
8 LIST OF TABLES
A.6 Netherlands Mode Choice Case: Descriptive statistics . . . . . 208
A.7 Swissmetro Case: Description of variables . . . . . . . . . . . 211
A.8 Swissmetro Case: Description of variables . . . . . . . . . . . 212
A.9 Swissmetro Case: Descriptive statistics . . . . . . . . . . . . . 213
A.10 Swissmetro Case: Cantons . . . . . . . . . . . . . . . . . . . . 214
A.11 Telephone Services Case: Service options . . . . . . . . . . . . 217
A.12 Telephone Services Case: Description of variables . . . . . . . 218
A.13 Telephone Services Case: Descriptive statistics . . . . . . . . . 219
A.14 The choice of airline itinerary: Description of Variables . . . . 222
A.15 The choice of airline itinerary: Description of Variables . . . . 222
A.16 The choice of airline itinerary: Description of Variables . . . . 223
A.17 The choice of airline itinerary: description of Variables . . . . 224
A.18 The choice of airline itinerary: descriptive Statistics . . . . . . 225
A.19 The choice of airline itinerary: descriptive Statistics . . . . . . 226
A.20 Facial Expressions Case: Description of Variables . . . . . . . 229
A.21 Facial Expressions Case: Description of Variables . . . . . . . 230
A.22 Facial Expressions Case: Descriptive Statistics . . . . . . . . . 232
A.23 Facial Expressions Case: Logit Model Results . . . . . . . . . 235
A.24 Italy Mode Choice Case: Description of variables . . . . . . . 238
A.25 Italy Mode Choice Case: Descriptive statistics . . . . . . . . . 239
A.26 Italy Mode Choice Case: RP Logit Model Results . . . . . . . 241
A.27 Italy Mode Choice Case: SP Logit Model Results . . . . . . . 242
A.28 Italy Mode Choice Case: RP/SP Logit Model Results . . . . . 244
A.29 Italy Mode Choice Case: RP NL Results . . . . . . . . . . . . 245
A.30 Italy Mode Choice Case: RP/SP NL Results . . . . . . . . . . 246
A.31 Italy Mode Choice Case: SP Logit Model with Agent Eect
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
A.32 Italy Mode Choice Case: RP/SP Logit Model with Agent Ef-
fect Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
8
LIST OF TABLES 9
A.33 Italy Mode Choice Case: RP/SP NL with Agent Eect Results 250
9
10 LIST OF TABLES
10
List of Figures
2.1 Biogeme: DOS example . . . . . . . . . . . . . . . . . . . . . 22
2.2 Biogeme: DOS example . . . . . . . . . . . . . . . . . . . . . 23
2.3 Biogeme: DOS example . . . . . . . . . . . . . . . . . . . . . 24
2.4 Biogeme: Example of data le . . . . . . . . . . . . . . . . . . 29
2.5 Biogeme: Example of model le . . . . . . . . . . . . . . . . . 31
2.6 Biogeme: Example of DOS commands . . . . . . . . . . . . . 33
4.1 BL: Marketing case Biogeme snapshot . . . . . . . . . . . . . 47
5.1 Logit model Challenge: Italy mode choice logit model speci-
cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1 IIA test: Biogeme snapshot IITest section . . . . . . . . . . . 88
6.2 Specication Testing: Swissmetro Biogeme snapshot . . . . . . 96
6.3 Specication Testing: Swissmetro Biogeme snapshot . . . . . . 100
6.4 IIA test: Biogeme snapshot IITest section . . . . . . . . . . . 104
6.5 Specication Testing: Telephone Biogeme snapshot . . . . . . 109
6.6 Specication Testing: Telephone Biogeme snapshot . . . . . . 113
6.7 IIA test: Biogeme snapshot IITest section . . . . . . . . . . . 117
6.8 Specication Testing: Airline itinerary Biogeme snapshot . . . 126
6.9 Specication Testing: Airline itinerary Biogeme snapshot . . . 130
11
12 LIST OF FIGURES
7.1 Forecasting: Swissmetro market shares . . . . . . . . . . . . . 137
7.2 Forecasting: Telephone market shares . . . . . . . . . . . . . . 140
7.3 Forecasting: Market Shares for Non-stop Itinerary . . . . . . . 143
8.1 MEV: Swissmetro NL correlation structure . . . . . . . . . . . 147
8.2 MEV Challenge: Swissmetro NL correlation structure . . . . . 148
8.3 MEV: Swissmetro NL Biogeme snapshot . . . . . . . . . . . . 151
8.4 MEV: Swissmetro NL correlation structure . . . . . . . . . . . 151
8.5 MEV: Swissmetro CNL correlation structure . . . . . . . . . . 153
8.6 MEV: Swissmetro CNL Biogeme snapshot . . . . . . . . . . . 154
8.7 MEV: Telephone NL correlation structure . . . . . . . . . . . 159
8.8 MEV: Telephone Biogeme snapshot . . . . . . . . . . . . . . . 159
8.9 MEV: Telephone Biogeme snapshot . . . . . . . . . . . . . . . 161
8.10 MEV: Telephone Biogeme snapshot . . . . . . . . . . . . . . . 161
8.11 MEV: Telephone CNL correlations structure . . . . . . . . . . 163
8.12 MEV: Telephone Biogeme snapshot . . . . . . . . . . . . . . . 164
8.13 MEV: Telephone Biogeme snapshot . . . . . . . . . . . . . . . 166
9.1 Mixtures Challenge: Airline itinerary logit model specication
with a random parameter . . . . . . . . . . . . . . . . . . . . 171
9.2 Mixtures: Biogeme snapshot alternative specic variance spec-
ication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.3 Mixtures: Biogeme snapshot error component specication . . 176
9.4 Mixtures: Biogeme snapshot error component specication . . 178
9.5 Mixtures: Biogeme snapshot random coecient specication . 180
9.6 Mixtures: Biogeme snapshot Log Normal specication . . . . . 182
9.7 Mixtures: Biogeme snapshot SB specication . . . . . . . . . . 183
A.1 The choice of airline itinerary: Survey Example . . . . . . . . 221
A.2 Facial Expressions Case: Primary Expressions . . . . . . . . . 228
12
LIST OF FIGURES 13
A.3 Facial Expressions Case: Facial Measures . . . . . . . . . . . . 230
A.4 Facial Expressions Case: Image Examples . . . . . . . . . . . 231
A.5 Facial Expressions Case: Interpretation of Results . . . . . . . 234
13
14 LIST OF FIGURES
14
Part I
Introduction and Biogeme
15
Chapter 1
Introduction
The objective of this workbook is to oer the reader a guide on the appli-
cation of discrete choice models by the use of case studies. The workbook
is addressed to an audience of both academics and practitioners and from
the very beginner user to the most advanced one. The workbook presents a
stepwise approach to building, estimating, and interpreting a rich variety of
models with application to the elds of transportation, engineering, market-
ing and economics. Examples of model specications are provided for each
case study together with possible interpretations of the estimation results.
The model building process is illustrated in a step by step approach, starting
with the most simple model, and then adding complexity to it. The idea of
providing these models is to illustrate an iterative model specication process
and to inspire the reader to continue the model development process. The
workbook does not substitute the theoretical treatment of discrete choice
models, and should be used as a companion to Ben-Akiva and Lerman (1985).
Direct references to Ben-Akiva and Lerman (1985) are therefore provided in
each case study, and the theoretical material in this document is consequently
kept at a minimum.
The case studies start with the treatment of simple binary logit models to
further continue with the application of more complex models like Gener-
alized extreme value and Mixtures of logit models. An integral part of the
workbook is the treatment of forecasting, specication testing, estimation of
models based on revealed (RP) and stated preference (SP) data, and panel
data. The workbook includes the following chapters:
17
18 introduction
Chapter 2 presents an introduction to the freeware Biogeme which is
used for the model estimations. This chapter guides the reader through
the installation and utilization of the software. It also provides a small
hands-on example on how to get started and estimate a simple model.
Chapter 3 gives an introduction to model building and discusses some
general guidelines on how to work with the case studies.
Chapters 4 and 5 treat respectively the binary logit model and the
logit model. These chapters are very important. They represent the
standard and most used models in the eld of discrete choice modeling.
Moreover, an extensive amount of hand holding is provided in order to
familiarize the reader with Biogeme.
Chapter 6 deals with specication testing; it includes several important
topics like the McFadden IIA test, non-linear specication tests, non-
nested hypothesis test, and market segmentation test.
Chapter 7 introduces forecasting techniques that are used in order to
estimate population market shares and to test policy scenarios.
Chapter 8 treats the specication and estimation of Multivariate (Gen-
eralized) Extreme Value Models and includes the Nested Logit and
Cross Nested Logit models. These models are very useful in building
intuition for the understanding of more complex techniques handled in
Chapter 9.
Chapter 9 deals with mixtures of logit models which represent the state
of the art in discrete choice modeling. This chapter includes several
specications: alternative specic variance models, error component,
random coecient, and Mixed GEV models.
Chapter 10 treats the simultaneous estimation of models based on re-
vealed and stated preference data.
Appendix A contains the descriptions of the datasets.
The following four datasets have been used in the case studies:
Netherlands Mode Choice: Data on intercity travelers choices between
the transport modes of rail and car.
18
Introduction 19
Choice-Lab-Fashion: Data on clients of a business-to-business rm that
collects and processes nancial and customer data for their clients in
the fashion industry. The dataset includes choices of what information
products were purchased by the client over time as well as the choice
to remain as a client or drop as a client.
Residential Telephone Choice: Data on households choices of local
telephone services.
Swissmetro: Data on travelers choices of transport mode among a
proposed underground system (Swissmetro), train and car.
Airline Itinerary: Data on travellers ranking of dierent airline itineraries.
Table 1.1 indicates the use of the datasets with respect to the dierent case
studies.
In addition, the following three datasets are provided:
Italy Mode Choice: Data on travellers choices between the transport
modes rail, bus and car.
Facial Expression Recognition: Data on peoples interpretations of fa-
cial expressions.
Table 1.2 indicates the type of models that can be specied with the dierent
datasets.
19
20 introduction
Type of Model Dataset
Netherlands Choice-Lab Residential Swissmetro Airline
Mode Choice Fashion Telephone Itinerary
Binary Logit

Logit

Specication

Testing
Forecasting

MEV

Mixtures of Logit

RP/SP

Table 1.1: Datasets and applications
Type of Model Dataset
Italy Facial
Mode Choice Expressions
Binary Logit

Logit

Specication

Testing
Forecasting

MEV

Mixtures of Logit

RP/SP

Table 1.2: Datasets and applications
20
Chapter 2
Biogeme
BIerlaire Optimization toolbox for GEv Model Estimation (Biogeme) is a
freeware designed for the estimation of logit models, nested logit models and
more complex models in the Multivariate Extreme Value (MEV) family as
well as mixtures of these models (e.g. mixed logit). All information relative
to Biogeme is maintained at:
http://biogeme.epfl.ch
2.1 Install Biogeme
There is a graphical version available for Windows and for Mac OS X; no
installation is needed in order to run this version. Simply download Biogeme
from the web page or the course USB stick and save the le to a directory
of your choice. Simply double click on the winBiogeme.exe le in order to
start Biogeme.
In the remainder of this section, we describe how to install the command
line version of Biogeme under Windows. For installation under any other
platform, we refer the reader to the Biogeme home page.
1. Open a DOS window (from the Start menu, select Run. In the dialog
box, type cmd and select OK).
21
22 biogeme
2. In order to use Biogeme from any directory on your computer, you
need to place the program in a directory that is in your path (envi-
ronment variable). To nd out which directories are in your path, type
path (in the DOS window) and press the enter key. An example is
given in Figure 2.1 where there are several possible directories, for ex-
ample C:\WINDOWS\system32 or C:\WINDOWS. Note that each directory
is separated with a ; character.
Figure 2.1: DOS example of choosing a path
3. Select a directory in your path, for example C:\WINDOWS.
4. Download Biogeme from the web site or copy it from the course USB
stick to the chosen directory. The following les should be available:
winBiogeme.exe, biogeme.exe, and
biosim.exe.
5. To check if the installation has been successful, just type biogeme in
the DOS window. A message displaying the version of Biogeme should
then appear (this is shown in Figure 2.2).
6. Please do not forget to register to the users group, homepage:
http://groups.yahoo.com/group/biogeme/
Here you can nd answers to frequently asked questions as well as
information on new versions of the software.
22
invoke biogeme under windows 23
Figure 2.2: Output after correctly installing Biogeme
2.2 Invoke Biogeme under Windows
Biogeme is invoked in a DOS command window or a Cygwin command win-
dow under Windows using the following statement structure:
biogeme model file sample file.dat
Note that the model le is given without the le extension while the sam-
ple le does have the extension. When typing this command, the les are
assumed to be located in the current directory.
Some useful DOS commands are listed below:
To select a drive (e.g. C), just type C: at the prompt.
To connect to a directory (e.g. C:\Biogeme), just type cd C:\Biogeme.
To see the content of a directory, use Windows Explorer, or type dir .
An example of DOS commands is given in Figure 2.3. The current direc-
tory in the example is rst C:\Documents and Settings\Emma Frejinger.
When typing the command dir, the content of this directory is displayed.
In order to move to the directory My Documents, the command cd My
Documents is used (note that the quotation marks are optional). Finally,
23
24 biogeme
the current directory is:
C:\Documents and Settings\Emma Frejinger\My Documents.
In order to return to the previous (top) directory, type cd .. .
Figure 2.3: DOS example of commands
2.3 Install Emacs
For using Biogeme, you need a text editor. Wordpad is ne, but Emacs is
recommended. Note that Notepad
1
should not be used. If you want to
install Emacs (which is window driven), the procedure is the following:
1. Create a directory for Emacs, for example C:\Emacs
2. Download Emacs for Windows from the web site
http://www.gnu.org/software/emacs/
or copy the le Emacs-23.2.zip from the USB stick.
3. Unzip the le into the directory.
4. In the subdirectory bin, execute addpm.exe.
1
Notepad adds characters in the end of the line that Biogeme cannot read.
24
input files 25
5. Emacs is now available from the Windows starting menu:
Start -> Programs -> Gnu Emacs -> Emacs
2.4 Input Files
Biogeme reads the following les:
a le containing the model specication: model file.mod;
a le containing the data: sample file.dat;
a le containing the parameters controlling the behavior of Biogeme
and of its optimization algorithms: default.par.
The model and data les are essential while the parameter le in general
does not need to be edited (it is created with default values when Biogeme
is invoked).
Model Specication File
You can take a look at the examples on the USB stick and read the instruc-
tions given on the website http://biogeme.epfl.ch on menus Biogeme and
Examples to understand the details about this le. In general, the speci-
cation of the model le is explained in each case study. Here we list some
important facts for the labs.
Variable names are case-sensitive and should be typed exactly as they
appear in the list of variable names in the corresponding data le.
Every string in the le must be ended with a blank space (even if it is
followed by a parenthesis);
Starting values, lower bounds and upper bounds for all model parame-
ters to be estimated should be in oat format (including decimal point).
If there is an Alternative Specic Constant (ASC) dened for each
utility function, at least one of these must be xed (typically set to
zero), or absent from the model.
25
26 biogeme
0.0 is a reasonable starting value for ASCs and other parameters in
the utility functions.
Data File
All data les needed for the labs are provided on the USB stick. Their
structure is the following:
The rst row contains the list of the variables in the le (the case is
important).
Each subsequent row contains the associated data, one row for each
observation.
No missing value is allowed and all rows must have the exact same
number of entries. If a value is missing, a meaningless value must be
written (e.g. 99999.9).
Typical information for a given observation is:
the observed choice;
the description of the choice set through attributes describing the
availability of each alternative;
the attributes of each alternative; and
the socio-economic characteristics of the decision-maker.
Parameters File
This le is divided into dierent sections associated with dierent types of
parameters. Each section contains a list of parameters and their correspond-
ing values. The most useful parameters for standard users are dened in the
section [GEV], in particular the following ones:
gevAlgo which allows selection of the optimization algorithm to be
used for the maximum likelihood estimation;
gevTtestThreshold which sets the threshold for the t-test hypothesis
tests on explanatory variables in the model.
26
output files 27
This is an example of a parameter le:
[GEV]
gevAlgo=CFSQP
//gevAlgo=SOLVOPT
//gevAlgo=DONLP2
//gevAlgo=BIO
gevTtestThreshold=1.96
The remaining sections are designed for advanced users to allow exibility to
change parameters default values in the dierent optimization algorithms.
Note that if you do not specify a parameter le, Biogeme will create a default
one called default.par where the BIO algorithm is selected.
2.5 Output Files
Biogeme automatically generates several output les which are described be-
low. The most important is the mymodel.html which contains the estimation
results and some statistics in an easily readable format.
A le containing the results of the maximum likelihood estimation:
mymodel.rep.
The same le in HTML format: mymodel.html.
A le containing the specication of the estimated model in the same
format as the model specication le mymodel.mod: mymodel.res.
A le containing some descriptive statistics on the sample such as the
number of excluded observations, the total number of observations,
details of group membership, etc.: mymodel.sta.
The following les are provided in order to help understand possible prob-
lems:
A le containing messages produced by Biogeme during the run: mymodel.log.
A le containing the specication of the model as it has actually been
understood by Biogeme: speFile.debug.
27
28 biogeme
A le containing the data stored in Biogeme to represent the model:
model.debug.
A le containing the values of the parameters which have been actually
used by Biogeme: parameters.out.
These lenames may be modied according to the following rules:
1. If an input le mymodel.xxx does not exist, Biogeme attempts to open
the le default.xxx. If this le does not exist, Biogeme exits with an
error. Typically, the parameter le is not model dependent. Therefore,
it is recommended to call it default.par to avoid copying it for each
dierent model to be estimated.
2. If an output le mymodel.xxx already exists, Biogeme does not over-
write it. Instead, it creates the le mymodel~1.xxx. If the le mymodel~1.xxx
exists, Biogeme creates the le mymodel~2.xxx, and so on.
Therefore, to avoid any ambiguity, Biogeme displays the lenames actually
used for a specic run.
If you want more detailed information on the output les generated by Bio-
geme, see menu Biogeme on website http://biogeme.epfl.ch.
2.6 Step-by-Step Example
In order to help rst time users of Biogeme, we provide in this section a simple
example where we go through the estimation of a model step-by-step. The
example works through the estimation of a binary logit model of travelers
choices between auto and rail for intercity trips (Netherlands mode choice
dataset). It uses a dataset of 223 travelers. For each traveler, a chosen mode
(either rail or auto) for a particular trip was collected, as well as the travel
times and travel costs of both the travelers rail alternative and the travelers
auto alternative. These travel times and travel costs are used as explanatory
variables for the model, and the deterministic utility specications are
28
step-by-step example 29
V
car
= ASC
car
+
cost
car
cost
+
time
car
time
V
rail
=
cost
rail
cost
+
time
rail
time
.
The example works through 4 steps: (1) examining the data le, (2) examin-
ing the model specication le, (3) estimating the model, and (4) examining
the outputs. As you go through the example, make sure that you know where
the referenced les are located, how to open the les, and the basic contents
of each of the les.
Step 1: Model and Data Files
Before using Biogeme, you need to specify a model according to the data
le
2
. In the case studies, you never need to modify the data le, but you
need to specify your model le accordingly. An example of a data le is given
in Figure 2.4 where the rst six and last ve rows of the complete le are
shown.
id choice rail_cost rail_time car_cost car_time
1 0 40 2.5 5 1.167
2 0 35 2.016 9 1.517
3 0 24 2.017 11.5 1.966
4 0 7.8 1.75 8.333 2
5 0 28 2.034 5 1.267
219 1 35 2.416 6.4 1.283
220 1 30 2.334 2.083 1.667
221 1 35.7 1.834 16.667 2.017
222 1 47 1.833 72 1.533
223 1 30 1.967 30 1.267
Figure 2.4: Example of Data File
Each row in the data le corresponds to one observation, except the rst
one that contains the column names. The rst column id contains a unique
2
The les can be edited and viewed with a text editor such as Wordpad or GNU Emacs.
Note that Notepad should not be used.
29
30 biogeme
identier of the observation. The column named choice shows which alterna-
tive has been chosen. In this example, there are two alternatives, train and
car. The choice is coded with a variable taking the value 0 if car is chosen
and 1 if train is chosen. It can be seen that in the rst ve observations the
car alternative has been chosen, and in the last ve observations the train
alternative has been chosen. The other four columns contain the values of
the alternative attributes: rail cost, rail time, car cost and car time.
Based on this data le, we specify a binary logit model containing the cost
and travel time attributes as well as the alternative specic constants (the
constant of one alternative is normalized to zero). This simple model spec-
ication le is shown in Figure 2.5. (Comments in the le are given after
//.)
The section [Choice] denes in which column Biogeme can nd the identi-
er of the chosen alternative. In this example, the column name is choice.
In section [Beta], we dene the parameters that are included in the utili-
ties. Here we have four parameters; two alternative specic constants named
ASC CAR and ASC RAIL as well as the cost (BETA COST) and travel time
(BETA TIME) parameters. In addition to the name of each parameter, we
specify:
default value that will be used as a starting point for the estimation,
normally set to 0.0;
lower and upper bounds: normally you can keep -100.0 and 100.0.
These bounds serve as safe-guards for the algorithm; and
status variable that is 0 if the parameter should be estimated and 1 if
it should be set to the default value.
In this example, we estimate all the parameters except the alternative specic
constant of the rail alternative which is set to zero.
In section [Utilities], we specify the deterministic parts of the utilities.
Each row corresponds to one alternative and we need to specify:
identier of the alternative, which must be coherent with the identier
given in section [Choice], in our case 0 and 1;
name of the alternative (can be arbitrarily chosen);
30
step-by-step example 31
[Choice]
choice
[Beta]
// Name DefaultValue LowerBound UpperBound status
ASC_CAR 0.0 -100.0 100.0 0
ASC_RAIL 0.0 -100.0 100.0 1
BETA_COST 0.0 -100.0 100.0 0
BETA_TIME 0.0 -100.0 100.0 0
[Utilities]
//Id Name Avail linear-in-parameter expression
0 Car one ASC_CAR * one + BETA_COST * car_cost +
BETA_TIME * car_time
1 Rail one ASC_RAIL * one + BETA_COST * rail_cost +
BETA_TIME * rail_time
[Expressions]
// Define here arithmetic expressions for name that are not directly
// available from the data
one = 1
[Model]
// Currently, only $MNL (multinomial logit), $NL (nested logit), $CNL
// (cross-nested logit) and $NGEV (Network GEV model) are valid keywords
//
$MNL
Note that there should be one line in the [Utilities] section for each alternative in the
model le (they are split in two here because of the size).
Figure 2.5: Biogeme Example of Model File
31
32 biogeme
availability of the alternative: here both alternatives are always avail-
able, so this value is set to one. Biogeme understands what one means
because it is specied in the [Expressions] section;
linear in parameter specication of the deterministic part of the utility,
that is, a list of terms separated by a +. Each term is composed of the
name of a parameter (as dened in the [Beta] section) and the name
of a variable (as dened in the data le). The names of the variables
and parameters must be written exactly in the same way as dened in
the data le and [Beta] section, respectively.
In section [Expressions], you can dene expressions that appear in the
availability conditions or utility functions. Here we have only specied that
one means the numerical value one.
Finally, we need to specify which type of discrete choice model we want to
estimate, in this case a logit model (also known as Multinomial logit, MNL).
Now we have a data le that we name data.dat and a model le named
model.mod. Both les are saved to the same directory. Here we have chosen
to save them to C:\BiogemeFiles.
Step 2: Model Estimation
Under Windows, Biogeme is invoked in a DOS command window
3
. First of
all, you have to go to the directory where you have placed the model and
data les. Figure 2.6 shows the procedure for this example (the command
cd changes the current directory and the command dir displays the content
of the current directory). Second, when the current directory is the one con-
taining the model and data les, Biogeme can be invoked with the command:
biogeme model data.dat
Note that the model le is given without le extension while the data le
is given with it. After the estimation is nished, Biogeme displays the le
names it has actually used for the estimation as well as the names of the
result les. All the result les are placed in the current directory, thus the
directory where you have the model and data les.
3
The DOS command window can be opened by choosing Run... under the Start Menu
and then typing cmd.
32
step-by-step example 33
Figure 2.6: Biogeme Example of DOS commands
Step 3: Estimation Results
For our example, Biogeme writes the following information after the estima-
tion is completed:
Biogeme Input files
===================
Parameters: default.par
Model specification: model.mod
Sample 1 : data.dat
Biogeme Output files
====================
Estimation results: model.rep
Estimation results (HTML): model.html
Result model spec. file: model.res
Sample statistics: model.sta
Biogeme Debug files
===================
Screen copy: model.log
Parameters debug: parameters.out
Model debug: model.debug
Model spec. file debug: __specFile.debug
Model informations: Multinomial Logit Model
==================
The minimum argument of exp was -3.45471
Note that there are three input les. In addition to the model and data
le, there is a le named default.par that contains the parameters which
control Biogeme. Since we did not provide such a le, Biogeme automatically
creates one with the default settings.
33
34 biogeme
The estimation results can be found in model.html. This le contains the
same information as the model.rep le, but is written in HTML format
which conveniently can be opened in any browser such as Mozilla Firefox or
Internet Explorer. There are two other result les:
model.res containing the specication of the estimated model in the
same format as the model specication le (here model.mod); and
model.sta containing data statistics.
A copy of the messages displayed in the DOS command window can be
found in the model.log le. If you have problems with your estimation, you
can consult the debug les: model.log, parameters.out, model.debug and
__specFile.debug. See section 2.5 for more information on these les.
2.7 BioSim
BioSim is a package provided with Biogeme that can be used for computing
predicted probabilities. BioSim is invoked exactly like Biogeme. BioSim can
compute predicted probabilities for all model types that can be estimated
with Biogeme, as long as it is not a panel data setting. BioSim is used in the
case study on forecasting, Chapter 7.
Below we indicate how to use BioSim if you have a model named mymodel.mod
that you have just estimated with Biogeme using the data le mydata.dat.
1. Rename the result le mymodel.res to mymodel_res.mod. This le
contains the estimated parameter values to be used for computing the
probabilities.
2. Invoke BioSim with the command:
biosim mymodel_res mydata.dat
3. BioSim reports the results in the le mymodel_res.enu. Each line in
this le corresponds to a line in the data le. It is important to note that
only observations that have been used in the estimation/simulation are
reported in the mymodel_res.enu le. That is, if you have excluded ob-
servations in the [Exclude] section, these observations are not present
34
biosim 35
in the .enu le.
For each observation, BioSim reports the probability for the chosen al-
ternative as well as the probability for each alternative in the choice
set.
4. If you want to analyze the BioSim output le with a software such as
Excel, then save the le in text format .txt.
See menu Biogeme of website http://biogeme.epfl.ch for more details on
BioSim.
35
36 biogeme
36
Part II
Case Studies
37
Chapter 3
Introduction to Model Building
The process of building models is not straightforward and requires the knowl-
edge of theory (e.g. consumer theory in the case of marketing), statistical
tools, as well as subjective judgment from the model builder. Hence, it is
not possible to give an exact algorithm for how to build models, but there
are some guidelines. Chapter 7 in Ben-Akiva and Lerman (1985) contains
good advice and procedures for model development. Based on this chapter,
we give below some general guidelines on how to approach the case studies
(see the introduction of each case study for specic guidelines).
Start each case study by studying the provided model specications (the
.mod les). Try to understand the underlying assumptions and how
these assumptions are modeled.
Estimate the example models with Biogeme and analyze the result les
(.html). Compare your interpretations with those provided.
Continue the model development and formulate your own assumptions
(select the variables to include and how they should aect the utilities)
and modify the model le accordingly. Estimate this modied model.
Examples of questions to ask yourself after the estimation are: Do the
coecients you included have the expected signs? Are they signicantly
dierent from zero? Does the new model have a better model t than
the original one?
39
40 introduction to model building
It is important for the model to have an intuitive interpretation. It is always
possible to improve the model t by adding parameters to the specication,
but the model has to have an intuitive interpretation in order to be useful.
As you go through the case studies, you become familiar with more and more
modeling concepts and statistical tools for analyzing your models. Conse-
quently, you are able to perform more and more sophisticated analysis of
your models.
The Binary Logit case study deals with dierent specications of the at-
tributes, generic versus alternative specic attributes, as well as including
socio-economic characteristics of the decision-maker. In the case study on
Logit, we show how to statistically test (log-likelihood ratio test) if an unre-
stricted model is signicantly better than a restricted one. More statistical
tests are introduced in the Specication Testing case study, where we
discuss market segmentation and testing of correlation among alternatives.
We also test dierent ways of including variables in the model. Before work-
ing on more general models allowing for correlation among alternatives in the
Multivariate Extreme Value Models case study, we show how to use dis-
crete choice models for forecasting. The specication of error component
and random coecients models is covered in the case study on Mixtures of
Logit models. Finally, we discuss simultaneous RP/SP estimation in
the last case study.
3.1 Practical Information
Before starting a case study for the rst time, you need to have the following
programs installed:
The latest version of Biogeme, which is the reference estimation soft-
ware. It is distributed on the course USB stick and on the Biogeme
website (http://biogeme.epfl.ch). The installation process is de-
scribed in section 2.1, page 21.
A text editor for editing the model les and for reading the data; Word-
pad works even though we prefer to work with GNU Emacs. Section 2.3
(page 24) shows how to install Emacs on your computer. Please note
that Notepad should not be used.
40
practical information 41
The programs that you need to use when working on the case studies are:
You can use Biogeme in two dierent ways: with the graphical user
interface (GUI) (for Windows or Mac OS X) or the command line
version.
How to invoke Biogeme is described in section 2.2 on page 23.
Emacs or Wordpad
It is convenient to use Windows Explorer for opening the results les.
Otherwise, the .html result le can be opened directly with Internet
Explorer (or another browser of your choice).
Note that depending on the choice of optimization algorithm in Biogeme, the
estimation results can dier slightly. See section 2.4 for more details on how
to specify the Biogeme parameters.
41
42 introduction to model building
42
Chapter 4
Binary Logit
This case study deals with the estimation of Binary Logit (BL) models using a
dataset of your interest. The case study will help you to get familiar with the
estimation techniques and the basic statistical tests used in the specication
process of BL models.
For this case study, you can choose between the Choice-Lab-Fashion Mar-
keting, the Netherlands Mode Choice and the Airline Itinerary datasets. A
detailed description of each dataset can be found in Appendix A.
Before starting the case study, read the general introduction to the case
studies in Chapter 3. The introduction discusses how to go through the case
study and gives you some guidelines on the model building process.
The examples of model specications that we have provided can be found
in the following sections: Choice-Lab-Fashion Marketing in section 4.2 on
paghe 46, Netherlands Mode Choice in section 4.3 on page 52 and Airline
Itinerary in section 4.4 on page 57.
43
44 binary logit
4.1 Challenge Question
The Netherlands mode choice dataset This case study deals with the
estimation of a mode choice behavior model for intercity travelers using
revealed preference data. The survey was conducted during 1987 for the
Netherlands Railways to assess factors that inuence the choice between rail
and car for intercity travel.
Context Nijmegen is a small city in the eastern side of the Netherlands
near the border with Germany. The city has typical rail connections with
the major cities in the western metropolitan area called the Randstad (that
contains Amsterdam, Rotterdam and The Hague). Trips from Nijmegen to
the Randstad take approximately two hours by both rail and car. A binary
choice model can be developed to model the mode choice of travelers for
intercity travel.
Data description Please read Appendix A.2 of the workbook for details.
Files to use with Biogeme:
Model le: BL NL socioec g2.mod
Data le: netherlands.dat
After estimating two models that only include variables that were attributes
of the alternatives, someone would like to test if a socioeconomic variable
gender, which indicates the respondents gender, has any impact in the model.
He came up with the following model:
V
car
= ASC
car
+
time car
car
time
+
cost
car
cost
+
gender1
gender
V
rail
=
time rail
rail
time
+
cost
rail
cost
+
gender2
gender
The variable is categorical and equals one if the gender is female and zero if
male.
The model is estimated in Biogeme, and the results are listed in Table 4.1.
44
challenge question 45
Estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
2.85 1.02 2.80
2
cost
-0.130 0.0265 -4.89
3
gender1
-0.338 5.80e+06 0.00
4
gender2
0.338 5.80e+06 0.00
5
time car
-2.34 0.495 -4.73
6
time rail
-0.529 0.414 -1.28
Summary statistics
Number of observations = 228
L(0) = 158.038
L(
^
) = 115.880

2
= 0.229
Table 4.1: Estimation results with socioeconomic characteristics
Question: Do you agree with the above approach? Motivate your answer.
45
46 binary logit
4.2 Choice-Lab-Fashion Marketing Case
Binary Logit with Customer Characteristics
Files to use with Biogeme:
Model le: BL Marketing 1.mod
Data le: marketing.dat
In this model, we try to assess what are the factors characterizing customers
choice of dropping out as clients from Choice-Lab-Fashion. The decision
maker (Choice-Lab-Fashion customer) faces a binary choice: either to remain
as a client or drop as a client. The dependent variable (Choice) equals 1 if
the customer drops next year and 0 otherwise. The model is estimated
using the following variables:
NegProt: dummy variable for negative prot,
NegEquity: dummy variable for negative equity,
LRSC: dummy variable indicating if the legal status of the rm is lim-
ited responsibility stock owned company,
LnNbEmpl: natural logarithm of total number of employees, and
LnAge: natural logarithm of the companys age.
For estimation purposes, we normalize the alternative remain client, and the
estimated coecients are therefore interpreted relative to it. The following
expressions are the systematic parts of the utilities for the two alternatives:
V
remain
= 0
V
drop
= ASC
drop
+
NegProt
NegProt +
NegEquity
NegEquity+

LRSC
LRSC +
Empl
LnEmpl +
Age
LnAge.
Figure 4.1 shows a snapshot of the Biogeme code that corresponds to the
systematic parts of the utility functions. Section [Choice] indicates the
46
choice-lab-fashion marketing case 47
[Choice]
Choice
[Beta]
// Name Value LowerBound UpperBound status (0=variable, 1=fixed
ASC_remain 0.0 -100.0 100.0 1
ASC_drop 0.0 -100.0 100.0 0
b_NegProfit 0.0 -100.0 100.0 0
b_NegEquity 0.0 -100.0 100.0 0
b_LRSC 0.0 -100.0 100.0 0
b_Empl 0.0 -100.0 100.0 0
b_Age 0.0 -100.0 100.0 0
[Utilities]
// Id Name Avail linear-in-parameter expression
0 Alt1 avail ASC_remain * one
1 Alt2 avail ASC_drop * one + b_NegProfit * NegProfit +
b_NegEquity * NegEquity + b_LRSC * LRSC + b_Empl * LnNbEmpl +
b_Age * LnAge
[Model]
// Currently, only MNL (multinomial logit), NL (nested logit), CNL
// (cross-nested logit) and $NGEV (Network GEV model) are valid
// keywords
$MNL
[Expressions]
// Define here arithmetic expressions for name that are not directly
// available from the data
one = 1
avail = 1
Figure 4.1: Snapshot from the Biogeme code
47
48 binary logit
dependent variable in the dataset, which is the variable identifying the chosen
alternative. The coding of the dependent variable is consistent with the Id
given in section [Utilities]. Section [Beta] lists the parameters which we
intend to use in our systematic utilities. If the status is set to one, this means
that the parameter is kept xed at its value; otherwise, it is estimated. This
is how we normalize one of the alternative specic constants (ASC remain).
The parameter names must be exactly the same as those expressed in the
[Utilities] section (note that Biogeme is case sensitive). In [Utilities],
we dene the systematic utilities. Since both options are available to all
customers, we have set the availability to be 1 in the [Expressions] section.
For further details on Biogeme, see Chapter 2.
Estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
drop
-0.535 0.0880 -6.08
2
LRSC
-0.234 0.0470 -4.97
3
Empl
-0.186 0.0143 -12.98
4
Age
-0.0973 0.0286 -3.41
5
NegEquity
0.185 0.104 1.78

6
NegProt
0.199 0.0483 4.11
Summary statistics
Number of observations = 15934
L(0) = 11044.607
L(
^
) = 7590.130

2
= 0.312
Table 4.2: Estimation results
The estimation results for this rst model (BL Marketing 1.mod) are shown
in Table 4.2. Given our specication, the negative sign of ASC
drop
can be
interpreted as the decision maker prefers to remain client to the company.
The coecient
Age
is negative and statistically signicantly dierent from
zero indicating that the older the customer (age of the rm), the less likely
it is to leave the company. Note that the coecient
Age
also can capture
other eects. Young rms might be more vulnerable to be closed down given
48
choice-lab-fashion marketing case 49
nancial diculties (in need to cut costs), so this could explain why they
decide to drop out. However, there might also be other viable explanations.
For example, new rms might be interested in buying a one-time list of
addresses for direct marketing purposes (i.e. product 3).
The signicant and negative estimate of the coecient
LRSC
(limited re-
sponsibility stock companies) implies that stock owned limited responsibility
companies are less likely to drop as clients compared to non-stock limited
responsibility rms.
The coecient
Empl
is negative and signicantly dierent from zero which
implies that larger rms are less likely to drop out. It could be that large rms
are better established in the market, or may be operating in industries where
access to companies nancial information is key to their success. This could
be, for example, banks and nancial institutions. We could also speculate
that large companies have larger client databases and establish credit policies
based on credit rating information provided by Choice-Lab-Fashion. A small
company might only buy one-time credit rating report for one of its clients,
and this might happen very sporadically.
On the nancial variables indicators, only negative prot is signicantly dif-
ferent from zero. Companies needing to cut costs are more likely to drop out
as clients from Choice-Lab-Fashion, as expected.
Binary Logit with Type of Purchased Product
Files to use with Biogeme:
Model le: BL Marketing 2.mod
Data le: marketing.dat
In this model, we keep all the independent variables from the previous model
and add a set of variables describing the product purchased by the deci-
sion maker. The idea is to verify if there are any patterns of loyalty that
can be explained by the type of products that clients have purchased. The
49
50 binary logit
systematic parts of the utilities are:
V
remain
= 0
V
drop
= ASC
drop
+
NegProt
NegProt +
NegEquity
NegEquity+

LRSC
LRSC +
Empl
LnNbEmpl +
Age
LnAge+

IndAnalysis
IndAnalysis +
CreditInfo
CreditInfo+

Accounts
Accounts +
Monitor
Monitor+

Web
Web +
CD
CD +
CRM
CRM +
Internet
Internet+

OpenDB
OpenDB +
Other
Other.
In Table 4.3, we show the estimation results for this model. All product
choice coecients have a negative sign and are signicantly dierent from
zero. However, they vary in magnitude. The largest coecient absolute
values are found for the products that provide integrated and web based
services (CRM, Internet and Web), which are the solutions that provide
clients with the most complete and updated data. We could speculate that
these might be solutions that clients use most frequently and that play an
important role in their day to day decisions. The alternative specic constant
ASC
drop
is positive and signicant, compared to negative and signicant in
the previous model. This indicates that clients are more likely to drop out
than remain as clients. This should be investigated further.
We have now identied some variables that have a signicant impact on
customer drop outs. However, we can provide Choice-Lab-Fashion with an
extra, valuable piece of information: a list of top 100 clients that have the
highest probability of dropping out in the next year. Since we have data only
until 2002, what we can calculate is the probability that a client will drop
out in 2003. One way of doing so is to divide the dataset in two samples:
training sample and test sample. First, we use the training sample (2000-
2001) and estimate the model. Second, we calculate the predicted probability
of dropping out with the test sample (2002) using the model estimated from
the training sample. Third, we list the data in descending order and pick the
100 clients with the largest probability. Choice-Lab-Fashion could analyze
the listing and decide for which clients it is worth considering a retention
strategy. We remind the reader that the dataset also includes other variables.
Therefore, it is advisable to improve the specication and run additional
models.
50
choice-lab-fashion marketing case 51
Estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
drop
1.49 0.115 12.97
2
LRSC
-0.169 0.0497 -3.41
3
Empl
-0.131 0.0153 -8.55
4
Age
-0.216 0.0320 -6.74
5
NegEquity
0.322 0.114 2.83
6
NegProt
0.275 0.0519 5.30
7
CRM
-2.56 0.707 -3.61
8
Internet
-2.24 0.139 -16.16
9
Web
-2.25 0.117 -19.26
10
CD
-1.80 0.0706 -25.47
11
Monitor
-1.02 0.252 -4.04
12
IndAnalysis
-1.08 0.0624 -17.36
13
Accounts
-1.04 0.0635 -16.36
14
Other
-0.613 0.0539 -11.37
15
CreditInfo
-0.568 0.0540 -10.51
Summary statistics
Number of observations = 15934
L(0) = 11044.607
L(
^
) = 6717.804

2
= 0.390
Table 4.3: Estimation results
51
52 binary logit
4.3 Netherlands Mode Choice Case
Model Specication with Generic Attributes
Files to use with Biogeme:
Model le: BL NL generic.mod
Data le: netherlands.dat
In this rst model, we assume that the total travel time (in-vehicle and out-of-
vehicle) and travel cost of the modes are the only factors inuencing the mode
choice. We also assume that the coecients of the explanatory variables are
generic, i.e. they do not vary between alternatives. The expression of utility
for this simple model can be written as:
V
car
= ASC
car
+
time
car
time
+
cost
car
cost
V
rail
=
time
rail
time
+
cost
rail
cost
Estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
-0.798 0.275 -2.90
2
cost
-0.113 0.0241 -4.67
3
time
-1.33 0.354 -3.75
Summary statistics
Number of observations = 228
L(0) = 158.038
L(
^
) = 123.133

2
= 0.202
Table 4.4: Estimation results with generic attributes
The estimation results are shown in Table 4.4. All the estimated coecients
are statistically signicantly dierent from zero. Looking at the alternative
specic constant, the negative sign indicates that the rest of the utilities
52
netherlands mode choice case 53
being equal, car is less preferred than rail. However, this may be due to the
fact that the model is too simple and there are important variables left out of
the model. The negative signs for the generic coecients for cost and travel
time indicate, as expected, that the utility perceived by the decision maker
for any of the two alternatives decreases with increase in cost and travel time.
Model Specication with Alternative Specic Attributes
Files to use with Biogeme:
Model le: BL NL specic.mod
Data le: netherlands.dat
In the second specication, we relax the hypothesis of generic travel time
coecients. The alternative specic coecients are more relevant if people
perceive a minute spent in one mode to be dierent than a minute spent in
the other mode. To illustrate this idea, two dierent travel time coecients
are introduced for car and rail. The corresponding utility function is given
below:
V
car
= ASC
car
+
time car
car
time
+
cost
car
cost
V
rail
=
time rail
rail
time
+
cost
rail
cost
The estimation results are shown in Table 4.5. This model has a better
adjusted likelihood ratio index than the model with generic travel time coef-
cients. However, the coecient for the travel time of the rail alternative is
not statistically signicantly dierent from zero. The coecient for the travel
time of the car alternative is negative and signicant as expected, and is also
greater in absolute value than the generic one presented in the previous table
(-2.26 vs. -1.33). As in the previous example, the negative sign indicates that
the utility perceived by the decision maker for the car alternative decreases
with the increase of travel time. However, it appears that travel time does
not aect the car and rail alternatives in the same way. The results indicate
that people have less negative utility for travel time in rail compared to car.
This may be due to the fact that people can make better use of their time
when traveling by rail. The alternative specic constant for the car alterna-
tive has now the reversed sign denoting increased preference for car (given
53
54 binary logit
Estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
2.43 0.973 2.50
2
cost
-0.123 0.0256 -4.79
3
time car
-2.26 0.485 -4.66
4
time rail
-0.543 0.396 1.37

Summary statistics
Number of observations = 228
L(0) = 158.038
L(
^
) = 118.023

2
= 0.228
Table 4.5: Estimation results with alternative-specic attributes
everything else the same) which is more intuitive. A likelihood ratio test
can be performed to test whether or not there is a signicant improvement
in the goodness-of-t in the modied specication with alternative specic
coecients for travel times.
Generic vs. Specic Test
The likelihood ratio test (see pages 28 and 164-167 in Ben-Akiva and Lerman
(1985)) can be used to test the generic vs. the alternative-specic speci-
cation. The likelihood ratio test statistic for the null hypothesis of generic
attributes is
2(L(
^

G
) L(
^

AS
))
where G and AS denote the generic and alternative-specic models, respec-
tively. It is
2
distributed with the number of degrees of freedom equal to the
number of restrictions (K
AS
K
G
). In this case, 2(123.133 + 118.023) =
10.220. Since
2
0.95,1
= 3.841 at a 95% level of condence, we can conclude
that the model with the alternative-specic coecients has a signicant im-
provement in t.
54
netherlands mode choice case 55
Model Specication with Socio-Economic Characteris-
tics
Files to use with Biogeme:
Model le: BL NL socioec.mod
Data le: netherlands.dat
The previous two models only included variables that were attributes of the
alternatives. We now introduce a socioeconomic variable gender which in-
dicates the respondents gender. The variable is categorical and equals one
if the gender is female and zero if male. Since the variable gender does not
vary by alternative (recall that only dierence in utility matters), we have
normalized the alternative car to zero. As is shown in the utility function
below, the gender variable only enters the utility of the rail alternative. How-
ever, this is an arbitrary normalization, as we could also have normalized the
rail alternative.
V
car
= ASC
car
+
time car
car
time
+
cost
car
cost
V
rail
=
time rail
rail
time
+
cost
rail
cost
+
gender
gender
The estimation results are shown in Table 4.6. The results show that there is
a slight improvement in the adjusted likelihood ratio index. The coecient
of the gender variable is positive and statistically signicant, which indicates
that women have higher probability than men of choosing the rail alternative
with respect to the car alternative. The reader can verify that if we had
included the gender variable in the utility of the car alternative instead of
the rail alternative, the conclusion would remain unchanged. In fact, the
results would be exactly the same. The only dierence is that the coecient
would show the opposite sign. In our case, it would become negative. The
interpretation would be that women would have lower probability than men
of using the car alternative with respect to the train alternative, which is
exactly the same result we had before. Regarding the coecients of the
other explanatory variables, they are almost unchanged with respect to the
previous model.
55
56 binary logit
Estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
2.85 1.02 2.80
2
gender
0.675 0.329 2.05
3
cost
-0.130 0.0265 -4.89
4
time car
-2.34 0.495 -4.73
5
time rail
-0.529 0.414 1.28

Summary statistics
Number of observations = 228
L(0) = 158.038
L(
^
) = 115.880

2
= 0.235
Table 4.6: Estimation results with socioeconomic characteristics
56
airline itinerary case 57
4.4 Airline Itinerary Case
Model Specication with Generic Attributes
Files to use with Biogeme:
Model le: BL airline generic.mod
Data le: airline.dat
We assume the choice variable (dependent variable) includes following alter-
natives:
Option 1 a non-stop ight,
Option 2 a ight with one stop on the same airline.
In this rst model, we assume leg room, fare, schedule delays (early and late)
are the factors inuencing the choice. We also assume that the coecients of
travel time variables are generic, i.e., they do not vary between alternatives.
The deterministic part of the utilities for this simple model can be expressed
as:
V
1
=
Fare
Opt1 FARE +
Total TT
TripTimeHours 1
+
Legroom
Opt1 Legroom +
SchedDE
Opt1 SchedDelayEarly
+
SchedDL
Opt1 SchedDelayLate
V
2
= ASC
2
+
Fare
Opt2 FARE +
Total TT
TripTimeHours 2
+
Legroom
Opt2 Legroom +
SchedDE
Opt2 SchedDelayEarly
+
SchedDL
Opt2 SchedDelayLate
where fare is coded as Opt1 FARE and Opt2 FARE in the unit of 100$, in
order to reduce numerical issues; the schedule delay is categorized into early
and late as variables:
Opt1 SchedDelayEarly,
Opt1 SchedDelayLate,
57
58 binary logit
Opt2 SchedDelayEarly and
Opt2 SchedDelayLate;
The leg room is coded as a continuous variable in inch unit. These variables
are coded in the [Expressions] section of the model le.
The estimation results are reported in Table 4.7. The results indicate that
all other things being equal, the rst option without stop is preferred. All
the estimated coecients are signicantly dierent from zero. The signs of
the time coecient
Total TT
and the fare coecient
Fare
are negative, as
expected, meaning that the utility of an alternative decreases with increase
in travel time and fare. The signs of the schedule delay coecients are both
negative, indicating that people dont like delays. The positive sign of the
leg room indicates that people like seats with bigger space.
Estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ACS
2
-1.41 0.176 -8.02
2
Fare
-1.83 0.104 -17.65
3
Legroom
0.115 0.0179 6.41
4
SchedDE
-0.111 0.0213 -5.23
5
SchedDL
-0.118 0.0189 -6.25
6
Total TT
-0.236 0.0966 -2.44
Summary statistics
Number of observations = 3093
L(0) = 2143.904
L(
^
) = 1171.504

2
= 0.451
Table 4.7: Estimation results with generic attributes
58
airline itinerary case 59
Logit Model with Alternative-Specic Attributes
Files to use with Biogeme:
Model le: BL airline specic.mod
Data le: airline.dat
In this second specication we relax the hypothesis of generic coecients.
To illustrate this idea, two dierent time coecients are introduced for two
alternatives. The corresponding utility functions are reported below:
V
1
=
Fare
Opt1 FARE +
Total TT1
TripTimeHours 1
+
Legroom
Opt1 Legroom +
SchedDE
Opt1 SchedDelayEarly
+
SchedDL
Opt1 SchedDelayLate
V
2
= ASC
2
+
Fare
Opt2 FARE +
Total TT2
TripTimeHours 2
+
Legroom
Opt2 Legroom
+
SchedDE
Opt2 SchedDelayEarly
+
SchedDL
Opt2 SchedDelayLate,
The estimation results are reported in Table 4.8. In this case, both time
coecients for the two options are estimated. Both their signs are negative,
as expected. The absolute value of
Total TT1
is larger, meaning that people
are more sensitive to time in case of non-stop ights. The interpretation for
other parameters remains the same.
Generic vs Specic Test
The likelihood ratio test can be used to test the generic vs. the alternative-
specic model specications. The likelihood ratio test statistic for the null
hypothesis of generic attributes is
2(L(
R
) L(
U
)),
where R and U denote the restricted (generic) and unrestricted (alternative-
specic) models, respectively. It is
2
-distributed with the number of degrees
59
60 binary logit
Estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
2
-1.48 0.205 -7.22
2
Fare
-1.82 0.105 -17.27
3
Legroom
0.115 0.0179 6.41
4
SchedDE
-0.112 0.0214 -5.21
5
SchedDL
-0.118 0.0190 -6.24
6
Total TT1
-0.257 0.104 -2.47
7
Total TT2
-0.236 0.0967 -2.44
Summary statistics
Number of observations = 3093
L(0) = 2143.904
L(
^
) = 1171.318

2
= 0.450
Table 4.8: Binary model with alternative specic attributes
of freedom equal to the number of restrictions (K
U
K
R
), with K
U
and K
R
the
numbers of estimated coecients in the unrestricted and restricted models,
respectively. In this case, 2(1171.504+1171.318) = 0.372. Since
2
0.90,1
=
2.71 at 90% level of condence, we can conclude that the null hypothesis of
a generic time coecient can not be rejected. So the model with alternative
specic coecient does not have a signicant improvement in t.
Inclusion of Socio-Economic Characteristics
Files to use with Biogeme:
Model le: BL airline socioec.mod
Data le: airline.dat
The previous two models only include variables that are attribute of the al-
ternatives. We now introduce a socio-economic characteristic, namely the
gender of the respondent. MALE is a dummy variable and is equal to 1 if
the gender is male and zero if female. It should be noticed that the socio-
60
airline itinerary case 61
economic variables do not vary among the alternatives (recall that only dif-
ference in the utilities matters), we have normalized alternative 2 to zero.
However, this is an arbitrary normalization, as we could also have normal-
ized alternative 1. The utility functions can be written now as follows:
V
1
=
Fare
Opt1 FARE +
Total TT
TripTimeHours 1
+
Legroom
Opt1 Legroom +
SchedDE
Opt1 SchedDelayEarly
+
SchedDL
Opt1 SchedDelayLate
V
2
= ASC
2
+
Male Opt2
Male +
Fare
Opt2 FARE
+
Total TT
TripTimeHours 2 +
Legroom
Opt2 Legroom
+
SchedDE
Opt2 SchedDelayEarly
+
SchedDL
Opt2 SchedDelayLate
Estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
2
-1.44 0.184 -7.86
2
Fare
-1.83 0.104 -17.66
3
Legroom
0.115 0.0179 6.41
4
Male Opt2
0.0620 0.105 0.59
5
SchedDE
-0.111 0.0212 -5.22
6
SchedDL
-0.118 0.0189 -6.26
7
Total TT
-0.234 0.0967 -2.42
Summary statistics
Number of observations = 3093
L(0) = 2143.904
L(
^
) = 1171.329

2
= 0.450
Table 4.9: Binary model with socio-economic characteristics
The estimation results are reported in Table 4.9. The coecient of the

Male Opt2
is not statistically signicant dierent from zero and indicates that
61
62 binary logit
dierent genders have the same preferences on the two options. The inter-
pretation of the other coecients remains the same as the previous model
specications.
62
Chapter 5
Logit
The topic of this case study is the logit model, sometimes called the Multi-
nomial Logit (MNL). Dierent specications are introduced using a stepwise
modeling strategy, which increases the complexity by adding dierent vari-
ables at each step. The objectives of this case study can be summarized as
follows:
Specication and estimation of a basic logit model making use of generic
attributes.
Specication and estimation of a logit model including alternative-
specic attributes.
Introduction of generic vs specic test techniques (likelihood ratio test).
For this case study, you can choose between the Swissmetro, the Residential
Telephone Services and the Airline Itinerary datasets. A detailed description
of each dataset can be found in Appendix A.
Before starting the case study, read the general introduction to the case
studies in Chapter 3. The introduction discusses how to go through the case
study and gives you some guidelines on the model building process.
The examples of model specications that we have provided can be found in
the following sections: Swissmetro in section 5.2 on page 67, Residential Tele-
phone Services in section 5.3 on page 73 and Airline Itinerary in section 5.4
on page 78.
63
64 logit
5.1 Challenge Question
The Italy mode choice dataset The data have been collected in Cagliari,
which is the capital of Sardinia Italy. In 1998, the local rail authority decided
to upgrade the service into metropolitan-like commuter train service, increas-
ing the speed, the frequency and the number of stations inside the corridor.
In order to analyze the impact of a potential new train system three types
of surveys were conducted: a qualitative survey using focus groups to gain a
good understanding of the phenomenon, a revealed preference (RP) survey
describing current trips, and a stated preference (SP) survey to evaluate the
introduction of radical improvements to the existing alternative.
In this challenge question, we focus on the RP survey. Households were
randomly selected from the telephone directory and each member of the
family over the age of 12 was asked to participate. After testing consistency
and validity of the data for mode choice modeling only people with an
actual modal choice among Car, Bus and Train were considered , a nal
sample of 318 observations was left for model estimation.
Data description Please read Appendix A.7 of the workbook for details.
Files to use with Biogeme:
Model le: mnl-RP Italy Challenge.mod
Data le: italy.dat
Figure 5.1 gives a suggested Biogeme specication of the model.
Question: Does this model make sense to you? What results do you expect
when you try to estimate this model?
The results estimated by Biogeme are given in Table 5.1. Do they correspond
to your expectations?
64
challenge question 65
[Choice]
ch
[Beta]
// Name Value LowerBound UpperBound status
ASC_car 0 -1000 1000 0
ASC_train 0 -1000 1000 0
B_cost 0 -1000 1000 0
B_Veh_time 0 -1000 1000 0
B_Wal_time 0 -1000 1000 0
B_nb_car 0 -1000 1000 0
[Utilities]
// Id Name Avail linear-in-parameter expression (beta1*x1 + beta2*x2 + ... )
1 TrainRP av1 ASC_train * one + B_Veh_time * tt_t + B_Wal_time * wt_t
+ B_cost * c_t
2 CarRP av2 ASC_car * one + B_Veh_time * tt_c + B_Wal_time * wt_c
+ B_cost * c_c + B_nb_car * nb_car
3 BusRP av3 B_Veh_time * tt_b + B_Wal_time * wt_b + B_cost * c_b
[Model]
$MNL
[Expressions]
one = 1
nb_car = car_lic * 10 * ( ch == 2 )
Figure 5.1: Italy mode choice, logit Specication
65
66 logit
Logit Model Estimation Results
Variable Variable Coecient standard t-stat. 0
number name estimate error
1 ASC car -48.7 13.2 -3.70
2 ASC train -1.30 0.996 -1.31
3 B Veh time -0.101 0.0775 -1.31
4 B Wal time -0.257 0.0516 -4.98
5 B cost -4.32 1.78 -2.43
6 B nb car 33.3 6.25 5.32
Summary statistics
Number of observations = 318
L(0) = 294.215
L(
^
) = 22.406

2
= 0.903
Table 5.1: Estimation results for the logit model related to the Italy mode
choice dataset
66
swissmetro case 67
5.2 Swissmetro Case
Model Specication with Generic Attributes
Files to use with Biogeme:
Model le: MNL SM generic.mod
Data le: swissmetro.dat
The dataset consists of survey data collected on the trains between St. Gallen
and Geneva in Switzerland. The idea is to analyze the impact of modal in-
novation in transportation, represented by the Swissmetro, against the more
classic types of transport modes. The choice variable consists of three al-
ternatives: train, Swissmetro and car (for car owners). In this rst model
specication, we assume that travel time, cost and headway of public trans-
portation modes inuence the utility functions. We also assume that the
coecients of the explanatory variables are generic, that is, they do not
vary over the alternatives. The corresponding expressions of the utilities are
dened as follows:
V
car
= ASC
car
+
time
CAR TT +
cost
CAR CO
V
train
=
time
TRAIN TT +
cost
TRAIN COST +
he
TRAIN HE
V
SM
= ASC
SM
+
time
SM TT +
cost
SM COST +
he
SM HE
where CAR TT is the car travel time, CAR CO is the car cost, TRAIN TT
is the train travel time, TRAIN COST is the train cost (considering the
ownership of Swiss annual season ticket, GA), TRAIN HE is train headway
(in minutes), SM TT is the Swissmetro travel time, SM COST is the Swiss-
metro cost (considering the ownership of GA), and SM HE is the Swissmetro
headway.
The estimation results are shown in Table 5.2. For estimation purposes,
we have normalized the alternative specic constant of train to zero. The
estimated values for the alternative specic constants ASC
car
and ASC
SM
show that, all the rest remaining constant, there is a preference in the choice
of car and Swissmetro with respect to train. Moreover, the higher value
of ASC
SM
shows a greater preference for Swissmetro compared to car. As
expected, both the travel time and cost coecients have negative signs. The
67
68 logit
Logit model with generic attributes
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
0.189 0.0798 2.37
2 ASC
SM
0.451 0.0932 4.84
3
cost
-0.0108 0.000682 -15.90
4
he
-0.00535 0.000983 -5.45
5
time
-0.0128 0.00104 -12.23
Summary statistics
Number of observations = 6768
L(0) = 6964.663
L(
^
) = 5315.386

2
= 0.236
Table 5.2: Logit model with generic attributes
higher the travel time or the cost of an alternative, the lower the related
utility. The negative estimate of the headway coecient
he
indicates that
the higher the headway, the lower the frequency of service, and thus the lower
the utility.
Model Specication with Alternative Specic Attributes
Files to use with Biogeme:
Model le: MNL SM specic.mod
Data le: swissmetro.dat
In this second model, we relax the hypothesis of generic coecients. To
illustrate this idea, we use three dierent cost coecients, one for each alter-
native. The corresponding utility functions are
68
swissmetro case 69
V
car
= ASC
car
+
time
CAR TT +
car cost
CAR CO
V
train
=
time
TRAIN TT +
train cost
TRAIN COST +
he
TRAIN HE
V
SM
= ASC
SM
+
time
SM TT +
SM cost
SM COST +
he
SM HE.
Logit model with alternative specic travel cost
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
-0.971 0.134 -7.22
2 ASC
SM
-0.444 0.102 -4.34
3
car cost
-0.00949 0.00116 -8.21
4
he
-0.00542 0.00101 -5.36
5
SM cost
-0.0109 0.000703 -15.49
6
time
-0.0111 0.00120 -9.26
7
train cost
-0.0293 0.00169 -17.32
Summary statistics
Number of observations = 6768
L(0) = 6964.663
L(
^
) = 5068.559

2
= 0.271
Table 5.3: Logit model with alternative-specic cost attributes
The estimation results for this model specication are shown in Table 5.3.
The results show the signicance of the alternative-specic cost coecients.
The inuence of the cost is dierent, showing a larger negative impact on
the train alternative with respect to car and Swissmetro. In this model, the
ASCs are negative implying a preference, with all the rest constant, for the
train alternative. These results are dierent from those of the previous model
where ASC
car
and ASC
SM
were positive and signicant. The larger negative
value of ASC
car
implies that this alternative is more negatively perceived
with respect to train than the Swissmetro alternative. Considering that
the deterministic utilities are very simple, only including three explanatory
69
70 logit
variables, the alternative specic constants can capture various eects. Their
signs and magnitudes should therefore be further investigated.
Generic vs. Specic Test
To test whether a coecient should be generic or alternative-specic, we use
the likelihood ratio test (see pages 28 and 164-167 in Ben-Akiva and Ler-
man, 1985). We compare the log likelihood functions of the restricted and
unrestricted models of interest. The restricted model includes generic travel
cost coecients over the three alternatives, and the unrestricted model in-
cludes alternative-specic travel cost coecients. Hence, the null hypothesis
is
H
0
:
car cost
=
train cost
=
SM cost
and the test statistic for the null hypothesis is given by
2(L
R
L
U
)
which is asymptotically distributed as
2
with df = K
U
K
R
degrees of
freedom, where K
U
and K
R
are the numbers of estimated parameters in the
unrestricted and restricted models, respectively. We reject the null hypoth-
esis that the restrictions are true if
2(L
R
L
U
) >
2
((1),df)
where is the level of signicance. In this specic case, using = 0.05 yields
2(5315.386 +5068.559) = 493.654 > 5.991
We can therefore reject the null hypothesis and conclude that the travel cost
coecient should be alternative-specic.
Model Specication with Socio-Economic Characteris-
tics
Files to use with Biogeme:
Model le: MNL SM socioec.mod
Data le: swissmetro.dat
70
swissmetro case 71
To capture the average of the dierences between the individuals in the sam-
ple, we make use of socio-economic characteristics. These types of variables
do not change over the choice set and are individual specic. In this exam-
ple, we add two variables to the model: a dummy variable (SENIOR) for
senior people (age above 65) and a dummy variable that captures the eect
of the Swiss annual season ticket for train (GA). A few observations, where
the variable AGE is unknown (coded as 6), are removed from the estimation.
The deterministic utilities are:
V
car
= ASC
car
+
time
CAR TT +
car cost
CAR CO +
senior
SENIOR
V
train
=
time
TRAIN TT +
train cost
TRAIN COST +
he
TRAIN HE+

ga
GA
V
SM
= ASC
SM
+
time
SM TT +
SM cost
SM COST +
he
SM HE+

senior
SENIOR +
ga
GA
The estimation results for this model are shown in Table 5.4. The coecients
of the socio-economic variables have been estimated and are signicantly
dierent from zero at a 95% condence level. The negative sign of the age
coecient (referring to SENIOR dummy variable) reects a preference of
older individuals for the train alternative. It seems a reasonable conclusion,
dictated probably by safety reasons with respect to the car choice and a
kind of inertia with respect to the modal innovation represented by the
Swissmetro alternative. The coecient related to the ownership of the Swiss
annual season ticket (GA) is positive, as expected. It reects a preference
for the SM and train alternative with respect to car, given that the traveler
possesses a season ticket. Finally, the interpretation of the alternative specic
constants is similar to that of the previous model specication.
71
72 logit
Logit model with socio-economic variables
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
-0.608 0.143 -4.24
2 ASC
SM
-0.135 0.106 -1.26
3
car cost
-0.00936 0.00117 -8.02
4
he
-0.00586 0.00106 -5.55
5
SM cost
-0.0104 0.000744 -14.02
6
time
-0.0111 0.00121 -9.20
7
train cost
-0.0268 0.00176 -15.24
8
senior
-1.88 0.109 -17.31
9
ga
0.557 0.191 2.91
Summary statistics
Number of observations = 6759
L(0) = 6958.425
L(
^
) = 4927.167

2
= 0.291
Table 5.4: Logit model with socio-economic variables
72
choice of residential telephone services case 73
5.3 Choice of Residential Telephone Services
Case
Model Specication with Generic Attributes
Files to use with Biogeme:
Model le: MNL Tel generic.mod
Data le: telephone.dat
In this example, we model the households choice of service option for local
telephone services. The choice variable (dependent variable) includes the fol-
lowing alternatives: budget measured (BM), standard measured (SM), local
at(LF), extended at(EF) and metro at(MF). In this rst model, we as-
sume that the cost of the calling plan is the only factor inuencing the choice
of the calling plan. We also assume that the coecients of the explanatory
variables are generic, i.e. they do not vary among the alternatives. The
expressions of the utilities for this simple model can be written as:
V
BM
= ASC
BM
+
cost
ln(cost BM)
V
SM
=
cost
ln(cost SM)
V
LF
= ASC
LF
+
cost
ln(cost LF)
V
EF
= ASC
EF
+
cost
ln(cost EF)
V
MF
= ASC
MF
+
cost
ln(cost MF).
Here we have included the natural logarithm of the cost in order to better
capture dierences in cost among alternatives.
The estimation results are shown in Table 5.5. The results indicate that
all the rest being equal, the budget measured (BM) alternative is the least
desired alternative and the metro area at (MF) is the most preferred alterna-
tive. The alternative specic constant for the extended at (EF) alternative
is not signicantly dierent from zero, as shown by the related t-statistic
value. The sign of the cost coecient is negative, as expected, meaning that
the utility of an alternative decreases with increase in cost.
73
74 logit
Model Specication with Alternative-Specic Attributes
Files to use with Biogeme:
Model le: MNL Tel specic.mod
Data le: telephone.dat
In this second specication, we relax the hypothesis of generic coecients.
To illustrate this idea, two dierent cost coecients are introduced, one
for the at alternatives and the other for the measured alternatives. The
corresponding utility functions are shown below:
V
BM
= ASC
BM
+
M cost
ln(cost BM)
V
SM
=
M cost
ln(cost SM)
V
LF
= ASC
LF
+
F cost
ln(cost LF)
V
EF
= ASC
EF
+
F cost
ln(cost EF)
V
MF
= ASC
MF
+
F cost
ln(cost MF)
The estimation results are shown in Table 5.6. In this case, both cost co-
ecients for at and measured alternatives are estimated. Both their signs
are negative, as expected, and the larger absolute value of
M cost
indicates
that people are more sensitive to cost in case of measured alternatives. The
value and the sign of the budget measured alternative specic constant still
indicates that this option is the least desired, all the rest remaining constant.
The other values of the ASCs for the at options are not signicant.
Generic vs. Specic Test
The likelihood ratio test (see pages 28 and 164-167 in Ben-Akiva and Ler-
man, 1985) can be used to test a generic versus an alternative-specic model
specication. The likelihood ratio test statistic for the null hypothesis of
generic attributes is
2(L(
^

R
) L(
^

U
))
where R and U denote the restricted (generic) and unrestricted (alternative-
specic) models, respectively. It is
2
distributed with the number of degrees
74
choice of residential telephone services case 75
of freedom equal to the number of restrictions (K
U
K
R
), where K
U
and K
R
are the numbers of estimated coecients in the unrestricted and restricted
models, respectively. In this case, 2(477.557 + 476.608) = 1.898. Since

2
0.95,1
= 3.841 at a 95% level of condence, we can conclude that the null
hypothesis of a generic cost coecient cannot be rejected. The restricted
model should therefore be preferred.
Model Specication with Socio-Economic Characteris-
tics
Files to use with Biogeme:
Model le: MNL Tel socioec.mod
Data le: telephone.dat
The previous two models only include variables that are attributes of the
alternatives. We now introduce a socio-economic characteristic, namely the
number of users in the household (users), in the utility of the at options.
It should be noted that the socio-economic variables do not vary among the
alternatives and are individual specic. The utility functions can be written
now as follows:
V
BM
= ASC
BM
+
M cost
ln(cost BM)
V
SM
=
M cost
ln(cost SM)
V
LF
= ASC
LF
+
F cost
ln(cost LF) +
users
users
V
EF
= ASC
EF
+
F cost
ln(cost EF) +
users
users
V
MF
= ASC
MF
+
F cost
ln(cost MF) +
users
users
The estimation results are shown in Table 5.7. The coecient of the users
variable is statistically signicantly dierent from zero and indicates that
people have higher preference towards at options if the number of users is
higher (as expected). The interpretation of the other coecients remains the
same as in the previous model specications.
75
76 logit
Logit model with generic attributes
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
BM
-0.721 0.152 -4.76
2 ASC
LF
1.20 0.159 7.56
3 ASC
EF
1.00 0.703 1.42
4 ASC
MF
1.74 0.267 6.51
5
cost
-2.03 0.212 -9.55
Summary statistics
Number of observations = 434
L(0) = 560.250
L(
^
) = 477.557

2
= 0.139
Table 5.5: Logit model with generic attributes
Logit model with alternative specic attributes
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
BM
-0.747 0.155 -4.82
2 ASC
LF
0.155 0.691 0.22
3 ASC
EF
-0.0920 1.00 -0.09
4 ASC
MF
0.479 0.817 0.59
5
M cost
-2.16 0.243 -8.90
6
F cost
-1.71 0.273 -6.25
Summary statistics
Number of observations = 434
L(0) = 560.250
L(
^
) = 476.608

2
= 0.139
Table 5.6: Logit model with alternative-specic attributes
76
choice of residential telephone services case 77
Logit model with socio-economic characteristics
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
BM
-0.731 0.153 -4.77
2 ASC
LF
-0.0871 0.700 -0.12
3 ASC
EF
-0.319 1.02 -0.31
4 ASC
MF
0.274 0.830 0.33
5
users
0.394 0.108 3.63
6
M cost
-1.96 0.246 -7.96
7
F cost
-1.79 0.286 -6.25
Summary statistics
Number of observations = 434
L(0) = 560.250
L(
^
) = 468.791

2
= 0.151
Table 5.7: Logit model with socio-economic characteristics
77
78 logit
5.4 Airline Itinerary Case
Logit model with Generic Attributes
Files to use with Biogeme:
Model le: MNL airline generic.mod
Data le: airline.dat
The choice set consists of the following three alternatives:
1. a non-stop ight,
2. a ight with one stop on the same airline,
3. a ight with one stop and a change of airline.
We dene the deterministic part of the utility for the household by including
the alternative specic constants (ASCs) and ve attributes, namely fare (in
the unit of 100$, in order to reduce numerical issues), legroom, total travel
time (Total TT), early and late schedule delays (SchedDE and SchedDL),
with their respective generic coecients
Fare
,
Legroom
,
Total TT
,
SchedDE
and
SchedDL
:
V
1
= ASC
1
+
Fare
Fare +
Legroom
Legroom
1
+
Total TT
Total TT
1
+
SchedDE
SchedDE
1
+
SchedDL
SchedDL
1
V
2
= ASC
2
+
Fare
Fare
2
+
Legroom
Legroom
2
+
Total TT
Total TT
2
+
SchedDE
SchedDE
2
+
SchedDL
SchedDL
2
V
3
= ASC
3
+
Fare
Fare
3
+
Legroom
Legroom
3
+
Total TT
Total TT
3
+
SchedDE
SchedDE
3
+
SchedDL
SchedDL
3
One of the alternative specic constants (arbitrarily ASC
1
) is normalized to
zero for identication. The corresponding alternative is the reference alter-
native for the ASCs. This is important for the interpretation we will perform
in the next paragraphs.
Given our specication, and everything being equal, an ASC with negative
sign indicates a lower utility level for the corresponding alternative compared
to the normalized one (i.e., the rst one). As it can be observed in Table 5.8,
this is the case for both other alternatives (ASC
2
and ASC
3
are negative and
78
airline itinerary case 79
Generic logit model estimation
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
2
-1.26 0.126 -9.95
2 ASC
3
-1.49 0.127 -11.72
3
Fare
-0.0194 0.000795 -24.37
4
Legroom
0.222 0.0266 8.35
5
SchedDE
-0.130 0.0161 -8.08
6
SchedDL
-0.0883 0.0145 -6.10
7
Total TT
-0.326 0.0671 -4.85
. . .
Summary statistics
Number of observations = 3609
L(0) = 3964.892
L(
^
) = 2333.701

2
= 0.410
Table 5.8: Logit model with generic attributes
statistically signicant). It means that alternative 1 is preferred to alterna-
tives 2 and 3, i.e., alternative without stop is preferred to alternatives with
stops all other things being equal.
The parameter related to leg room has a positive sign and it is signicantly
dierent from zero. It implies that more room for legs increases the utility of
the alternative. For other parameters, like fare, delays and travel time, the
sign is negative. It means that all these factors have a negative impact on
utility: they make the alternative less likely to be chosen.
Logit model with Alternative-Specic Coecients
Files to use with Biogeme:
Model le: MNL airline specic.mod
Data le: airline.dat
Next we present a model (unrestricted) with alternative-specic travel time
coecients and we compare it with the (restricted) model with generic co-
79
80 logit
ecients presented in the previous section. We carry out a statistical test
(likelihood ratio test) to assess if one specication is signicantly better than
the other. We perform the analysis on the coecient of the travel time. The
deterministic utilities for this model with alternative-specic travel times are:
V
1
= ASC
1
+
Fare
Fare
1
+
Legroom
Legroom
1
+
Total TT 1
Total TT
1
+
SchedDE
SchedDE
1
+
SchedDL
SchedDL
1
V
2
= ASC
2
+
Fare
Fare
2
+
Legroom
Legroom
2
+
Total TT 2
Total TT
2
+
SchedDE
SchedDE
2
+
SchedDL
SchedDL
2
V
3
= ASC
3
+
Fare
Fare
3
+
Legroom
Legroom
3
+
Total TT 3
Total TT
3
+
SchedDE
SchedDE
3
+
SchedDL
SchedDL
3
Note that instead of only
Total TT
, we have now
Total TT 1
,
Total TT 2
and

Total TT 3
.
The results for the unrestricted model are reported in Table 5.9.
Generic vs Specic Test Under the null hypothesis:
H
0
:
Total TT 1
=
Total TT 2
=
Total TT 3
We reject null hypothesis (generic travel time coecient) if :
2(L
R
L
U
) >
((1),df
Next we describe the standard steps to perform the test:
1. L
R
and L
U
represent the log-likelihood for both the restricted and the
unrestricted models:
L
R
= 2333.701
L
U
= 2320.447
2. The degree of freedom is given by the dierence in the number of esti-
mated parameters between the models:
df = K
U
K
R
= 9 7 = 2
80
airline itinerary case 81
Generic logit model estimation
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
2
-1.43 0.183 -7.81
2 ASC
3
-1.64 0.192 -8.53
3
Fare
-0.0193 0.000802 -24.05
4
Legroom
0.226 0.0267 8.45
5
SchedDE
-0.139 0.0163 -8.53
6
SchedDL
-0.104 0.0137 -7.59
7
Total TT
1
-0.332 0.0735 -4.52
8
Total TT
2
-0.299 0.0696 -4.29
9
Total TT
3
-0.302 0.0699 -4.31
. . .
Summary statistics
Number of observations = 3609
L(0) = 3964.892
L(
^
) = 2320.447

2
= 0.412
Table 5.9: Logit model with alternative-specic travel-time attributes
81
82 logit
3. 2(L
R
L
U
) = 2(2333.701 + 2320.447) = 26.508
4. The critical value for
(0.95,2)
is 0.103.
5. We conclude that we can reject the null hypothesis H
0
of generic coef-
cient in favor of alternative-specic coecients.
Inclusion of Socio-Economic Characteristics
Files to use with Biogeme:
Model le: MNL airline socioecon.mod
Data le: airline.dat
It is reasonable to assume that people make choices not only in relation
to the attributes that characterize the alternatives but also depending on
some personal characteristics or socioeconomic indicators. The availability
of individual-specic information gives us the opportunity to model partly
the heterogeneity present in the population. We modify the previous model
by adding income of respondents into the utilities.
V
1
= ASC
1
+
Fare
Fare
1
+
Legroom
Legroom
1
+
Total TT 1
Total TT
1
+
SchedDE
SchedDE
1
+
SchedDL
SchedDL
1
+
Inc
1
Income
V
2
= ASC
2
+
Fare
Fare
2
+
Legroom
Legroom
2
+
Total TT 2
Total TT
2
+
SchedDE
SchedDE
2
+
SchedDL
SchedDL
2
+
Inc
2
Income
V
3
= ASC
3
+
Fare
Fare
3
+
Legroom
Legroom
3
+
Total TT 3
Total TT
3
+
SchedDE
SchedDE
3
+
SchedDL
SchedDL
3
+
Inc
3
Income
Since the variable of the income does not vary between the alternatives and
only dierences in utilities matter, we need to normalize one alternative to
zero. We interpret the estimated coecients for the remaining alternatives
with respect to the reference alternative, which arbitrarily is alternative 1.
It is similar to what we did when specifying alternative specic constants.
We assumed that the income of the respondent aects dierently each alter-
native.
The estimation results of this model are reported in Table 5.10.
82
airline itinerary case 83
Generic logit model estimation
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
2
-1.07 0.215 -4.96
2 ASC
3
-1.05 0.228 -4.61
3
Fare
-0.0195 0.000807 -24.18
4
Income
2
-0.0419 0.0148 -2.83
5
Income
3
-0.0755 0.0154 -4.90
6
Legroom
0.227 0.0268 8.49
7
MI
-0.578 0.159 -3.64
8
SchedDE
-0.139 0.0163 -8.50
9
SchedDL
-0.104 0.0139 -7.49
10
Total TT
1
-0.335 0.0735 -4.56
11
Total TT
2
-0.301 0.0696 -4.32
12
Total TT
3
-0.304 0.0698 -4.36
. . .
Summary statistics
Number of observations = 3609
L(0) = 3964.892
L(
^
) = 2307.488

2
= 0.415
Table 5.10: Logit model with socio-economic variables
83
84 logit
Therefore we have specied two dierent parameters associated with the
attribute income.
Inc
for alternative 1 has been normalized to zero. The
two parameter estimates have negative signs, implying that the higher the
income of the respondent, the lower the likelihood for choosing these two
alternatives (with stops) compared to the rst one (without stops).
In this model, we need to deal with missing data for income. We dened
Income as being the income variable without -1 and 99. The [Exclude]
section tells Biogeme not to consider some observations. One solution would
be to exclude missing data (-1 and 99) from the whole data set.
A second and better solution consists in dening another variable, called
MissingIncome (MI). MissingIncome is equal to 1 if the income variable
is -1 or 99. We dont exclude any observation any more and the [Exclude]
section is not changed, but we add this new variable in the utility function.
84
Chapter 6
Specication Testing
The topic of this case study is the testing of dierent hypotheses regarding
both model specications and structures. The objectives can be summarized
as follows:
Illustration of the market segmentation concept and related testing.
Explanation of the McFadden IIA test to test the assumption of inde-
pendence between alternatives.
Testing of non-nested hypotheses using the Cox test.
Testing of non-linear specications using the piecewise linear approx-
imation, the power series expansion and the Box-Cox transformation
methods.
For this case study, you can choose between the Swissmetro, the Residential
Telephone Services and the Airline Itinerary datasets. A detailed description
of each dataset can be found in Appendix A.
Before starting the case study, read the general introduction to the case
studies on page 17. The introduction discusses how to go through the case
study and gives you some guidelines on the model building process.
The examples of model specications that we have provided can be found
in the following sections: Swissmetro in section 6.1, Residential Telephone
Services in section 6.2 and Airline Itinerary in section 6.3.
85
86 specification testing
6.1 Swissmetro Case
Market Segmentation
Files to use with Biogeme:
Model les: SpecTest SM male.mod,
SpecTest SM female.mod,
SpecTest SM full.mod,
Data le: swissmetro.dat
In this example, the segmentation is made on the gender variable. We rst
create two market segments as follows:
Male: all observations where MALE=1 belong to this subgroup.
Female: all observations where MALE=0 belong to this subgroup.
Following the procedure described in Ben-Akiva and Lerman (1985) (pages
194-204), we estimate a model on the full data set. Then we run the same
model for each gender group separately. Note that we make use of the
[Exclude] section in the model specication le to dene which observa-
tions should be excluded for the estimation. We obtain the values shown
in Table 6.1. The expressions of the utility functions are the same for all
models. Note that we dene the dummy variable SENIOR which takes the
value 1 for individuals with age above 65 and 0 otherwise.
V
car
= ASC
car
+
time
CAR TT +
car cost
CAR CO +
senior
SENIOR
V
train
=
time
TRAIN TT +
train cost
TRAIN COST +
he
TRAIN HE +

ga
GA
V
SM
= ASC
SM
+
time
SM TT +
SM cost
SM COST +
he
SM HE +

senior
SENIOR +
ga
GA
The null hypothesis is of no taste variation across the market segments:
H
0
:
Male
=
Female
Note that in the above equation Male and Female refer to market segments
and not to variables in the dataset.
86
swissmetro case 87
Model Log likelihood Number of coecients
Male -3680.002 9
Female -1110.618 9
Restricted model -4927.167 9
Table 6.1: Values for the market segmentation test
The likelihood ratio test (with 18-9=9 degrees of freedom) yields
LR = 2(L
N
(
^
)
G

g=1
L
Ng
(
^

g
))
= 2(4927.167 +3680.002 + 1110.618) = 273.094

2
0.95,9
= 16.920
and we can therefore reject the null hypothesis at a 95% level of condence.
McFadden IIA Test
Files to use with Biogeme:
Model les: SpecTest SM socioec bis.mod, SpecTest SM IIA.mod
Data les: swissmetro.dat, swissmetro exclude.dat
Command le: doit.bat
Supplementary software: biomerge.exe
We are studying the impact of the modal innovation, represented by the
Swissmetro, against traditional transport modes represented by car and train.
It would seem logical to expect some kind of relationship between the tradi-
tional alternatives. They are probably correlated, where the source of this
correlation might be the presence of unobserved shared attributes between
the car and train alternatives. In order to test this assumption, we follow
the procedure that is described in McFadden (1987) and Train et al. (1989).
The procedure is semi-automatic in Biogeme. First we estimate a logit model
(SpecTest SM socioec bis.mod) on the full data set swissmetro.dat. The spec-
ication le SpecTest SM socioec bis.mod contains a section describing the
correlation we want to test. The corresponding Biogeme snapshot is shown
87
88 specification testing
[IIATest]
C13 1 3
Figure 6.1: Biogeme snapshot: IIATest section
in Figure 6.1. Alternative 1 corresponds to train, and 3 to car. Then the
estimated model is applied on the same data le, using BioSim. By dening
the section [IIATest] in the orginal .mod le, auxiliary variables are auto-
matically computed for each observation, and reported in the .enu output
le. The original .dat le and the .enu le are merged using BIOMERGE
in order to create a new data le. In fact to do the merging we use swiss-
metro exclude.dat because some observations are excluded in the original
estimation. Now we specify a new model (SpecTest SM IIA.mod) which in-
cludes the auxiliary variables in the utility functions associated with train
and car. Finally, we estimate this model on the new data le created by
merging. We show in Table 6.2 the estimation results. Note that the en-
tire procedure described above can be carried out automatically using the
command le doit.bat
The focus in this test is not related to the sign of the estimated IIA parameter.
What is important is the value of the t-statistic for such a coecient.
IIA
is signicantly dierent from 0 at a 95% level of condence. This indicates
that the IIA property does not hold for the car and train alternatives. This
kind of correlation can be captured with GEV models that are treated in one
of the case studies (Chapter 8).
Note that we can also do a likelihood ratio test for the null hypothesis:
H
0
:
IIA
= 0. The test statistic for the null hypothesis is given by
2(L
R
L
U
) = 2(5245.512 + 5237.543) = 15.938
where the restricted model is the model without the auxiliary variables
(SpecTest SM socioec bis.mod) and the unrestricted model is the model with
the auxiliary variables. The test statistic is asymptotically
2
distributed
with 1 degree of freedom since there is 1 restriction. Since 15.908 > 3.841
(the critical value of the
2
distribution with 1 degree of freedom at a 95 %
level of condence), we reject the null hypothesis and conclude that the IIA
property does not hold for the car and train alternatives.
88
swissmetro case 89
Logit model for car/train IIA test
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
0.217 0.159 1.37
2 ASC
SM
0.486 0.129 3.76
3
cost
-0.00121 0.000116 -10.40
4
car time
-0.0103 0.000965 -10.69
5
train time
-0.0118 0.00116 -10.11
6
SM time
-0.0112 0.00168 -6.65
7
he
-0.00516 0.00111 -4.65
8
ga
6.66 0.703 9.48
9
IIA
0.301 0.128 2.35
Summary statistics
Number of observations = 6759
L(0) = 6958.425
L(
^
) = 5237.543

2
= 0.246
Table 6.2: Logit model for IIA test
89
90 specification testing
Test of Non-Nested Hypotheses
Files to use with Biogeme:
Model les: SpecTest SM M1.mod, SpecTest SM M2.mod,
SpecTest SM MC.mod
Data le: swissmetro.dat
In discrete choice analysis, we often perform tests based on the so-called
nested hypotheses, which means that we specify two models such that the rst
one (the restricted model) is a special case of the second one (the unrestricted
model). For this type of comparison, the classical likelihood ratio test can
be applied. However, there are situations in which we aim at comparing
models which are not nested, meaning that one model cannot be obtained as
a restricted version of the other. One way to compare two non-nested models
is to build a composite model from which both models can be derived. We
can thus perform two likelihood ratio tests for each of the restricted models
against the composite model. This procedure is known as the Cox test of
separate families of hypothesis.
Cox Test
The Cox test is described in detail in Ben-Akiva and Lerman (1985), pages
171-174, and in the Textbook of the course, in section Tests of Non-Nested
Hypothesis. Assume that we want to test a model M
1
against another
model M
2
(and one model is not a restricted version of the other). We start
by generating a composite model M
C
such that both models M
1
and M
2
are
restricted cases of M
C
. We then test M
1
against M
C
and M
2
against M
C
using the likelihood ratio test. There are three possible outcomes of this test:
One of the two models is rejected. Then we keep the one that is not
rejected.
Both models are rejected. Then better models should be developed.
The composite model could be used as a new basis for future speci-
cations.
Both models are accepted. Then we choose the model with the higher

2
index.
90
swissmetro case 91
We show next the expressions of the utility functions used for the three
dierent models M
1
, M
2
and M
C
. M
1
has the following systematic utilities
V
car
= ASC
car
+
car time
CAR TT +
car cost
CAR CO
V
train
=
train time
TRAIN TT +
train cost
TRAIN CO
V
SM
= ASC
SM
+
SM time
SM TT +
SM cost
SM CO
where both the time and cost related coecients are alternative specic. The
systematic utilities of M
2
are
V
car
= ASC
car
+
time
CAR TT +
car cost
CAR CO
V
train
=
time
TRAIN TT +
train cost
TRAIN CO+

he
TRAIN HE +
ga
GA
V
SM
= ASC
SM
+
time
SM TT +
SM cost
SM CO+
he
SM HE
+
ga
GA
where only the cost related coecient is assumed to be alternative specic,
headway of train and SM has been added, and one socio-economic variable
has been added to the model. We now dene the composite model M
C
with
the following systematic utilities
V
car
= ASC
car
+
car time
CAR TT +
car cost
CAR CO
V
train
=
train time
TRAIN TT +
train cost
TRAIN CO+

he
TRAIN HE +
ga
GA
V
SM
= ASC
SM
+
SM time
SM TT +
SM cost
SM CO+

he
SM HE +
ga
GA
In Table 6.3, we summarize the dierences between the various models, and
we show in Tables 6.4, 6.5 and 6.6 the estimation results for the M
1
, M
2
and
M
C
models, respectively.
At this point, we can apply the likelihood ratio test for M
1
against M
C
. In
this case, the null hypothesis is:
H
0
:
he
=
ga
= 0
91
92 specification testing
Models used for the Cox test
Model Parameters Description
M
1
8 two ASCs, three alternative specic time coef-
cients and three alternative specic cost coef-
cients
M
2
8 two ASCs, one generic time coecient, three
alternative specic cost coecients, one generic
headway coecient and one socio-economic co-
ecient
M
C
10 two ASCs, three alternative specic time co-
ecients, three alternative specic cost coe-
cients, one generic headway coecient and one
socio-economic coecient
Table 6.3: Summary of the dierent model specications
M
1
model: estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
-0.260 0.138 -1.89
2 ASC
SM
0.113 0.106 1.06
3
car cost
-0.00785 0.00149 -5.26
4
train cost
-0.0308 0.00193 -15.98
5
SM cost
-0.0113 0.000790 -14.24
6
car time
-0.0129 0.00163 -7.91
7
train time
-0.00870 0.00118 -7.34
8
SM time
-0.0112 0.00178 -6.25
Summary statistics
Number of observations = 6759
L(0) = 6958.425
L(
^
) = 5065.901

2
= 0.271
Table 6.4: Estimation results for the M
1
model
92
swissmetro case 93
M
2
model: estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
-0.872 0.140 -6.24
2 ASC
SM
-0.410 0.103 -3.99
3
car cost
-0.00934 0.00116 -8.02
4
train cost
-0.0284 0.00176 -16.08
5
SM cost
-0.0104 0.000743 -13.99
6
time
-0.0111 0.00120 -9.22
7
he
-0.00533 0.00102 -5.25
8
ga
0.521 0.191 2.72
Summary statistics
Number of observations = 6759
L(0) = 6958.425
L(
^
) = 5055.843

2
= 0.272
Table 6.5: Estimation results for the M
2
model
93
94 specification testing
M
C
model: estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
-0.529 0.158 -3.35
2 ASC
SM
-0.126 0.116 -1.08
3
car cost
-0.00776 0.00150 -5.18
4
train cost
-0.0300 0.00200 -14.97
5
SM cost
-0.0108 0.000828 -12.99
6
car time
-0.0129 0.00162 -7.94
7
train time
-0.00866 0.00120 -7.22
8
SM time
-0.0111 0.00179 -6.19
9
he
-0.00535 0.00101 -5.31
10
ga
0.513 0.193 2.65
Summary statistics
Number of observations = 6759
L(0) = 6958.425
L(
^
) = 5047.205

2
= 0.273
Table 6.6: Estimation results for the M
C
model
94
swissmetro case 95
As usual, 2(L(M
1
) L(M
C
)) is
2
distributed with K = 2 degrees of free-
dom. In this case, we have:
2(5065.901 + 5047.205) = 37.392 > 5.991
The result of this rst test is that we can reject the null hypothesis. Applying
the same test for M
2
against M
C
, we have
H
0
:
car time
=
train time
=
SM time
.
In this case, the likelihood ratio test with K = 2 degrees of freedom gives
2(5055.843 + 5047.215) = 17.276 > 5.991
and we can therefore reject the null hypothesis in this case as well. Since
both models are rejected, better models should be developed. If both models
were accepted, we would choose the one with the higher
2
index.
Tests of Non-Linear Specications
Files to use with Biogeme:
Model les: SpecTest SM piecewise.mod,
SpecTest SM powerseries.mod,
SpecTest SM boxcox.mod
Data le: swissmetro.dat
In the previous case study, the models were specied with linear in parameter
formulations of the deterministic parts of the utilities (i.e. parameters that
remain constant throughout the whole range of the values of each variable).
However, in some cases non-linear specications may be more justied. In
this section, we test three dierent non-linear specications of the deter-
ministic utility functions (see Ben-Akiva and Lerman, 1985, pages 174-179).
Namely, piecewise linear approximation, power series method and Box-Cox
transformation are used below.
95
96 specification testing
[Expressions]
TRAIN_TT1 = min( TRAIN_TT , 90)
TRAIN_TT2 = max(0,min( TRAIN_TT - 90, 90))
TRAIN_TT3 = max(0,min( TRAIN_TT - 180 , 90))
TRAIN_TT4 = max(0,TRAIN_TT - 270)
Figure 6.2: Biogeme snapshot concerning the piecewise variables denition
Piecewise Linear Approximation
In this rst example, we want to test the hypothesis that the value of
the travel time related parameter for the train alternative assumes dier-
ent values for dierent ranges of values of the variable itself. We split the
range of values for travel time t (which is t [35, 1022] , expressed in min-
utes) into four dierent intervals: train
tt1
[0, 90], train
tt2
]90, 180],
train
tt3
]180, 270] and train
tt4
> 270. We show in Figure 6.2 the corre-
sponding Biogeme code.
The systematic utility expressions used in this model are
V
car
= ASC
car
+
car time
CAR TT +
car cost
CAR CO
V
train
=
train time1
TRAIN TT1 +
train time2
TRAIN TT2 +

train time3
TRAIN TT3 +
train time4
TRAIN TT4 +

train cost
TRAIN CO +
he
TRAIN HE +
GA
GA
V
SM
= ASC
SM
+
SM time
SM TT +
SM cost
SM CO+
he
SM HE +

GA
GA
We can see from the estimation results shown in Table 6.7 that all time coe-
cients related to the piecewise linear expression are negative. The coecient
associated with very long trips is the largest in magnitude in an absolute
sense, meaning that trips longer than 4 hours and a half are more penalizing
the utility function of the train alternative.
We perform the likelihood ratio test where the restricted model is the one
with linear train travel time (the M
C
model from the previous section) and
the unrestricted model is the piecewise linear specication. The
2
statistic
96
swissmetro case 97
Piecewise linear model: estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
-0.991 0.434 -2.28
2 ASC
SM
-0.584 0.421 -1.39
3
car cost
-0.00776 0.00150 -5.18
4
train cost
-0.0301 0.00204 -14.78
5
SM cost
-0.0107 0.000828 -12.97
6
car time
-0.0129 0.00162 -7.94
7
train time1
-0.0135 0.00508 -2.65
8
train time2
-0.0109 0.00180 -6.05
9
train time3
-0.00208 0.00224 -0.93
10
train time4
-0.0179 0.00551 -3.25
11
SM time
-0.0112 0.00179 -6.24
12
he
-0.00534 0.00101 -5.30
13
ga
0.515 0.193 2.67
Summary statistics
Number of observations = 6759
L(0) = 6958.425
L(
^
) = 5041.952

2
= 0.274
Table 6.7: Estimation results for the piecewise linear model
97
98 specification testing
for the null hypothesis is given by
H
0
:
train time1
=
train time2
=
train time3
=
train time4
The test yields
2(5047.205 +5041.952) = 10.506
and since
2
0.95,3
= 7.815, we can reject the null hypothesis of a linear train
travel time at a 95% level of condence.
The Power Series Expansion
We introduce here a power series expansion for the train travel time variable.
In principle, we could add a polynomial expression but here we introduce just
the squared term. The subsequent model specication is practically the same
as the M
C
model, with the exception of the train alternative:
V
train
=
train time
TRAIN TT +
train time sq
TRAIN TT SQ+

train cost
TRAIN CO+
he
TRAIN HE +

GA
GA
The estimation results for this specication are shown in Table 6.8. The esti-
mated parameter associated with the linear term of the power series expan-
sion is negative while the estimated parameter associated with the squared
term is positive. However, the cumulative eect of the travel time variable
on the utility is still negative, as can be easily veried by a plot of utility
versus travel time for a reasonable range of rail travel time.
We perform the likelihood ratio test where the restricted model is the one
with linear train travel time (the M
C
model from the previous section) and
the unrestricted model is the power series expansion specication. The
2
statistic for the null hypothesis is given by:
H
0
:
train time
2 = 0
The test yields
98
swissmetro case 99
Power series model: estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
-0.693 0.190 -3.65
2 ASC
SM
-0.289 0.149 -1.94
3
car cost
-0.00776 0.00150 -5.18
4
train cost
-0.0299 0.00201 -14.86
5
SM cost
-0.0108 0.000828 -12.99
6
car time
-0.0129 0.00162 -7.95
7
train time
-0.0109 0.00190 -5.72
8
train time sq
0.00000628 0.00000282 2.23
9
SM time
-0.0111 0.00178 -6.23
10
he
-0.00537 0.00101 -5.31
11
ga
0.515 0.194 2.65
Summary statistics
Number of observations = 6759
L(0) = 6958.425
L(
^
) = 5046.573

2
= 0.273
Table 6.8: Estimation results for the power series model
99
100 specification testing
[GeneralizedUtilities]
1 B_TRAIN_TIME * ( ( ( TRAIN_TT ) ^ LAMBDA - 1 ) / LAMBDA )
Figure 6.3: Biogeme snapshot of Box-Cox transformation
2(5047.205 + 5046.573) = 1.264
and since
2
0.95,1
= 3.841, we can accept the null hypothesis of a linear rail
travel time at a 95% level of condence.
The Box-Cox Transformation
In this section, we analyze the possibility of testing non-linear transforma-
tions of variables that are non-linear in the unknown parameters. One pos-
sible transformation is the Box-Cox, expressed as
x

, where x 0.
We apply this transformation to the train time variable. The utilities remain
exactly the same, with the substitution of such a variable with its Box-Cox
transformation. This introduces one more unknown parameter, . We show
in Figure 6.3 a Biogeme snapshot from the model specication le to visualize
how non-linear in parameters utility functions are implemented.
The results related to the Box-Cox transformed model are shown in Table 6.9.
The Box-Cox transformation reduces to a linear function as a special case
when the parameter is equal to 1. Looking at the estimated values, we see
that is signicantly dierent from 1 at a 95 % level of condence (t-stat
= -2.13). Note though that the parameter
train time
associated with train
travel time is not signicant.
We can also perform a likelihood ratio test as follows. The null hypothesis
is given by:
H
0
: = 1
The
2
statistic for this null hypothesis is as follows:
100
swissmetro case 101
2(L(
^

L
) L(
^

BC
)) = 2(5047.205 +5045.420) = 3.570

2
0.95,1
= 3.841 > 3.570
Therefore, the null hypothesis of a linear specication is accepted at a 95
% level of condence. Note that the t-test and the likelihood ratio test for
testing one restriction are asymptotically equivalent. Here the t-stat with
respect to 1 is equal to -2.13, so is close to being insignicant (w.r.t. 1). In
small samples, the likelihood ratio test is preferred to the t-test. Therefore,
we prefer the linear specication over the Box-Cox transformation in this
case.
Box-Cox transformed model: estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
-1.72 1.01 -1.71
2 ASC
SM
-1.32 1.01 -1.31
3
car cost
-0.00776 0.00150 -5.18
4
train cost
-0.0298 0.00200 -14.90
5
SM cost
-0.0107 0.000828 -12.98
6
car time
-0.0129 0.00162 -7.95
7
train time
-0.128 0.160 -0.80
8
SM time
-0.0111 0.00178 -6.23
9
he
-0.00535 0.00101 -5.30
10
ga
0.508 0.194 2.62
11 0.465 0.251 1.85
Summary statistics
Number of observations = 6759
L(0) = 6958.425
L(
^
) = 5045.420

2
= 0.273
Table 6.9: Estimation results for the Box-Cox transformed model
101
102 specification testing
6.2 Choice of Residential Telephone Services
Case
Market Segmentation
Files to use with Biogeme:
Model les: SpecTest Tel low inc.mod, SpecTest Tel med inc.mod,
SpecTest Tel high inc.mod, MNL Tel socioec.mod
Data le: telephone.dat
We test if there is a taste variation across market segments. We dene
dierent segments based on income and divide the population into three
income groups. We estimate separate models for each income group using
the same model specication, namely MNL Tel socioec.mod used in the logit
case study, and compare the estimation results with a model based on the
complete dataset. The results in terms of nal log-likelihood are summarized
in Table 6.10.
The null hypothesis is of no taste variation across the market segments, that
is
H
0
:
HI
=
MI
=
LI
.
Performing a likelihood ratio test,
LR = 2(L
N
(
^
)
G

g=1
L
Ng
(
^

g
))
= 2(468.791 + 120.103 + 297.990 + 46.668) = 8.060

2
0.95,13
= 22.360
We can conclude that the null hypothesis cannot be rejected, that is, market
segmentation on income does not exist.
102
choice of residential telephone services case 103
Model Denition Log- Nb. of
likelihood Coecients
Low Income Income < 10000 -120.103 6
Medium Income 10000 < Income < 40000 -297.990 7
High Income Income > 40000 -46.668 7
Pooled Data
Restricted Model All -468.791 7
Table 6.10: Results for the market segmentation test
IIA test estimation
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
BM
-0.185 0.233 -0.79
2 ASC
LF
0.801 0.166 4.82
3 ASC
EF
1.07 0.833 1.28
4 ASC
MF
1.83 0.279 6.56
5
cost
-1.26 0.228 -5.51
6
IIAm
0.832 0.334 2.49
7
IIA
f
1.83 0.538 3.41
Summary statistics
Number of observations = 434
L(0) = 560.250
L(
^
) = 460.754

2
= 0.165
Table 6.11: Estimation results for the IIA test
103
104 specification testing
[IIATest]
C12 1 2
C345 3 4 5
Figure 6.4: Biogeme snapshot: IIATest section
McFadden IIA Test
Files to use with Biogeme:
Model le: MNL base.mod, MNL base IIA.mod
Data le: telephone.dat
Command le: doit.bat
Supplementary software: biomerge.exe
For the telephone dataset, it is possible that there are common unobserved
attributes between the measured options (alternatives BM and SM) and
common unobserved attributes among the at options (alternatives LF,EF,
and MF). We can perform the McFadden IIA test to check this. The proce-
dure is described in McFadden (1987) and Train et al. (1989). We estimate
a logit model (MNL base.mod) on the full dataset telephone.dat. The speci-
cation le (MNL base.mod) contains a section describing the correlation we
want to test. The corresponding Biogeme snapshot is shown in Figure 6.4.
Alternatives 1 and 2 correspond to measured options, alternatives 3, 4, 5 to
at options. Then the estimated model is applied on the same data le, using
BioSim. By dening the section [IIATest] in the orginal .mod le, auxiliary
variables are automatically computed for each observation, and reported in
the .enu output le. The original .dat le and the.enu le are merged us-
ing BIOMERGE in order to create a new data le. As discussed above,
we assume in this case that there are 2 subsets of alternatives suspected to
be correlated:

C
1
= {BM,SM} and

C
2
= {LF,EF,MF}. Now we specify a
new model (MNL base IIA.mod) which includes the auxiliary variables in
the utility functions associated with measured and at options. Finally, we
estimate the model on the new data le created by merging and obtain the
results shown in Table 6.11. Note that the entire procedure described above
can be carried out automatically using the command le doit.bat
We do a likelihood ratio test where the null hypothesis is
H
0
:
IIAm
=
IIA
f
= 0.
104
choice of residential telephone services case 105
The test statistic for the null hypothesis is given by
2(L
R
L
U
) = 2(477.557 +460.747) = 33.620
where the restricted model is the model without the auxiliary variables and
the unrestricted model is the model with the auxiliary variables. The test
statistic is asymptotically
2
distributed with 2 degrees of freedom since
there are 2 restrictions. Since 33.620 > 5.991 (the critical value of the
2
distribution with 2 degrees of freedom at a 95 % level of condence), we
reject the null hypothesis and conclude that the IIA assumption does not
hold for the group of measured alternatives and does not hold for the group
of at alternatives as well. In presence of such correlations, GEV models like
the Nested Logit are more appropriate.
Test of Non-Nested Hypotheses
In discrete choice analysis, we often perform tests based on the so-called
nested hypotheses, which means that we specify two models such that the rst
one (the restricted model) is a special case of the second one (the unrestricted
model). For this type of comparison, the classical likelihood ratio test can
be applied. However, there are situations in which we aim at comparing
models which are not nested, meaning that one model cannot be obtained as
a restricted version of the other. One way to compare two non-nested models
is to build a composite model from which both models can be derived. We
can thus perform two likelihood ratio tests for each of the restricted models
against the composite model. This procedure is known as the Cox test of
separate families of hypothesis.
Cox Test
Files to use with Biogeme:
Model le: SpecTest Tel M1.mod, SpecTest Tel M2.mod
SpecTest Tel MC.mod
Data le: telephone.dat
The Cox test is described in detail in Ben-Akiva and Lerman (1985), pages
171-174, and in the Textbook of the course, in section Tests of Non-Nested
105
106 specification testing
Hypothesis. Assume that we want to test a model M
1
against another
model M
2
(and one model is not a restricted version of the other). We start
by generating a composite model M
C
such that both models M
1
and M
2
are
restricted cases of M
C
. We then test M
1
against M
C
and M
2
against M
C
using the likelihood ratio test. There are three possible outcomes of this test:
One of the two models is rejected. Then we keep the one that is not
rejected.
Both models are rejected. Then better models should be developed.
Both models are accepted. Then we choose the model with the higher

2
index.
The deterministic parts of the utility functions for each of the three model
specications are:
1. M
1
V
BM
= ASC
BM
+
Mcost
cost
BM
V
SM
=
Mcost
cost
SM
V
LF
= ASC
LF
+
Fcost
cost
LF
V
EF
= ASC
EF
+
Fcost
cost
EF
V
MF
= ASC
MF
+
Fcost
cost
MF
2. M
2
V
BM
= ASC
BM
+
cost
cost
BM
V
SM
=
cost
cost
SM
V
LF
= ASC
LF
+
cost
cost
LF
+
users
users
V
EF
= ASC
EF
+
cost
cost
EF
+
users
users
V
MF
= ASC
MF
+
cost
cost
MF
+
users
users
106
choice of residential telephone services case 107
Model Nb. of parameters Log-likelihood
2
M
1
6 -476.040 0.140
M
2
6 -471.151 0.148
M
C
7 -467.804 0.153
Table 6.12: Results from the non-nested hypothesis test
3. M
c
V
BM
= ASC
BM
+
Mcost
cost
BM
V
SM
=
Mcost
cost
SM
V
LF
= ASC
LF
+
Fcost
cost
LF
+
users
users
V
EF
= ASC
EF
+
Fcost
cost
EF
+
users
users
V
MF
= ASC
MF
+
Fcost
cost
MF
+
users
users
The estimation results of the dierent models are summarized in Table 6.12.
We rst compare the M
1
model specication against the composite model
M
C
by means of a likelihood ratio test:
H
0
:
users
= 0
2(L(
^

M1
) L(
^

MC
)) = 2(476.040 +467.804) = 16.472

2
0.95,1
= 3.841 < 16.472
We can therefore reject the null hypothesis of not including socio-economic
variables. We then compare M
2
against M
C
:
H
0
:
Mcost
=
Fcost
2(L(
^

M2
) L(
^

MC
)) = 2(471.151 +467.804) = 6.694

2
0.95,1
= 3.841 < 6.694
We can therefore reject the null hypothesis of generic coecients. Since both
models are rejected, we need to develop better models. Had both models
been accepted, we could have used
2
to choose which model to keep.
The adjusted likelihood ratio index
2
is computed as follows (it is provided
in the Biogeme result le):

2
= 1
L(
^
) K
L(0)
107
108 specification testing
So, for the two models M
1
and M
2
, we obtain respectively:

1
2
= 0.140

2
2
= 0.148
Tests of Non-Linear Specications
In the previous case study, the models were specied with linear in param-
eter formulations of the deterministic parts of the utilities (parameters that
remain constant throughout the whole range of the values of each variable).
However, in some cases, non-linear specications may be more justied (e.g.
sensitivity to cost may not be the same in all cost ranges). In this section,
we test three dierent non-linear specications of the deterministic utility
functions (see Ben-Akiva and Lerman, 1985, pages 174-179). Namely, piece-
wise linear approximation, power series method and Box-Cox transformation
are used below. We have used the logit model with alternative specic cost
coecients as the base model (SpecTest Tel M1.mod).
Piecewise Linear Approximation
Files to use with Biogeme:
Model le: SpecTest Tel piecewise.mod
Data le: telephone.dat
In the rst model, we assume that the coecient of measured cost assumes
dierent values for dierent ranges of the cost variable. The full range of
values for the measured cost variable is $3.28 to $433.5. We split the range
of values for cost
i
(which is cost
i
[3.28, 433.5] , expressed in dollars) into
three dierent intervals: cost
i1
[0, 10], cost
i2
]10, 50] and cost
i3
> 50.
The selection of these ranges is based on a priori hypotheses of the user
behavior and distribution of cost in the observed sample. The reader is
encouraged to experiment with dierent ranges. An extract from the Biogeme
model le to code the ranges of costs is presented in Figure 6.5.
108
choice of residential telephone services case 109
[Expressions]
// Define here arithmetic expressions for name
// that are not directly available from the data
cost11 =min(cost1 ,10)
cost12 =max(0,min(cost1 - 10 ,40))
cost13 =max(0,cost1 - 50)
cost21 =min(cost2 ,10)
cost22 =max(0,min(cost2 - 10 ,40))
cost23 =max(0,cost2 - 50)
Figure 6.5: Biogeme snapshot for the piecewise linear approximation
The deterministic utility functions are
V
BM
= ASC
BM
+
Mcost1
cost
BM1
+
Mcost2
cost
BM2
+
Mcost3
cost
BM3
V
SM
=
Mcost1
cost
SM1
+
Mcost2
cost
SM2
+
Mcost3
cost
SM3
V
LF
= ASC
LF
+
Fcost
cost
LF
V
EF
= ASC
EF
+
Fcost
cost
EF
V
MF
= ASC
MF
+
Fcost
cost
MF
The results shown in Table 6.13 indicate that the sensitivity to measured
cost becomes less important in the range 10 < cost
i
< 50 compared to the
range cost
i
< 10, but has a steep increase for higher costs. This model has a
better goodness-of-t than the model with linear coecients in general. To
test whether or not the improvement in goodness-of-t is statistically signif-
icant, we need to perform a likelihood ratio test between the two dierent
specications.
The null hypothesis in this case is
H
0
:
Mcost1
=
Mcost2
=
Mcost3
The
2
statistic for this null hypothesis is as follows:
2(L(
^

R
) L(
^

U
)) = 2(476.040 +474.703) = 2.674

2
0.95,2
= 5.991 > 2.674
109
110 specification testing
Piecewise linear approximation
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
BM
-0.613 0.152 -4.03
2 ASC
LF
-0.631 0.500 -1.26
3 ASC
EF
-0.843 0.869 -0.97
4 ASC
MF
-0.261 0.640 -0.41
5
Mcost1
-0.294 0.0661 -4.44
6
Mcost2
-0.149 0.0665 -2.23
7
Mcost3
-1.23 0.629 -1.96
8
Fcost
-0.105 0.0217 -4.84
Summary statistics
Number of observations = 434
L(0) = 560.250
L(
^
) = 474.703

2
= 0.138
Table 6.13: Estimation results for the piecewise linear approximation
110
choice of residential telephone services case 111
where the restricted model (R) is represented by the linear specication while
the unrestricted model (U) corresponds to the piecewise linear specication.
The improvement in goodness-of-t due to the introduction of the piecewise
linear specication is not signicant and the null hypothesis that the cost
coecient is linear cannot be rejected.
The Power Series Expansion
Files to use with Biogeme:
Model le: SpecTest Tel powerseries.mod
Data le: telephone.dat
In this test, we relax the hypothesis of linear coecients for measured options
by assuming a second order power series (a squared term and a linear term).
The corresponding systematic utility functions are
V
BM
= ASC
BM
+
Mcost1
cost
BM
+
Mcost2
cost
2
BM
V
SM
=
Mcost1
cost
SM
+
Mcost2
cost
2
SM
V
LF
= ASC
LF
+
Fcost
cost
LF
V
EF
= ASC
EF
+
Fcost
cost
EF
V
MF
= ASC
MF
+
Fcost
cost
MF
.
From the estimation results presented in Table 6.14, it may be noted that the
coecient of the squared term is positive while the coecient of the linear
term is negative, and the coecient of the linear term is greater in absolute
value than that of the squared term. However, since the squared term is very
small in magnitude, the total eect is expected to remain negative in the
cost range which can be easily veried through a plot of utility versus cost.
To test whether or not we should prefer the power series expansion specica-
tion over the linear specication, we need to perform a likelihood ratio test.
The null hypothesis in this case is:
H
0
:
Mcost2
= 0
The
2
statistic for this null hypothesis is as follows:
111
112 specification testing
Power series estimation
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
BM
-0.563 0.147 -3.83
2 ASC
LF
-0.162 0.370 -0.44
3 ASC
EF
-0.377 0.813 -0.46
4 ASC
MF
0.215 0.532 0.41
5
Mcost1
-0.227 0.0427 -5.32
6
Mcost2
0.000475 0.0000936 5.07
7
Fcost
-0.107 0.0218 -4.91
Summary statistics
Number of observations = 434
L(0) = 560.250
L(
^
) = 475.465

2
= 0.139
Table 6.14: Estimation results for the power series expansion
2(L(
^

R
) L(
^

U
)) = 2(476.040 + 475.465) = 1.150

2
0.95,1
= 3.841 > 1.150
where now the unrestricted model (U) corresponds to the power series spec-
ication. Therefore, we accept the null hypothesis of a linear specication
at a 95 % level of condence, and we select the linear specication over the
power series expansion specication.
The Box-Cox Transformation
Files to use with Biogeme:
Model le: SpecTest Tel boxcox.mod
Data le: telephone.dat
In this section, we analyze the possibility of testing non-linear transforma-
tions of variables which are non-linear in the unknown parameters. One such
transformation is the Box-Cox expressed as
112
choice of residential telephone services case 113
[Utilities]
// Id Name Avail linear-in-parameter expression
1 BM avail1 ASC_BM * one
2 SM avail2 ASC_SM * one
3 LF avail3 ASC_LF * one + B_FCOST * cost3
4 EF avail4 ASC_EF * one + B_FCOST * cost4
5 MF avail5 ASC_MF * one + B_FCOST * cost5
[GeneralizedUtilities]
1 B_MCOST * ( ( ( cost1 )^ LAMBDA - 1)/LAMBDA )
2 B_MCOST * ( ( ( cost2 )^ LAMBDA - 1)/LAMBDA )
Figure 6.6: Biogeme snapshot for the Box-Cox transformation
x

, where x 0.
where is a parameter that has to be estimated. We apply such a transfor-
mation to the measured cost variable. The utilities remain the same with the
substitution of the measured cost variable with its Box-Cox transformation.
The Biogeme snapshot dening such a transformation is shown in Figure 6.6.
The parameter is estimated along with the other parameters.
The estimation results are shown in Table 6.15. The estimate of was not
found to be statistically signicantly dierent from 0. However, it is statisti-
cally signicantly dierent from 1 (t-statistic w.r.t. 1 is -2.51). Therefore, we
should prefer this non-linear specication over the linear specication. We
can also perform a likelihood ratio test as follows. The null hypothesis is
given by:
H
0
: = 1
The
2
statistic for this null hypothesis is as follows:
2(L(
^

L
) L(
^

BC
)) = 2(476.040 +472.624) = 6.832

2
0.95,1
= 3.841 < 6.832
Therefore, the null hypothesis of a linear specication can be rejected at a
95 % level of condence, and we prefer the Box-Cox transformation.
113
114 specification testing
Box-Cox estimation
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
BM
-0.695 0.166 -4.19
2 ASC
LF
-1.76 1.20 -1.46
3 ASC
EF
-1.98 1.39 -1.43
4 ASC
MF
-1.39 1.28 -1.09
5
Fcost
-0.104 0.0215 -4.83
6
Mcost
-1.30 0.880 -1.47
7 0.234 0.305 0.77
Summary statistics
Number of observations = 434
L(0) = 560.250
L(
^
) = 472.624

2
= 0.144
Table 6.15: Estimation results for the Box-Cox transformation
114
airline itinerary case 115
6.3 Airline Itinerary Case
Market Segmentation
Files to use with Biogeme:
Model les: SpecTest airline male.mod,
SpecTest airline female.mod,
SpecTest airline GenderNA.mod,
SpecTest airline full.mod,
Data le: airline.dat
In this example, we test if there exists taste variation across market segments.
The segmentation is made on the gender variable. We rst create three
market segments as follows: Male, Female, and no answer (NA). The sum of
the number of observations for each segment is equal to the total number of
observations:
N
Male
+ N
Female
+N
NA
= N
We estimate a model on the full data set. Then we estimate the same model
for each gender group separately. Note that we make use of the [Exclude]
section in the model specication le to dene the observations which should
be excluded for the estimation. We obtain the values shown in Table 6.16.
The expressions of the utility functions are the same for all models:
V
1
= ASC
1
+
Fare
Fare
1
+
Legroom
Legroom
1
+
Total TT
Total TT
1
+
SchedDE
Opt1 SchedDelayEarly +
SchedDL
Opt1 SchedDelayLate
V
2
= ASC
2
+
Fare
Fare
2
+
Legroom
Legroom
2
+
Total TT
Total TT
2
+
SchedDE
Opt2 SchedDelayEarly +
SchedDL
Opt2 SchedDelayLate
V
3
= ASC
3
+
Fare
Fare
3
+
Legroom
Legroom
3
+
Total TT
Total TT
3
+
SchedDE
Opt3 SchedDelayEarly +
SchedDL
Opt3 SchedDelayLate
Let us remark that one of the three alternative specic constants ASC
1
,
ASC
2
and ASC
3
must be set to 1 for normalization purposes.
The null hypothesis assumes no taste variation across the market segments:
H
0
:
Male
=
Female
=
NA
115
116 specification testing
Model Log likelihood Number of coecients
Male -1195.819 9
Female -929.325 9
NA -178.017 9
Restricted model -2320.447 9
Table 6.16: Values for the market segmentation test
where
segment
is the vector of coecients of market segment. Note that in
the above equation Male, Female and NA refer to market segments and
not to variables in the dataset.
The likelihood ratio test (with 27 9 = 18 degrees of freedom) yields
LR = 2

L
N
(
^
)

L
N
Male
(
^

Male
) +L
N
Female
)
(
^

Female
) + L
N
NA
(
^

NA
)

= 2(2320.447 +1195.819 +929.325 +178.017) = 34.572

2
0.95,18
= 28.87
and we can therefore reject the null hypothesis at a 95% level of condence:
market segmentation on gender does exist.
McFadden IIA Test
Files to use with Biogeme:
Model les: SpecTest airline full.mod,
SpecTest airline IIA.mod
Data le: airline.dat
In this survey, the choice is made between three ight itineraries, two of which
are with the same company. It is possible that there are common unobserved
attributes between the two itineraries of the same company. It would seem
logical to expect a relationship between the traditional alternatives. They
might be correlated. In order to test this assumption, we perform the McFad-
den IIA test. First we estimate a logit model (SpecTest airline full bis.mod)
116
airline itinerary case 117
on the full data set airline.dat. The specication le SpecTest airline full bis.mod
contains a section describing the correlation we want to test. The correspond-
ing Biogeme snapshot is shown in Figure 6.7. Alternative 1 corresponds to
an itinerary without stops, and alternative 2 to an itinerary with the same
company but with one stop.
Biogeme SpecTest airline full bis airline.dat
[IIATest]
C12 1 2
Figure 6.7: Biogeme snapshot: IIATest section
By dening the section [IIATest] in the orginal .mod le, auxiliary vari-
ables are automatically computed for each observation, and reported in
the .enu output le. Biogeme also produces a le containing the speci-
cation of the estimated model, in the same format as the model specica-
tion le SpecTest airline full bis.res. We need to rename it as a .mod le:
SpecTest airline full bis res.mod in order to apply it on the same data le,
using BioSim:
biosim SpecTest_airline_full_bis_res airline.dat
The original .dat le and the SpecTest airline full bis res.enu le need to be
merged in order to create a new data le that contains both the original
model variables and the auxiliary variables. This step is performed using
BIOMERGE:
biomerge airline.dat SpecTest_airline_full_bis_res.enu
The merged data le is stored into a le named biomergeOutput.lis. We re-
name this le as SpecTest airline IIATest.dat. Now we specify a new model
(SpecTest airline IIA.mod) which includes the auxiliary variables in the util-
ity functions associated with alternatives 1 and 2. Finally, we estimate this
model on the new data le created by merging the original data le and
SpecTest airline full res.enu, using the following command:
Biogeme SpecTest airline IIA SpecTest airline IIATest.dat
117
118 specification testing
Logit model for IIA test for itineraries 1 and 2
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
2
-1.51 0.211 -7.14
2 ASC
3
-1.65 0.194 -8.51
3
Fare
-0.0198 0.00104 -18.94
4
Legroom
0.232 0.0281 8.24
5
SchedDE
-0.143 0.0168 -8.49
6
SchedDL
-0.107 0.0145 -7.40
7
Total TT
1
-0.341 0.0744 -4.58
8
Total TT
2
-0.304 0.0700 -4.34
9
Total TT
3
-0.312 0.00111 -4.65
10
IIA
-0.0489 0.0714 -4.37
Summary statistics
Number of observations = 3609
L(0) = 3964.892
L(
^
) = 2320.155

2
= 0.412
Table 6.17: Logit model for IIA test
118
airline itinerary case 119
The estimation results are shown in Table 6.17.
In the IIA Test, we are interested in the value of the t-statistic for the
coecient related to the auxiliary variables. If
IIA
is signicantly dierent
from 0 at a 95% level of condence, this indicates that the IIA property does
not hold for alternatives 1 and 2. It would mean that alternatives 1 and 2
share some unobserved attributes.
However Table 6.17 shows that parameter
IIA
is not signicantly dierent
from 0. Hence we cannot conclude that the IIA property does not hold.
The calibration of more complex models such as Generalized Extreme Value
(GEV) models which capture correlation between alternative sharing some
common characteristics might not be justied in this case. We can hence
keep the logit specication.
Let us note that the whole procedure for the IIA test can be performed
automatically by double-clicking on batch le doit.bat.
Test of Non-Nested Hypotheses
Files to use with Biogeme:
Model les: SpecTest airline full LogFare.mod (M
1
),
SpecTest airline full.mod (M
2
),
SpecTest airline full C.mod (M
C
)
Data le: airline.dat
In discrete choice analysis, we often perform tests based on so-called nested
hypotheses, which means that we specify two models such that the rst one
(the restricted model) is a special case of the second one (the unrestricted
model). For this type of comparison, the classical likelihood ratio test can
be applied. However, there are situations, such as non-linear specications,
in which we aim at comparing models which are not nested, i.e. one model
cannot be obtained as a restricted version of the other. One way to compare
two non-nested models is to build a composite model from which both models
can be derived. We can thus perform two likelihood ratio tests, testing each
of the restricted models against the composite model. This procedure is
known as the Cox test of separate families of hypothesis.
119
120 specification testing
Cox Test
The Cox test is described in detail in Ben-Akiva and Lerman (1985), pages
171-174, and in the Textbook of the course, section Tests of Non-Nested
Hypothesis. Assume that we want to test a model M
1
against another
model M
2
(and one model is not a restricted version of the other). We start
by generating a composite model M
C
such that both models M
1
and M
2
are
restricted cases of M
C
. We then test M
1
against M
C
and M
2
against M
C
using the likelihood ratio test. There are three possible outcomes of this test:
One of the two models is rejected. Then we keep the one that is not
rejected.
Both models are rejected. Then better models should be developed.
The composite model could be used as a new basis for future speci-
cations.
Both models are accepted. Then we choose the model with the highest

2
index.
We present here the expressions of the utility functions used for three dierent
models M
1
, M
2
and M
C
developed on the airline itinerary case study.
M
1
has the following systematic utilities:
V
1
= ASC
1
+
Fare
Fare
1
+
Legroom
Legroom
1
+
Total TT
1
Total TT
1
+
SchedDE
Opt1 SchedDelayEarly +
SchedDL
Opt1 SchedDelayLate
V
2
= ASC
2
+
Fare
Fare
2
+
Legroom
Legroom
2
+
Total TT
2
Total TT
2
+
SchedDE
Opt2 SchedDelayEarly +
SchedDL
Opt2 SchedDelayLate
V
3
= ASC
3
+
Fare
Fare
3
+
Legroom
Legroom
3
+
Total TT
3
Total TT
3
+
SchedDE
Opt3 SchedDelayEarly +
SchedDL
Opt3 SchedDelayLate
where the cost related coecients are linear.
120
airline itinerary case 121
The systematic utilities of M
2
are expressed as follows:
V
1
= ASC
1
+
LogFare
log(Fare
1
) +
Legroom
Legroom
1
+
Total TT
1
Total TT
1
+
SchedDE
Opt1 SchedDelayEarly +
SchedDL
Opt1 SchedDelayLate
V
2
= ASC
2
+
LogFare
log(Fare
2
) +
Legroom
Legroom
2
+
Total TT
2
Total TT
2
+
SchedDE
Opt2 SchedDelayEarly +
SchedDL
Opt2 SchedDelayLate
V
3
= ASC
3
+
LogFare
log(Fare
3
) +
Legroom
Legroom
3
+
Total TT
3
Total TT
3
+
SchedDE
Opt3 SchedDelayEarly +
SchedDL
Opt3 SchedDelayLate
where the cost related coecients are logarithmic.
We now dene the composite model M
C
with the following systematic utili-
ties:
V
1
= ASC
1
+
Fare
Fare
1
+
LogFare
log(Fare
1
) +
Legroom
Legroom
1
+
Total TT
1
Total TT
1
+
SchedDE
Opt1 SchedDelayEarly
+
SchedDL
Opt1 SchedDelayLate
V
2
= ASC
2
+
Fare
Fare
1
+
LogFare
log(Fare
2
) +
Legroom
Legroom
2
+
Total TT
2
Total TT
2
+
SchedDE
Opt2 SchedDelayEarly
+
SchedDL
Opt2 SchedDelayLate
V
3
= ASC
3
+
Fare
Fare
1
+
LogFare
log(Fare
3
) +
Legroom
Legroom
3
+
Total TT
3
Total TT
3
+
SchedDE
Opt3 SchedDelayEarly
+
SchedDL
Opt3 SchedDelayLate
Table 6.18 summarizes the dierences between the various models and Ta-
bles 6.19, 6.20 and 6.21 show the estimation results for models M
1
, M
2
and
M
C
, respectively.
Now we can apply the likelihood ratio test for M
1
against M
C
. In this case,
the null hypothesis is:
H
0
:
LogFare
= 0
As usual, 2(L(M
1
) L(M
C
)) is
2
distributed with K = 1 degrees of free-
dom. In this case, we have:
2(2320.447 + 2271.656) = 97.582 > 3.84
121
122 specification testing
Models used for the Cox test
Model Parameters Description
M
1
9 two ASCs, one generic cost linear coecient,
three generic time coecients and three generic
coecients (for legroom, schedule delay early
departure, schedule delay late departure)
M
2
9 two ASCs, one generic cost logarithmic coe-
cient, three alternative specic time coecients
and three generic coecients (for legroom,
schedule delay early departure, schedule de-
lay late departure)
M
C
10 two ASCs, one generic cost logarithmic coe-
cient, one generic cost logarithmic coecient,
three alternative specic time coecients and
three generic coecients (for legroom, schedule
delay early departure, schedule delay late
departure)
Table 6.18: Summary of the dierent model specications
122
airline itinerary case 123
Parameter Parameter Parameter Robust
number name estimate standard error t-stat p-value
1 ASC
2
-1.43 0.183 -7.81 0.00
2 ASC
3
-1.64 0.192 -8.53 0.00
3 Fare -0.0193 0.000802 -24.05 0.00
4 Legroom 0.226 0.0267 8.45 0.00
5 SchedDE -0.139 0.0163 -8.53 0.00
6 SchedDL -0.104 0.0137 -7.59 0.00
7 Total TT
1
-0.332 0.0735 -4.52 0.00
8 Total TT
2
-0.299 0.0696 -4.29 0.00
9 Total TT
3
-0.302 0.0699 -4.31 0.00
Summary statistics
Number of observations = 3609
L(0) = 3964.892
L(
^
) = 2320.447

2
= 0.412
Table 6.19: Estimation results for model M
1
The result of this rst test is that we can reject the null hypothesis H
0
: it
means the composite model is better than M
1
. The linear model is rejected.
Applying the same test for M
2
against M
C
, we have
H
1
:
Fare
= 0.
In this case, the likelihood ratio test with K = 2 degrees of freedom gives
2(2283.103 + 2271.656) = 22.894 > 3.84
and we can therefore reject the null hypothesis H
1
in this case as well. The
logaritmic model is also rejected. Since both models are rejected, better
models should be developed: we cannot keep the composite model with two
dierent cost-related coecients since it does not have a behavioral interpre-
tation. If both models had been accepted, we would choose the one with the
highest
2
index.
123
124 specification testing
Parameter Parameter Parameter Robust
number name estimate standard error t-stat p-value
1 ASC
2
-1.82 0.194 -9.39 0.00
2 ASC
3
-2.09 0.200 -10.46 0.00
3 Fare -8.54 0.305 -28.02 0.00
4 Legroom 0.219 0.0261 8.38 0.00
5 SchedDE -0.142 0.0167 -8.50 0.00
6 SchedDL -0.105 0.0139 -7.54 0.00
7 Total TT
1
-0.465 0.0729 -6.37 0.00
8 Total TT
2
-0.335 0.0690 -4.86 0.00
9 Total TT
3
-0.321 0.0692 -4.63 0.00
Summary statistics
Number of observations = 3609
L(0) = 3964.892
L(
^
) = 2283.103

2
= 0.422
Table 6.20: Estimation results for model M
2
Tests of Non-Linear Specications
Files to use with Biogeme:
Model les: SpecTest airline piecewise.mod,
SpecTest airline powerseries.mod,
SpecTest airline boxcox.mod
Data le: airline.dat
The models studied previously were specied with linear-in-parameter formu-
lations of the deterministic parts of the utilities (i.e. parameters that remain
constant throughout the whole range of the values of each variable). How-
ever, in some cases non-linear specications may be more justied. In this
section, we test three dierent non-linear specications of the deterministic
utility functions: a piecewise linear specication of the time parameter of the
non-stop itinerary, a power series method and Box-Cox transformation.
124
airline itinerary case 125
Parameter Parameter Parameter Robust
number name estimate standard error t-stat p-value
1 ASC
2
-1.69 0.193 -8.74 0.00
2 ASC
3
-1.94 0.199 -9.72 0.00
3 Fare -0.00658 0.00154 -4.28 0.00
4 Legroom 0.223 0.0265 8.40 0.00
5 LogFare -5.96 0.665 -8.96 0.00
6 SchedDE -0.142 0.0167 -8.51 0.00
7 SchedDL -0.106 0.0140 -7.57 0.00
8 Total TT
1
-0.415 0.0739 -5.62 0.00
9 Total TT
2
-0.324 0.0694 -4.67 0.00
10 Total TT
3
-0.316 0.0697 -4.53 0.00
Summary statistics
Number of observations = 3609
L(0) = 3964.892
L(
^
) = 2271.656

2
= 0.425
Table 6.21: Estimation results for model M
C
125
126 specification testing
Piecewise Linear Approximation
In this rst example, we want to test the hypothesis that the value of
the travel time related parameter for the non-stop itinerary alternative as-
sumes dierent values for dierent ranges of values of the variable itself. We
split the range of values for travel time TripTimeHours
1
[0.67, 6.35] (ex-
pressed in hours) into three dierent intervals: TripTimeHours
1
1
[0, 2],
TripTimeHours
1
2
]2, 3], TripTimeHours
1
3
> 3. Figure 6.8 displays the
corresponding Biogeme code.
[Expressions]
TripTimeHours_1_1 = min( TripTimeHours_1 , 2)
TripTimeHours_1_2 = max(0,min( TripTimeHours_1 - 2, 1))
TripTimeHours_1_3 = max(0,TripTimeHours_1 - 3)
Figure 6.8: Biogeme snapshot for the denition of the variables related to
the piecewise linear approximation
The systematic utility expressions used in this model are given as follows:
V
1
= ASC
1
+
Fare
Fare
1
+
Legroom
Legroom
1
+
SchedDE
Opt1 SchedDelayEarly +
SchedDL
Opt1 SchedDelayLate
+
Total TT
1
1
Total TT
1
1 +
Total TT
1
2
Total TT
1
2
+
Total TT
1
3
Total TT
1
3
V
2
= ASC
2
+
Fare
Fare
2
+
Legroom
Legroom
2
+
SchedDE
Opt2 SchedDelayEarly +
SchedDL
Opt2 SchedDelayLate
+
Total TT
2
Total TT
2
V
3
= ASC
3
+
Fare
Fare
3
+
Legroom
Legroom
3
+
SchedDE
Opt3 SchedDelayEarly +
SchedDL
Opt3 SchedDelayLate
+
Total TT
3
Total TT
3
The estimation results are shown in Table 6.22. All time coecients related to
the piecewise linear expression are negative. The coecient associated with
short trips (< 2 hours) is the largest in absolute value, meaning that the same
increase of travel time penalizes the utility of the non-stop alternative more
126
airline itinerary case 127
if the trip is shorter than 2 hours than if is longer than 2 hours. Similarly,
the coecient associated with trips with an intermediate duration (between
2 and 3 hours) penalizes more the utility of the non-stop alternative than if
the trip lasts longer than 3 hours.
Piecewise linear model: estimation results
Parameter Parameter Coe. Robust Robust
number name estimate standard error t-stat
1 ASC
2
-2.33 0.412 -5.65
2 ASC
3
-2.55 0.438 -5.83
3
Fare
-0.0193 0.000799 -24.10
4
Legroom
0.227 0.0267 8.51
5
SchedDE
-0.140 0.0165 -8.47
6
SchedDL
-0.105 0.0137 -7.64
7
Total TT
1
1
-0.825 0.238 -3.47
8
Total TT
1
2
-0.443 0.188 -2.36
9
Total TT
1
3
-0.229 0.0889 -2.57
10
Total TT
2
-0.300 0.0701 -4.29
11
Total TT
3
-0.301 0.0701 -4.29
. . .
Summary statistics
Number of observations = 3609
L(0) = 3964.892
L(
^
) = 2315.041

2
= 0.413
Table 6.22: Estimation results for the piecewise linear model
We perform a likelihood ratio test where the restricted model is the one with
linear travel time for the non-stop alternative and the unrestricted model is
the piecewise linear specication. The null hypothesis is given as follows:
H
0
:
Total TT
1
1
=
Total TT
1
2
=
Total TT
1
3
The statistic for the likelihood ratio test is the following:
2(2320.447 +2315.041) = 10.812
Since
2
0.95,2
= 5.99, we can reject the null hypothesis of a linear travel time
for the non-stop alternative at a 95% level of condence.
127
128 specification testing
The Power Series Expansion
We introduce here a power series expansion for the travel time of the non-
stop itinerary. Other polynomial expressions could be tried as well, but in
the following example, we only specify a squared term.
The specication of the model presented in this section is the same as the
one presented in the previous section, except for the alternative relative to
the non-stop itinerary. The latter is given as follows:
V
1
= ASC
1
+
Fare
Fare
1
+
Legroom
Legroom
1
+

SchedDE
Opt1 SchedDelayEarly +
SchedDL
Opt1 SchedDelayLate
+
Total TT
1
Total TT
1
+
Total TT
1
sq
Total TT
1
sq
Power series model: estimation results
Parameter Parameter Coe. Robust Robust
number name estimate standard error t-stat
1 ASC
2
-2.21 0.298 -7.42
2 ASC
3
-2.43 0.312 -7.78
3
Fare
-0.0193 0.000800 -24.11
4
Legroom
0.227 0.0267 8.51
5
SchedDE
-0.139 0.0165 -8.46
6
SchedDL
-0.105 0.0137 -7.63
7
Total TT
1
-0.870 0.172 -5.05
8
Total TT
1
sq
0.0745 0.0220 3.38
9
Total TT
2
-0.301 0.0701 -4.30
10
Total TT
3
-0.302 0.0701 -4.31
. . .
Summary statistics
Number of observations = 3609
L(0) = 3964.892
L(
^
) = 2314.435

2
= 0.414
Table 6.23: Estimation results for the power series model
128
airline itinerary case 129
The estimation results for this specication are shown in Table 6.23. The es-
timated parameter associated with the linear term of the power series expan-
sion is negative while the estimated parameter associated with the squared
term is positive. However, for reasonable travel times, the cumulative eect
of the travel time variable on the utility is still negative, as the coecient
associated with the power series term is much smaller in absolute value.
In order to see if the power series specication is better than the linear one,
we perform a likelihood ratio test. Here, the restricted model is the one with
linear travel time for the non-stop alternative and the unrestricted model is
the one with the power series expansion. The null hypothesis is given by:
H
0
:
Total TT
1
sq
= 0
The statistic for the likelihood ratio test is given as follows:
2(2314.435 +2320.447) = 12.024
Since
2
0.95,1
= 3.841, we can reject the null hypothesis of a linear travel time
for the non-stop alternative at a 95% level of condence.
The Box-Cox Transformation
In this section, we specify a Box-Cox transformation, which is a non-linear
transformation of a variable that also depends on an unknown parameter .
Precisely, a Box-Cox transformation of a variable x is given as follows:
x

, where x 0.
We apply this transformation to the travel time variable for the non-stop
itinerary. The utilities are the same as the previous models, apart from the
one relative to the non-stop itinerary, which we report below:
V
1
= ASC
1
+
Fare
Fare
1
+
Legroom
Legroom
1
+
SchedDE
Opt1
S
chedDelayEarly +
SchedDL
Opt1
S
chedDelayLate
+
Total TT
1

Total TT

1
1

129
130 specification testing
[GeneralizedUtilities]
1 Total_TT1 * ( ( ( TripTimeHours_1 ) ^ LAMBDA - 1 ) / LAMBDA )
Figure 6.9: Biogeme snapshot of Box-Cox transformation
Let us note that in this specication, we have one more unknown parameter,
. Figure 6.9 displays a Biogeme snapshot from the model specication le.
The results relative to the model including the Box-Cox transformation are
shown in Table 6.24.
Let us remark that the Box-Cox transformation reduces to a linear function
as a special case when the parameter is equal to 1. The estimate of is
signicantly dierent from 1 at a 95 % level of condence, with a t-test equal
to 3.36.
We perform a likelihood ratio test between the linear model and the Box-Cox
model. The null hypothesis is given by:
H
0
: = 1
The statistic of the likelihood ratio test for this null hypothesis is given as
follows:
2(2320.447 +2314.574) = 11.746

2
0.95,1
= 3.841 > 11.746
The null hypothesis of a linear specication is hence rejected at a 95 % level
of condence. Therefore, the Box-Cox transformation of the time is more
adequate.
130
airline itinerary case 131
Box-Cox transformed model: estimation results
Parameter Parameter Coe. Robust Robust
number name estimate standard error t-stat
1 ASC
2
-1.51 0.263 -5.77
2 ASC
3
-1.74 0.280 -6.22
3 Fare -0.0193 0.000799 -24.12
4 lambda -0.139 0.338 -0.41
5 Legroom 0.227 0.0267 8.52
6 SchedDE -0.140 0.0165 -8.47
7 SchedDL -0.105 0.0137 -7.63
8 Total TT1 -1.24 0.372 -3.34
9 Total TT2 -0.306 0.0681 -4.49
10 Total TT3 -0.306 0.0683 -4.48
. . .
Summary statistics
Number of observations = 3609
L(0) = 3964.892
L(
^
) = 2314.574

2
= 0.414
Table 6.24: Estimation results for the Box-Cox transformed model
131
132 specification testing
132
Chapter 7
Forecasting
The objective of this case study is to forecast market shares for dierent
policy scenarios using the models estimated in the logit model case study.
You can choose between the Swissmetro, Residential Telephone Services and
Airline Itinerary datasets. A detailed description of each dataset can be
found in Appendix A.
The provided forecasting examples are given in the following sections: Swiss-
metro in section 7.2 on page 135, Residential Telephone Services in section 7.3
on page 138 and Airline Itinerary in 7.4 on page 141.
7.1 Guidelines
This case study diers from the previous ones since you do not develop new
model specications. Instead, you use the model specications from the logit
model case study. In addition to the programs you normally use, you need a
spreadsheet application such as OpenOce Calc or Microsoft Oce Excel.
The estimated coecients of a discrete choice model can be used to calculate
the choice probability of each alternative for each observation in the sample.
In forecasting, however, we are interested in the aggregate market shares for
the entire population or for dierent segments. It could also be interesting
to know how these aggregate market shares are aected by a change in an
independent variable.
In this case study, you learn to aggregate the individual probabilities to
133
134 forecasting
obtain market shares and to test the eect of dierent alternative scenarios
on the market shares. In all case studies it is assumed that the available
sample is a random sample of the population.
Start by studying the given base case as well as the corresponding fore-
casting scenario based on cost policy changes. Use the given model and
spreadsheet (distributed on the course USB key) to test and analyze the
proposed scenarios.
134
swissmetro case 135
7.2 Swissmetro Case
Forecasting the Eect of Change in Swissmetro Cost
Files to use with BIOGEME:
Model les: MNL SM socioec.mod
MNL SM socioec res.mod
MNL SM socioec res2.mod
Data le: swissmetro.dat
Excel worksheet: swissmetro.xls
In this case study, we forecast the eects of change in Swissmetro costs across
dierent market segments. (See Chapter 6 in Ben-Akiva and Lerman, 1985
for details on forecasting techniques.) Suppose that we know that market
segmentation exists on income. We can then consider three markets, namely,
low income, medium income and high income that are dened as follows
Low Income: under $50,000 (INCOME = 0 or 1)
Medium Income: between $50,000 and $100,000 (INCOME = 2)
High Income: Over $100,000 (INCOME= 3).
We use the model MNL SM socioec.mod, from the case study on logit mod-
els (Chapter 5). The procedure used for forecasting market shares is the
following
Estimate the model with BIOGEME.
Compute predicted probabilities with BioSim. See Section 2.7 on page 34
for instructions on how to use BioSim.
Excel can be used for editing and processing the data and probabilities.
For example, you can open the data le with Excel and paste the
probabilities given in the BioSim result le into the Excel le.
We have provided an Excel le (swissmetro.xls) containing the observations
and their corresponding probabilities. This le has also been used for com-
puting market shares by averaging the alternative probabilities over each
market segment.
135
136 forecasting
We would like to investigate the cost inuence on the market shares of
Swissmetro. We therefore increase the cost for the Swissmetro by 20%
and we forecast the market shares after this change. We modify the le
MNL SM socioec res.mod to take into account the cost policy in the follow-
ing way:
[Expressions]
SM_COST = 1.2 * SM_CO * ( GA == 0 )
We name this le MNL SM socioec res2.mod. It is provided with this case
study. We simulate again using BioSim in order to obtain the alternative
probabilities under this new scenario. The probabilities are integrated in the
Excel le (swissmetro.xls) and the market shares can be computed in the
same way as for the base case. The results for the base case and the new
cost scenario are given in Table 7.1. We can note a decrease in the market
shares of Swissmetro for all market segments. However, it is not an important
decrease which indicates that travelers are not very sensitive to cost changes
for this new transportation mode.
Figure 7.1 shows the market shares of the Swissmetro alternative for the low
and high income segments as a function of changes in Swissmetro cost. We
can see that surprisingly the sensitivity to cost is higher for the high income
group than for the low income group. This might indicate that a dierent
model specication should be attempted (for example, one that includes
income as an explanatory variable). We can also note that surprisingly the
Swissmetro alternative has a higher market share for the low income group
than for the high income group. This could be due to the SP data collection
where the price for Swissmetro may not have been high enough to capture
the dierences between these groups.
It would also be interesting to investigate the impact on the market shares
for the following two policy scenarios:
The Swissmetro SA has decided to provide a 20% discount to youths
(age < 24) and 50% discount to elderly (age > 65) when using Swiss-
metro. To compensate for the lost revenue, the company considers
increasing the general Swissmetro fare uniformly by 10%.
The Swissmetro SA is considering an alternative option of making in-
cremental investment in Swissmetro and initially starting with half the
136
swissmetro case 137
Base case Forecast
Low Med Hi Low Med Hi
INC INC INC INC INC INC
CAR 14 28 32 16 31 36
TRAIN 23 12 9 24 13 10
SM 62 60 60 60 56 54
Table 7.1: Market Shares (percent) for increased cost of Swissmetro
40
50
60
70
-20% -10% base +10% +20%
High income
Low income
Changes in Swissmetro Cost
M
a
r
k
e
t
S
h
a
r
e
(
%
)
Figure 7.1: Swissmetro: Market Shares for Low and High Income Segments
maglev trains they originally planned to purchase. To meet the grow-
ing demand, they are also considering doubling the frequency of the
regular trains.
137
138 forecasting
7.3 Choice of Residential Telephone Services
Case
Forecasting the Eect of Change in Cost Across Market
Segments
Files to use with Biogeme:
Model les: MNL Tel socioec.mod
MNL Tel socioec res.mod
MNL Tel socioec res2.mod
Data le: telephone.dat
Excel worksheet: telephone.xls
In this case study, we forecast the eects of change in cost of alternatives
across dierent market segments (See Chapter 6 in Ben-Akiva and Ler-
man, 1985 for details on forecasting techniques.) Suppose that we know
that market segmentation exists on income (Inc). We can then consider
three markets, namely, low income, medium income and high income. We
dene these market segments as follows
Low Income: under $20,000 (Inc = 1 or 2)
Medium Income: Between $20,000 and $40,000 (Inc = 3 or 4)
High Income: Over $40,000 (Inc = 5).
We use the model MNL Tel socioec.mod from the case study on logit mod-
els (Chapter 5). The procedure used for forecasting market shares is the
following
Estimate the model with Biogeme.
Compute predicted probabilities with BioSim. See Section 2.7 on page 34
for instructions on how to use BioSim.
Excel can be used for editing and processing the data and probabilities.
For example, you can open the data le with Excel and paste the
probabilities given in the BioSim result le into the Excel le.
138
choice of residential telephone services case 139
We have provided an Excel le (telephone.xls) containing the observations
and their corresponding probabilities. This le has also been used for com-
puting market shares by averaging the alternative probabilities over each
market segment.
Assume that the telephone company in an eort to increase revenues consid-
ers raising the xed costs for alternatives SM, LF, EF and MF by $4, $6, $7
and $11, respectively. We would like to forecast the market shares after this
change. We modify the le MNL Tel socioec res.mod to take into account
the cost policy in the following way:
[Expressions]
logcost1 = log(cost1 )
logcost2 = log(cost2 + 4 )
logcost3 = log(cost3 + 6 )
logcost4 = log(cost4 + 7 )
logcost5 = log(cost5 + 11 )
We name this le MNL Tel socioec res2.mod, and it is provided with this case
study. We simulate again using BioSim in order to obtain the alternative
probabilities under this new scenario. The probabilities are integrated in
the Excel le (telephone.xls), and the market shares can be computed in the
same way as for the base case. The results for the base case and the new cost
scenario are given in Table 7.2. The cost change does not result in important
changes for the EF and MF alternatives. There is, however, an important
increase for all market segments towards the BM alternative.
Figure 7.2 shows the market shares of the standard measure (SM) alternative
for the low and high income segments as a function of changes in SM cost.
We can see that the sensitivity to cost is about the same for the two market
segments. The SM alternative has however a higher market share for the low
income group than for the high income group.
It would also be interesting to investigate the impact on the market shares
for the following two policy scenarios:
Due to legal restrictions, the telephone company is expected to subsi-
dize the telephone costs of elderly households (a household with at least
1 household member older than 65 years) and low-income households
(a household with annual household income less than $20,000). The
139
140 forecasting
Base case Forecast
Low Med Hi Low Med Hi
INC INC INC INC INC INC
BM 19 14 13 34 26 23
SM 30 28 23 22 21 18
LF 40 43 41 34 39 37
EF 0 1 2 0 1 2
MF 11 14 21 10 13 19
Table 7.2: Market Shares (percent)
15
25
35
45
-20% -10% base +10% +20%
High income
Low income
Changes in SM Cost
M
a
r
k
e
t
S
h
a
r
e
(
%
)
Figure 7.2: Market Shares for Low and High Income Segments, SM alterna-
tive
telephone company must provide a 50% discount to these households
telephone costs. To compensate for these losses in the revenues, the
company considers increasing the telephone costs of all other house-
holds uniformly by 10%.
Due to recession, the number of employed persons per household has
reduced to half of the previous scenario and the telephone company
has decided to provide a 20% discount for households that have no
employed persons. To compensate for these losses in the revenues, the
company considers increasing the telephone costs of households with
at least one employed person by 10%.
140
airline itinerary case 141
7.4 Airline Itinerary Case
Forecasting the Eect of Change in the Cost of the Non-
stop Itinerary
Files to use with Biogeme:
Model les: MNL airline.mod
MNL airline res.mod
MNL airline res2.mod
Data le: airline.dat
Excel worksheet: airline.xls
In this case study, we are interested in forecasting the eects of changes
in the fare of the non-stop airline itinerary for dierent market segments,
i.e. individuals who pay for their trips and individuals whose airplane ticket
is paid by a third party. We assuming that there is evidence for market
segmentation between these two groups. Precisely, the latter are dened as
follows:
Traveler pays: category traveler is paying for the trip (q03 WhoPays=
1)
Third party pays: categories employer pays (q03 WhoPays= 2) and
third party pays (q03 WhoPays= 3)
The base model we are using here is MNL airline.mod. The procedure used to
forecast the market shares of the dierent airline itineraries is the following:
Estimate the model with Biogeme.
Compute the predicted probabilities with BioSim. See Section 2.7 on
page 34 for instructions on how to use BioSim.
Excel can be used for editing and processing the data and probabilities.
For example, you can open the data le with Excel and paste the
probabilities given in the BioSim result le into the Excel le.
141
142 forecasting
An Excel le airline.xls which contains the observations and their correspond-
ing probabilities is provided. In this le you can also nd the market shares
for each alternative, which were obtained by averaging the probabilities of
the alternative over each market segment.
We would like to investigate the inuence of a change in the non-stop itinerary
fare on the market shares of the three alternatives. For example, we increase
the fare of the non-stop itinerary by 20% and observe the subsequent changes
in the market shares.
From the estimation procedure of model MNL airline.mod we obtained le
MNL airline.res. This le has been renamed as MNL airline res.mod and is
also provided in the folder that contains the les relative to this case study.
We now modify it in order to take into account the change of fare in the
non-stop itinerary. This is performed in the section called [Expressions] as
follows:
[Expressions]
HighFare_1 = 1.2 * Fare_1
The modied le is called MNL airline res2.mod and is also provided with
this case study. We perform a new simulation with BioSim in order to obtain
the probabilities of the dierent alternatives for this scenario. The probabil-
ities have been included in the Excel le (airline.xls) and the market shares
are computed similarly as for the base case. The results for the base case
and the new cost scenario are reported in Table 7.3.
A important decrease in the market share of the non-stop itinerary can be
noticed for both market segments. This shows that individuals are sensitive
to variations of the fare of direct ight.
Figure 7.3 shows the evolution of the market share of the non-stop ight
itinerary for the market segments of individuals who pay for their trips and
individuals who do not, with respect to several changes in the non-stop ight
fare. As expected, we notice that individuals who pay for their ight are
slightly more sensitive to changes in the airplane ticket for the non-stop
alternative. This result shows that the variable indicating who pays for the
trip could be included as an explanatory variable in the model.
142
airline itinerary case 143
Base case Forecast
Traveler Third party Traveler Third party
pays pays pays pays
Opt1 69.4 69.5 43.9 42.9
Opt2 16.4 15.9 29.9 30.1
Opt3 14.2 14.6 26.1 27.1
Table 7.3: Market Shares (percent) for an increased cost of the non-stop
itinerary
40
50
60
70
80
-20% -10% base +10% +20%
Traveler pays
Third party pays
Changes in Cost of Non-stop Alternative
M
a
r
k
e
t
S
h
a
r
e
(
%
)
Figure 7.3: Swissmetro: Market Shares for Traveler pays and Third party
pays segments
143
144 forecasting
144
Chapter 8
Multivariate (Generalized)
Extreme Value Models
The topic of this case study is the specication and estimation of Multivariate
(Generalized) Extreme Value (MEV) models. Dierent specications are
introduced using a stepwise modeling strategy, increasing the complexity at
each step. The objectives of this case study can be summarized as follows:
Specication and estimation of Nested Logit (NL) models.
Testing of the nesting parameters.
Estimation of Cross Nested Logit (CNL) models, with xed alpha pa-
rameters.
Estimation of CNL models with unknown alpha parameters.
For this case study, you can choose between the Swissmetro and Residential
Telephone Services datasets. A detailed description of each dataset can be
found in Appendix A.
We focus here on the correlation among alternatives and dierent ways to
include this correlation in the model structure. We iteratively test dierent
types of nesting structures for the Nested and Cross-Nested Logit models.
The examples of model specications that we have provided can be found in
the following sections: Swissmetro in section 8.2 on page 150 and Residential
Telephone Services in section 8.3 on page 158.
145
146 multivariate (generalized) extreme value models
8.1 Challenge Question
The Swissmetro dataset Innovation in the market for intercity passen-
ger transportation is a dicult enterprise as the existing modes: private car,
coach, rail as well as regional and long-distance air services continue to in-
novate in their own right by oering new combinations of speeds, services,
prices and technologies. Consider for example high-speed rail links between
the major centers or direct regional jet services between smaller countries.
The Swissmetro SA in Geneva is promoting such an innovation: a mag-lev
underground system operating at speeds up to 500 km/h in partial vacuum
connecting the major Swiss conurbations, in particular along the Mittelland
corridor (St. Gallen, Zurich, Bern, Lausanne and Geneva).
The dataset consists of survey data collected on the trains between St. Gallen
and Geneva, Switzerland, during March 1998. The interviewed respondents
provided information in order to analyze the impact of the modal innovation
in transportation, represented by the Swissmetro, a revolutionary mag-lev
underground system, against the usual transport modes represented by car
and train. The Swissmetro is a true innovation. It is therefore not appro-
priate to base forecasts of its impact on observations of existing revealed
preferences (RP) data. As a consequence, a stated preference survey (SP)
has been conducted, which allowed to collect 6759 usable observations.
Data description Please read Appendix A.3 of the workbook for details.
Estimation of a Nested Logit Model
Files to use with Biogeme:
Model le: GEV SM NL Challenge.mod
Data le: swissmetro.dat
We hypothesize that alternatives which are public transportations, share
unobservable factors. We want our model to incorporate the potential cor-
relation pattern between the unobservable parts of the Swissmetro and train
alternatives. We group them inside the Public nest. The Car alternative
remains alone in the Private nest.
146
challenge question 147
Private
Car
Public
Train SM
Figure 8.1: The correlation structure of the specied NL model
The model structure is shown in Figure 8.1.
The model le used by Biogeme is shown in Figure 8.2
When we ran this model in Biogeme, we obtained the results as shown in
Table 8.1.
Questions: Can we use this model? Motivate your answer.
147
148 multivariate (generalized) extreme value models
[Choice]
CHOICE
[Beta]
// Name Value LowerBound UpperBound status (0=variable, 1=fixed)
ASC_CAR 0 -1000 1000 0
ASC_SBB 0 -1000 1000 0
ASC_SM 0 -1000 1000 0
B_COST 0 -1000 1000 0
B_CAR_TIME 0 -1000 1000 0
B_TRAIN_TIME 0 -1000 1000 0
B_SM_TIME 0 -1000 1000 0
B_HE 0 -1000 1000 0
B_GA 0 -1000 1000 0
[Utilities]
// Id Name Avail linear-in-parameter expression (beta1*x1 + beta2*x2 + ... )
1 SBB_SP TRAIN_AV_SP B_TRAIN_TIME * TRAIN_TT + B_COST * TRAIN_CO + B_HE * TRAIN_HE
+ B_GA * GA
2 SM_SP SM_AV ASC_SM * one + B_SM_TIME * SM_TT + B_COST * SM_CO + B_HE * SM_HE
+ B_GA * GA
3 Car_SP CAR_AV_SP ASC_CAR * one + B_CAR_TIME * CAR_TT + B_COST * CAR_CO
[Model]
$NL
[NLNests]
// Name paramvalue LowerBound UpperBound status list of alt
public 1.0 1 10 0 1 2
private 1.0 1 10 1 3
[Expressions]
one = 1
Figure 8.2: Swissmetro NL specication for Biogeme
148
challenge question 149
NL Model Estimation Results
Variable Variable Coecient Robust Robust Robust
number name estimate std error t-stat. 0 t-stat. 1
1 ASC CAR 0.256 0.163 1.57
2 ASC SM 0.434 0.129 3.37
3 B CAR TIME -0.0104 0.00111 -9.30
4 B COST -0.00124 0.000178 -6.95
5 B GA 7.18 0.976 7.35
6 B HE -0.00541 0.00108 -5.01
7 B SM TIME -0.0110 0.00187 -5.87
8 B TRAIN TIME -0.0120 0.00179 -6.69
9
private
1.0
10
public
1.14 0.160 7.10 0.87
Summary statistics
Number of observations = 6759
L(0) = 6958.425
L(
^
) = 5244.668

2
= 0.245
Table 8.1: Estimation results for the Swissmetro Nested Logit model
149
150 multivariate (generalized) extreme value models
8.2 Swissmetro Case
Estimation of a Nested Logit Model
Files to use with Biogeme:
Model le: GEV SM NL.mod
Data le: swissmetro.dat
The application of the IIA McFadden test in the case study on specication
testing revealed that the IIA assumption does not hold between the car and
train alternatives. This is an indication of probable correlation between car
and train. We start with a Nested Logit (NL) specication, where the car and
train alternatives are both assigned to the same nest and the Swissmetro is
alone in a second nest, as shown in Figure 8.4. See Chapter 10 in Ben-Akiva
and Lerman (1985) for details on the NL model.
The expressions of the systematic utility functions for each alternative used
in this model specication are
V
car
= ASC
car
+
CAR time
CAR TT +
cost
CAR CO
V
train
=
TRAIN time
TRAIN TT +
cost
TRAIN CO+
he
TRAIN HE +

GA
GA
V
sm
= ASC
SM
+
SM time
SM TT +
cost
SM CO+
he
SM HE

GA
GA,
and in Figure 8.3 an extract from the .mod le illustrating the nest specica-
tion with Biogeme is shown. Note that only one of the two nest parameters
can be estimated. The estimation results are shown in Table 8.2.
The alternative specic constants show a preference for the Swissmetro al-
ternative compared to the other modes, all the rest remaining constant. The
cost and travel time coecients have the expected negative sign. The co-
ecient related to the ownership of the Swiss annual season ticket (GA) is
positive as expected, reecting the preference for the SM and train alterna-
tives with respect to the car alternative. The negative estimated value of the
headway parameter
he
indicates that the higher the headway, the lower the
frequency of service, and thus the lower the utility. Finally, the scale param-
150
swissmetro case 151
[NLNests]
// Name paramvalue LowerBound UpperBound status list of alt
Classic 1.0 1 10 0 1 3
Innovative 1.0 1 10 1 2
Figure 8.3: Biogeme snapshot
Innovative
SM
Classic
Car Train
Figure 8.4: The correlation structure of the specied NL model
eter of the random term associated with the classic nest has been estimated
as
classic
= 1.64.
To be consistent with random utility theory, the inequality

m
< 1 with
being normalized to 1 implies
m
> 1. To see if this is the case here, we
can test the null hypothesis H
0
:
m
= 1. Since there is a single restriction,
we can use either a t-test or a likelihood ratio test which are asymptotically
equivalent. The t-statistic with respect to 1 can be computed as follows:
(^ m1)
std err of ^ m
. It is also output by Biogeme. Here the t-statistic with respect
to 1 is 4.86, which indicates that
classic
is signicantly dierent from 1, and
hence there is a signicant correlation between the car and train alternatives.
We can also do a likelihood ratio test as follows. The test statistic for the
null hypothesis is given by
2(L
R
L
U
) = 2(5245.550 +5207.794) = 75.422
where the restricted model is the logit model (SpecTest SM socioec bis.mod)
and the unrestricted model is the nested logit model. The test statistic
151
152 multivariate (generalized) extreme value models
NL model
Parameter Parameter Parameter Robust Robust Robust
number name estimate standard error t-stat. 0 t-stat. 1
1 ASC
car
0.0272 0.119 0.23
2 ASC
SM
0.243 0.119 2.05
3
cost
-0.000986 0.000105 -9.36
4
car time
-0.00874 0.00101 -8.64
5
train time
-0.0113 0.000958 -11.77
6
SM time
-0.00995 0.00163 -6.09
7
he
-0.00472 0.000862 -5.48
8
ga
5.39 0.582 9.26
9
classic
1.64 0.132 12.42 4.86
Summary statistics
Number of observations = 6759
L(0) = 6958.425
L(
^
) = 5207.794

2
= 0.250
Table 8.2: NL estimation results
152
swissmetro case 153
Rail-Based
SM Train
Classic
Car
Figure 8.5: A representative scheme for the CNL correlation structure.
is asymptotically
2
distributed with 1 degree of freedom since there is 1
restriction. Since 75.440 > 3.841 (the critical value of the
2
distribution
with 1 degree of freedom at a 95 % level of condence), we reject the null
hypothesis (logit model) and accept the nested logit model.
Estimation of a Cross-Nested Logit Model with Fixed
Alphas
Files to use with Biogeme:
model le: GEV SM CNL x.mod
data le: swissmetro.dat
In this model, we relax the assumption that an alternative can belong to
only one nest and we assume that the train alternative can be assigned to
two dierent nests. This correlation structure is motivated by considering
the train alternative as a classic transportation mode (along with the car
against the more innovative Swissmetro) on one hand, and as a rail-based
mode (as the Swissmetro) on the other hand. We represent this cross-nested
structure in Figure 8.5. See Abbe et al. (2007) for a detailed description of
the Cross-Nested Logit (CNL) model.
In Figure 8.6 we show a snapshot from the Biogeme .mod le illustrating the
CNL nest specication. The estimation results are shown in Table 8.3. The
alternative-specic constants now have a negative sign. All other coecients
153
154 multivariate (generalized) extreme value models
[CNLNests]
// Name paramvalue LowerBound UpperBound status
classic 1.0 1 10 0
Rail_based 1.0 1 10 0
[CNLAlpha]
// Alt Nest value LowerBound UpperBound status
Car classic 1 0.00001 1.0 1
Train classic 0.5 0.00001 1.0 1
Train Rail_based 0.5 0.00001 1.0 1
SM Rail_based 1 0.00001 1.0 1
Figure 8.6: Biogeme snapshot
have the expected signs.
In this CNL specication, we have xed the
train classic
and
train rail
coef-
cients to 0.5. It means that we assume that the train alternative equally
belongs to both nests classic and rail-based. This assumption will be relaxed
in the next section. Thus, CNL with xed s is a restricted model of CNL
with variable s.
Estimation of a Cross-Nested Logit Model with Un-
known Alphas
Files to use with Biogeme:
Model le: GEV SM CNL var.mod
Data le: swissmetro.dat
In Table 8.4, we show the results for the CNL specication with variable co-
ecients. We also want to underline the fact that in both CNL specications
the condition

jm
= 1
has been imposed. Such a condition is not necessary for the validity of the
model. It is imposed for identication purposes. We refer the interested
reader to Abbe et al. (2007) for more theoretical details.
154
swissmetro case 155
CNL model with xed s
Parameter Parameter Parameter Robust Robust Robust
number name estimate standard error t-stat. 0 t-stat. 1
1 ASC
car
-0.838 0.0787 -10.65
2 ASC
SM
-0.457 0.0744 -6.15
3
cost
-0.00705 0.000526 -13.39
4
car time
-0.00628 0.00122 -5.17
5
train time
-0.00863 0.00105 -8.18
6
SM time
-0.00715 0.00151 -4.74
7
he
-0.00298 0.000533 -5.58
8
ga
0.618 0.0940 6.57
9
classic
2.85 0.260 10.93 7.09
10
rail based
4.73 0.483 9.78 7.71
Summary statistics
Number of observations = 6759
L(0) = 6958.425
L(
^
) = 5120.738

2
= 0.263
Table 8.3: Estimation results for the CNL specication. The coecients
are xed.
155
156 multivariate (generalized) extreme value models
CNL model with unknown s
Parameter Parameter Parameter
number name estimate standard error t-stat. 0 t-stat. 1
1 ASC
car
-0.849 0.0692 -12.26
2 ASC
SM
-0.460 0.0656 -7.01
3
cost
-0.00697 0.000440 -15.85
4
car time
-0.00621 0.000583 -10.66
5
train time
-0.00849 0.000660 -12.85
6
SM time
-0.00711 0.000745 -9.54
7
he
-0.00293 0.000510 -5.75
8
ga
0.620 0.0886 7.00
9
classic
2.87 0.212 13.54 8.82
10
rail based
4.90 0.722 6.78 5.40
11
train classic
0.486 0.0265 18.35 -19.40
12
train rail
0.514 0.0265 19.40 -18.35
Summary statistics
Number of observations = 6759
L(0) = 6958.425
L(
^
) = 5120.608

2
= 0.262
Table 8.4: Estimation results for the CNL specication. The coecients
are estimated.
156
swissmetro case 157
To select between the nested logit and CNL model with variable s, we can
test the null hypothesis H
0
:
train rail
= 0,
rail based
= 1. Since there are
multiple restrictions, we cannot use multiple t-tests but should rather use a
likelihood ratio test as follows. The test statistic for the null hypothesis is
given by
2(L
R
L
U
) = 2(5207.794 +5120.680) = 174.372
where the restricted model is the nested logit model and the unrestricted
model is the CNL model with variable s. The test statistic is asymptotically

2
distributed with 2 degrees of freedom since there are 2 restrictions. Since
174.372 > 5.991 (the critical value of the
2
distribution with 2 degrees of
freedom at a 95 % level of condence), we reject the null hypothesis (nested
logit model) and accept the CNL model with variable s. We can thus
conclude that the train alternative is correlated with both Swissmetro and
car alternatives.
To select between the CNL model with xed s and the CNL model with
variable s, we can test the null hypothesis H
0
:
train rail
= 0.5. Since
there is a single restriction, we can use either a t-test or a likelihood ratio
test which are asymptotically equivalent. The t-statistic with respect to 0.5
is 0.53, which indicates that
train rail
is not signicantly dierent from 0.5,
and hence we accept the null hypothesis (CNL model with xed s) and
reject the CNL model with variable s.
We can also do a likelihood ratio test as follows. The test statistic for the
null hypothesis is given by
2(L
R
L
U
) = 2(5120.738 +5120.680) = 0.260
where the restricted model is the CNL model with xed s and the un-
restricted model is the CNL model with variable s. The test statistic is
asymptotically
2
distributed with 1 degree of freedom since there is 1 re-
striction. Since 0.260 < 3.841 (the critical value of the
2
distribution with
1 degree of freedom at a 95 % level of condence), we accept the null hy-
pothesis (CNL model with xed s) and reject the CNL model with variable
s.
As a conclusion, since both the nested logit model and the CNL model with
xed s are restricted models of the CNL model with variable s, and since
we have rejected the nested logit model and accepted the CNL model with
xed s, we select the CNL model with xed s.
157
158 multivariate (generalized) extreme value models
8.3 Choice of Residential Telephone Services
Case
Estimation of a Nested Logit Model
Files to use with Biogeme:
Model le: GEV Tel NL unrestricted.mod
Data le: telephone.dat
The application of the IIA McFadden test in the case study on specication
testing revealed that the IIA assumption does not hold between the SM and
BM alternatives and does not hold among the EF, LF, and MF alternatives
as well. We start by giving some examples of possible nesting structures for
the Nested Logit (NL) model in Figure 8.7. See Chapter 10 in Ben-Akiva
and Lerman (1985) for details on the NL model.
The sample model le describes the rst nesting structure shown in Fig-
ure 8.7. The expressions of the utilities for this simple NL model are
V
BM
= ASC
BM
+
cost
ln(cost
BM
)
V
SM
=
cost
ln(cost
SM
)
V
LF
= ASC
LF
+
cost
ln(cost
LF
)
V
EF
= ASC
EF
+
cost
ln(cost
EF
)
V
MF
= ASC
MF
+
cost
ln(cost
MF
).
We show a snapshot of the Biogeme code in Figure 8.8. In the rst column,
we write the name of the nest and in the last column the alternatives that
belong to it. Here the alternative numbers must correspond to those used in
the utility functions under the column ID. The estimation results of the NL
model are shown in Table 8.5.
To be consistent with random utility theory, the inequality

m
< 1 with
being normalized to 1 implies
m
> 1. To see if this is the case here, we
can test the null hypothesis H
0
:
meas
=
flat
= 1. Since there are multiple
restrictions here, we cannot do multiple t-tests. We should do a likelihood
158
choice of residential telephone services case 159
Measured
BM SM
Flat
LF EF MF
BM
BM
SM
SM
Flat
LF EF MF
Measured
BM SM
LF
LF
EF
EF
MF
MF
Figure 8.7: The possible nesting structures
[NLNests]
// Name paramvalue LowerBound UpperBound status list of alt
N_MEAS 1.0 1.0 10.0 0 1 2
N_FLAT 1.0 1.0 10.0 0 3 4 5
Figure 8.8: Biogeme snapshot
159
160 multivariate (generalized) extreme value models
NL with generic attributes
Parameter Parameter Parameter Robust Robust Robust
number name estimate standard error t stat. 0 t stat. 1
1 ASC
BM
-0.378 0.117 -3.22
2 ASC
LF
0.893 0.158 5.64
3 ASC
EF
0.847 0.391 2.17
4 ASC
MF
1.41 0.238 5.90
5
cost
-1.49 0.243 -6.13
6
meas
2.06 0.573 3.60 1.86
7
flat
2.29 0.763 3.00 1.69
Summary statistics
Number of observations = 434
L(0) = 560.250
L(
^
) = 473.219

2
= 0.143
Table 8.5: NL with generic attributes
ratio test as follows. The test statistic for the null hypothesis is given by
2(L
R
L
U
) = 2(477.557 + 473.219) = 8.676
where the restricted model is the logit model (MNL Tel generic.mod) and the
unrestricted model is the nested logit model. The test statistic is asymptot-
ically
2
distributed with 2 degrees of freedom since there are 2 restrictions.
Since 8.676 > 5.991 (the critical value of the
2
distribution with 2 degrees
of freedom at a 95 % level of condence), we reject the null hypothesis (logit
model) and accept the nested logit model.
The
m
s of the two nests can be set equal to each other too. This can be
done in two ways. One way is to keep the
m
s xed to 1 and estimate
(the related Biogeme code is shown in Figure 8.9).
Alternatively, we can also constrain the two nest coecients to be equal while
keeping xed to 1 (Figure 8.10).
The estimation results for this last specication are shown in Table 8.6.
160
choice of residential telephone services case 161
[Mu]
// Value LowerBound UpperBound Status
+1.0000000e+00 +0.0000000e+00 +1.0000000e+00 0
[NLNests]
// Name paramvalue LowerBound UpperBound status list of alt
N_MEAS 1.0 1.0 10.0 1 1 2
N_FLAT 1.0 1.0 10.0 1 3 4 5
Figure 8.9: Biogeme snapshot
[NLNests]
// Name paramvalue LowerBound UpperBound status list of alt
N_MEAS 1.0 1.0 10.0 0 1 2
N_FLAT 1.0 1.0 10.0 0 3 4 5
[ConstraintNestCoef]
// List of pairs of nests for which the associated
// coefficients must be constrained to be equal
// Syntax: COEF_NEST_A = COEF_NEST_B
N_MEAS = N_FLAT
Figure 8.10: Biogeme snapshot
161
162 multivariate (generalized) extreme value models
NL with linear constraints
Parameter Parameter Parameter
number name estimate standard error t stat. 0 t stat. 1
1 ASC
BM
-0.368 0.110 -3.35
2 ASC
LF
0.882 0.167 5.29
3 ASC
EF
0.833 0.398 2.09
4 ASC
MF
1.39 0.251 5.51
5
cost
-1.50 0.257 -5.83
6
meas
2.16 0.519 4.17 2.24
7
flat
2.16 0.519 4.17 2.24
Summary statistics
Number of observations = 434
L(0) = 560.250
L(
^
) = 473.288

2
= 0.143
Table 8.6: NL with linear constraint on nest parameters
Estimation of a Cross-Nested Logit Model with Fixed
Alphas
Files to use with Biogeme:
Model le: GEV Tel CNL x.mod
Data le: telephone.dat
In this section and the next one, we specify two dierent Cross-Nested Logit
(CNL) models (see Abbe et al. (2007) for a detailed description of the CNL
model) using both xed and variable degrees of membership. The major
premise here is that such specications are mainly for demonstration pur-
poses. However, an assumption that might make sense is that the standard
measured alternative (SM) is likely to be correlated with both measured and
at options. Indeed, if we look at its denition, it turns out that it may
belong to both nests, having also a xed monthly charge. Based on this
hypothesis, the proposed cross-nested structure is shown in Figure 8.11.
We present the CNL model with the same deterministic utility functions as
162
choice of residential telephone services case 163
Measured
BM SM
Flat
LF EF MF
Figure 8.11: The cross-nested structure
in the previous model. The corresponding snapshot from the Biogeme code
for this cross-nesting specication is shown in Figure 8.12.
Note that we dene
CNL
so that the SM alternative belongs equally to both
the at and the measured nests. This assumption will be relaxed in the next
section. Thus, CNL with xed s is a restricted model of CNL with variable
s. The estimation results are shown in Table 8.7.
Cross-Nested Logit Model with Variable Alphas
Files to use with Biogeme:
Model le: GEV Tel CNL var.mod
Data le: telephone.dat
In the previous CNL model, we assumed that the SM alternative belongs
equally to the measured nest and the at nest by xing
SM meas
and
SM flat
to be equal to 0.5. This assumption can be relaxed, and we can estimate the
share of SM in each nest during the estimation of the model parameters. The
corresponding Biogeme snapshot is shown in Figure 8.13. From the results
presented in Table 8.8, we see that the alternative SM has a very small share
in the at nest.
We also want to underline the fact that in both CNL specications the con-
163
164 multivariate (generalized) extreme value models
[CNLNests]
// Name paramvalue LowerBound UpperBound status
N_MEAS 1.0 1 10 0
N_FLAT 1.0 1 10 0
[CNLAlpha]
// Alt Nest value LowerBound UpperBound status
BM N_MEAS 1 0 1.0 1
SM N_MEAS 0.5 0 1.0 1
SM N_FLAT 0.5 0 1.0 1
LF N_FLAT 1 0 1.0 1
EF N_FLAT 1 0 1.0 1
MF N_FLAT 1 0 1.0 1
Figure 8.12: Biogeme snapshot
CNL estimation results
Parameter Parameter Parameter Robust Robust Robust
number name estimate standard error t stat. 0 t stat. 1
1 ASC
BM
-0.791 0.0769 -10.28
2 ASC
LF
0.460 0.241 1.91
3 ASC
EF
0.405 0.393 1.03
4 ASC
MF
0.845 0.329 2.57
5
cost
-1.21 0.311 -3.91
6
meas
3.14 1.18 2.66 1.81
7
flat
2.36 1.14 2.08 1.19
Summary statistics
Number of observations = 434
L(0) = 560.250
L(
^
) = 474.429

2
= 0.141
Table 8.7: CNL estimation results
164
choice of residential telephone services case 165
dition

jm
= 1
has been imposed. Such a condition is not necessary for the validity of the
model. It is imposed for identication purposes. We refer the interested
reader to Abbe et al. (2007) for more theoretical details.
To select between the nested logit and CNL model with variable s, we can
test the null hypothesis H
0
:
SM flat
= 0. Since there is a single restriction,
we can use either a t-test or a likelihood ratio test which are asymptotically
equivalent. The t-statistic with respect to 0 is 0.00, which indicates that

SM flat
is not signicantly dierent from 0, and hence we accept the null
hypothesis (nested logit model) and reject the CNL model with variable s.
We can also do a likelihood ratio test as follows. The test statistic for the
null hypothesis is given by
2(L
R
L
U
) = 2(473.219 +473.219) = 0.000
where the restricted model is the nested logit model and the unrestricted
model is the CNL model. The test statistic is asymptotically
2
distributed
with 1 degree of freedom since there is 1 restriction. Since 0.000 < 3.841
(the critical value of the
2
distribution with 1 degree of freedom at a 95 %
level of condence), we accept the null hypothesis (nested logit model) and
reject the CNL model with variable s. We can thus conclude that the SM
alternative is correlated only with the measured nest but not with the at
nest.
To select between the CNL model with xed s and the CNL model with
variable s, we can test the null hypothesis H
0
:
SM flat
= 0.5. Since there
is a single restriction, we can use either a t-test or a likelihood ratio test
which are asymptotically equivalent. The t-statistic with respect to 0.5 is
-0.58, which indicates that
SM flat
is not signicantly dierent from 0.5, and
hence we accept the null hypothesis (CNL model with xed s) and reject
the CNL model with variable s.
We can also do a likelihood ratio test as follows. The test statistic for the
null hypothesis is given by
2(L
R
L
U
) = 2(474.429 +473.219) = 2.420
where the restricted model is the CNL model with xed s and the un-
restricted model is the CNL model with variable s. The test statistic is
165
166 multivariate (generalized) extreme value models
[CNLNests]
// Name paramvalue LowerBound UpperBound status
N_MEAS 1.0 1 10 0
N_FLAT 1.0 1 10 0
[CNLAlpha]
// Alt Nest value LowerBound UpperBound status
BM N_MEAS 1 0 1.0 1
SM N_MEAS 0.5 0 1.0 0
SM N_FLAT 0.5 0 1.0 0
LF N_FLAT 1 0 1.0 1
EF N_FLAT 1 0 1.0 1
MF N_FLAT 1 0 1.0 1
Figure 8.13: Biogeme snapshot
asymptotically
2
distributed with 1 degree of freedom since there is 1 re-
striction. Since 2.420 < 3.841 (the critical value of the
2
distribution with
1 degree of freedom at a 95 % level of condence), we accept the null hy-
pothesis (CNL model with xed s) and reject the CNL model with variable
s.
Since both the nested logit model and the CNL model with xed s are
preferred to the unrestricted model (CNL model with variable s), we select
the nested logit model because it has a higher
2
than the CNL model with
xed s (0.143 vs. 0.141).
166
choice of residential telephone services case 167
CNL with
CNL
variable
Parameter Parameter Parameter
number name estimate standard error t stat. 0 t stat. 1
1 ASC
BM
-0.378 1.07 -0.35
2 ASC
EF
0.847 1.13 0.75
3 ASC
LF
0.893 1.08 0.83
4 ASC
MF
1.41 1.09 1.28
5
cost
-1.49 0.257 -5.80
6
flat
2.29 0.640 3.58 2.02
7
meas
2.06 0.575 3.59 1.85
8
SM flat
9.40e-005 1.06 0.00 -0.94
9
SM meas
1.00 1.06 0.94 0.00
Summary statistics
Number of observations = 434
L(0) = 560.250
L(
^
) = 473.219

2
= 0.139
Table 8.8: CNL
CNL
variable
167
168 multivariate (generalized) extreme value models
168
Chapter 9
Mixtures of Logit and GEV
Models
This case study deals with the specication of mixtures of Logit models. The
objectives can be summarized as follows:
Gaining an overview of the dierent formulations of mixtures of logit
and becoming familiar with the concepts of exible correlation struc-
tures and taste heterogeneity.
Specication and estimation of alternative specic variance models.
Specication and estimation of error component models.
Specication and estimation of random coecients models.
Specication and estimation of mixtures of GEV models.
For this case study, the Swissmetro dataset is considered. Details on the
dataset can be found in the Appendix, section A.3.
The general guidelines presented on page 17 discuss how to go through the
case study.
169
170 mixtures of logit and gev models
9.1 Challenge Question
The Airline Itinerary Case The data come from an Internet choice sur-
vey conducted by the Boeing Company in the Fall of 2004. Boeing was in-
terested in understanding the sensitivity that air passengers have toward the
attributes of an airline itinary, such as fare, travel time, transfers, legroom,
and aircraft. It was executed on a sample of customers of an Internet air-
line booking service. There are 1633 respondents, each providing one Stated
Preference response. Each respondent was faced with three choice alterna-
tives based on the origin-destination market request that she entered into
the itinerary search engine. The rst alternative is always a non-stop ight,
the second a ight with one stop on the same airline, and the third a ight
with one stop and a change of airline.
Data description Please read Appendix A.5 of the workbook for details.
Files to use with Biogeme:
Model le: Mixture airline.mod
Data le: airline.dat
We propose a specication of a logit model with a random parameter. The
utility functions include the alternative specic attributes for legroom, sched-
ule delay early and late departures. Two attributes capturing the fare are also
included: one for business trips and one for non-business trips. The travel
time parameter is assumed to be randomly distributed over the population.
Constants are included for all alternatives except the rst one which has
arbitrarily been chosen as a referent. Figure 9.1 gives a suggested Biogeme
specication of the model.
Question: Does this model make sense to you? What results do you expect
when you try to estimate this model?
The results estimated by Biogeme are given in Table 9.1. Do they correspond
to your expectations?
170
challenge question 171
[Choice]
SP1_MostAttractive
[Beta]
// Name Value LowerBound UpperBound status
ASC_1 0 -10000 10000 1
ASC_2 0 -10000 10000 0
ASC_3 0 -10000 10000 0
BETA_LogFare_Business 0 -10000 10000 0
BETA_LogFare_NonBusiness 0 -10000 10000 0
BETA_TotalTripTime 0 -10000 10000 0
BETA_TotalTripTime_std 0 -10000 10000 0
BETA_Legroom 0 -10000 10000 0
BETA_SchedDelayEarly 0 -10000 10000 0
BETA_SchedDelayLate 0 -10000 10000 0
[Utilities]
// Id Name Avail linear-in-parameter expression (beta1*x1 + beta2*x2 + ... )
1 Opt1 one ASC_1 * one + BETA_LogFare_Business * Opt1LogFare_Business
+ BETA_LogFare_NonBusiness * Opt1LogFare_NonBusiness
+ BETA_TotalTripTime [ BETA_TotalTripTime_std ] * Opt1_TotalTriptime
+ BETA_Legroom * Opt1_Legroom
+ BETA_SchedDelayEarly * Opt1_SchedDelayEarly
+ BETA_SchedDelayLate * Opt1_SchedDelayLate
2 Opt2 one ASC_2 * one + BETA_LogFare_Business * Opt2LogFare_Business
+ BETA_LogFare_NonBusiness * Opt2LogFare_NonBusiness
+ BETA_TotalTripTime [ BETA_TotalTripTime_std ] * Opt2_TotalTriptime
+ BETA_Legroom * Opt2_Legroom
+ BETA_SchedDelayEarly * Opt2_SchedDelayEarly
+ BETA_SchedDelayLate * Opt2_SchedDelayLate
3 Opt3 one ASC_3 * one + BETA_LogFare_Business * Opt3LogFare_Business
+ BETA_LogFare_NonBusiness * Opt3LogFare_NonBusiness
+ BETA_TotalTripTime [ BETA_TotalTripTime_std ] * Opt3_TotalTriptime
+ BETA_Legroom * Opt3_Legroom
+ BETA_SchedDelayEarly * Opt3_SchedDelayEarly
+ BETA_SchedDelayLate * Opt3_SchedDelayLate
[Expressions]
one = 1
Opt1LogFare_Business = log( Opt1_Fare ) * ( Trip_Purpose <> 2 )
Opt1LogFare_NonBusiness = log( Opt1_Fare ) * ( Trip_Purpose == 2 )
Opt2LogFare_Business = log( Opt2_Fare ) * ( Trip_Purpose <> 2 )
Opt2LogFare_NonBusiness = log( Opt2_Fare ) * ( Trip_Purpose == 2 )
Opt3LogFare_Business = log( Opt3_Fare ) * ( Trip_Purpose <> 2 )
Opt3LogFare_NonBusiness = log( Opt3_Fare ) * ( Trip_Purpose == 2 )
[Model]
$MNL
[Draws]
100
Figure 9.1: Airline itinerary logit model specication with a random param-
eter
171
172 mixtures of logit and gev models
Model Estimation Results
Variable Variable Coecient standard t-stat. 0 p-value
number name estimate error
1 ASC
2
-1.14 0.230 -4.95 0.00
2 ASC
3
-1.22 0.229 -5.31 0.00
3
Legroom
0.219 0.0455 4.81 0.00
4
LogFare Business
-7.54 1.01 -7.44 0.00
5
LogFare NonBusiness
-10.5 0.900 -11.66 0.00
6
SchedDelayEarly
-0.196 0.0285 -6.86 0.00
7
SchedDelayLate
-0.127 0.0257 -4.93 0.00
8
TotalTripTime
-0.665 0.191 -3.48 0.00
9
TotalTripTime std
-0.579 0.208 -2.78 0.01
Summary statistics
Number of observations = 1633
L(0) = 1794.034
L(
^
) = 1008.504

2
= 0.433
Table 9.1: Estimation results for the Airline itinerary logit model with a
random parameter
172
swissmetro case 173
9.2 Swissmetro Case
Alternative Specic Variance Model
Files to use with Biogeme:
Model le: Mixture SM AltSpVar.mod
Data le: swissmetro.dat
In this rst model specication, we assume that the ASCs are randomly
distributed. We show below the utility expressions and in Figure 9.2 the
related Biogeme snapshot
1
.
V
car
= ASC
car
+
time
CAR TT +
cost
CAR CO
V
train
=
time
TRAIN TT +
cost
TRAIN CO+
he
TRAIN HE
V
SM
= ASC
SM
+
time
SM TT +
cost
SM CO +
he
SM HE
This model is very simple. The parameters are assumed to be generic over
the alternatives, and just a few variables are taken into account. ASC
car
and ASC
SM
are now randomly distributed, with mean
car
and
SM
and
standard deviation
car
and
SM
, which are both estimated. We normalize
with respect to the train alternative, and the estimation results are shown in
Table 9.2. Note that this is a simplication of the proper estimation process
that is needed for alternative specic variance estimation. Recall that the
normalization is not arbitrary in that only the minimum variance alternative
can be normalized to 0. Therefore, proper estimation requires rst that
an unidentied model be estimated (with all three variances in this case).
Then, the model should be re-estimated with the smallest variance from the
unidentied model normalized to 0.
The estimated values of the time, cost and headway coecients show their
negative impact on the utility functions. Time and cost estimated coe-
cients are numerically very close, indicating the same negative impact, which
is larger than that of headway. The estimated ASCs show that, all the rest
1
Lines in the Biogeme snapshots have been broken but in the original Biogeme .mod
le they are not.
173
174 mixtures of logit and gev models
Estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1
car
0.244 0.107 2.29
2
SM
0.845 0.178 4.75
3
car
0.0992 0.0974 1.02
4
SM
2.92 0.417 7.00
5
cost
-0.0169 0.00155 -10.94
6
he
-0.00763 0.00133 -5.72
7
time
-0.0166 0.00192 -8.66
Summary statistics
Number of draws = 100
Number of observations = 6768
L(0) = 6964.663
L(
^
) = 5257.982

2
= 0.244
Table 9.2: Alternative specic variance specication
174
swissmetro case 175
[Utilities]
// Id Name Avail linear-in-parameter expression
1 SBB_SP TRAIN_AV_SP ASC_SBB * one + BETA_TIME * TRAIN_TT +
BETA_COST * TRAIN_COST + BETA_HE * TRAIN_HE
2 SM_SP SM_AV ASC_SM [ ASC_SM_std ] * one + BETA_TIME * SM_TT
+ BETA_COST * SM_COST + BETA_HE * SM_HE
3 Car_SP CAR_AV_SP ASC_CAR [ ASC_CAR_std ] * one +
BETA_TIME * CAR_TT + BETA_COST * CAR_CO
Figure 9.2: The Biogeme snapshot illustrating the alternative specic vari-
ance specication
remaining constant, both car and Swissmetro alternatives are preferred, on
average, to the train alternative. The average preference for the innovative
transportation mode is larger in value and its standard deviation is signif-
icantly dierent from zero as well as greater than the mean. This means
that part of the population prefers the train to the Swissmetro (all the rest
being constant). We could argue that one of the reasons is more strict bud-
get issues, for example, related to individuals with lower incomes. Note also
that the variance parameter
car
for the ASC associated with the car alter-
native is not signicant. We could therefore dene the parameter ASC
car
as
a constant in order to reduce the complexity of the model.
Only 100 random draws have been used for the estimation. Note that this is
not enough. We have chosen few draws in order to decrease the estimation
time for the case study. For more theoretical details on this choice, we refer
the reader to Train (2003)
2
.
Error Component Model
Files to use with Biogeme:
Model les: Mixture SM EC1.mod, Mixture SM EC2.mod
Data le: swissmetro.dat
2
The number of random draws is an important issue in simulated estimations. For
reliable values, such a number should theoretically be , as the Simulated Maximum
Likelihood estimator is not consistent for a nite number of draws. In practical applica-
tions, the trade-o between the reliability of the estimates and a reasonable computational
time becomes the most important issue. By default, Biogeme uses pseudo-random draws.
175
176 mixtures of logit and gev models
[Utilities]
// Id Name Avail linear-in-parameter expression
1 SBB_SP TRAIN_AV_SP ASC_SBB * one + BETA_TIME * TRAIN_TT +
BETA_COST * TRAIN_COST + BETA_HE * TRAIN_HE +
RAIL [ RAIL_std ] * one
2 SM_SP SM_AV ASC_SM * one + BETA_TIME * SM_TT + BETA_COST * SM_COST
+ BETA_HE * SM_HE + RAIL [ RAIL_std ] * one
3 Car_SP CAR_AV_SP ASC_CAR * one + BETA_TIME * CAR_TT +
BETA_COST * CAR_CO
Figure 9.3: The Biogeme snapshot illustrating how the error component
specication is implemented.
This rst error component model attempts to capture the correlation between
the train and Swissmetro alternatives. They are both rail-based transporta-
tion modes, so the hypothesis is that they share unobserved attributes. We
show below the systematic utility expressions and in Figure 9.3 the related
Biogeme snapshot.
V
car
= ASC
car
+
time
CAR TT +
cost
CAR CO
V
train
=
time
TRAIN TT +
cost
TRAIN CO
+
he
TRAIN HE +
rail
V
SM
= ASC
SM
+
time
SM TT +
cost
SM CO
+
he
SM HE +
rail
The train and SM modes share the random term
rail
, which is assumed to
be normally distributed
rail
N(m
rail
,
2
rail
). We estimate the standard
deviation
rail
of this error component, while the mean m
rail
is xed to zero.
The estimation results are shown in Table 9.3. The interpretation is substan-
tially the same as before.
rail
has been estimated signicantly dierent from
zero, capturing the correlation between the train and the Swissmetro alter-
natives. This parameter is actually the element of the variance-covariance
matrix capturing the correlation between Swissmetro and train.
In the following model, we use a more complex error structure. The idea
is that train and SM are correlated, both being rail-based transportation
modes, but also that train and car are correlated representing more classical
176
swissmetro case 177
Estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
0.184 0.0801 2.30
2 ASC
SM
0.449 0.0935 4.80
3
cost
-0.0109 0.000684 -15.92
4
he
-0.00536 0.000984 -5.45
5
time
-0.0128 0.00105 -12.19
6
rail
0.153 0.0576 2.66
Summary statistics
Number of draws = 100
Number of observations = 6768
L(0) = 6964.663
L(
^
) = 5314.698

2
= 0.236
Table 9.3: Error component specication. The
rail
coecient is the standard
deviation of the random term capturing the unobserved shared attributes
between the train and Swissmetro alternatives.
177
178 mixtures of logit and gev models
[Utilities]
// Id Name Avail linear-in-parameter expression
1 SBB_SP TRAIN_AV_SP ASC_SBB * one + BETA_TIME * TRAIN_TT +
BETA_COST * TRAIN_COST + BETA_HE * TRAIN_HE +
RAIL [ RAIL_std ] * one +
CLASSIC [ CLASSIC_std ] * one
2 SM_SP SM_AV ASC_SM * one + BETA_TIME * SM_TT +
BETA_COST * SM_COST + BETA_HE * SM_HE +
RAIL [ RAIL_std ] * one
3 Car_SP CAR_AV_SP ASC_CAR * one + BETA_TIME * CAR_TT +
BETA_COST * CAR_CO + CLASSIC [ CLASSIC_std ] * one
Figure 9.4: The Biogeme snapshot for the second error component specica-
tion.
transportation modes with respect to the more innovative Swissmetro. The
corresponding utility functions are
V
car
= ASC
car
+
time
CAR TT +
cost
CAR CO+
classic
V
train
=
time
TRAIN TT +
cost
TRAIN CO+
he
TRAIN HE
+
rail
+
classic
V
SM
= ASC
SM
+
time
SM TT +
cost
SM CO
+
he
SM HE +
rail
and the related Biogeme snapshot is shown in Figure 9.4. As before, the
random terms are assumed to be normally distributed
rail
N(m
rail
,
2
rail
)
and
classic
N(m
classic
,
2
classic
). The standard deviations,
rail
and
classic
,
are estimated, while the means m
rail
and m
classic
are xed to zero.
A similar correlation pattern could be specied by means of a Cross-Nested
Logit model where the SM alternative belongs to a rail nest, the car alter-
native belongs to a classic nest and the train alternative is assigned with
certain degrees of membership to both rail and classic nests. In the model,
we have normalized with respect to the train alternative. The estimation
results are shown in Table 9.4.
ASC
SM
and ASC
car
have positive values, indicating a preference towards
Swissmetro and car over train, all the rest being constant. The interpreta-
tion of the cost, time and headway coecients remains the same. Only the
178
swissmetro case 179
Estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
0.254 0.110 2.32
2 ASC
SM
0.865 0.238 3.63
3
cost
-0.0166 0.00165 -10.05
4
he
-0.00759 0.00134 -5.66
5
time
-0.0160 0.00197 -8.12
6
classic
2.86 0.526 5.44
7
rail
0.0982 0.101 0.97
Summary statistics
Number of draws = 100
Number of observations = 6768
L(0) = 6964.663
L(
^
) = 5261.818

2
= 0.243
Table 9.4: Error component specication. Train and car share unobserved
attributes through
classic
and train and SM through
rail
.
179
180 mixtures of logit and gev models
[Utilities]
// Id Name Avail linear-in-parameter expression
1 SBB_SP TRAIN_AV_SP ASC_SBB * one + BETA_TIME * TRAIN_TT
+ BETA_TRAIN_COST [ BETA_TRAIN_COST_std ] * TRAIN_COST
+ BETA_HE [ BETA_HE_std ] * TRAIN_HE
2 SM_SP SM_AV ASC_SM * one + BETA_TIME * SM_TT
+ BETA_SM_COST [ BETA_SM_COST_std ] * SM_COST
+ BETA_HE [ BETA_HE_std ] * SM_HE
3 Car_SP CAR_AV_SP ASC_CAR * one + BETA_TIME * CAR_TT
+ BETA_CAR_COST [ BETA_CAR_COST_std ] * CAR_CO
Figure 9.5: The Biogeme snapshot for the random coecient specication.
standard deviation related to
classic
is signicantly dierent from zero
3
Random Coecients
Files to use with Biogeme:
Model le: Mixture SM Randcoe.mod
Data le: swissmetro.dat
In this specication, the unknown parameters are assumed to be randomly
distributed over the population. They capture the so called taste variation of
individuals. The utility expressions are shown below and the related Biogeme
snapshot in Figure 9.5.
V
car
= ASC
car
+
time
CAR TT +
car cost
CAR CO
V
train
=
time
TRAIN TT +
train cost
TRAIN CO +
he
TRAIN HE
V
SM
= ASC
SM
+
time
SM TT +
SM cost
SM CO+
he
SM HE
We have three alternative-specic coecients for the cost variable which are
normally distributed with means m
car cost
, m
train cost
, and m
SM cost
and stan-
dard deviations
car cost
,
train cost
, and
SM cost
, respectively. The coecient
3
The signs of the estimated standard deviations are always reported as positive. In
Biogeme they may be reported as negative. If so, just ignore the sign and consider the
absolute value.
180
swissmetro case 181
Estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
-1.47 0.177 -8.30
2 ASC
SM
-0.915 0.130 -7.07
3 m
car cost
-0.0168 0.00409 -4.11
4
car cost
0.00883 0.00329 2.68
5 m
train cost
-0.0588 0.00484 -12.14
6
train cost
0.0229 0.00209 10.94
7 m
SM cost
-0.0162 0.00217 -7.48
8
SM cost
0.00814 0.00204 3.99
9 m
he
-0.00619 0.00121 -5.12
10
he
0.00102 0.00415 0.25
11
time
-0.0129 0.00168 -7.72
Summary statistics
Number of draws = 100
Number of observations = 6768
L(0) = 6964.663
L(
^
) = 4979.704

2
= 0.283
Table 9.5: Random coecient specication assuming normal distributions.
related to headway is also assumed to be randomly distributed over the pop-
ulation, with mean m
he
and standard deviation
he
.
The estimation results are shown in Table 9.5. The ASCs have negative
signs, and their values still show a preference, all the rest remaining constant,
for the train with respect to both the car and Swissmetro alternatives. The
mean for the car cost coecient is negative, as expected, and the standard
deviation
car cost
is signicantly dierent from zero. Its numerical value
indicates that the probability that the parameter has a negative value is
97.15%. The assumed Normal distribution allows for non-zero probabilities
of having a positive car cost coecient. Similar considerations can be made
for the other random coecients. The mean for the train cost coecient is
negative, as expected, and both the mean and the standard deviation are
181
182 mixtures of logit and gev models
[GeneralizedUtilities]
1 exp( BETA_TIME [ BETA_TIME_std ] ) * TRAIN_TT
2 exp( BETA_TIME [ BETA_TIME_std ] ) * SM_TT
3 exp( BETA_TIME [ BETA_TIME_std ] ) * CAR_TT
Figure 9.6: The Biogeme Log-Normal specication.
signicant. Computing the cumulative distribution function (cdf) for the
Normal distribution with these parameters, we observe that the cumulative
probability of having a train cost coecient less than zero is 99.49%. For the
SM cost parameter (both mean and standard deviation are signicant), we
have the cdf for negative values equal to 97.67%. The mean of the headway
parameter is negative as expected, and its standard deviation has not been
estimated signicantly dierent from zero.
Dierent distributions We show here two examples of Biogeme code
to specify a random coecient model where the parameters are log-normally
and Johnsons Sb distributed. The Biogeme snapshots are shown in Figures9.6
and 9.7, respectively. Recall that a variable X is log-normally distributed if
y = ln(X) is normally distributed. We can easily dene in Biogeme such a
distribution by assuming a generic time coecient to be log-normally dis-
tributed.
In the case of Johnsons SB distribution, the functional form is derived using a
Logit-like transformation of a Normal distribution, as dened in the following
equation
= a + (b a)
e

+ 1
where N(,
2
). This distribution is very exible; it is bounded between
a and b and its shape can change from a very at one to a bimodal, changing
the parameters of the normal variable. It requires the estimation of four
parameters (a, b, and ) and a nonlinear specication, assuming as before,
a generic time coecient following such a distribution.
The topic of the functional form for random coecient distributions is treated
in more detail in, for example, Train (2003) and Walker et al. (2007).
182
swissmetro case 183
[GeneralizedUtilities]
1 ( A + ( ( B - A ) * ( exp( BETA_TIME [ BETA_TIME_std ] )
/ ( exp( BETA_TIME [ BETA_TIME_std ] ) + 1 ) ) ) ) * TRAIN_TT
2 ( A + ( ( B - A ) * ( exp( BETA_TIME [ BETA_TIME_std ] )
/ ( exp( BETA_TIME [ BETA_TIME_std ] ) + 1 ) ) ) ) * SM_TT
3 ( A + ( ( B - A ) * ( exp( BETA_TIME [ BETA_TIME_std ] )
/ ( exp( BETA_TIME [ BETA_TIME_std ] ) + 1 ) ) ) ) * CAR_TT
Figure 9.7: The Biogeme SB specication.
Mixture of GEV Models
Files to use with Biogeme:
Model le: Mixture SM M-NL.mod
Data le: swissmetro.dat
In this example, we capture the substitution patterns using a Nested Logit
model, and we allow for some parameters to be randomly distributed over
the population.
V
car
= ASC
car
+
car time
CAR TT +
cost
CAR CO
V
train
=
train time
TRAIN TT +
cost
TRAIN CO +
he
TRAIN HE
+
ga
GA+
senior
SENIOR
V
SM
= ASC
SM
+
SM time
SM TT +
cost
SM CO+
he
SM HE
+
ga
GA+
seats
SM SEATS
We have added the socio-economic characteristics senior (a dummy variable
for senior people, i.e. age above 65), ga and SM seats. A few observations
have been removed where the variable Age was missing. We specify a nest
composed of alternatives car and train representing standard transportation
modes, while the Swissmetro alternative represents the technological inno-
vation. We further assume a generic cost parameter and three randomly
distributed alternative-specic time parameters. Normal distributions are
183
184 mixtures of logit and gev models
Estimation results
Parameter Parameter Parameter Robust Robust Robust
number name estimate standard error t stat. 0 t stat. 1
1 ASC
car
-0.145 0.120 -1.21
2 ASC
SM
0.185 0.115 1.61
3
senior
1.53 0.132 11.65
4 m
car time
-0.0134 0.000996 -13.43
5
cost
-0.00961 0.000817 -11.76
6
car time
0.00462 0.000499 9.26
7
ga
1.01 0.149 6.79
8
he
-0.00467 0.000878 -5.32
9
seats
-0.262 0.100 -2.62
10 m
SM time
-0.0152 0.00132 -11.56
11
SM time
0.00877 0.00139 6.29
12 m
train time
-0.0158 0.00109 -14.50
13
train time
0.000741 0.000661 1.12
14
classic
1.85 0.141 13.10 6.02
Summary statistics
Number of draws = 100
Number of observations = 6759
L(0) = 6958.425
L(
^
) = 4956.477

2
= 0.286
Table 9.6: Mixture of Nested Logit estimation results
used for the random coecients, that is,

car time
N(m
car time
,
2
car time
)

train time
N(m
train time
,
2
train time
)

SM time
N(m
SM time
,
2
SM time
).
The estimation results are shown in Table 9.6. The nest parameter has
been estimated signicantly dierent from 1, showing a correlation between
the train and car alternatives, as expected. The three mean parameters for
184
swissmetro case 185
the time coecients have been estimated with negative signs (as expected)
and are signicantly dierent from zero. Their numerical values are only
slightly dierent, suggesting that probably a generic specication would have
been acceptable. For the car and Swissmetro time coecients, the estimated
standard deviations are signicant and one magnitude order less than the
mean value. It means that their distribution over the population is very
peaked, indicating that the way dierent individuals perceive the negative
impact of travel time on the alternatives utilities is not so dierent. Finally,
given the narrow shape of the estimated random coecient distributions,
other choices than the normal would probably be suitable, such as bounded
distributions.
Mixture of Logit with Panel Data
Files to use with Biogeme:
Model le: Mixture SM panel.mod
Data le: swissmetro.dat
In this example, we take into account the fact that we have panel data in
the sample le. Indeed, the sample le is composed of nine observations per
individual. These nine observations correspond to the choices made by a
single respondent in nine hypothetical mode choice situations described in
the questionnaire of the Swissmetro survey. The idea is thus to specify a
model which is able to deal with sequences of observed choices and with the
intrinsic correlation among the choices of a sequence.
The specication le Mixture SM panel.mod is based on the model MNL -
SM specic.mod with alternative-specic cost coecients which has been an-
alyzed in the Case Study dealing with logit models. We have added the
following section:
[PanelData]
ID
ZERO_SIGMA_PANEL
where ID is the name of the variable in the dataset identifying the observa-
tions belonging to a given individual, and ZERO_SIGMA_PANEL is the name of
185
186 mixtures of logit and gev models
the random coecient which will not vary across observations from the same
individual.
The way we deal with panel data is therefore to use a Mixture of Logit model
with random coecients specication. More precisely, we add individual
specic error terms (specied in Biogeme by ZERO [ SIGMA_PANEL ] * one)
in two alternatives (we need to normalize one alternative), where the standard
deviation (SIGMA_PANEL) needs to be estimated while the mean (ZERO) is
xed to zero. The utility functions for this model can therefore be specied
in Biogeme as follows:
[Utilities]
Car ASC_CAR * one + BETA_TIME * CAR_TT + BETA_CAR_COST *
CAR_CO + ZERO [ SIGMA_PANEL ] * one
Train ASC_SBB * one + BETA_TIME * TRAIN_TT +
BETA_TRAIN_COST * TRAIN_COST + BETA_HE * TRAIN_HE +
ZERO [ SIGMA_PANEL ] * one
SM ASC_SM * one + BETA_TIME * SM_TT + BETA_SM_COST *
SM_COST + BETA_HE * SM_HE
We see from the estimation results presented in Table 9.7 that the coef-
cient
panel
is highly signicant, which means that this model allows for
capturing intrinsic correlations among the observations of the same indi-
vidual. Moreover, the nal log-likelihood value is 4235.440, which is much
greater (in absolute value) than the value 5068.560 obtained with the model
MNL SM specic.mod without a panel term. The interpretation of other co-
ecients remains the same as that for the coecients of MNL SM specic.mod,
except that ASC
SM
is no longer signicantly dierent from 0.
186
swissmetro case 187
Estimation results
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
car
-0.988 0.390 -2.53
2 ASC
SM
-0.291 0.531 -0.55
3
car cost
-0.0132 0.00324 -4.07
4
train cost
-0.0323 0.00574 -5.63
5
SM cost
-0.0163 0.00262 -6.22
6
he
-0.00757 0.00127 -5.96
7
time
-0.0190 0.00616 -3.09
8
panel
2.39 0.216 11.06
Summary statistics
Number of draws = 100
Number of individuals = 752
L(0) = 6964.663
L(
^
) = 4235.440

2
= 0.391
Table 9.7: Mixture of logit model with panel data.
187
188 mixtures of logit and gev models
188
Chapter 10
Simultaneous RP/SP
Estimation
This case study deals with the simultaneous estimation of a Binary Logit
model from revealed and stated preference (RP and SP) data. The objec-
tive of this case study is to estimate Binary Logit models with RP, SP and
combined RP/SP datasets and compare the results of the three models.
The intercity mode choice dataset taken from the Nijmegen, Netherlands,
will be used in this case study. The survey was conducted during 1987 for
the Netherlands Railways to assess factors that inuence the choice between
rail and car for intercity travel. The detailed description of the data collection
method and variable denitions are presented in the Appendix, section A.2.
189
190 simultaneous rp/sp estimation
10.1 Model Specication with RP Data
Files to use with Biogeme:
Model le: RP-SP NL rp.mod
Data le: netherlands.dat
The simple RP model consists of travel time and travel cost with generic
coecients for both alternatives.
V
auto
=
time
car
time
+
cost
car
cost
V
rail
= ASC
rp-rail
+
time
rail
time
+
cost
rail
cost
The estimation results are shown in Table 10.1. The results show that the
utility of a mode decreases with increase in total travel time and travel cost.
10.2 Model Specication with SP Data
Files to use with Biogeme:
Model le: RP-SP NL sp.mod
Data le: netherlands.dat
The simple SP model is estimated with a generic cost coecient, a generic
time coecient, and an inertia variable (rpchoice) in the rail utility. The
inertia variable captures the eect of the actual choice of the responder on
his/her SP response (based on the hypothesis that people who have chosen
a particular mode in an actual case will tend to have a bias towards that
mode).
The sample size here is composed of 1511 observations. The coecients have
the expected sign, and they are signicantly dierent from zero at a 95%
level of condence. Note that the ASC associated with the rail alternative
is now negative. Combined with the inertia coecient, this implies that the
intercept is negative for car users and positive for rail users. The inertia
eect of the actual choice is signicant in the SP experiment.
The estimation results are shown in Table 10.2.
190
model specification with combined rp-sp data 191
10.3 Model Specication with Combined RP-
SP Data
Files to use with Biogeme
model le: RP-SP NL rpsp.mod
data le: netherlands.dat
Having dened the utility functions for the RP model as follows:
U
RP
= V
RP
+
RP
and those of the SP model as follows:
U
SP
= V
SP
+
SP
,
we have already estimated separately the RP model and the SP model. Now,
in order to perform a joint estimation of both models, that is an RP-SP
model, it is mandatory that the variances of error terms are the same. This
is why we assume that:
Var(
RP
) = Var(
SP
) =
2
Var(
SP
).
The utilities for the RP and SP models can now be rewritten as
U
RP
= V
RP
+
RP
U
SP
= V
SP
+
SP
and the error terms (
RP
and
SP
) of both models have the same vari-
ance. Assume that V
SP
in
= X
SP
in
is a linear in parameter specication. Then
V
SP
in
= X
SP
in
, where both and must be estimated introducing a nonlin-
ear specication.
In this example, the combined RP-SP model consists of total travel time and
travel cost for both types of observations, and inertia (rpchoice) in rail for
the SP observations. The scale of the RP observations is xed at 1, and
therefore represents the scale of the SP observations. The model is estimated
on a total of 1739 observations.
The estimation results are shown in Table 10.3. The negative and signicant
coecient for the alternative specic constant in the SP rail alternative in-
dicates that all else being equal, car users tend to dislike rail in the SP case.
191
192 simultaneous rp/sp estimation
The inertia dummy was found to have a large impact on the utility both in
terms of value and statistical signicance. The scale parameter was also
found to be signicantly dierent from one indicating a signicant dierence
in the variance between the RP and SP data.
Finally, we can do a likelihood ratio test to test for stability of preferences
1
.
Specically, the null hypothesis is:
H
0
:
RP
=
SP
.
The test statistic for the null hypothesis is given by
2(L
R
L
U
) = 2(780.124 + 123.133 + 656.991) = 0.000
where the restricted model is the combined RP-SP model and the unre-
stricted model is comprised of the separate RP and SP models.
The test statistic is asymptotically
2
distributed with the degrees of freedom
equal to K
RP
+K
SP
K
RPSP
= 3 +4 6 = 1.
Since 0.000 < 3.841 (the critical value of the
2
distribution with 1 degree
of freedom at a 95 % level of condence), we accept the null hypothesis of
stability of preferences (i.e. the combined RP-SP model).
1
Note that the likelihood ratio test in such a situation is an approximate test. The
test results are asymptotically valid if the standard errors and the robust standard errors
are approximately the same. However, if there are substantial dierences between the
standard errors and the robust standard errors, the likelihood ratio test results may be
misleading and Wald / Lagrange Multiplier tests are more appropriate.
192
model specification with combined rp-sp data 193
BL with RP data
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
rp-rail
0.798 0.275 2.90
2
cost
-0.0499 0.0107 -4.67
3
time
-1.33 0.354 -3.75
Summary statistics
Number of observations: 228
L(0) = 158.038
L(
^
) = 123.133

2
= 0.202
Table 10.1: BL with RP data estimation results
BL with SP data
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
sp-rail
-1.62 0.128 -12.65
2
inert
2.72 0.144 18.91
3
cost
-0.0170 0.00384 -4.42
4
time
-0.447 0.0977 -4.58
Summary statistics
Number of observations: 1511
L(0) = 1047.350
L(
^
) = 656.991

2
= 0.369
Table 10.2: BL with SP data estimation results
193
194 simultaneous rp/sp estimation
BL with combined RP-SP data
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
rp-rail
0.798 0.275 2.9
2 ASC
sp-rail
-4.79 1.35 -3.54
3
inert
8.03 1.91 4.21
4
cost
-0.05 0.00965 -5.18
5
time
-1.32 0.293 -4.51
6 0.339 0.0817 -8.09*
Summary statistics
Number of observations: 1739
L(0) = 1205.383
L(
^
) = 780.124

2
= 0.348
* Robust t statistic 1
Table 10.3: BL with RP and SP data estimation results
194
Bibliography
Abbe, E., Bierlaire, M. and Toledo, T. (2007). Normalization and correlation
of cross-nested logit models, Transportation Research B: Methodological
41(7): 795808.
Ben-Akiva, M. and Lerman, S. R. (1985). Discrete Choice Analysis: Theory
and Application to Travel Demand, MIT Press, Cambridge, MA.
Ben-Akiva, M. and Morikawa, T. (1990). Revealed preferences and stated
intentions, Transporation Research A 24A(6): 485495.
Bierlaire, M., Axhausen, K. and Abay, G. (2001). The acceptance
of modal innovation: The case of swissmetro, Proceedings of the
1st Swiss Transportation Research Conference, Ascona, Switzerland.
www.strc.ch/bierlaire.pdf.
Cherchi, E. and Ortuzar, J. (2002). Mixed RP/SP models incorporating
interaction eects, Transportation 29: 371395.
Ekman, P. and Friesen, W. V. (1978). Facial Action Coding System Investi-
gators Guide, Consulting Psycologist Press, Palo Alto, CA.
Kanade, T., Cohn, J. and Tian, Y. L. (2000). Comprehensive database
for facial expression analysis, Proceedings of the 4th IEEE International
Conference on Automatic Face and Gesture Recognition (FG00), pp. 46
53.
McFadden, D. (1987). Regression-based specication tests for the multino-
mial logit model, Journal of Econometrics 34(1/2): 6382.
Train, K. (2003). Discrete Choice Methods with Simulation, Cambridge Uni-
versity Press, University of California, Berkeley.
195
196 BIBLIOGRAPHY
Train, K., Ben-Akiva, M. and Atherton, T. (1989). Consumption patterns
and self-selecting taris, Review of Economics and Statistics 71(1): 62
73.
Train, K., McFadden, D. and Ben-Akiva, M. (1987). The demand for local
telephone service: a fully discrete model of residential calling patterns
and service choices, Rand Journal of Economics .
Walker, J., Ben-Akiva, M. and Bolduc, D. (2007). Identication of parame-
ters in normal error component logit-mixture (neclm) models, Journal
of Applied Econometrics 22(6): 10951125.
196
Appendix A
Datasets
A.1 Choice-Lab-Fashion Marketing Case
Context
Choice-Lab-Fashion
1
is a European company that specializes in collecting
and processing data on companies operating in the fashion industry. The
company sells marketing solutions and operates in a business-to-business
market covering producers and distributors of clothes, shoes and accessories.
Choice-Lab-Fashion provides its clients with a selection of products that will
help them make the right decision in connection with competitors analysis,
segmentation, sales management and direct marketing. Choice-Lab-Fashion
has lately been experiencing a decrease in its customer base. As the fashion
industry has been consolidating in the past few years, one reason might be
related to the shrinking of the target market. Choice-Lab-Fashion has also
been introducing new products that might have been cannibalizing some
older ones.
The management team would like to investigate if there is a possibility to
understand what are the factors characterizing customer departure. However,
the company does not have a survey and in the near future they would
1
By using this dataset, the student agrees to use it only for academic purposes related
to this course. The user is responsible for keeping the data on a computer with secure
access. The user of the data must not transfer it or distribute it to third parties. The
name Choice-Lab-Fashion is ctitious.
197
198 datasets
like to see if it is possible to learn something from the available customer
data. The customer data, described below, include a set of socio-economic
characteristics of Choice-Lab-Fashions clients and the type of product they
purchased. These variables are individual specic and do not vary between
alternatives.
Data
The Choice-Lab-Fashion customer database includes an unbalanced panel of
data from 2000 until 2002. For each row of data, we observe customer ID,
year of observation, some nancial and economic indicators and the type of
products that the customer has purchased. The dependent variable indicates
a binary choice. It is equal to one when the customer decides to defect and
zero when the customer decides to stay with the company. (There is no more
data on a client once it has defected.)
Note that out of 16220 observations, there is one that equals zero (the cus-
tomer is still with the company) but none of the products has been purchased.
We believe that this is probably an error from the data provider, and this
observation has therefore been excluded in the model estimations (see the
.mod Biogeme les in the [Exclude] section).
Unit of analysis: rm
Observation period: 2000 - 2002
Choice set: choice made by the rm if remaining as client or not.
Choice-Lab-Fashion has 10 dierent products which are described below.
Product 1: Fashion Industry Analysis Report
The fashion industry analysis report provides key gures in the past
5 years on the clothing, shoes and accessories sectors. Fashion indus-
try analysis reports include mergers acquisitions and bankruptcies that
have characterized the industry in the past 5 years. The report also
provides an opportunity to compare the 10 largest competitors in the
given sector (clothes, shoes or accessories).
198
choice-lab-fashion marketing case 199
Product 2: Fashion Credit Info
Choice-Lab-Fashion helps the client if it is about to grant credit to
either a domestic or foreign based customer. The fashion credit re-
port also contains an overview of the companys management, history,
and accounting gures. This service is mainly used by manufacturing
companies in relation to their distributors or wholesalers.
Product 3: Individual Accounts Database
This product allows clients to retrieve copies of past 5 annual accounts
on companies via the Choice-Lab-Fashion web site. The accounts are
sent via email. This product is mainly used by rst time shoppers
and does not generate a lot of revenues to Choice-Lab-Fashion. But
it gives Choice-Lab-Fashion exposure to potential clients. If the client
purchases in 30 days after this rst purchase any of the main data
access products oered by Choice-Lab-Fashion, the cost of this rst
purchase will be deducted from the clients new invoice.
Product 4: Customized Individual Business Monitoring
Choice-Lab-Fashion monitors dierent areas (i.e., product launches -
nancial information, ownership change) of a list of companies of the
clients choice. This product is customized to the clients needs.
Product 5: Web Access Real-Time Fashion Data
Web access real-time fashion data is an Internet based program with
real-time data. It allows to perform company searches on the full
Choice-Lab-Fashion data or on geographic segments of it. The data
cannot be downloaded but the user can generate simple reports as text
les.
Product 6: CD-Fashion
CD-Fashion is the most complete database of companies in the form
of a CD which is being updated semiannually. It contains accounting
and nancial data, companies addresses, and their ownership informa-
tion. As manufacturers often produce several brands, for the top 30
manufacturers in each country the data also include information on
the name of brands produced. (Choice-Lab-Fashion collects this infor-
mation via the daily press and telephone interviews with the dierent
manufacturers every six months.)
199
200 datasets
Product 7: CRM-F Integrated
CRM-F Integrated is a user-friendly Internet based customer relations
management system. It is an integrated and professional tool to help
control and plan all activities directed towards customers, prospects
and suppliers. This solution is normally purchased by large accounts
that have a well developed and integrated IT platform.
Product 8: Internet-Credit
An internet-credit annual subscription gives access to credit informa-
tion of companies. The credit information provides a detailed overview
of the companys credit limits and credit rating. The data cannot be
downloaded but the client can generate small reports as text les.
Product 9: Open Fashion Data Base
Real-time access to Choice-Lab-Fashion database. This feature ensures
that the client has access to the latest company data. These data are
updated daily by Choice-Lab-Fashions sta. The data include new
product launches, mergers, acquisitions and bankruptcies. The data
can be downloaded as a data le.
Product 10: Other Customized Solutions
This product includes other solutions such as:
Fashion Event Analysis describes what is going on in a specic area in
terms of events related to the fashion world.
In-depth Interviews - Focus Groups Choice-Lab-Fashion can carry in-
depth interviews and focus groups in the fashion industry for a list of
companies identied by its customers.
Variables and Descriptive Statistics
In Table A.1, we summarize the variables of the dataset, and in Table A.2
we summarize the descriptive statistics.
200
choice-lab-fashion marketing case 201
Variable Description
Choice Equals 1 if customer drops next year; 0 otherwise
ID Company ID
IndAnalysis Equals 1 if product 1 has been purchased; 0 otherwise
CreditInfo Equals 1 if product 2 has been purchased; 0 otherwise
Accounts Equals 1 if product 3 has been purchased; 0 otherwise
Monitor Equals 1 if product 4 has been purchased; 0 otherwise
Web Equals 1 if product 5 has been purchased; 0 otherwise
CD Equals 1 if product 6 has been purchased; 0 otherwise
CRM Equals 1 if product 7 has been purchased; 0 otherwise
Internet Equals 1 if product 8 has been purchased; 0 otherwise
OpenDB Equals 1 if product 9 has been purchased; 0 otherwise
Other Equals 1 if product 10 has been purchased; 0 otherwise
Age Number of years the client has existed
Rating Client credit rating: 100 represents the best and 0 the
worst (this is a proxy for the current nancial condition
of the client)
Year Year of observation
NegProt Equals 1 if prot < 0; 0 otherwise
NegEquity Equals 1 if equity < 0; 0 otherwise
LRSC Equals 1 if company is a limited responsibility stock
owned company; 0 otherwise
LRC Equals 1 if a company is a limited responsibility com-
pany; 0 otherwise
NbEmpl Total number of employees
LnNbEmpl Natural log of the number of employees
LnAge Natural log of the age of the company
Table A.1: Description of the variables in the dataset
201
202 datasets
Variable Mean Std. Dev. Min Max
Choice 0.19 0.39 0 1
ID 253164.74 259851.50 830 1091364
IndAnalysis 0.27 0.45 0 1
CreditInfo 0.35 0.48 0 1
Accounts 0.29 0.46 0 1
Monitor 0.02 0.14 0 1
Web 0.13 0.34 0 1
CD 0.34 0.47 0 1
CRM 0.00 0.06 0 1
Internet 0.06 0.24 0 1
OpenDB 0.00 0.04 0 1
Other 0.52 0.50 0 1
Age 29.94 27.27 1 380
Rating 55.96 18.66 0 100
Year 2000.98 0.82 2000 2002
NegProt 0.24 0.43 0 1
NegEquity 0.04 0.19 0 1
LRSC 0.77 0.42 0 1
LRC 0.20 0.40 0 1
NbEmpl 52.76 106.53 1 989
LnNbEmpl 2.83 1.58 0 6.90
LnAge 3.10 0.76 0 5.94
Table A.2: Descriptive statistics
202
netherlands mode choice case 203
A.2 Netherlands Mode Choice Case
Context
Nijmegen is a small city in the eastern side of the Netherlands near the bor-
der with Germany. The city has typical rail connections with the major cities
in the western metropolitan area called the Randstad (that contains Ams-
terdam, Rotterdam and The Hague). Trips from Nijmegen to the Randstad
take approximately two hours by both rail and car. A binary choice model
can be developed to model the mode choice of travelers for intercity travel.
Data Collection
This dataset was collected by a survey conducted in this corridor during
1987 by the Netherlands Railways to assess factors that inuence the choice
between car and rail (see Ben-Akiva and Morikawa, 1990). The sample con-
sisted of residents of Nijmegen who:
made a trip in the previous three months to Amsterdam, Rotterdam
or The Hague;
did not use a yearly rail pass, or other types of pass which would
eliminate the marginal cost of the trip;
had the possibility of using a car, namely, possessed a drivers license
and had a car available in the household; and
had the possibility of using rail, namely, did not have any very heavy
baggage, were not handicapped, and did not need to visit multiple
destinations.
Qualifying residents of Nijmegen were identied in a random telephone sur-
vey and requested to participate in a home interview. 235 interviews were
conducted out of the 365 people who were reached by telephone and satis-
ed the above criteria. The entire home interview was administered using
laptop microcomputers, so the respondents replied to the questions on the
computer screen. The respondents were requested to report the characteris-
tics of the above-mentioned trip, and those of a trip to the same destination
203
204 datasets
but with the unchosen mode. So the attribute values of both modes were
provided by the respondents rather than calculated from network data. The
data have 228 observations (some observations had to be discarded because
of inconsistency), each including the following items:
mode used (rail or car)
trip purpose
travel cost (for both chosen mode and unchosen mode)
in-vehicle travel time (for both chosen mode and unchosen mode)
access and egress time (for both chosen mode and unchosen mode)
number of transfers for rail mode
socio-economic characteristics of the respondent (e.g., age, gender)
Variables and Descriptive Statistics
In addition to the 228 RP observations, all individuals (except two) provided
up to nine stated preference (SP) responses to hypothetical changes in net-
work attributes. There is a total of 1739 RP and SP observations available.
The variables in this dataset are summarized in Tables A.3, A.4 and A.5 (if
the type of data is not specied, it means that the variable appears in both
RP and SP).
Note that even though the out-of-vehicle times are obtained from the RP sur-
vey, the same values can be used for SP because in the SP survey, respondents
referred to the trip they reported in the RP survey, and so they would have
considered out-of-vehicle time in evaluating the hypothetical alternatives.
In Table A.6, we show the descriptive statistics for some of the variables.
Note that for RP specic attributes, the descriptive statistics in Table A.6
only concern a subsample of the observations.
204
netherlands mode choice case 205
Name Description Data
id Unique numerical identier for each subject
rp 1 if the record is an RP choice,
0 otherwise
sp 1 if the record is an SP choice,
0 otherwise (note: rp + sp = 1)
choice Mode choice (and setting) indicator:
0 for auto in RP context,
1 for rail in RP context,
10 for auto in SP context,
11 for rail in SP context
rp choice Mode choice indicator for the persons actual
choice:
0 for auto,
1 for rail (note: rpchoice = choice for RP
records)
rail ivtt in-vehicle travel time for rail (hours)
rail cost Cost (per person) for rail (Guilders)
rail transfers Number of transfers for rail
rp transfers Number of rail transfers in the RP choice
(note: rail transfers = rp transfers for RP
records)
RP
rail comfort Comfort level for rail in the SP exercises: SP
0 = least comfortable,
1 = medium comfort,
2 = most comfortable;
-1 for RP records
Table A.3: Description of variables
205
206 datasets
Name Description Data
rp rail ovt Access plus egress time for rail (hours) in the
RP choice
RP
rail acc mode Walk access dummy for rail in the RP choice: RP
1 = respondent walked to station,
0 = other access mode;
-1 for SP records
rail egr mode Walk egress dummy for rail in the RP choice: RP
1 = respondent walked from station,
0 = other egress mode;
-1 for SP records
seat status First class dummy for rail in the RP choice: RP
1 = respondent traveled in rst class,
0 = other class(es);
-1 for SP records
car ivtt in-vehicle time for auto (hours)
car cost Cost (per person) for auto (Guilders)
rp car ovt Out-of-vehicle time (hours) for auto in the
RP choice
RP
car parking fee Free parking dummy for auto in the RP
choice:
RP
1 = traveler can park for free,
0 = traveler must pay for parking;
-1 for SP records
purpose Business trip dummy:
1 = business trip
0 = other purposes
Table A.4: Description of variables
206
netherlands mode choice case 207
Name Description
arrival time Fixed arrival time dummy:
1 = traveler must arrive at a given time,
0 = traveler has exibility in arrival time
gender Gender dummy:
1 = female,
0 = male
npersons Number of persons traveling together
age Age dummy:
1 = 41 or older,
0 = 40 or younger
employ status Unemployment dummy:
1 = unemployed,
0 = employed
mainearn Main earner dummy:
1 = main earner in the family,
0 otherwise
Table A.5: Description of variables
207
208 datasets
Mean Std. Dev. Minimum Maximum
choice (RP) 0.36 0.48 0 1
choice (SP) 10.27 0.44 10 11
npersons 2.46 1.30 1 6
car ivtt 1.71 0.38 0.75 3.05
car cost 16.52 15.74 0.25 112.5
rail ivtt 2.00 0.49 0.75 4.17
rail cost 31.09 11.79 5.45 93.75
purpose 0.16 0.37 0 1
rail transfers 0.57 0.68 0 3
gender 0.45 0.50 0 1
age 0.33 0.47 0 1
employ status 0.49 0.50 0 1
mainearn 0.48 0.50 0 1
arrival time 0.39 0.49 0 1
rail acc mode 0.25 0.43 0 1
rail egr mode 0.26 0.44 0 1
seat status 0.07 0.26 0 1
car parking fee 0.65 0.48 0 1
rail comfort 0.74 0.64 0 2
rp rail ovt 0.55 0.25 0.08 1.50
rp car ovt 0.09 0.11 0 0.83
Table A.6: Descriptive statistics
208
swissmetro case 209
A.3 Swissmetro Case
This dataset consists of survey data collected on the trains between St.
Gallen and Geneva, Switzerland, during March 1998. The respondents pro-
vided information in order to analyze the impact of the modal innovation
in transportation, represented by the Swissmetro, a revolutionary mag-lev
underground system, against the usual transport modes represented by car
and train.
Context
Innovation in the market for intercity passenger transportation is a dicult
enterprise as the existing modes: private car, coach, rail as well as regional
and long-distance air services continue to innovate in their own right by oer-
ing new combinations of speeds, services, prices and technologies. Consider
for example high-speed rail links between the major centers or direct re-
gional jet services between smaller countries. The Swissmetro SA in Geneva
is promoting such an innovation: a mag-lev underground system operating at
speeds up to 500 km/h in partial vacuum connecting the major Swiss conur-
bations, in particular along the Mittelland corridor (St. Gallen, Zurich, Bern,
Lausanne and Geneva).
Data Collection
The Swissmetro is a true innovation. It is therefore not appropriate to
base forecasts of its impact on observations of existing revealed preferences
(RP) data. It is necessary to obtain data from surveys of hypothetical mar-
kets/situations, which include the innovation, to assess the impact. Survey
data were collected on rail-based travels, interviewing 470 respondents. Due
to data problems, only 441 are used here. Nine stated choice situations were
generated for each of 441 respondents, oering three alternatives: rail, Swiss-
metro and car (only for car owners).
A similar method for relevant car trips with a household or telephone sur-
vey was deemed impractical. The sample was therefore constructed using
license plate observations on the motorways in the corridor by means of
209
210 datasets
video recorders. A total of 10529 relevant license plates were recorded dur-
ing September 1997. The central Swiss car license agency had agreed to send
up to 10000 owners of these cars a survey-pack. Until April 1998, 9658 let-
ters were mailed, of which 1758 were returned. A total of 1070 persons lled
in the survey completely and were willing to participate in the second SP
survey, which was generated using the same approach used for the rail in-
terviews. 750 usable SP surveys were returned, from the license-plate based
survey.
Variables and Descriptive Statistics
The variables of the dataset are described in Tables A.7 and A.8, and the de-
scriptive statistics are summarized in Table A.9. A more detailed description
of the data set as well as the data collection procedure is given in Bierlaire
et al. (2001).
210
swissmetro case 211
Variable Description
GROUP Dierent groups in the population
SURVEY Survey performed in train (0) or car (1)
SP It is xed to 1 (stated preference survey)
ID Respondent identier
PURPOSE Travel purpose. 1: Commuter, 2: Shopping, 3: Busi-
ness, 4: Leisure, 5: Return from work, 6: Return from
shopping, 7: Return from business, 8: Return from
leisure, 9: other
FIRST First class traveler (0 = no, 1 = yes)
TICKET Travel ticket. 0: None, 1: Two way with half price card,
2: One way with half price card, 3: Two way normal
price, 4: One way normal price, 5: Half day, 6: Annual
season ticket, 7: Annual season ticket Junior or Senior,
8: Free travel after 7pm card, 9: Group ticket, 10: Other
WHO Who pays (0: unknown, 1: self, 2: employer, 3: half-
half)
LUGGAGE 0: none, 1: one piece, 3: several pieces
AGE It captures the age class of individuals. The age-class
coding scheme is of the type:
1: age24, 2: 24<age39, 3: 39<age54, 4: 54<age
65, 5: 65 <age, 6: not known
MALE Travelers Gender 0: female, 1: male
INCOME Travelers income per year [thousand CHF]
0 or 1: under 50, 2: between 50 and 100, 3: over 100, 4:
unknown
GA Variable capturing the eect of the Swiss annual season
ticket for the rail system and most local public trans-
port. It is 1 if the individual owns a GA, zero otherwise.
ORIGIN Travel origin (a number corresponding to a Canton, see
Table A.10)
Table A.7: Description of variables
211
212 datasets
Variable Description
DEST Travel destination (a number corresponding to a Can-
ton, see Table A.10)
TRAIN AV Train availability dummy
CAR AV Car availability dummy
SM AV SM availability dummy
TRAIN TT Train travel time [minutes]. Travel times are door-
to-door making assumptions about car-based distances
(1.25*crow-ight distance)
TRAIN CO Train cost [CHF]. If the traveler has a GA, this cost
equals the cost of the annual ticket.
TRAIN HE Train headway [minutes]
Example: If there are two trains per hour, the value of
TRAIN HE is 30.
SM TT SM travel time [minutes] considering the future Swiss-
metro speed of 500 km/h
SM CO SM cost [CHF] calculated at the current relevant rail
fare, without considering GA, multiplied by a xed fac-
tor (1.2) to reect the higher speed.
SM HE SM headway [minutes]
Example: If there are two Swissmetros per hour, the
value of SM HE is 30.
SM SEATS Seats conguration in the Swissmetro (dummy). Airline
seats (1) or not (0).
CAR TT Car travel time [minutes]
CAR CO Car cost [CHF] considering a xed average cost per kilo-
meter (1.20 CHF/km)
CHOICE Choice indicator. 0: unknown, 1: Train, 2: SM, 3: Car
Table A.8: Description of variables
212
swissmetro case 213
Variable Min Max Mean St. Dev.
GROUP 2 3 2.63 0.48
SURVEY 0 1 0.63 0.48
SP 1 1 1.00 0.00
ID 1 1192 596.50 344.12
PURPOSE 1 9 2.91 1.15
FIRST 0 1 0.47 0.50
TICKET 1 10 2.89 2.19
WHO 0 3 1.49 0.71
LUGGAGE 0 3 0.68 0.60
AGE 1 6 2.90 1.03
MALE 0 1 0.75 0.43
INCOME 0 4 2.33 0.94
GA 0 1 0.14 0.35
ORIGIN 1 25 13.32 10.14
DEST 1 26 10.80 9.75
TRAIN AV 1 1 1.00 0.00
CAR AV 0 1 0.84 0.36
SM AV 1 1 1.00 0.00
TRAIN TT 31 1049 166.63 77.35
TRAIN CO 4 5040 514.34 1088.93
TRAIN HE 30 120 70.10 37.43
SM TT 8 796 87.47 53.55
SM CO 6 6720 670.34 1441.59
SM HE 10 30 20.02 8.16
SM SEATS 0 1 0.12 0.32
CAR TT 0 1560 123.80 88.71
CAR CO 0 520 78.74 55.26
CHOICE 1 3 2.15 0.63
Table A.9: Descriptive statistics
213
214 datasets
Number Canton
1 ZH
2 BE
3 LU
4 UR
5 SZ
6 OW
7 NW
8 GL
9 ZG
10 FR
11 SO
12 BS
13 BL
14 Schahausen
15 AR
16 AI
17 SG
18 GR
19 AG
20 TH
21 TI
22 VD
23 VS
24 NE
25 GE
26 JU
Table A.10: Coding of Cantons
214
choice of residential telephone services case 215
A.4 Choice of Residential Telephone Services
Case
Context
Local telephone service typically involves the choice between at (i.e., a xed
monthly charge for unlimited calls within a specied geographical area) and
measured (i.e., a reduced xed monthly charge for a limited number of calls
and additional usage charges for additional calls) services. Various at rate
services dier by the size of the geographical area within which calling is
provided at no extra charge, the monthly charge being higher for larger areas.
Measured services dier with respect to the threshold number (or dollar
value) of calls beyond which the customer is charged. The availability of
each service may depend on the geographical location within the service
area.
In developing a model of the residential demand for local telephone service,
it is necessary to explicitly account for the inter-relationship between class of
service choice and usage patterns. For example, expected usage patterns will
inuence the households choice of service option since households with high
usage levels typically could minimize their monthly bill for local telephone
service by choosing some sort of at rate service, while households with rel-
atively low usage would be better o with a measured service. Given that
a household has chosen a particular service option, usage patterns would be
dependent to a certain extent upon the service option that is chosen since it
determines the marginal price of calls. To accommodate these interrelation-
ships, the model representing the households choice of calling patterns and
service options needs to include:
1. choice of the service option, which is modeled conditional upon the
calling portfolio chosen by the household;
2. choice of the calling portfolio or the usage pattern as represented by
the number and duration of calls by time of day and calling band.
This case study deals only with the rst choice.
215
216 datasets
Data Collection
A household survey was conducted in 1984 for a telephone company among
434 households in Pennsylvania. The dataset involves choices among ve
calling plans and consists of various attributes and socio-economic character-
istics. It was originally used to develop a model system to predict residential
telephone demand (Train et al., 1987).
Variables and Descriptive Statistics
In the current application, ve types of services are involved: two measured
options and three at options. The availability of these service options varies
depending upon geographic location. Table A.11 below lists the ve service
alternatives and their availability within the dierent service areas. Names
and denitions of the variables are shown in Table A.12. Some descriptive
statistics of the dataset are summarized in Table A.13.
Complications caused by very few respondents choosing alternative
4: If you examine the dataset, you see that only 3 of the respondents chose
alternative 4 (extended area at service). This implies that it is not possible
to estimate numerous alternative specic coecients for alternative 4. The
intuition is that the dataset does not provide enough information on why
people chose or did not choose alternative 4. If you try to estimate too
many alternative specic coecients for alternative 4, you get Singularity
in the Hessian error, and in order to estimate the model you have to reduce
the number of coecients specic to alternative 4. A practical solution to
this problem is to use an enriched sample although such a sample is not
available here. It is however not recommended to omit the observations for
which the chosen alternative is 4 or combine alternative 4 with a dierent
alternative.
216
Availability
Service option Description metro,
suburban, other
some perimeter perimeter non-metro
areas areas areas
1. Budget measured No xed monthly charge; usage charges ap-
ply to each call made.
yes yes yes
2. Standard measured A xed monthly charge covers up to a spec-
ied dollar amount (greater than the xed
charge) of local calling, after which usage
charges apply to each call made.
yes yes yes
3. Local at A greater monthly charge that may depend
upon residential location; unlimited free call-
ing within local calling area; usage charges
apply to calls made outside local calling area.
yes yes yes
4. Extended area at A further increase in the xed monthly
charge to permit unlimited free calling within
an extended area.
no yes no
5. Metro area at The greatest xed monthly charge that per-
mits unlimited free calling within the entire
metropolitan area.
yes yes no
Table A.11: Service options and their availability
218 datasets
Name Description
age0 number of household members under age 6
age1 number of household members age 6-12
age2 number of household members age 13-19
age3 number of household members age 20-29
age4 number of household members age 30-39
age5 number of household members age 40-54
age6 number of household members age 55-64
age7 number of household members 65 and older
area location of household residence
1=metro, 2=suburban, 3=perimeter with extended,
4=perimeter without extended, 5=non-metro
avail1, avail2,
avail3, avail4,
avail5
binary indicators of availability of each option.
availX=0 if alternative X is not available to the house-
hold, availX=1 if alternative X is available to the house-
hold
choice chosen alternative (dependent variable)
1=budget measured, 2=standard measured, 3=local
at, 4=extended at, 5=metro at
cost1, cost2,
cost3, cost4,
cost5
costX = monthly cost (in $) of alternative X.
employ number of household members employed
inc annual household income
1=under $10,000, 2=$10,000-20,000, 3=$20,000-30,000,
4=$30,000-40,000, 5=0ver $40,000
ones ones = 1 for all observations
status marital status
1=single, 2=married, 3=widowed, 4=divorced, 5=other
users number of phone users in household
Table A.12: Description of variables
218
choice of residential telephone services case 219
mean max min stand dev range
age0 0.21 4 0 0.53 4
age1 0.23 3 0 0.58 3
age2 0.24 4 0 0.67 4
age3 0.41 3 0 0.71 3
age4 0.44 2 0 0.73 2
age5 0.36 2 0 0.67 2
age6 0.31 3 0 0.61 3
age7 0.38 2 0 0.65 2
area 2.93 5 1 1.65 4
avail1 1.00 1 1 0.00 0
avail2 1.00 1 1 0.00 0
avail3 1.00 1 1 0.00 0
avail4 0.03 1 0 0.17 1
avail5 0.65 1 0 0.48 1
choice 2.65 5 1 1.17 4
cost1 11.73 433.5 3.28 24.13 430.22
cost2 11.49 432.8 5.78 23.90 427.02
cost3 14.82 435.5 7.03 23.56 428.47
cost4 62.19 433.03 10.48 117.88 422.55
cost5 27.48 38.28 23.28 4.17 15
employ 1.07 3 0 0.89 3
inc 2.53 5 1 1.28 4
ones 1.00 1 1 0.00 0
status 2.22 5 1 0.91 4
users 2.30 6 1 1.28 5
Table A.13: Descriptive Statistics
219
220 datasets
A.5 Airline Itinerary Case
These data come from an Internet choice survey conducted by the Boeing
Company in the Fall of 2004. Boeing was interested in understanding the sen-
sitivity that air passengers have toward the attributes of an airline itinerary,
such as fare, travel time, transfers, legroom, and aircraft. It was executed on
a sample of the customers of an Internet airline booking service. The Internet
service takes a specic user request for travel in a city pair and interrogates
the web sites of airlines that provide service in that market, returning to
the user a compiled list of available itineraries. While that interrogation is
taking place, randomly selected customers were recruited to be surveyed.
A typical page of the survey instrument is shown in Figure A.1. The respon-
dent was oered three choices based on the origin-destination market request
that the respondent entered into the itinerary search engine. The rst alter-
native is always a non-stop ight, the second always a ight with 1 stop on
the same airline, and the third is always a ight with 1 stop and a change
of airline. The respondent was asked to rank the available choices as well as
given the option to decline all of the stated options. Demographic data col-
lected included age, gender, income, occupation, and education. Situational
variables that were identied included: a) the desired departure time; b) trip
purpose; c) who is paying for the trip; and d) the number in the travel party.
All trips were for origin-destination city pairs in the United States.
There are 1633 respondents, each providing 1 SP response. Descriptions of
the available variables are reported in Tables A.14 to A.17 and some descrip-
tive statistics are given in Tables A.18 and A.19.
220
airline itinerary case 221
Figure A.1: Example of Survey Instrument
221
222 datasets
Variable Description
SubjectId Unique identier for each respondent.
q17 Gender 1 if male, 2 if female, 99 or -1 if missing.
q15 Age Age, (1 = Less than 18 years, 2 = 18-24 years,
3= 25-34 years, 3.5 = 25-44 years, 4 = 35-44
years, 5 = 45-54 years, 6 = 55-64 years, 7 =
65-74 years, 8 = 75 years or older, 99 or -1 if
missing)
q19 Occupation Occupation (01 = Executive and Managerial,
02 = Professional, 03 = Technicians and re-
lated support, 04 = Sales, 05 = Administra-
tive support, 06 = Services, 07 = Precision
production, craft, repair, 08 = Machine opera-
tors, assemblers, inspectors, 09 = Transporta-
tion and material moving, 10 = Handlers,
cleaners, helpers, 11 = Farming, forestry, and
shing, 12 = Armed forces, 99 or -1 if missing)
q16 Income Annual income in 100$; -1 or 99 if income in-
formation is missing
q20 Education Education (01 = Less than High School
Diploma, 02 = High School Graduate, 03 =
Some college, No Degree, 04 = Associate De-
gree - Occupational, 05 = Associate Degree -
Academic, 06 = Bachelors Degree, 07 = Mas-
ters Degree, 08 = Professional Degree, 09 =
Doctorate Degree, 99 or -1 if missing)
q11 DepartureOrArrivalIsImportant Importance of punctuality of departure or ar-
rival (1 = departure is important; 2= arrival
is important; otherwise, not important)
Table A.14: Description of Respondent Specic Variables
Variable Description
BestAlternative X The chosen alternative is X
Table A.15: Description of Survey Responses
222
airline itinerary case 223
Variable Description
q02 TripPurpose Trip purpose (1=business, 2=leisure, 3=at-
tending conference/seminar/training, 4=both
business and leisure, 0=trip purpose missing)
q03 WhoPays 1 if the traveler is paying for the trip, 2 if it
is his employer, 3 if it is a third party, 0 if
missing
q12 IdealDepTime Respondents ideal departure time (hours after
midnight), -1 indicates a missing value
q14 PartySize Number of persons traveling, -1 and 99 indi-
cate missing values
OriginGMT Origin city time zone (minutes from GMT
(Greenwich Mean Time))
DestinationGMT Destination city time zone (minutes from
GMT)
Direction Direction of itinerary (1=East to West,
2=West to East, 3=North-South, 0=missing)
Table A.16: Description of Trip Specic Attributes
223
224 datasets
Variable Description
DepartureTimeHours X Option X: Departure time, local (hours after
midnight)
ArrivalTimeHours X Option X: Arrival time, local (hours after mid-
night)
FlyingTimeHours X Option X: Total time in air (hours)
TripTimeHours X Option X: Total trip time (hours)
Legroom X Option X: Legroom , 1 = 2 inches less than
typical, 2 = typical, 3 = 2 inches more than
typical, 4 = 4 inches more than typical
AirlineFirstFlight X Option X: Airline for rst leg (only known to
arbitrary airline number for proprietary rea-
sons)
AirlineSecondFlight X Option X: Airline for second leg (if there exists
a second leg) (only known to arbitrary airline
number for proprietary reasons)
AirplaneFirstFlight X Option X: Airplane for rst leg (only known
to arbitrary airplane number for proprietary
reasons)
AirplaneSecondFlight X Option X: Airplane for second leg (if there ex-
ists a second leg) (only known to arbitrary air-
plane number for proprietary reasons)
Fare X Option X: Fare ($)
Table A.17: Description of Alternative Specic Attributes where X Corre-
sponds to Choice Option (1),(2) and (3)
224
airline itinerary case 225
Variable Average St. Dev. Min Max
SubjectId 1807.50 1043.41 1.00 3613.00
q17 Gender 1.46 0.50 1.00 2.00
q15 Age 3.95 1.15 1.00 8.00
q19 Occupation 2.54 1.90 1.00 12.00
q16 Income 8.09 3.53 1.00 14.00
q20 Education 5.88 1.71 1.00 9.00
q02 TripPurpose 2.04 0.76 1.00 4.00
q03 WhoPays 1.20 0.46 1.00 3.00
q14 PartySize 1.70 0.99 1.00 5.00
OriginGMT 382.18 82.08 300.00 480.00
DestinationGMT 397.34 82.87 300.00 480.00
Direction 1.59 0.49 1.00 2.00
BestAlternative 1 0.69 0.46 0.00 1.00
BestAlternative 2 0.16 0.37 0.00 1.00
DepartureTimeHours 1 11.72 3.34 6.00 18.00
ArrivalTimeHours 1 15.21 3.35 7.67 21.63
FlyingTimeHours 1 3.74 1.59 0.67 6.35
TripTimeHours 1 3.74 1.59 0.67 6.35
Legroom 1 2.46 1.12 1.00 4.00
AirlineFirstFlight 1 4.61 2.56 1.00 11.00
AirlineSecondFlight 1 0.00 0.00 0.00 0.00
AirplaneFirstFlight 1 4.52 2.30 1.00 8.00
AirplaneSecondFlight 1 0.00 0.00 0.00 0.00
Fare 1 405.66 199.87 80.00 1330.00
Table A.18: Descriptive Statistics of Variables
225
226 datasets
Variable Average St. Dev. Min Max
DepartureTimeHours 2 11.67 3.35 6.00 18.00
ArrivalTimeHours 2 16.92 3.36 9.17 24.10
FlyingTimeHours 2 4.24 1.59 1.17 6.85
TripTimeHours 2 5.50 1.68 1.83 8.85
Legroom 2 2.48 1.13 1.00 4.00
AirlineFirstFlight 2 4.68 2.65 1.00 11.00
AirlineSecondFlight 2 0.00 0.00 0.00 0.00
AirplaneFirstFlight 2 4.51 2.29 1.00 8.00
AirplaneSecondFlight 2 0.00 0.00 0.00 0.00
Fare 2 407.07 200.96 80.00 1390.00
DepartureTimeHours 3 11.66 3.34 6.00 18.00
ArrivalTimeHours 3 16.89 3.41 9.25 24.03
FlyingTimeHours 3 4.24 1.59 1.17 6.85
TripTimeHours 3 5.48 1.67 1.92 8.85
Legroom 3 2.53 1.13 1.00 4.00
AirlineFirstFlight 3 4.65 2.59 1.00 11.00
AirlineSecondFlight 3 4.65 2.65 1.00 11.00
AirplaneFirstFlight 3 4.50 2.31 1.00 8.00
AirplaneSecondFlight 3 4.50 2.28 1.00 8.00
Fare 3 405.20 197.68 80.00 1275.00
Table A.19: Descriptive Statistics of Variables
226
facial expressions recognition case 227
A.6 Facial Expressions Recognition Case
. . . the face is the most extraordinary communicator, capable
of accurately signaling emotion in a bare blink of a second, capable
of concealing emotion equally well. . .
Deborah Blum
These data come from an ongoing Internet survey called the EPFL Facial
Expressions Evaluation Survey available at:
http://lts5www.epfl.ch/face
The goal is to collect a dataset with observations of a heterogeneous group
of respondents allowing to investigate what human factors play a role in the
perception of human expressions. The goal is also to understand what facial
parts are important and what are their impact on the expression recognition
task performed by dierent people.
During the survey each respondent is asked to associate 23 images of facial
expressions with one out of seven proposed alternatives:
1. happiness;
2. surprise;
3. fear;
4. disgust;
5. sadness;
6. anger; or
7. neutral.
In the beginning of the survey the respondent is also asked to provide the
socio-economic characteristics described in Table A.20.
The images used in the survey come from the Cohn-Kanade database (Kanade
et al., 2000). Examples are shown in Figure A.4. This database consists of
expression sequences of persons (for clarity called subjects), starting from
a neutral expression and ending most of the time in the peak of the facial
227
228 datasets
Figure A.2: List of the AUs related to the 6 primary expressions
expression. The subjects are university students enrolled in introductory
psychology classes. They ranged in age from 18 to 30 years. Subjects were
instructed by an experimenter to perform a series of 23 facial displays. Six of
the displays were based on descriptions of prototypic emotions: happiness,
anger, fear, disgust, sadness and surprise. There are 104 subjects in the
database but only 10 of them (8 women and 2 men) gave the consent for
publications. The subset of the Cohn-Kanade database used in this survey
consists of the 1274 images of these 10 subjects.
In the eld of automated facial expression recognition system, the Facial Ac-
tion Coding System (FACS) has become the leading standard for measuring
facial expressions. FACS is a system originally developed by Paul Ekman
and Wallace Friesen (see Ekman and Friesen, 1978). It is a human-observer
based system designed to detect subtle changes in facial features. It denes
expressions as a combination of a subset of the 46 Action Units (AUs), which
correspond to contraction or relaxation of one or several muscles. Figure A.2
shows the subset of AUs that are related to the six prototypic expressions.
In this case study we consider the measures related to mouth and eyes, see
Table A.21 and Figure A.3 for descriptions of the attributes. Some statistics
are reported in Table A.22.
In the following sections we provide an example specication of an logit
model.
228
facial expressions recognition case 229
Variable Description
UserID Unique identier for each participant.
UserLocation Location for the current survey (01 = Home,
02 = Work, 03 = Other)
UserGender 1 if male, 0 otherwise
UserBirthDate Age in years, 0 if no information
UserOccupation Occupation (00 = None, 01 = Medical, 02 =
Educational, 03 = Management, 04 = Scien-
tic, 05 = Engineering, 06 = Technical, 07 =
Rural, 08 = Other)
UserFormation Education (04 = High School, 05 = University,
06 = PhD, 07 = Other)
UserEthnic Ethnic (00 = None, 01 = White, 02 = Black,
03 = Asian, 04 = Mixed White-Black, 05 =
Mixed White-Asian, 06 = Mixed Asian-Black,
07 = Other)
UserRegion Continent (00 = None, 01 = Africa, 02 =
Antarctica, 03 = Asia, 04 = Australia, 05
= Europe, 06 = North America, 07 = South
America)
UserScienceKW Participant scientic knowledge (00 = None,
02 = Behavioral Science, 03 = Social Science,
04 = Computer Science, 05 = Cognitive Sci-
ence, 06 = Other)
UserLanguage Survey language(01 = French, 02 = English,
03 = Italian)
Table A.20: Description of Participant Socio-Economic Variables
229
230 datasets
Figure A.3: Facial measures: width and height of left and right eye and
mouth
Variable Description
Choice Choice indicator (1 = happiness, 2 = surprise,
3 = fear, 4 = disgust, 5 = sadness, 6 = anger,
7 = neutral)
mouth w mouth width, normalized pixel measure
mouth h mouth height, normalized pixel measure
leye w left eye width, normalized pixel measure
leye h left eye height, normalized pixel measure
reye w right eye width, normalized pixel measure
reye h right eye height, normalized pixel measure
Table A.21: Description of Variables
230
facial expressions recognition case 231
Figure A.4: Examples of images in the database
231
232 datasets
Variable Mean St. Dev. Min Max
Choice 3.84 2.07 1.00 7.00
UserID 902.74 484.29 59.00 1713.00
UserLocation 1.50 0.58 1.00 3.00
UserGender 0.58 .49 0.00 1.00
UserBirthDate 1977.23 9.12 1935 2001
UserJob 3.89 2.52 0.00 8.00
UserEthnic 1.14 1.08 0.00 7.00
UserRegion 4.66 1.40 0.00 7.00
UserScienceKW 3.48 2.26 0.00 6.00
UserLanguage 1.64 0.75 1.00 3.00
mouth w 0.14 0.02 0.10 0.20
mouth h 0.07 0.04 0.02 0.22
leye w 0.07 0.01 0.06 0.09
leye h 0.03 0.01 0.02 0.05
reye w 0.07 0.01 0.06 0.09
reye h 0.04 0.01 0.01 0.06
Table A.22: Descriptive Statistics of Variables
232
facial expressions recognition case 233
Example of Model Specication
In this section we describe a logit model specication, the corresponding Bio-
geme model le is called MNL exp fm.mod and the data le expressions.dat.
The deterministic utility functions include alternative specic constants for
all alternatives except the neutral which has been selected as referent. We
also estimate alternative specic coecients related to eye height (average of
the left and right eye height), mouth width and mouth height. Note that we
cannot include all characteristics in all utilities since their respective values
are the same for the dierent alternatives.
V
neutral
= 0
V
anger
= ASC
A
+
mouth height A
mouth height+

mouth width A
mouth width+

eyes height A
eyes height
V
disgust
= ASC
D
+
mouth height D
mouth height+

mouth width D
mouth width+

eyes height D
eyes height
V
fear
= ASC
F
+
mouth height F
mouth height+

mouth width F
mouth width+

eyes height F
eyes height
V
happiness
= ASC
H
+
mouth height H
mouth height+

mouth width H
mouth width+

eyes height H
eyes height
V
sadness
= ASC
SA
+
mouth height SA
mouth height+

mouth width SA
mouth width+

eyes height SA
eyes height
V
surprise
= ASC
SU
+
mouth height SU
mouth height+

mouth width SU
mouth width+

eyes height SU
eyes height
The estimation results are shown in Table A.23. All the explanatory variables
233
234 datasets
Figure A.5: Estimated parameters interpretation: First row Anger and sec-
ond row Surprise
have their expected signs. For example, the
eyes height A
and
mouth height A
have negative values meaning that the utility of the anger alternative is
increased if the heights of mouth and eyes are smaller than the neutral ex-
pression. The coecients related to the surprise alternative,
mouth height SU
and
mouth width SU
, show as expected that the utility for this alternative is
increased when the mouth height is larger and the width smaller than the
neutral expression. Images showing the interpretation of these coecients
are shown in Figure A.5.
234
facial expressions recognition case 235
Logit model estimation
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
A
3.96 0.897 4.41
2 ASC
D
8.23 0.845 9.74
3 ASC
F
-7.36 0.931 -7.90
4 ASC
H
-16.9 1.44 -11.73
5 ASC
SA
3.88 0.801 4.84
6 ASC
SU
-0.895 0.930 -0.96
7
eyes height A
-58.3 11.8 -4.93
8
eyes height D
-146 12.8 -11.46
9
eyes height F
-25.8 12.0 -2.15
10
eyes height H
-53.6 12.3 -4.35
11
eyes height SA
-3.93 11.4 -0.35
12
eyes height SU
-37.9 11.4 -3.34
13
mouth height A
-35.5 9.28 -3.82
14
mouth height D
10.3 4.03 2.56
15
mouth height F
57.0 4.66 12.23
16
mouth height H
17.2 8.48 2.03
17
mouth height SA
-46.6 6.91 -6.74
18
mouth height SU
66.7 4.97 13.41
19
mouth width A
-5.14 5.89 -0.87
20
mouth width D
-24.8 4.54 -5.46
21
mouth width F
29.1 5.52 5.27
22
mouth width H
119 9.42 12.6
23
mouth width SA
-13.6 5.44 -2.50
24
mouth width SU
-18.9 5.77 -3.27
Summary statistics
Number of observations = 2889
L(0) = 5621.73
L(
^
) = 3721.22

2
= 0.334
Table A.23: Estimation results for the logit model
235
236 datasets
A.7 Italy Mode Choice Case
Context
Cagliari is the capital of Sardinia, Italy. With a metropolitan area of more
than 450,000 inhabitants, it contains one third of the islands population.
Although Sardinias railway system can be described as a suburban service,
the last 20 km overlapping the Cagliari Assemini corridor actually passes
through an urban area. The area under study, to the North of Cagliari, had
20,490 inhabitants at the time of the study in 1998, and generated about
10,000 trips a day in the corridor of interest. Of these, 75% use car, 20% go
by bus, 3% by train and 2% by other modes. More than 80% of the work and
other-purpose trips are made by car. To prevent falling demand in the corri-
dor which has dropped to 3% , the local rail authority decided to upgrade
the service into a metropolitan-like commuter train service, increasing not
only speed and frequency, but also the number of stations inside this corridor.
In order to analyze the impact of a potential new train system three types
of surveys were conducted: a qualitative survey using focus groups to gain a
good understanding of the phenomenon, a revealed preference (RP) survey
describing current trips, and a stated preference (SP) survey to evaluate the
introduction of radical improvements to the existing alternative.
A description of the data as well as the modeling results are reported in
Cherchi and Ortuzar, 2002. The RP data concern choice between car, bus
and train; the SP data consider the binary choice between a new train service
(quicker, more frequent, with a lower fare and more stations than the current
one) and the alternative currently chosen by car and bus users.
Data Collection
The data was collected in 1998. First, the RP survey considered data on
actual travellers trips and socio-economic characteristics in the form of a
24-hour travel diary.
Households were randomly selected from the telephone directory and each
member of the family over the age of 12 was asked to participate. With
a response rate of 83%, a total of 524 responses were obtained yielding a
total of 1840 reported trips. From these trips, only 748 observations actually
236
italy mode choice case 237
referred to the corridor of interest. After testing consistency and validity
of the data for mode choice modeling only people with an actual modal
choice among Car, Bus and Train were considered , a nal sample of 338
observations was left for model estimation.
Then, a SP survey was applied to 90% of the 338 individuals already inter-
viewed. The SP survey was designed as a stated choice experiment between
a proposed new train service and the mode currently used with customized
attribute levels (a percentage variation of the values declared on the RP sur-
vey). The design for bus users considered four variables at three levels each:
travel time, cost, frequency and comfort. The design for car users included
three variables at three levels each: travel time, cost and frequency; while
comfort was modeled by one two-level variable. Each respondent provided 9
choices, according to the situations dened by an appropriate block design.
After validation a total of 1396 mixed RP/SP observations were obtained.
Variables and Descriptive Statistics
We use a sub-sample of 1152 observations where the rst 318 correspond to
the RP data. The available variables are described in Table A.24 and some
descriptive statistics in Table A.25.
237
238 datasets
Variable Description
ch Mode Choice. 1: Train RP, 2: Car RP, 3: Bus RP, 4:
Train SP, 5: Car SP, 6: Bus SP
av1 Train RP availability dummy
av2 Car RP availability dummy
av3 Bus RP availability dummy
av4 Train SP availability dummy
av5 Car SP availability dummy
av6 Bus SP availability dummy
tt t Train travel time [minutes]
tt c Car travel time [minutes]
tt b Bus travel time [minutes]
wt t Train walking time [minutes]
wt c Car walking time [minutes]
wt b Bus walking time [minutes]
c t Train travel cost [Euro]
c c Car travel cost [Euro]
c b Bus travel cost [Euro]
tr t Number of train transfers
tr b Number of bus transfers
frq t Number of trains in an interval of time (60 min)
frq b Number of buses in an interval of time (60 min)
cw t Train low comfort dummy
cfav t Train average comfort dummy
cw b Bus low comfort dummy
cfav b Bus average comfort dummy
car lic Number of cars available in each household divided by
the number of members with driving licenses
id Actual survey ID
rp RP responses dummy
sp SP responses dummy
Table A.24: Description of variables
238
italy mode choice case 239
Mean Std. Dev. Minimum Maximum
ch 4.00 1.26 1.00 6.00
av1 0.25 0.44 0.00 1.00
av2 0.18 0.38 0.00 1.00
av3 0.28 0.45 0.00 1.00
av4 0.72 0.45 0.00 1.00
av5 0.37 0.48 0.00 1.00
av6 0.35 0.48 0.00 1.00
tt t 19.52 7.45 0.00 55.00
tt c 12.32 12.49 0.00 80.00
tt b 16.61 15.25 0.00 70.00
wt t 20.03 8.84 0.00 62.00
wt c 1.34 4.31 0.00 40.00
wt b 9.02 9.29 0.00 45.00
c t 1.15 0.72 0.00 2.58
c c 1.13 1.14 0.00 7.18
c b 0.31 0.41 0.00 2.01
tr t 0.49 0.51 0.00 2.00
tr b 0.24 0.44 0.00 2.00
frq t 5.24 2.38 0.00 12.00
frq b 3.56 2.97 0.00 12.00
cw t 0.12 0.32 0.00 1.00
cfav t 0.53 0.50 0.00 1.00
cw b 0.57 0.50 0.00 1.00
cfav b 0.06 0.23 0.00 1.00
car lic 0.20 0.36 0.00 1.00
id 311.54 108.53 1.00 420.00
rp 0.28 0.45 0.00 1.00
sp 0.72 0.45 0.00 1.00
Table A.25: Descriptive statistics
239
240 datasets
Example of RP/SP Logit Model Specication
In this section, we describe logit model specications for the RP data alone,
SP data alone, and combined RP/SP data. The corresponding Biogeme
model les are called mnl-RP.mod, mnl-SP.mod and mnl-RPSP.mod, respec-
tively. The data le is called italy.dat.
The deterministic utility functions for the RP alternatives include travel
times, walking times, travel costs, frequency for train and bus, number of
transfers for train and bus, and for car we include the variable car lic (ratio
between number of cars in household and number of members with driving
license). We also include constants for all alternatives except bus, which is
arbitrarily chosen as a referent.
V
train RP
= ASC
train RP
+
tt
tt t +
wt
wt t +
c
c t +
frq
frq t +
tr
tr t
V
car RP
= ASC
car RP
+
tt
tt c +
wt
wt c +
c
c c +
carlic
car lic
V
bus RP
=
tt
tt b +
wt
wt b +
c
c b +
frq
frq b +
tr
tr b
The estimation results are reported in Table A.26. All the coecients of the
explanatory variables have their expected signs. Note from the constants that
when the remaining deterministic utilities are equal car and bus alternatives
are preferred over the train alternative.
We use the same deterministic utility functions for the SP alternatives with
the exceptions that we add the variables related to comfort and remove the
car license variable. From the estimation results reported in Table A.27 we
note that the interpretation of the coecients of the explanatory variables
remains the same and the new coecients have intuitive signs.
V
train SP
= ASC
train SP
+
tt
tt t +
wt
wt t +
c
c t +
frq
frq t +
tr
tr t+

cf1
cw t +
cf2
cfav t
V
car SP
= ASC
car SP
+
tt
tt c +
wt
wt c +
c
c c +
carlic
V
bus SP
=
tt
tt b +
wt
wt b +
c
c b +
frq
frq b +
tr
tr b+

cf1
cw b +
cf2
cfav b
240
italy mode choice case 241
RP Logit Model Estimation
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
train RP
-0.997 0.594 -1.68
2 ASC
car RP
-0.276 1.02 -0.27
3
c
-0.972 0.283 -3.43
4
carlic
5.46 1.64 3.33
5
frq
0.203 0.129 1.58
6
tr
-0.757 0.497 -1.52
7
tt
-0.0390 0.0189 -2.06
8
wt
-0.102 0.0313 -3.26
Summary statistics
Number of observations = 318
L(0) = 294.22
L(
^
) = 86.720

2
= 0.678
Table A.26: RP data: Estimation results for a logit model
241
242 datasets
SP Logit Estimation
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
train SP
-0.756 0.260 -2.91
2 ASC
car SP
0.0426 0.421 0.10
3
c
-1.37 0.355 -3.86
4
cf1
-2.03 0.278 -7.32
5
cf2
-1.07 0.173 -6.16
6
frq
0.226 0.036 6.21
7
tr
-0.395 0.213 -1.85
8
tt
-0.0689 0.0152 -4.55
9
wt
-0.0233 0.0118 -1.97
Summary statistics
Number of observations = 834
L(0) = 578.085
L(
^
) = 507.646

2
= 0.106
Table A.27: SP data: Estimation results for a Logit model
242
italy mode choice case 243
Recall that the utility functions for the RP model are dened as
U
RP
= V
RP
+
RP
,
and those of the SP model as
U
SP
= V
SP
+
SP
.
Previously we presented estimation results of the RP and SP models sepa-
rately. We now want to perform a joint estimation of both models, that is
an RP/SP model, where the coecients common to both models (travelT,
walkT, cost and frq) are estimated based on both datasets. In order to
do so, the variances of error terms must be the same. We therefore assume
that
Var(
RP
) = Var(
SP
) =
2
Var(
SP
).
The utilities for the RP and SP models can now be rewritten as
U
RP
= V
RP
+
RP
U
SP
= V
SP
+
SP
and the error terms (
RP
and
SP
) of both models have the same variance.
Assume that V
SP
in
= X
SP
in
is a linear in parameter specication. Then V
SP
in
=
X
SP
in
, where both and are to be estimated.
The estimation results of the combined model are shown in Table A.28. All
the parameters have the right expected sign but the parameter for scaling
the SP alternatives is not signicantly dierent from one. Accordingly we
cannot reject that the RP and the SP data have the same variance.
Example of NL Model Specication
In this section, we describe an example of a NL model specication for the RP
data alone, and for the combined RP/SP data. The corresponding Biogeme
model les are called nl-RP.mod, and nl-RPSP.mod, respectively.
We postulate a nested structure for the public transport RP alternatives
(train and bus). This correlation structure implies the estimation of the
scales parameters describing each nest. The estimation results are shown in
Tables A.29 and A.30, which also show the additional parameters
car
for
243
244 datasets
Combined RP/SP Logit Model Estimation
Parameter Parameter Parameter Robust Robust Robust
number name estimate st. error t stat. 0 t stat. 1
1 ASC
train RP
-1.43 0.437 -3.27
2 ASC
car RP
0.298 0.837 0.36
3 ASC
train SP
-0.659 0.288 -2.29
4 ASC
car SP
-0.422 0.526 -0.80
5
c
-1.13 0.268 -4.22
6
carlic
5.43 1.59 3.42
7
cf1
-2.18 0.812 -2.69
8
cf2
-1.15 0.462 -2.50
9
frq
0.231 0.0790 2.93
10
tr
-0.471 0.245 -1.92
11
tt
-0.0644 0.0152 -4.23
12
wt
-0.0444 0.0236 -1.88
13 0.922 0.348 2.65 -0.22
Summary statistics
Number of observations = 1152
L(0) = 872.30
L(
^
) = 599.798

2
= 0.297
Table A.28: Combined RP/SP data: Estimation results for a Logit Model
244
italy mode choice case 245
RP NL estimation
Parameter Parameter Parameter Standard t t
number name estimate Error stat. 0 stat. 1
1 ASC
train RP
0.0165 0.246 0.07
2 ASC
car RP
-0.290 0.831 -0.35
3
c
-0.738 0.216 -3.42
4
carlic
4.34 1.18 3.69
5
frq
0.103 0.0629 1.63
6
tr
-0.616 0.315 -1.95
7
tt
-0.0201 0.0133 -1.52
8
wt
-0.0578 0.0181 -3.19
9
car
5.65 2.14 2.64 2.17
10
pub
5.65 2.14 2.64 2.17
Summary statistics
Number of observations = 318
L(0) = 294.22
L(
^
) = 76.34

2
= 0.707
Table A.29: RP data: Estimation results for a NL model
the only car nest and
pub
for the public transport nest. Note that in the
case of the mixed RP/SP estimation we have to include one-alternative nests
in the same way we do for the RP car alternative. The estimation results are
shown in Table A.30.
In both cases we can support the specied nested structure based on the
results of the nesting parameters obtained and their t statistic.
Example of Agent Eect Model Specication
Since there are several observations for each individual in the SP survey
(panel data), we describe examples of dierent model specications account-
ing for the intrinsic correlation among observations of a same individual. The
corresponding Biogeme model les are called mnl-SP agentEect.mod for a
SP logit model, mnl-RPSP agentEect.mod for a mixed RP/SP logit model
245
246 datasets
Combined RP/SP NL estimation
Parameter Parameter Parameter Standard t t
number name estimate Error stat. 0 stat. 1
1 ASC
train RP
-0.192 0.172 -1.12
2 ASC
car RP
0.0673 0.758 0.09
3 ASC
train SP
-0.345 0.195 -1.77
4 ASC
car SP
-0.341 0.305 -1.12
5
c
-0.815 0.217 -3.76
6
carlic
4.47 1.18 3.78
7
cf1
-1.42 0.476 -2.99
8
cf2
-0.765 0.271 -2.82
9
frq
0.141 0.0459 3.08
10
tr
-0.382 0.167 -2.28
11
tt
-0.0380 0.0124 -3.07
12
wt
-0.0335 0.0119 -2.82
13 1.38 0.447 3.09 0.86
14
car
4.99 1.84 2.71 2.17
15
pub
4.99 1.84 2.71 2.17
16
SP train
4.99 1.84 2.71 2.17
17
SP car
4.99 1.84 2.71 2.17
18
SP bus
4.99 1.84 2.71 2.17
Summary statistics
Number of observations = 1152
L(0) = 872.30
L(
^
) = 590.90

2
= 0.302
Table A.30: Combined RP/SP data: Estimation results for a NL model
246
italy mode choice case 247
and nl-RPSP agentEect.mod for the mixed RP/SP nested logit.
In order to account for an agent eect for the SP repeated choices (each
respondent provided 9 choices) we consider an additive common random
term aecting the utilities:
V
train RP
= ASC
train RP
+
tt
tt t +
wt
wt t +
c
c t +
frq
frq t +
tr
tr t+

panel
V
car RP
= ASC
car RP
+
tt
tt c +
wt
wt c +
c
c c +
carlic
car lic
V
bus RP
=
tt
tt b +
wt
wt b +
c
c b +
frq
frq b +
tr
tr b +
panel
V
train SP
= ASC
train SP
+
tt
tt t +
wt
wt t +
c
c t +
frq
frq t +
tr
tr t+

cf1
cw t +
cf2
cfav t +
panel
V
car SP
= ASC
car SP
+
tt
tt c +
wt
wt c +
c
c c +
carlic
car lic
V
bus SP
=
tt
tt b +
wt
wt b +
c
c b +
frq
frq b +
tr
tr b+

cf1
cw b +
cf2
cfav b +
panel
The term
panel
corresponds to an additional error term which we assume is
normally distributed with a zero mean and a standard deviation
panel
. It
represents random taste variation across individuals. The results of using
500 draws for simulated maximum likelihood estimation are presented in the
following tables, where the estimation for the parameter
panel
is included.
We present a logit model estimation with random agent eect (Table A.31), a
mixed RP/SP logit model estimation with random agent eect (Table A.32)
and a mixed RP/SP NL estimation with random agent eect (Table A.33). If
we analyze the last model we can see that we are able to obtain a signicant
parameter for the standard deviation of the random agent eect
panel
(which
is also the case for the other models estimated in this section) and parameters
validating the nested correlation structure.
247
248 datasets
SP Logit Model Estimation with Agent Eect
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
train SP
-0.827 0.276 -2.99
2 ASC
car SP
0.136 0.515 0.26
3
c
-1.46 0.337 -4.33
4
cf1
-2.09 0.336 -6.24
5
cf2
-1.13 0.212 -5.32
6
frq
0.236 0.0366 6.45
7
tr
-0.389 0.302 -1.29
8
tt
-0.0742 0.0185 -4.02
9
wt
-0.0191 0.0172 -1.11
10
panel
0.750 0.143 5.24
Summary statistics
Number of observations = 834
L(0) = 578.08
L(
^
) = 501.81

2
= 0.115
Table A.31: SP data: Estimation results for a logit model with agent eect
248
italy mode choice case 249
Combined RP/SP Logit Model Estimation with Agent Eect
Parameter Parameter Parameter Robust Robust
number name estimate standard error t statistic
1 ASC
train RP
-1.33 0.716 -1.85
2 ASC
car RP
0.479 1.04 0.46
3 ASC
train SP
-0.814 0.607 -1.34
4 ASC
car SP
-0.616 1.12 -0.55
5
c
-1.34 0.613 -2.19
6
carlic
6.30 3.26 1.94
7
cf1
-2.79 2.52 -1.10
8
cf2
-1.52 1.46 -1.04
9
frq
0.294 0.233 1.26
10
tr
-0.546 0.459 -1.19
11
tt
-0.0796 0.0398 -2.00
12
wt
-0.0593 0.0633 -0.94
13
panel
1.07 1.25 0.86
14 0.735 0.695 -0.38*
Summary statistics
Number of observations = 1152
L(0) = 872.300
L(
^
) = 593.425

2
= 0.304
* t statistic w.r.t. 1
Table A.32: Combined RP/SP data: Estimation results for a logit model
with agent eect
249
250 datasets
Combined RP/SP NL estimation
Parameter Parameter Parameter Standard t t
number name estimate Error stat. stat.
1 ASC
train RP
-0.202 0.172 -1.17
2 ASC
car RP
0.0602 0.784 0.08
3 ASC
train SP
-0.330 0.189 -1.74
4 ASC
car SP
-0.339 0.308 -1.10
5
c
-0.824 0.232 -3.56
6
carlic
4.56 1.23 3.70
7
cf1
-1.37 0.471 -2.90
8
cf2
-0.748 0.272 -2.75
9
frq
0.139 0.0467 2.98
10
tr
-0.395 0.195 -2.03
11
tt
-0.0385 0.0130 -2.96
12
wt
-0.0338 0.0125 -2.71
13
panel
0.472 0.187 2.52
14 1.49 0.498 2.99 0.98
15
car
4.97 1.90 2.61 2.09
16
pub
4.97 1.90 2.61 2.09
17
SP train
4.97 1.90 2.61 2.09
18
SP car
4.97 1.90 2.61 2.09
19
SP bus
4.97 1.90 2.61 2.09
Summary statistics
Number of observations = 1152
L(0) = 872.300
L(
^
) = 585.647

2
= 0.307
Table A.33: Combined RP/SP data: Estimation results for a NL model with
agent eect
250

You might also like