Essays in Bayesian Econometrics Modeling, Estimation, and Inference

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 106

UC Irvine

UC Irvine Electronic Theses and Dissertations

Title
Essays in Bayesian Econometrics: Modeling, Estimation, and Inference

Permalink
https://escholarship.org/uc/item/5cj868cq

Author
Yoshioka, Kai

Publication Date
2020

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library


University of California
UNIVERSITY OF CALIFORNIA,
IRVINE

Essays in Bayesian Econometrics: Modeling, Estimation, and Inference

DISSERTATION

submitted in partial satisfaction of the requirements


for the degree of

DOCTOR OF PHILOSOPHY

in Economics

by

Kai L. Yoshioka

Dissertation Committee:
Associate Professor Ivan G. Jeliazkov, Chair
Professor Emeritus David Brownstone
Professor Emeritus Dale J. Poirier

2020
© 2020 Kai L. Yoshioka
DEDICATION

To my parents.

ii
TABLE OF CONTENTS

Page

LIST OF FIGURES v

LIST OF TABLES vi

ACKNOWLEDGMENTS vii

VITA viii

ABSTRACT OF THE DISSERTATION x

1 Nonparametric Modeling under Nonadditive Effects, Endogeneity, and


Sample Selection 1
1.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Markov Process Smoothness Prior . . . . . . . . . . . . . . . . . . . . 6
1.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 Sampling β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2 Sampling Ω r Ω33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.3 Sampling g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.4 Sampling τ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.5 Sampling z ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.1 Cholesky Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.2 Ordering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.3 Conditional Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.4 Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.1 Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.2 Computational Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5.3 Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

iii
2 A Model for Built Environment Effects on Mode Usage 30
2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.1.1 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.1.2 Prior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.1 Sampling β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.2.2 Sampling Ωg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.2.3 Sampling z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3 Cross Entropy Index for Land Use Imbalance . . . . . . . . . . . . . . . . . . 43
2.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.4.1 Dependent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4.2 Independent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.5.1 Results for β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.5.2 Policy Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 58
2.5.3 Results for Ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3 An Efficient Gibbs Procedure for the Binary Mixed Logit Model 62


3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1.1 Complete-Data Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1.2 Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2.1 Sampling z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.2 Sampling β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.3 Sampling b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.4 Sampling D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2.5 Sampling κ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3.1 Estimating ln π(β ∗ , D∗ |y) . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.2 Estimating ln π(b∗i |y, β ∗ , D∗ ) . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Bibliography 80

Appendix A Chapter 2 Appendix 85

Appendix B Chapter 3 Appendix 92

iv
LIST OF FIGURES

Page

1.1 Simulation Output for n = 500 . . . . . . . . . . . . . . . . . . . . . . . . . 20


1.2 Precision Matrix Sparsity Patterns . . . . . . . . . . . . . . . . . . . . . . . 21
1.3 Inefficiency Factor Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4 Cross Sections of g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.1 Confounding Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31


2.2 Multivariate Ordinal Outcomes Model with a Binary Selection Mechanism . 32
2.3 Two Cities with an Entropy Index Value of 1 . . . . . . . . . . . . . . . . . . 45
2.4 Comparison of Reference Distributions . . . . . . . . . . . . . . . . . . . . . 47
2.5 Entropy Index vs. Cross Entropy Index . . . . . . . . . . . . . . . . . . . . . 48

3.1 Autocorrelation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76


3.2 Autocorrelation Functions for D22 . . . . . . . . . . . . . . . . . . . . . . . . 77

v
LIST OF TABLES

Page

1.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24


1.2 Estimation Results for β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.1 Cities Surveyed in the 5th Nationwide Person Trip Survey . . . . . . . . . . 50


2.2 Usage by Travel Mode and License Status . . . . . . . . . . . . . . . . . . . 52
2.3 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4 Posterior Means and Standard Deviations for β . . . . . . . . . . . . . . . . 57
2.5 Effect of Simultaneously Doubling Density and Halving Land Use Imbalance 60
2.6 Posterior Probability of Ωjk < 0 in Ω . . . . . . . . . . . . . . . . . . . . . . 61

3.1 Simulation Results based on the Proposed Gibbs Algorithm . . . . . . . . . 75


3.2 Inefficiency Factors for Different Sampling Schemes . . . . . . . . . . . . . . 77
3.3 Inefficiency Factors for n ∈ {250, 500, 1000} and T ∈ {10, 15, 20} . . . . . . . 78

vi
ACKNOWLEDGMENTS

I would like to thank the members of my dissertation committee, Professor Ivan Jeliazkov,
Professor David Brownstone, and Professor Dale Poirier, for their infinite wisdom and endless
support. It is an honor to have been advised by such a distinguished panel of econometricians.

I thank my advisor, Professor Ivan Jeliazkov, for his untiring guidance, his many encourage-
ments, his contagious enthusiasm, and for his patience. From basic writing tips to technical
advice on efficiently computing the marginal likelihood, he has supported me at every step
of the research process. His willingness to spend time giving detailed comments and expla-
nations was crucial in my development as a researcher and a teacher. I owe my confidence
and my capabilities as an econometrician to Professor Jeliazkov.

I thank Professor David Brownstone, whose insights and expertise in both econometrics and
transportation economics were paramount for the completion of this dissertation. I have
gained numerous leads, references, and tools from his vast knowledge of the literature. I owe
my understanding of applied econometrics to Professor Brownstone.

I thank Professor Dale Poirier, who introduced me to Bayesian econometrics and taught me
the correct interpretation of the frequentist confidence interval. His teachings have shaped
the core of my statistical worldview. I owe my deeper understanding of data, modeling, and
econometrics to Professor Poirier.

I would also like to thank Professor Tomomi Miyazaki, whose input was crucial for the
empirical applications presented in this dissertation, as well as the Ministry of Land, In-
frastructure, Transport and Tourism of Japan for providing data. I thank Koji Adachi and
Tatsuki Enomoto for data cleaning and handling. I am grateful to the Department of Eco-
nomics for the Summer Research Fellowship, which was essential in making research progress
during the summer. Funding from the Associate Dean Fellowship is gratefully acknowledged.

I want to thank Alexander Luttmann, Padma Sharma, Sanjana Goswami, Paul Jackson,
and Bryan Castro for their many insights, support, and fellowship; Aunty Denise, Uncle
Paul, and the Yamaguchi family for making southern California my home away from home;
Professor Mike Hand, whose enthusiasm for statistics started it all; Professor Miho Fujiwara,
Professor Mark Janeba, and Professor Nathan Sivers Boyce for their advisorship during my
undergraduate years at Willamette University; Mr. Jayson Kunihiro, my high school math
teacher, for his tireless devotion to his students; and my cousin, Dr. Kimberly Burnett, for
getting me started in economics and helping me numerous times along the way.

And finally, I thank my parents, Neal and Miho, and my sister, Noe, whose love, support,
and encouragement brought me to where I am today.

vii
VITA

Kai L. Yoshioka

EDUCATION
Doctor of Philosophy in Economics 2020
University of California, Irvine Irvine, California
Master of Science in Statistics 2019
University of California, Irvine Irvine, California
Master of Arts in Economics 2015
University of California, Irvine Irvine, California
Bachelor of Arts in Economics and Mathematics 2014
Willamette University Salem, Oregon

RESEARCH EXPERIENCE
Graduate Student Researcher 2018–2019
University of California, Irvine Irvine, California

TEACHING EXPERIENCE
Teaching Assistant 2014–2019
University of California, Irvine Irvine, California
Lecturer 2017
University of California, Irvine Irvine, California

AWARDS
Best Teaching Assistant 2019
University of California, Irvine Irvine, California
Summer Research Fellowship 2016, 2017, 2018
University of California, Irvine Irvine, California
Phi Beta Kappa 2014
Willamette University Salem, Oregon

viii
PROFESSIONAL EXPERIENCE
Software Development Intern/Apprentice at EViews 2018–2020
University of California, Irvine Irvine, California

CONFERENCE AND SEMINAR PRESENTATIONS


Joint Statistical Meetings 2019
Western Economic Association International 2019
Advances in Econometrics Conference 2018

REFEREE EXPERIENCE

Advances in Econometrics

ix
ABSTRACT OF THE DISSERTATION
Essays in Bayesian Econometrics: Modeling, Estimation, and Inference

By

Kai L. Yoshioka

Doctor of Philosophy in Economics

University of California, Irvine, 2020

Associate Professor Ivan G. Jeliazkov, Chair

This dissertation features a selection of Bayesian estimation frameworks for a variety of data
and modeling scenarios. Economists seldom work with data arising from a prototypical linear
regression model. In reality, data may be discrete, mixed, or missing; the true model may be
semiparametric or nonparametric; effects may be nonadditive, or it may be heterogeneous
across sampling units; observational data may be subject to sample selection and endogeneity
issues; and the data might be coming from a sequential process. We discuss estimation
methodologies for multifaceted data arising from complex processes. Special emphasis is
placed on computational and mixing efficiency.

The first chapter develops a class of Markov process smoothness priors for potentially nonad-
ditive multivariate mean functions. The proposed class of priors is incorporated into a larger
semiparametric estimation framework that allows for endogeneity and sample selection, and
is used to study how age and population density jointly affect car use in Japan. We find
that an increase in density leads to a reduction in car use, and that this effect is smaller
for elderly drivers than for young drivers. We also uncover interesting nonlinearities in the
relationship between age and car use. Our empirical findings highlight the importance of
allowing for nonlinear and nonadditive effects in our models.

The second chapter develops an econometric framework for estimating the effect of the built

x
environment on transportation mode choice and usage when a large fraction of the popula-
tion under study is nonlicensed. We use a multivariate ordinal outcomes model with a binary
selection component to allow for heterogeneity in built environment effects and indirect ef-
fects urban form has on mode usage via license choice. Our exposition focuses on the joint
modeling of correlated and discrete outcomes (binary and ordinal), strategizing with identi-
fication restrictions and nonidentification, and the efficient estimation of model parameters.
A separate contribution this paper makes to the urban/transportation economics literature
is the cross entropy index for land use imbalance, which we propose as a replacement to the
entropy index for land use mix/balance. Using data from the 5th Nationwide Person Trip
Survey (NPTS), we investigate whether the built environment is a policy-relevant determi-
nant of travel behavior in the Japanese elderly. Effects are found to be nonzero but modest
at best. Our results and conclusions are broadly consistent with those based on the United
States.

In the third chapter, we design a Bayesian estimation method for the binary logit model
with fixed and random effects. To this end, we develop a Gibbs procedure that uses sparse
matrix algorithms, vectorization (parallelization), and collapsing to achieve computational
and mixing efficiency. The method is fully Gibbs in that it does not involve a Metropolis-
Hastings step anywhere. We also discuss the computation of the marginal likelihood, which
is used to compare and select models. A detailed simulation study shows that the proposed
methodology works well.

xi
Chapter 1

Nonparametric Modeling under


Nonadditive Effects, Endogeneity, and
Sample Selection

This paper presents a nonparametric/semiparametric estimation framework that allows for


nonadditive effects, sample selection, and endogeneity. Semiparametric models, also known
as partially linear models, have parametric (linear) and nonparametric (nonlinear) mean
components. Nonadditivity occurs when the nonparametric part of the mean function is not
additive.

The intuition behind nonparametric estimation is the same for non-Bayesians and Bayesians:
smooth over nearby data. Non-Bayesians typically use kernels to achieve smoothing, whereas
Bayesians use their priors. A popular class of smoothing priors is based on the Gaussian pro-
cess. For works that use Gaussian process smoothness priors, also known as Markov process
smoothness priors, see Chib and Jeliazkov (2006), Chib and Greenberg (2007), Jeliazkov
(2008), Chib et al. (2009), and Jeliazkov (2013). A comprehensive coverage of Gaussian

1
processes as it applies to machine learning is provided by Williams and Rasmussen (2006).

The class of Markov process smoothness priors is flexible and easy to use. However, its use so
far has been limited to additive models. If the true model is nonadditive, fitting an additive
model to the data can yield misleading results. Additivity also precludes the analysis of
interaction effects, which is the foci of many studies. We overcome these limitations by
extending the class of Markov process smoothness priors to cover nonadditive multivariate
mean functions.

Endogenous variables are variables determined within the model. Not properly account-
ing for endogeneity leads to bias and inconsistency. The usual approach to dealing with
endogeneity is to use instrumental variables, which are correlated with the endogenous vari-
able (i.e., relevant) but are uncorrelated with the error term in the response equation (i.e.,
excludable).

Sample selection, also known as incidental truncation and informative missingness, occurs
when a subset of the data is nonrandomly missing for a subset of the sample. Ignoring
selection can lead to incorrect inference. Selection can be modeled, and doing so leads to
valid inference. The classical solution to the prototypical sample selection problem was given
by Heckman (1976, 1977). See Wooldridge (2010) for further developments on the topic.

The proposed estimation framework is applied on a sample of Japanese travel survey par-
ticipants to study how age and population density affect car use in Japan. Age and density
effects are modeled nonparametrically. Age is expected to have a nonlinear effect on car use
owing to its relationship to life-cycle events (e.g., entering the workforce, starting a family,
and retiring). By modeling age and density jointly, we allow density effects to vary with
age. Following the transportation economics literature, population density is modeled as a
potentially endogenous variable. Selection occurs in the context of our study because car
usage cannot be observed for those that are not licensed to drive.

2
The remainder of this paper is organized as follows. Section 3.1 presents the nonparametric
model with endogeneity and sample selection. Priors are developed in Section 1.2, where we
elaborate on the Markov process smoothness prior for multivariate functions. Estimation
is carried out using a Markov chain Monte Carlo (MCMC) sampler, which is given in Sec-
tion 3.2. Section 1.4 discusses several techniques for speeding up the proposed estimation
algorithm. A simulated data set is used to study the performance of the nonadditive fitting
technique in Section 1.5. Our emprical investigation using Japanese travel survey data is
conducted in Section 1.6, and Section 3.5 concludes.

1.1 Model

Following Albert and Chib (1993), the model for subject i ∈ {1, . . . , n} is

zi1 = x0i1 β1 + g(wi0 ) + εi1 (1.1)

zi2 = x0i2 β2 + εi2 (1.2)

zi3 = x0i3 β3 + εi3 , (1.3)

where zij for j ∈ {1, 2, 3} is a latent (unobserved) variable, xij and wi are covariate vectors,
βj is a coefficient vector, g is a smooth nonparametric function, and εij is an error term.
Equations (1.1), (1.2), and (1.3) model the outcome of interest, the endogenous variable,
and the selection variable, respectively.

Covariates in xij enter the model parametrically. Covariates in wi enter the outcome of
interest nonparametrically. Equation (1.1) is level-identified by excluding an intercept term
from xi1 (see Shively et al. (1999) for an anternative identification approach). The endoge-
nous variable zi2 is assumed to enter both zi1 and zi3 . In Equation (1.1), zi2 can enter either
xi1 or wi . For identification reasons, xi2 should contain at least one variable that does not

3
appear in any of the other covariate vectors. This variable is our instrument.

Let yij = lj (zij ) denote the observed counterpart to zij , where lj is a link function that is
specific to equation j. The data type on yij determines lj . For continuous and unrestricted
yij , the identity link is often used so that yij = zij . Binary outcomes can be modeled using
yij = 1{zij > 0}, where 1{·} is the indicator function that returns one when its argument is
true and zero otherwise. For outcomes censored below zero, we have yij = zij × 1{zij > 0}
P
(Tobin, 1958). Ordinal data can be modeled using yij = tj 1{zij > γj,tj }, where the

set of ordered cutpoints {γj,tj } discretizes the real number line into Tj categories. In our
application, the outcome of interest is car use, which is measured on a three-bin ordinal
scale that starts at zero. The endogenous variable, the log of density, is continuous and
unrestricted. The selection variable is an indicator for whether a person has a driver’s
license. Thus we have

yi1 = 1{zi1 > 0} + 1{zi1 > 1}, yi2 = zi2 , and yi3 = 1{zi3 > 0}.

Let S = {i : yi3 = 1} and U = {i : yi3 = 0} denote the selected and unselected samples.
For subjects in the selected sample, yij is observed for all j. For those in the unselected
sample, the outcome of interest, yi1 , is missing. Sample sizes are NS = |S|, NU = |U |, and
N = NS + NU for the selected, unselected, and potential samples, respectively.

Define the vectors


       
0
zi1   β1  g(wi ) εi1 
       
zi = 
zi2  ,
 β=
 β2  ,
 gi = 
 0 ,
 and εi = 
εi2  .

       
zi3 β3 0 εi3

4
The data matrix  
0
xi1 
 
Xi = 
 x0i2 

 
x0i3

is given in seemingly unrelated regression (SUR) form (Zellner, 1962), where zeros in the
matrix are suppressed for notational clarity. The model for i can then be expressed as

zi = Xi β + gi + εi ,

where the error vector (εi |Ω) ∼ N (03×1 , Ω) is assumed to be multivariate normal and iid
across subjects. The error covariance matrix is
 
Ω11 Ω12 Ω13 
 
Ω=
Ω21 Ω22 Ω23  .

 
Ω31 Ω32 Ω33

Following Chib et al. (2009), the scaling restriction Ω33 = 1 is not imposed until after
the posterior distribution is known. Other distributional forms can be used for the errors.
We work with the normal distribution because it lays the groundwork for more flexible
distributions, including finite mixtures and scale mixtures.

The region of truncation for subject i is Bi = Bi1 × {yi2 } × Bi3 , where







 (−∞, 0] if yi1 = 0

 

 
(0, 1]
 if yi1 = 1 (−∞, 0]
 if yi3 = 0
Bi1 = and Bi3 =
 



 (1, ∞) if yi1 = 2 (0, ∞)
 if yi3 = 1.




(−∞, ∞)
 if yi1 is missing

Let θ = {β, Ω, g, τ 2 } denote the set of model parameters, where g is a vector of ordinates

5
of the nonparametric function g, and τ 2 is known as the smoothing parameter. Then the
complete-data likelihood is

n
Y
f (y, z|θ) = 1{zi ∈ Bi }f (zi |θ).
i=1

1.2 Prior

Our prior on θ is
π(θ) = π(β)π(Ω)π(g|τ 2 )π(τ 2 ),

where β ∼ N (β0 , B0 ), Ω ∼ IW (Ψ, ν), (g|τ 2 ) ∼ N (0m×1 , τ 2 K −1 ), and τ 2 ∼ IG(γ/2, δ/2)


leads to semiconjugacy. The remainder of this section is devoted to developing a Markov
process smoothness prior for g that allows for nonadditive effects. The notation used here
draws heavily from Ohlson et al. (2013).

1.2.1 Markov Process Smoothness Prior

Let wik denote the k-th covariate in wi , where i ∈ S and wi has q elements. For each k,
sort and stack the unique values of wik to produce the mk -dimensional vector dk . In the
univariate function (q = 1) case, d = d1 is known as the design point vector. The elements
of d, called design points, are ordered grid points on a one-dimensional grid. In the general
(q ≥ 1) case, the design point is a q-dimensional row vector

pt = [d1,t1 d2,t2 · · · dq,tq ],

where dk,tk denotes the tk -th element of dk and t = [t1 t2 · · · tq ] is a vector of indices. The
set of all design points {pt } comprises the covariate lattice, a q-dimensional lattice in the

6
covariate space. If gt = g (pt ) denotes the ordinate of g at pt , then

G = [gt ] : ×qk=1 mk

is a tensor of order q and dimensions m1 , m2 , . . . , mq . We refer to G as the ordinate tensor.


Ordinate in G are arranged in the same way the underlying lattice points are arranged in
the covariate lattice.

Qq
Let m = k=1 mk be the number of ordinates in G, and let m = [m1 m2 · · · mq ]. The
ordinate tensor G can be vectorized with the help of the m-dimensional vector

m1 m2 m
em q
t = et1 ⊗ et2 ⊗ · · · ⊗ etq ,

where em
tk is the mk -dimensional unit basis vector that has a one in the tk -th position and
k

zeros everywhere else. The symbol ⊗ is the Kronecker product. Vectorizing G yields the
ordinate vector
X
g = vec(G) = gt e m
t ,
Im

where Im = {t1 , t2 , . . . , tq : 1 ≤ tk ≤ mk , 1 ≤ k ≤ q} is an index set. In the bivariate case


where G is a matrix, vec(G) is equivalent to vectorizing G row-wise.

Design points are stacked to form the m × q design point matrix D, where design points are
ordered according to t = q−1
P Qq
k=1 (tk − 1) l=k+1 ml + tq . Let hk,tk = dk,tk − dk,tk −1 for tk ≥ 2. In

the univariate function case, ht = h1,t denotes the distance between two consecutive design
points. In the bivariate function case, h1,t denotes the distance between two consecutive

7
rows in the 2-dimensional covariate lattice. Define
 
 1 
 

 1 

   
Hk =  hhk,3 − 1 + hhk,3 1
 

 k,2 k,2 

 . .. ... ... 

 
   
hk,mk hk,mk
hk,mk −1
− 1 + hk,m −1 1
k

and  
Gk,0 
 
 hk,3 
Σk =  ,
 
 ... 
 
 
hk,mk

where Gk,0 is a 2 × 2 matrix. Let K = K1 ⊗ K2 ⊗ · · · ⊗ Kq , where Kk = Hk0 Σ−1


k Hk . The

Markov process smoothness prior on the ordinate vector g is

(g|τ 2 ) ∼ N (0m×1 , τ 2 K −1 ).

This prior is equivalent to specifying (G|τ 2 ) ∼ Nm (0m×1 , τ 2 K −1 ) as our prior on the ordinate
tensor G, where the notation Nm denotes the multilinear normal (MLN) distribution of order
q (Ohlson et al., 2013).

8
1.2.1.1 Bivariate Function Case

The MLN prior over G is a Markov process smoothness prior. We show this for the bivariate
function case. The ordinate tensor G in the bivariate case is the m1 × m2 matrix
   
 g(d1,1 , d2,1 ) · · · g(d1,1 , d2,m2 )    g10
 .
.. . .. .
..
  . 
 =  ..  ,
G=
   
   
0
g(d1,m1 , d2,1 ) · · · g(d1,m1 , d2,m2 ) gm 1

where gt0 for t ∈ {1, . . . , m1 } are the rows of G.

The distance between consecutive lattice rows is h1,t = d1,t − d1,t−1 for t ≥ 2. Suppose that
the rows in G follow the second-order vector Markov process

 
h1,t h1,t
gt = 1 + gt−1 − gt−2 + ut ,
h1,t−1 h1,t−1

where u0t is a row from the error matrix

(U |τ 2 ) ∼ Nm1 ,m2 (0m1 ×m2 , τ 2 Σ1 ⊗ H2−1 Σ2 H2−10 ).

The notation Nm1 ,m2 denotes the matrix normal distribution, also known as the matricvariate
normal or the bilinear normal distribution, with m1 rows and m2 columns (Ohlson et al.,
2013; Gupta and Nagar, 2018). The vector process is initialized by setting g1 and g2 to u1
and u2 , respectively.

Let g = vec(G) and u = vec(U ), where vectorization takes place row-wise. The system of
equations obtained from the vector Markov process on gt and its initial conditions can be
written succinctly as
(H1 ⊗ Im2 )g = u.

9
Premultiplying both sides by the inverse of (H1 ⊗Im2 ), we get the Markov process smoothness
prior on g, (g|τ 2 ) ∼ N (0m×1 , τ 2 K −1 ), where K = H10 Σ−1 0 −1
1 H1 ⊗ H2 Σ2 H2 . This prior is

equivalent to the matrix normal prior on G of the form (G|τ 2 ) ∼ Nm1 ,m2 (0m1 ×m2 , τ 2 K −1 ).

1.3 Estimation

Combining the complete-data likelihood from Section 3.1 with the prior from Section 1.2
yields the fully augmented posterior distribution

n
Y
π(θ, z|y) ∝ π(θ) 1{zi ∈ Bi }f (zi |θ).
i=1

Integrating {zi1 : i ∈ U } out, we get the partially augmented posterior distribution

" #" #
Y Y
π(θ, z ∗ |y) ∝ π(θ) 1{zi ∈ Bi }f (zi |θ) 1{ziU ∈ BiU }f (ziU |θ) ,
i∈S i∈U

where z ∗ = z r{zi1 : i ∈ U }. The symbol “r” is shorthand for “except” so that, for example,
z ∗ is z without the elements in {zi1 : i ∈ U }. Define the vectors and matrices

       
0
zi2  xi2 β2  Ω22 Ω23 
ziU =  , XiU = , βU =   , and ΩU =  .

zi3 x0i3 β3 Ω32 Ω33

The contribution of subject i ∈ U to the partially augmented posterior is

1{ziU ∈ BiU }f (ziU |θ),

where BiU = {yi2 } × (−∞, 0], ziU = XiU βU + εiU , and (εiU |ΩU ) ∼ N (02×1 , ΩU ).

Inference is based on π(θ r Ω33 , z ∗ |y, Ω33 = 1), which is the partially augmented posterior

10
π(θ, z ∗ |y) with the scaling restriction Ω33 = 1 included in the conditioning set. Enforcing
the scaling restriction only affects how Ω is sampled; the steps for sampling β, g, τ 2 , and z ∗
are not affected.

Posterior simulation is carried out using the Gibbs procedure summarized below. Parameters
and latent data that do not enter a posterior full conditional are suppressed in the condition-
ing set. The MCMC chain is initiated at some starting value and evolved until convergence.
Evolving the chain further produces a set of correlated draws from the posterior distribution.
These draws are used to make inference, generate predictions, compare models, and so on.
Implementation details are provided in the subsections.

Algorithm 1.

1. Sample β from π(β|y, Ω, g, z ∗ ).

2. Sample Ω r Ω33 from π(Ω r Ω33 |y, β, Ω33 = 1, g, z ∗ ).

3. Sample g from π(g|y, β, Ω, τ 2 , z ∗ ).

4. Sample τ 2 from π(τ 2 |y, g).

5. (a) For i ∈ S, sample zi1 from π(zi1 |y, θ, zi2 , zi3 ) and zi3 from π(zi3 |y, θ, zi1 , zi2 ).

(b) For i ∈ U, sample zi3 from π(zi3 |y, θ, zi2 ).

In principle, we could have skipped the step where {zi1 : i ∈ U } was integrated out of the
posterior; either way, we are left with draws from π(θ r Ω33 |y, Ω33 = 1) once we discard
the latent data. There are four important benefits to collapsing the posterior, the first
three of which are true in general (i.e., apply to other models, too). First, it reduces the
computational and storage demands of the sampler. Second, collapsing improves the mixing
efficiency of the underlying Markov chain. Third, estimation can be carried out even if some

11
or all of the covariates in Equation (1.1) are missing for i ∈ U. The fourth benefit applies
specifically to semiparametric/nonparametric models: marginalizing {zi1 : i ∈ U } out of the
posterior can drastically reduce the dimensionality of g. This point deserves special emphasis
because smaller m means faster runtimes, lower storage requirements, and improved mixing.

1.3.1 Sampling β

Define the selection matrix C such that βU = Cβ, and let XiC = XiU C. The posterior full
conditional on β is (β|y, Ω, g, z ∗ ) ∼ N (β̂, B̂), where

X X
B̂ −1 = B0−1 + Xi0 Ω−1 Xi + 0
XiC Ω−1
U XiC
i∈S i∈U

and " #
X X
β̂ = B̂ B0−1 β0 + Xi0 Ω−1 (zi − gi ) + 0
XiC Ω−1
U ziU .
i∈S i∈U

1.3.2 Sampling Ω r Ω33

The sampling of Ω is somewhat complicated owing to the restriction Ω33 = 1 and the fact
that the posterior distribution is only partially augmented (i.e., augmented with z ∗ instead
of z). Here we update Ω r Ω33 in the manner described in Chib et al. (2009).

Pn
εi ε0i and T = Ψ + εi ε0i , where εi1 for i ∈ U is set to zero. Let u be
P
Let R = Ψ + i∈S i=1

the vector containing the second and third indices so that, for example, Ωu1 = Ω(23)1 . Make
the change of variables

{Ω11 , Ωu1 , Ω22 , Ω32 , Ω33 } → {Ω11·u , Bu1 , Ω22·3 , B32 , Ω33 },

12
where
Ωtt·l = Ωtt − Ωtl Ω−1
ll Ωlt and Blt = Ω−1
ll Ωlt ,

where t and l can either be indices or vectors of indices. The following one-block, five-step
procedure is used to generate Ω:

1. Sample Ω22·3 from (Ω22·3 |y, β, g, z ∗ ) ∼ IG [(ν + n − 1)/2, T22·3 /2] .

2. Sample Ω11·u from (Ω11·u |y, β, g, z ∗ ) ∼ IG [(ν + nS + 1)/2, R11·u /2] .

−1 −1
3. Sample B32 from (B32 |y, β, Ω22·3 , g, z ∗ ) ∼ N T33

T32 , T33 Ω22·3 .

4. Sample Bu1 from (Bu1 |y, β, Ω11·u , g, z ∗ ) ∼ N (Ruu


−1 −1
Ru1 , Ruu Ω11·u ) .

5. Construct
   
0 0 0 0
Ω11·u + Bu1 ΩU Bu1 Bu1 ΩU  Ω22·3 + B32 B32 B32 
Ω=  where ΩU =  .
ΩU Bu1 ΩU B32 1

1.3.3 Sampling g

Stack zi over i ∈ S to create the 3NS × 1 vector zS . Do the same with Xi and εi to get XS
and εS . Let Q be the 3NS × m incidence matrix, where

 1 if wi0 equals the t-th row of D

Q3(i−1)+1,t =
 0
 otherwise

and Q3(i−1)+2,t = Q3i,t = 0 for all i ∈ S and t ∈ {1, . . . , m}. The posterior full conditional
distribution on the ordinate vector is given by (g|y, β, Ω, τ 2 , z ∗ ) ∼ N (ĝ, Ĝ), where

Ĝ−1 = K/τ 2 + Q0 V Q and ĝ = ĜQ0 V (zS − XS β)

13
where V = INS ⊗ Ω−1 . Generating the ordinate vector can be a computationally demanding
process. Section 1.4 discusses several sampling and modeling strategies that can be used to
speed up this process.

1.3.4 Sampling τ 2

The smoothing parameter τ 2 is sampled from (τ 2 |y, g) ∼ IG(γ̂/2, δ̂/2), where

γ̂ = γ + m and δ̂ = δ + g 0 Kg.

1.3.5 Sampling z ∗

The latent data in z ∗ are sampled using the method of Geweke (1991). Note that the poste-
rior full conditionals on zij ∈ z ∗ are univariate truncated normal distributions. Specifically,
we have
(zi1 |y, θ, zi2 , zi3 ) ∼ T NBi1 [E(zi1 |θ, zi2 , zi3 ), var(zi1 |θ, zi2 , zi3 )]

and
(zi3 |y, θ, zi1 , zi2 ) ∼ T N(0,∞) [E(zi3 |θ, zi1 , zi2 ), var(zi3 |θ, zi1 , zi2 )]

for i ∈ S. For i ∈ U, we have

(zi3 |y, θ, zi2 ) ∼ T N(−∞,0] [E(zi3 |θ, zi2 ), var(zi3 |θ, zi2 )] .

The rejection sampler from Robert (1995) is used to sample from truncation regions in the
tails of the normal distribution.

14
1.4 Computation

In Algorithm 1, the ordinate vector g is sampled from an m-dimensional normal density,


where m is typically very large. In many applications, generating g directly is impractical.
This section discusses four sampling and modeling strategies that can be used to speed up
the ordinate vector sampling process.

1.4.1 Cholesky Approach

The nonzeros in K are confined to a diagonal band. Because Q0 V Q is a diagonal matrix,


the precision matrix Ĝ−1 has the same banded structure as K. If the bandwidth on K is
narrow, then Algorithm 2 can be used to generate g in O(m) time. This procedure makes
nonparametric estimation practical computationally, and is one reason why Markov process
smoothness priors are popular when the nonparametric function g is univariate or additive.

Algorithm 2.

1. Compute the Cholesky factor R of Ĝ−1 such that R0 R = Ĝ−1 .

2. Let b = Rĝ. Solve R0 b = Q0 V (zS − XS β) for b, then solve Rĝ = b for ĝ.

3. Generate u ∼ N (0m×1 , Im ) and solve Rη = u for η. Add ĝ to get g = ĝ + η.

How efficient Algorithm 2 is depends on the number of nonzeros R, or nnz(R). Fewer nonzero
values in R means that computations that involve R, including the factorization process used
to find it, require fewer operations. An easy way to reduce nnz(R) is to use ordering methods,
which we describe next. The method improves the computational efficiency of Algorithm
2 without changing the model or affecting the mixing properties of the underlying Markov
chain.

15
1.4.2 Ordering Methods

In general, the Cholesky factor of a sparse matrix is dense. This means that a subset of the
zero values in the original sparse matrix are “filled in” with nonzero values in the Cholesky
factor. Ordering methods reduce instances of fill-in by judiciously permuting the rows and
columns of the original sparse matrix.

Let P be a permutation matrix that reduces instances of fill-in, and let “←” denote the
assignment operator so that, for example, g ← P 0 g means “assign P 0 g to g.” Make the
assignments K ← P 0 KP and Q ← QP. Doing so automatically results in the assignments
g ← P 0 g, ĝ ← P 0 ĝ, and Ĝ ← P 0 ĜP. Algorithm 2 generates the permuted ordinate vector
relatively efficiently because the permuted precision matrix has its rows and columns sorted
in a way that reduces nnz(R). The computational gains from using ordering methods can
be substantial in iterative methods such as MCMC sampling, where similar calculations are
done many times over.

The problem of finding P that minimizes fill-in is NP-complete (Yannakakis, 1981). Fortu-
nately, there are fast heuristics for reducing (not necessarily minimizing) fill-in. One such
method is the approximate minimum degree (AMD) permutation, which is used in MAT-
LAB.

Ordering methods determine P based on the location of zeros in the original sparse matrix.
Since the location of zeros in Ĝ−1 is that of K, the permutation matrix P only needs to
be found once, and can be done so prior to running the sampler. Ordering methods are
easy to implement, as knowledge of the underlying theory is not required if computer code
for ordering methods is available. In MATLAB, for example, P is the third output to the
built-in Cholesky factorization function.

16
1.4.3 Conditional Sampling

Significant speedup can be achieved, albeit at a cost to mixing efficiency, by judiciously


assigning ordinates into blocks and sampling each block conditionally on the rest. Let P
be a permutation matrix, and make the assignments K ← P 0 KP and Q ← QP so that
g ← P 0 g, ĝ ← P 0 ĝ, and Ĝ ← P 0 ĜP. Let a = Q0 V (zS − XS β). Partition g into B ≥ 2
blocks, and partition a and Ĝ conformably. Let Ĝtl denote the (t, l)-th block of Ĝ−1 . The
permutation matrix P is chosen so that Ĝbb has a narrow band for all b ∈ {1, . . . , B}.

Let gb denote the b-th ordinate block, and let grb = g r gb denote all other blocks. The
posterior full conditional on gb is (gb |y, β, Ω, grb , τ 2 , z ∗ ) ∼ N (ĝb|rb , Ĝb|rb ), where

 
Ĝ−1
b|rb = Ĝ
bb
and ĝb|rb = Ĝb|rb ab − Ĝb(rb) grb .

The precision matrix Ĝ−1


b|rb has a small bandwidth by construction. Thus, the Cholesky

approach can be used to generate gb inexpensively (see Algorithm 3).

Algorithm 3.

1. Compute the Cholesky factor R of Ĝ−1 −1 0


b|rb such that Ĝb|rb = R R.

2. Let c = Rĝb|rb . Solve R0 c = ab − Ĝb(rb) grb for c, then Rĝb|rb = c for ĝb|rb .

3. Generate u ∼ N (0wb ×1 , Iwb ), where wb is the number of ordinates in the b-th ordinate
block. Solve Rη = u for η and add ĝb|\b to get gb = ĝb|\b + η.

1.4.4 Grouping

Covariates can be grouped, i.e., continuous covariates can be discretized, and discrete co-
variates can be coarsened. Grouping reduces the overall number of ordinates, reduces the

17
number of unidentified ordinates, leads to repetition, and prevents the ordinate vector from
growing too quickly relative to sample size. Benefits include faster runtimes, better mix-
ing of the Markov chain, lower storage requirements, and increased interpretability of the
estimation results. For an excellent coverage of the topic, see Poirier (2020).

1.5 Simulation

We carried out several simulation studies to examine the performance of the proposed non-
parametric estimation method. We simulated data for the model

yi = g(wi1 , wi2 ) + εi ,

where yi is a continuous outcome and (εi |σ 2 ) ∼ N (0, σ 2 ) is an iid error term. The error
variance was set to σ 2 = 0.1. The object of interest is the nonadditive bivariate function

g(wi1 , wi2 ) = sin(wi1 wi2 ) + wi1 cos(wi2 ),

where the covariates wi1 ∈ {−3, −2.9, . . . , 2} and wi2 ∈ {−2, −1.9, . . . , 3} were generated
from discrete uniform distributions. The prior on σ 2 is IG(ν/2, ψ/2). Prior hyperparameters
were set to Gj0 = 10 × I2 and γ = δ = ν = ψ = 1.

1.5.1 Fit

Simulating a sample of size n = 500 yielded a 51 × 51 grid (lattice) over the covariate space.
Of the m = 2,601 grid points, only 450 were realized in-sample. The use of discrete covariates
resulted in repetition.

18
Posterior sampling was carried out using Algorithm 4. The incidence matrix Q here is n×m,
where Qit = 1 when wi0 equals the t-th row of D, and Qit = 0 otherwise. The sampler ran
for 11,000 cycles, the first 1,000 of which were discarded for burn-in.

Algorithm 4.

1. Sample g from (g|y, τ 2 , σ 2 ) ∼ N (ĝ, Ĝ), where

Ĝ−1 = K/τ 2 + Q0 Q/σ 2 and ĝ = ĜQ0 y/σ 2 .

2. Sample τ 2 from (τ 2 |y, g) ∼ IG [(γ + m)/2, (δ + g 0 Kg)/2] .

3. Sample σ 2 from (σ 2 |y, g) ∼ IG [(ν + n)/2, (ψ + ε0 ε)/2] , where ε = y − Qg.

Simulation results are presented in Figure 1.1. The top-left graph shows the 51×51 covariate
lattice obtained based on the unique in-sample values for wi1 and wi2 . Solid circles represent
realized covariate vectors. The nonparametric function g is shown in the top-middle graph.
Point estimates (posterior means) for the ordinates in g are shown in the top-right graph.

The bottom graph in Figure 1.1 compares ordinate point estimates to their true values. 95%
credibility intervals (CI) are also shown. It is interesting to note that, of the 2,601 ordinates
being estimated, only 450 are identified. Yet, the graph shows that we have learned something
new about every ordinate from the data, including the 2,151 ordinates that are not identified.
Learning takes place on unidentified ordinates because ordinates are dependent in the prior
(Poirier, 1998, 2020). The Markov process smoothness prior on g facilitates how ordinates
share information over the covariate lattice.

19
Figure 1.1: Simulation Output for n = 500

1.5.2 Computational Efficiency

This subsection looks at how runtimes are affected by the different ordinate vector sampling
schemes discussed in Section 1.4. Using the same simulation setup as before, the Gibbs
sampler that generates the ordinate vectors directly took over 9 hours to go through 11,000
cycles. In comparison, the sampler that uses the Cholesky approach took 5 minutes and 53
second to run. Ordering methods reduced this to 2 minutes and 28 seconds. The sampler
that uses conditional sampling ran in 14 seconds.

To provide some intuition on how permuting and blocking ordinates results in faster run-
times, Figure 1.2 shows the sparsity pattern of the precision matrix Ĝ−1 under the Cholesky,
ordering methods (OM), and conditional sampling (CS) approaches. The left graph shows
the precision matrix in its original banded form. The Cholesky factor here had 265,045

20
nonzero values. The precision matrix shown in the middle had its rows and columns per-
muted to reduce instances of fill-in. The Cholesky factor here had 181,137 nonzero entries.
The precision matrix on the right was generated using a sorting algorithm that produces
tridiagonal Ĝbb for all b ∈ {1, . . . , B}. The Cholesky factor of Ĝbb had 2,550 nonzeros each,
where B = 3.

Figure 1.2: Precision Matrix Sparsity Patterns

1.5.3 Mixing


We gauged the mixing performance of the sampler using inefficiency factors. Let θ(g)
denote a set of MCMC draws on a generic scalar parameter θ. The inefficiency factor on θ is
computed using the formula 1 + 2 L−1
P
l=1 (1 − l/L)ρ̂l , where ρ̂l is the sample autocorrelation

of θ(g) at lag l, and L is the lag length at which autocorrelation tapers off. The inefficiency
factor is numerically equivalent to the ratio of variances

 
var θ̄M CM C /var θ̄IID ,


where the sample mean in the numerator is computed using the set of MCMC draws θ(g) ,
and the sample mean in the denominator is calculated based on a hypothetical set of IID
draws. An inefficiency factor of 2 means that twice as many MCMC draws are needed to
compute the sample mean to the same level of precision as the set of IID draws.

21
Inefficiency factors were computed for each ordinate in g. Figure 1.3 uses boxplots to show the
distribution of inefficiency factors under different blocking schemes. Both the Cholesky and
ordering method approaches are one-block schemes since they generate all ordinates jointly.
The conditional sampling method is multi-block. Results here are based on a simulated
data set that uses wi1 ∈ {−3, −2.5, . . . , 2}, wi2 ∈ {−2, −1.5, . . . , 3}. Under the single-block
scheme, inefficiency factors were all near or equal to one, indicating that the MCMC draws
were essentially as good as IID. Inefficiency factors under the multi-block scheme were typi-
cally much larger.

Although slower, generating all ordinates jointly results in excellent mixing. In contrast,
conditional sampling runs quickly but mixes slowly. Grouping can help in both cases. For
the single-block sampling schemes, grouping speeds up sampling. For the multi-block scheme,
grouping increases instances of repetition and decreases the number of unidentified ordinates,
leading to improved mixing.

Figure 1.3: Inefficiency Factor Boxplots

22
1.6 Application

1.6.1 Data

We applied the estimation framework developed in Sections 3.1 through 3.2 on a sample of
individuals who took part in the 5th Nationwide Person Trip Survey (NPTS). The survey
was conducted in Japan by the Ministry of Land, Infrastructure, Transport and Tourism
(MLIT). The study surveyed 35,000 households from 70 cities. Members of participating
households kept a travel diary for one day. A sample from the 4th NPTS was used in
Parady et al. (2015) to study the relationship between the built environment and nonwork
trip frequency. For our study, we restrict the sample to those in the age range 20 to 100,
who completed their travel diary on a weekday.

Car use is measured on a three-bin ordinal scale, where the first bin corresponds to no car
trips, the second bin corresponds to 1 to 2 car trips, and the third bin corresponds to 3+
car trips. The percent of the sample in bins one, two, and three are 53.64%, 27.54%, and
18.83%, respectively. Individual-level controls include age and dummy variables for male,
unemployment, and housewife/househusband. Household-level controls include household
size and dummy variables for living in a multi-dwelling unit (MDU), having a motorcycle,
having a motorized bicycle, and having a bicycle at the household. The survey did not
inquire about income and education levels.

Population density data were collected from the 2010 national census. We used the percent
of the population in the age range 15 to 64 at the municipality level as our density measure.
This variable was instrumented on the percent of the population in the age range 15 to 64
at the prefecture level. See Fang (2008), Brueckner and Largey (2008), Brownstone and
Golob (2009), and Brownstone and Fang (2014) for studies that treat population density as
a potentially endogenous variable. Our study follows the identification strategy of Brueckner

23
and Largey (2008) in instrumenting density on aggregate density. Descriptive statistics are
provided in Table 1.1.

Table 1.1: Descriptive Statistics

Variables Mean SD

Endogenous Variables
Drivers License 0.793
Population Density (Municipality) 63.034 3.191
Car Use

Individual-Level Characteristics
Age 53.939 17.023
Male 0.486
Unemployed 0.229
Housewife or Househusband 0.150

Household-Level Characteristics
Household Size 3.091 1.411
MDU 0.229
Motorcycle 0.077
Motorized Bicycle 0.144
Bicycle 0.676

Prefecture-Level Characteristics
Population Density (Prefecture) 62.800 2.493

24
1.6.2 Results

1.6.2.1 Results for β1 and β3

Regression results for β1 and β3 are reported in Table 1.2. We report the mean, standard de-
viation, the 2.5th percentile, and the 97.5th percentile of the marginal posterior distribution
on each coefficient.

Based on our results for Equation (1.1), we see that car use is positively associated with
household size. This relationship seems plausible considering adults in multigenerational
households typically have children and/or parents that need a ride to their destinations.
Living in an MDU, having a motorcycle or a bicycle (motorized and nonmotorized), being a
housewife or househusband, and being unemployed are associated with less driving. These
results are also consistent with our expectations: MDUs are typically closer to where goods
and services are provided than single-dwelling units; transportation modes are substitutes
to some extent; and housewives/househusbands and those not employed spend more time at
home than those who work for pay.

Focusing now on our results for Equation (1.3), we see that men are more likely than women
to have their driver’s licenses. Being licensed and having a motorcycle are positively related,
presumably because those that are licensed to operate motorcycles are also licensed to drive
cars. Having a bicycle, being a housewife/househusband, and being unemployed are asso-
ciated with being unlicensed. Age and density are inversely related to the probability of
having a driver’s license.

25
Table 1.2: Estimation Results for β

Coefficient Covariate Mean SD 2.5% 97.5%

β1 Household Size 0.018 0.004 0.011 0.027


Male 0.029 0.015 -0.002 0.057
MDU -0.154 0.014 -0.181 -0.126
Motorcycle -0.102 0.019 -0.140 -0.066
Motorized Bicycle -0.149 0.016 -0.179 -0.117
Bicycle -0.074 0.012 -0.097 -0.050
Housewife/Househusband -0.213 0.017 -0.248 -0.181
Unemployed -0.244 0.017 -0.278 -0.212

β3 Household Size -0.005 0.006 -0.015 0.006


Male 1.027 0.014 1.000 1.055
MDU -0.295 0.018 -0.335 -0.260
Motorcycle 0.113 0.026 0.062 0.166
Motorized Bicycle -0.120 0.019 -0.159 -0.085
Bicycle -0.233 0.015 -0.262 -0.204
Housewife/Househusband -0.182 0.018 -0.217 -0.146
Unemployed -0.616 0.016 -0.646 -0.583
Age -1.975 0.030 -2.032 -1.912
Population Density (Municipality) -0.519 0.041 -0.601 -0.439

1.6.2.2 Results for g

Figure 1.4 shows cross sections of g(Age, Density). Age and density were grouped into do-
deciles. The left graph depicts the relationship between age and car use in low and high
density areas. As expected, we find that the effect of age is nonlinear, with car use increas-

26
ing with age for the young, relatively flat for the middle-aged, and decreasing with age for
the old. Nonlinearities can be explained in terms of life-cycle events: The young drive more
as the probability of being independent, having one’s own vehicle, and having children in-
creases with age. Car use stabilizes as people settle down as middle-aged adults. Eventually,
car use decreases with age due to aging.

In a parametric model, the effect of age is usually modeled as a quadratic function. Figure
1.4 suggests such a quadratic function would not be appropriate here, as it cannot capture
the three “stages” g displays over age.

The right graph depicts the relationship between density and car use for the young and for
the elderly. Density and car use are found to be inversely related. This result is intuitive
because high density is associated with better accessibility to goods, services, and public
transit. We also see that density effects are weaker on the elderly. From an urban planning
point of view, knowing that the elderly may not be as receptive to changes made to the built
environment may be important.

Our results for density displays the strength of nonadditive modeling. By design, additive
models cannot capture differences in effects, at least not without making additional assump-
tions.

27
Figure 1.4: Cross Sections of g

1.7 Conclusion

We presented a semiparametric/nonparametric estimation methodology that deals with non-


additive effects, sample selection, and endogeneity. Nonadditive estimation is achieved using
a Markov process smoothness prior over the ordinate vector. This is shown to be equivalent
to specifying a multilinear normal prior over the ordinate tensor. Methods for reducing the
computational burden of the estimation algorithm were presented and studied in detail. Ap-
plying the method on a sample of Japanese travel survey participants, we detected nonlinear
and nonadditive effects of age and density on car use.

The proposed econometric framework can be extended in several ways. One possible exten-
sion is to include an additive component to the nonparametric function so that covariates can
enter the mean function linearly, additively, or nonadditively. Theory and model comparison
methods can be used to decided where a covariate should be included.

We discussed grouping as a way to reduce the computational burden of the proposed es-

28
timation algorithm. Grouping can be combined with the Cholesky and ordering method
sampling approaches to speed up estimation. It can be used with conditional sampling to
reduce the number of unidentified ordinates, which helps with mixing. See Poirier (2020) for
a novel nonparametric estimation methodology based on grouping.

29
Chapter 2

A Model for Built Environment


Effects on Mode Usage

We consider in this paper the problem of estimating the effect of the built environment on
transportation mode choice and usage when a large fraction of the population under study
does not have the option to drive. We address in particular two important ways in which a
mode usage model could be misspecified in this context: (1) assuming that built environment
effects are the same for those who can and cannot legally drive, and (2) treating the choice
to be licensed as exogenous. We present an econometric framework to overcome these mis-
specification concerns and apply it to travel diary data to study whether the transportation
habits of the Japanese elderly can be shaped using the tools of urban planning. To motivate
our model, we begin by elaborating on how (1) and (2) could lead to incorrect inference.

Previous studies have found that an increase in density leads to more nonmotorized travel
and less driving (Bento et al., 2005; Leck, 2006; Parady et al., 2015). These two effects
seem highly plausible because jointly, they appeal to the notion of a substitution effect: all
else held constant, improving accessibility to goods and services results in people opting to

30
walk or bike instead of driving. Of course, if this were true, then densification should have
a different effect on a person who cannot drive. If those who cannot drive make up a large
portion of the population of interest, then failing to distinguish the density effect between the
nonlicensed and licensed groups may lead to a serious case of confounding. This is shown in
Figure 2.1. In the left pane, the licensed far outnumber the unlicensed, resulting in a line of
best fit that resembles the former’s conditional expectation function (CEF). Confounding is
inconsequential here insofar as the average effect is concerned. If the two groups are similar
in size, however, then confounding can be severe as shown in the right pane.

Figure 2.1: Confounding Effects

One possible solution to the problem of confounding is to fit two models, one for each group,
so that built environment effects are estimated separately. Fitting two models also allows
for different unobserved substitution patterns in the error term, and furthermore, renders
trivial the task of excluding driving as a feasible transportation mode option for those not
licensed. The issue with this strategy is that it treat license choice as exogenous. To see
why this is problematic, suppose that improving accessibility to rail leads to less driving by
drivers and also fewer drivers overall. Fitting a model for each group may reveal the former

31
effect but not the latter.

To address these misspecification concerns, we build a multivariate ordinal outcomes model


with a binary selection mechanism to estimate the effect of urban form on transportation
mode choice and usage. A graphical representation of our model is given in Figure 2.2. This
figure shows that, in the first stage, individuals select into either the nonlicensed group or
the licensed group. This decision is observed for all units in the sample. In the second
stage, individuals decide on how much of each available mode option to use. Available mode
options are walking/biking, public transit, getting a ride, and for those who can do so legally,
driving. Mode usage is measured on a three-bin ordinal scale. This is done for practical and
theoretical reasons, as we will explain later. A key feature of our model is that mode usage
is modeled jointly. Correlation patterns not captured by the included covariates appear in
the covariance matrix of the error vector.

Figure 2.2: Multivariate Ordinal Outcomes Model with a Binary Selection Mechanism

Driver’s License

Nonlicensed Licensed
Walking/Biking Walking/Biking

Public Transit Public Transit

Getting a Ride Getting a Ride

— Driving

Despite its intuitive appeal and innocuous appearance, the econometric modeling of the
proposed model is not straightforward: license outcome is binary whereas mode usage are
ordinal, all outcomes are discrete and correlated, the likelihood function is comprised of
many high-dimensional integrals, and the modeling of counterfactuals yields unidentified
parameters. Fortunately, the Bayesian paradigm offers practical solutions to all of these
problems. Our estimation strategy relies on those of Albert and Chib (1993), Chib (2007),

32
and Chib et al. (2009).

Empirically, our econometric model is motivated by growing concerns for the future of road
safety in Japan, where babyboomers will reach the age of 75+ by 2025 (the so-called 2025
problem). A report by the National Police Agency Transportation Bureau (NPATB) in
Japan found that in 2017, the average number of fatal accidents involving drivers age 75+
was double that of drivers under the age of 75. It also found that the number of individuals
75+ with driver’s licenses doubled in the last decade (NPATB, 2018). The ultimate goal
of this paper is to determine whether urban planning can help ensure a high traffic safety
standard in Japan in the face of the 2025 problem.

A separate contribution we make to the urban/transportation economics literature is the


cross entropy index for measuring land use imbalance. The literature has pointed out the
many problems with the entropy index, which numerous studies have used to measure land
use mix/balance. In this paper, we argue that if a reference distribution is available, a
natural alternative to the entropy index is the cross entropy index. We do this by showing
that the entropy and cross entropy indices are elegantly linked: If the reference distribution
is the uniform distribution, then the entropy index is a valid measure of land use balance,
and furthermore, balance and imbalance sum to unity.

The remainder of this paper is organized as follows: Section 2.1 elaborates on the proposed
econometric model with emphasis on strategizing with identification restrictions and non-
identification. MCMC estimation is discussed in Section 2.2. The cross entropy index and
data are covered in Sections 2.3 and 2.4, respectively. Results are given in Section 2.5, and
Section 2.6 concludes.

33
2.1 Model

The econometric model represented graphically by Figure 2.2 contains eight equations: a
selection equation, three mode usage equations for the nonlicensed group, and four mode
usage equations for the licensed group. Following the latent variables framework of Albert
and Chib (1993), we have for sample unit i ∈ {1, . . . , n} :

Selection Mechanism - License : zi1 = x0i1 β1 + i1 (2.1)

Nonlicensed - Walking/Biking : zi2 = x0i2 β2 + i2 (2.2)

Nonlicensed - Public Transit : zi3 = x0i3 β3 + i3 (2.3)

Nonlicensed - Riding : zi4 = x0i4 β4 + i4 (2.4)

Licensed - Walking/Biking : zi5 = x0i5 β5 + i5 (2.5)

Licensed - Public Transit : zi6 = x0i6 β6 + i6 (2.6)

Licensed - Riding : zi7 = x0i7 β7 + i7 (2.7)

Licensed - Driving : zi8 = x0i8 β8 + i8 , (2.8)

where, for equation j ∈ {1, . . . , 8}, xij is a vector of covariates, βj is a vector of coefficients,
and ij is an error term. Let N and L denote the nonlicensed and licensed groups, respec-
tively. Due to the nature of the data generating process, the latent data (zi5 , zi6 , zi7 , zi8 ) is
missing for i ∈ N . Likewise, (zi2 , zi3 , zi4 ) is missing for i ∈ L.

The non-missing latent data vectors ziN = (zi1 , zi2 , zi3 , zi4 ) and ziL = (zi1 , zi5 , zi6 , zi7 , zi8 ) re-
late to their discrete observed counterparts yiN = (yi1 , yi2 , yi3 , yi4 ) and yiL = (yi1 , yi5 , yi6 , yi7 , yi8 )
through the link functions

yi1 = 1 {zi1 > 0} and yij = 1 {zij > 0} + 1 {zij > 1} for j > 1. (2.9)

34
Here, 1 {·} is the indicator function which equals one when its argument is true and zero
otherwise. The link functions in (2.9) imply that yi1 ∈ {0, 1} is binary, yij ∈ {0, 1, 2} for
j > 1 is ordinal, and every cutpoint is set to either zero or one.

Fixing every cutpoint identifies Equations (2.2) through (2.8), freeing up all but the top-
left corner of the error vector covariance matrix (Nandram and Chen, 1996). Although
restricted covariance matrices are harder to deal with than unrestricted covariance matrices,
the restriction here actually simplifies both the specification and estimation of the model.
To see why, suppose that the error vector i = (i1 , i2 , i3 , i4 , i5 , i6 , i7 , i8 )0 has the normal
distribution N (0, Ω) , where the restricted covariance matrix Ω is given by

 
1 Ω12 Ω13 Ω14 Ω15 Ω16 Ω17 Ω18
 
 
Ω21 Ω22 Ω23 Ω24
 · · · · 
 
Ω
 31 Ω32 Ω33 Ω34 · · · · 
 
Ω41 Ω42 Ω43 Ω44 · · · · 
 
Ω=

.

Ω51 · · · Ω55 Ω56 Ω57 Ω58 
 
 
 61 · · ·
Ω Ω65 Ω66 Ω67 Ω68 

 
Ω71 · · · Ω75 Ω76 Ω77 Ω78 
 
 
Ω81 · · · Ω85 Ω86 Ω87 Ω88

The “1” in the top-left corner is the usual scaling restriction for the binary probit model, and
the dots “ · ” represent unidentified parameters. Following Chib (2007), we do not model
unidentified covariance terms. This is different from assuming that the unidentified covari-
ance terms are zero, nor does it mean that the data is uninformative about the unidentified
parameters (see Koop and Poirier (1997) and Poirier and Tobias (2003) for a discussion on
the topic). “Ignoring” the unidentified terms is convenient because the identified portion of

35
Ω can be partitioned into ΩN and ΩL , where

 
  1 Ω15 Ω16 Ω17 Ω18
 1 Ω12 Ω13 Ω14   
 
  Ω51 Ω55 Ω56 Ω57 Ω58 
Ω21 Ω22 Ω23 Ω24 




ΩN =  and ΩL =  .
 
Ω
 Ω61 Ω65 Ω66 Ω67 Ω68 
 31 Ω32 Ω33 Ω34 
  
Ω71 Ω75 Ω76 Ω77 Ω78 
   
Ω41 Ω42 Ω43 Ω44  
Ω81 Ω85 Ω86 Ω87 Ω88

The sole overlap between ΩN and ΩL is Ω11 , which is already set to one. The scaling restric-
tion simplifies the estimation of the covariance matrix by making it so that the identified
terms in Ω update using data from either N or L and not both. For ease of specification,
we work with {ΩN , ΩL } instead of Ω. Group-specific error vectors can now be defined as
iN = (i1 , i2 , i3 , i4 )0 ∼ N (0, ΩN ) and iL = (i1 , i5 , i6 , i7 , i8 )0 ∼ N (0, ΩL ) . There are
several ways to deal with covariance matrices that have a single on-diagonal restriction; see
McCulloch et al. (2000), Chan and Jeliazkov (2009), and Chib et al. (2009). This paper uses
Chib et al. (2009) as it is more efficient than McCulloch et al. (2000) and easier to implement
than Chan and Jeliazkov (2009).

As a final note, specification and estimation can be simplified in the way described above as
long as mode usage is measured on an ordinal scale with at least three bins. The number of
bins may vary across modes and groups. The benefit to using three bins, however, is that
the cutpoints need not be sampled (Nandram and Chen, 1996; Jeliazkov et al., 2008).

36
2.1.1 Likelihood Function

Let β = (β10 , β20 , β30 , β40 , β50 , β60 , β70 , β80 )0 denote the coefficient vector. The selection matrices

 
  I 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0  
 
  0 0 0 0 I 0 0 0
0 I 0 0 0 0 0 0




SN =  and SL = 
 
0
 0 0 0 0 0 I 0 0
 0 I 0 0 0 0 0
  
0 0 0 0 0 0 I 0
   
0 0 0 I 0 0 0 0  
0 0 0 0 0 0 0 I

select those βj ’s in the coefficient vector pertaining to each group so that SN β = (β10 , β20 , β30 , β40 )0
and SL β = (β10 , β50 , β60 , β70 , β80 )0 . Data matrices are in seemingly unrelated regression (SUR)
form:  
  x0i1 0 0 0 0
0
x
 i1 0 0 0  
0
 


  0 xi5 0 0 0 
 0 x0 0 0 
 
i2  
XiN = and XiL = 0 0
.
 
  0 xi6 0 0 
 0 0
 0 x i3 0 
  
 0 0 0 x0i7 0 
   
0 0 0 x0i4  
0
0 0 0 0 xi8

Let g ∈ {N , L} , NN = |N | , and NL = |L| . Define the vectors and matrices

     
 z1g   X1g   1g 
 .   .   . 
zg =  .  .  . 
 . , Xg = 
 . , and g = 
 . .
     
zNg g XN g g N g g

The latent data generating process is given by

zg = Xg Sg β + g ∼ N (Xg Sg β, INg ⊗ Ωg ),

where “⊗” is the Kronecker product.

37
For notational ease and faster runtimes, we also make use of the matrix normal representation
of the latent data generating process: Let vec−1
p,q (·) denote the inverse of the vectorization

function vec(·) that takes a column vector as its input and outputs a p × q matrix. Let Jg
denote the number of equations that pertain to group g so that JN = 4 and JL = 5. Define
the matrices

Zg0 = vec−1
Jg ,Ng (zg ), Mg0 = vec−1
Jg ,Ng (Xg Sg β), and Eg0 = vec−1
Jg ,Ng (g ).

Then
Zg = Mg + Eg ∼ M NNg ,Jg (Mg , INg ⊗ Ωg ),

where M N denotes the matrix normal distribution.

Regions of truncation are given by BiN = Bi1 × Bi2 × Bi3 × Bi4 and BiL = Bi1 × Bi5 × Bi6 ×
Bi7 × Bi8 , where







 (−∞, 0] if yij = 0
(−∞, 0] if yi1 = 0
 

Bi1 = and Bij = (0, 1] if yij = 1 for j > 1.
 
(0, ∞)
 if yi1 = 1 



(1, ∞)
 if yij = 2

Let θ = {β, ΩN , ΩL } denote the set of identified model parameters, and let Bg be the
collection of truncation regions for all individuals in group g. The data-augmented likelihood
is given by
Y 
f (y, z|θ) = 1 {zg ∈ Bg } fN zg Xg Sg β, INg ⊗ Ωg ,
g

where y and z are the observed data and the non-missing latent data, respectively. Note that,
because the likelihood is not augmented with missing latent data, it is not a complete-data
likelihood.

38
2.1.2 Prior Distribution

The model is completed by specifying a prior distribution over θ. Let β, ΩN and ΩL be


independent a priori. As per convention, the coefficient vector has a normal prior: β ∼
N (b0 , B0 ). We follow Chib et al. (2009) in specifying a prior for Ωg : Partition Ωg as

 
 1 Ωg12 
Ωg =  
Ωg21 Ωg22

and let Ωg22·1 = Ωg22 − Ωg21 Ωg12 . Note that the Jacobian of the one-to-one transformation
{Ωg21 , Ωg22 } → {Ωg21 , Ωg22·1 } is 1. Let Qg be a positive definite matrix of the same size as
Ωg . We induce a prior for Ωg by specifying the following prior over the set of transformed
parameters:

Ωg22·1 ∼ IW (Qg22·1 , νg ) and (Ωg21 |Ωg22·1 ) ∼ N (Qg21 Q−1 −1


g11 , Ωg22·1 Qg11 ).

Our prior is given by

Y
fN Ωg21 Qg21 Q−1 −1

π (θ) = fN (β|b0 , B0 ) g11 , Ωg22·1 Q g11 fIW (Ωg22·1 |Qg22·1 , νg ) .
g

2.2 Estimation

Putting together the data-augmented likelihood and prior, we get the data-augmented pos-
terior
π (θ, z|y) ∝ f (y, z|θ) π (θ) .

Estimation and inference is facilitated via posterior Gibbs sampling (Gelfand and Smith,
1990). Our Gibbs estimation algorithm can be summarized in the following way:

39
Gibbs Sampling Algorithm

In each MCMC iteration:

1. Sample β ∼ (β|Z, θ \ β)

2. For g ∈ {N , L} :

(a) Sample Ωg22·1 ∼ (Ωg22·1 |Zg , β)

(b) Sample Ωg21 ∼ (Ωg21 |Zg , β, Ωg22·1 )

(c) Construct Ωg from Ωg22·1 and Ωg21

3. For each i ∈ N ∪ L :

(a) For each j ∈ {1, 2, 3, 4} for i ∈ N and j ∈ {1, 5, 6, 7, 8} for i ∈ L :

i. Sample zij ∼ (zij |yij , zig \ zij , β, Ωg )

Note the following: First, following Chib et al. (2009), counterfactuals are marginalized out
of the sampler. Collapsing the Gibbs sampler in this way simplifies prior inputs, reduces
storage costs as well as computational demands, and improves mixing (Liu, 1994; Chib, 2007;
Li, 2011). Second, covariance matrices are sampled in a single block, which improves the
mixing of the MCMC chain. Third, latent data is sampled within three nested for-loops.
The outer, middle, and inner loops correspond to the MCMC iteration, sample unit, and
equation, respectively. This type of triple for-loop setup is common in the Bayesian analysis
of multivariate discrete data, and is slow to implement. We describe a way to vectorize out
the middle for-loop in subsection 2.2.3 to cut code run time. Implementation details for our
Gibbs sampler are given below.

40
2.2.1 Sampling β

The posterior full conditional distribution for β is N (b, B) , where

!
X
b=B B0−1 b0 + Sg0 Xg0 vec(Ω−1 0
g Zg )
g

and !−1
X
B= B0−1 + Sg0 Xg0 (INg ⊗ Ω−1
g )Xg Sg .
g

2.2.2 Sampling Ωg

Let Rg = Qg + Eg0 Eg . Further let Rg22·1 = Rg22 − Rg21 Rg12 /Rg11 , where

 
 Rg11 Rg12 
Rg =  .
Rg21 Rg22

To simulate Ωg , first draw Ωg22·1 ∼ IW (Rg22·1 , νg + ng ) . Next, sample (Ωg21 |Ωg22·1 ) ∼


−1 −1
N (Rg21 Rg11 , Ωg22·1 Rg11 ). Recover Ωg using the inverse transform

 
 1 Ωg12
Ωg =  .

Ωg21 Ωg22·1 + Ωg21 Ωg12

2.2.3 Sampling z

Let µigj·\j and Ωgj·\j denote the conditional mean and conditional variance of zij . The non-
missing latent data z is generated from truncated univariate conditional normal distributions:

1. For i ∈ N ,

41
i. Sample (zi1 |yi1 , ziN \ zi1 , β, ΩN ) ∼ T NBi1 (µiN 1·234 , ΩN 1·234 )

ii. Sample (zi2 |yi2 , ziN \ zi2 , β, ΩN ) ∼ T NBi2 (µiN 2·134 , ΩN 2·134 )

iii. Sample (zi3 |yi3 , ziN \ zi3 , β, ΩN ) ∼ T NBi3 (µiN 3·124 , ΩN 3·124 )

iv. Sample (zi4 |yi4 , ziN \ zi4 , β, ΩN ) ∼ T NBi4 (µiN 4·123 , ΩN 4·123 )

2. For i ∈ L,

i. Sample (zi1 |yi1 , ziL \ zi1 , β, ΩL ) ∼ T NBi1 (µiL1·5678 , ΩL1·5678 )

ii. Sample (zi5 |yi5 , ziL \ zi5 , β, ΩL ) ∼ T NBi5 (µiL5·1678 , ΩL5·1678 )

iii. Sample (zi6 |yi6 , ziL \ zi6 , β, ΩL ) ∼ T NBi6 (µiL6·1578 , ΩL6·1578 )

iv. Sample (zi7 |yi7 , ziL \ zi7 , β, ΩL ) ∼ T NBi7 (µiL7·1568 , ΩL7·1568 )

v. Sample (zi8 |yi8 , ziL \ zi8 , β, ΩL ) ∼ T NBi8 (µiL8·1567 , ΩL8·1567 )

Sampling the latent data using a triple for-loop is slow, especially when the sample size and
the number of equations are both large. To cut runtime, we vectorize out the middle for-
loop using the following multivariate truncated normal sampling technique. This technique
requires code that efficiently generates a column vector of inid (independent, not identically
distributed) univariate truncated normals. Our code uses a modified version of Robert (1995)
to sample from truncation regions that are deep in the tails of the normal distribution. Let
x ∼ T N[a,b] (µ, σ 2 ) denote a column vector of inid univariate truncated normal draws, where
a, b, µ, and σ 2 are lower bound, upper bound, mean, and variance vectors, respectively,
inid
such that xi ∼ T N[ai ,bi ] (µi , σi2 ).

Let Zgj denote the jth column of Zg . Let Zg\j denote Zg without its jth column. Define
Mgj and Mg\j accordingly. Let Ωgjj be the jjth element of Ωg , Ωg\jj be the jth column of
Ωg omitting the jth row, Ωgj\j be the jth row of Ωg omitting the jth column, and Ωg\j\j
be Ωg with the jth row and jth column omitted. Let Ωgjj·\j = Ωgjj − Ωgj\j Ω−1
g\j\j Ωg\jj . By

42
the property of the matrix normal distribution we have that

 
Zgj Zg\j , β, Ωg ∼ M NNg ,1 Mgj + Zg\j − Mg\j Ω−1
 
g\j\j g\jj Ng gjj·\j ,
Ω , I Ω

which is a column vector of inid univariate normal distributions. Let Bgj be the set of
equation j truncation regions for all individuals in group g. The latent data zigj for all
sample units i can be drawn at once using

 
Zgj Ygj , Zg\j , β, Ωg ∼ T NBgj Mgj + Zg\j − Mg\j Ω−1
 
g\j\j g\jj Ng gjj·\j ,
Ω , ι Ω

where ιNg is the Ng × 1 vector of ones.

2.3 Cross Entropy Index for Land Use Imbalance

In this section, we introduce the cross entropy index for land use imbalance and recommend
it as an alternative to the entropy index for land use mix/balance. Along with density, land
use mix/balance is a popular urban landscape measure of accessibility to goods and services.
It is typically associated with the notion of entropy from information theory, which, for a
random variable X with finite support X and probability mass function p, is given by

X
H(p) = − p (x) lnp (x) . (2.10)
x∈X

This formula can be traced back to Shannon and Weaver (1963) in the field of communication,
who relate their work to Boltzmann’s H-theorem from statistical mechanics. Cervero (1989)
was the first in the area of transportation studies to use Equation (2.10) to measure land use
integration. This was done by setting X to the set of all possible land uses in a given area
and p(x) to the proportion of land in said area devoted to use x ∈ X . The entropy index

43
H ∗ (p) = H(p)/ln (|X |) in popular use today is due to Kockelman (1997), who proposed
standardizing H to H ∗ so that H ∗ lies between 0 and 1, where H ∗ = 1 represents optimal
mix/balance.

The entropy index, while popular, has three known issues: (i) it is not a valid measure of
1
land use integration; (ii) it prescribes the discrete uniform distribution, u (x) = |X |
, as the
optimal distribution of land uses over X ; and (iii) it is symmetric with respect to land use
types (Kockelman, 1997; Mitchell Hess et al., 2001; Song et al., 2013). To see why the
entropy index is not a valid measure of land use integration, consider two square cities, city
A and city B, whose land areas are allocated to residential and commercial uses as shown in
Figure 2.3. Clearly, city B has better land use mix than city A. The entropy index value is
1 for both cities, however, indicating that the two are equally (and also maximally) mixed.
Evidently, the entropy index is not a measure of land use integration. Some authors attempt
to rectify this problem by referring to the entropy index as a measure of land use balance
rather than land use mix. Unless there is good reason to believe that the optimal distribution
of land uses is the discrete uniform distribution, however, it is not a valid measure of balance
either.

44
Figure 2.3: Two Cities with an Entropy Index Value of 1
City A City B

To understand the land use symmetry problem, suppose that there are two other cities C
and D whose land areas are split 60%/30%/10% and 10%/30%/60%, respectively, between
commercial, residential, and recreational use, in that order. One would expect city C to
have better balance than city D; yet, they generate the same entropy index value. This is
because land uses are treated symmetrically, i.e., land use labels can be switched around
without affecting the value of the index.

Existing solutions to the aforementioned validity and land use symmetry problems involve
modifying the entropy index so that balance is measured relative to some reference distri-
bution; see, for example, Kockelman (1997) and Song et al. (2013). In the remainder of
this section, we demonstrate that if a reference distribution is available, the natural solu-
tion should be to abandon balance and entropy altogether and to instead measure land use
imbalance using cross entropy.

45
The cross entropy of probability mass function p relative to the reference distribution q is

X p (x)
D(p||q) = p (x) ln . (2.11)
x∈X
q (x)

Also known as relative entropy and Kullback-Leibler (KL) divergence, cross entropy has
many uses in the fields of information theory and statistics. Its purpose here is to measure
land use imbalance. As before, let p(x) denote the proportion of land area devoted to
use x ∈ X . Let q(x) denote the reference or optimal counterpart to p(x). Then, Equation
(2.11) quantifies the discrepancy between the actual and ideal land use distributions. Perfect
balance is achieved when there is no imbalance, i.e., D(p||q) = 0, which occurs at p = q.
When p 6= q, there is imbalance, and D(p||q) takes a positive value. To match the entropy
index H ∗ , we refer to D∗ (p||q) = D(p||q)/ln (|X |) as the cross entropy index.

We believe that the cross entropy index is the natural alternative to the entropy index
because the two can be linked elegantly in the following fashion: For any pmf q over X , we
have
X p (x) lnq (x)
D∗ (p||q) = − − H ∗ (p).
x∈X
ln (|X |)

Now suppose that q is the discrete uniform distribution, u, as prescribed by the entropy
index. Then, the entropy index H ∗ is a valid measure of balance, and the above expression
simplifies to

D∗ (p||u) = 1 − H ∗ (p). (2.12)

In other words, balance and imbalance sum to unity. Equation (2.12) shows that the uni-
form ideal q ∼ u implies p ∼ u achieves perfect balance, H ∗ (p) = 1, and zero imbalance,
D∗ (p||u) = 0.

46
Figure 2.4: Comparison of Reference Distributions

The cross entropy index requires a choice for q. The go-to reference distribution in the liter-
ature is the land use distribution for the greater region, such as that of the state/prefecture
or country under study. Note that the reference distribution may vary from one place to
another; for example, reference distributions can be region-specific to incorporate regional
variation in the optimal land use distribution. For our application, we set the reference dis-
tribution to the country-level distribution. Figure 2.4 shows the reference distributions for
the cross entropy index (left) and the entropy index (right). We proxy land use proportions
with the proportion of the working population in twenty industries (see the next section for
a list of these industries). A scatterplot comparing the cross entropy index D∗ to the entropy
index H ∗ for every city in Japan is given in Figure 2.5. Although we can visually verify that
the two are inversely related, there is substantial variability around this relationship.

47
Figure 2.5: Entropy Index vs. Cross Entropy Index

2.4 Data

To study the transportation habits of the Japanese elderly, we collect individual and household-
level data from the 5th Nationwide Person Trip Survey (NPTS), which was carried out by
the Ministry of Land, Infrastructure, Transport and Tourism (MLIT) in 2010. The study
surveyed 35,000 households from 70 cities (see Table 2.1 for a list of surveyed cities). Mem-
bers from participating households kept a travel diary for one “typical” day, either a midweek
day (Tuesday, Wednesday, Thursday) or a Sunday in a two-day weekend, in either October
or November (MLIT, 2012). The travel diary and household survey can be found in MLIT
(2012).

We restrict our sample to participants between the ages of 65 and 100, which yields a sample
of size n = 25,743. Data from the 4th NPTS was used to study the relationship between
the built environment and non-work trip frequency in Parady et al. (2015). City-level built
environment measures are constructed using national census and geographic information

48
system (GIS) data.

49
Table 2.1: Cities Surveyed in the 5th Nationwide Person Trip Survey

Group Cities

Saitama, Chiba, Tokyo, Yokohama,


Central Kawasaki, Nagoya, Kyoto, Osaka,
Kobe
Three Major
Toride, Tokorozawa, Matsudo, Inagi,
Metropolitan Areas Peripheral 1
Sakai, Toyonaka, Nara

Ome, Odawara, Gifu, Toyohashi,


Kasugai, Tsushima, Tokai, Yokkaichi,
Peripheral 2
Kameyama, Omihachiman, Uji,
Izumisano, Akashi

Sapporo, Sendai, Hiroshima,


Central
Regional Urban Area I Kitakyushu, Fukuoka
(Central City Pop. ≥ 1M) Otaru, Chitose, Shiogama, Kure,
Peripheral
Otake, Dazaifu

Utsunomiya, Kanazawa, Shizuoka,


Central
Regional Urban Area II Matsuyama, Kumamoto, Kagoshima
(Central City Pop. ≥ 400K) Oyabe, Komatsu, Iwata, Soja,
Peripheral
Isahaya, Usuki

Hirosaki, Morioka, Koriyama, Matsue,


Central
Regional Urban Area III Tokushima, Kochi
(Central City Pop. < 400K) Takasaki, Yamanashi, Kainan, Yasugi,
Peripheral
Nangoku, Urasoe

Yuzawa, Ina, Joetsu, Nagato, Imabari,


Regional Area
Hitoyoshi
Adapted from MLIT (2007).
50
2.4.1 Dependent Variables

Dependent variables include a binary indicator for whether or not a person is licensed and
three/four ordinal mode usage measures. Mode usage measures are generated from the
travel diary portion of the NPTS. Available transportation modes are walking/biking, public
transit, getting a ride, and for those who can do so legally, driving. Since a trip may involve
multiple modes of travel, we follow MLIT’s reporting guidelines in prioritizing public transit
over the use of a private vehicle, which in turn has priority over nonmotorized travel (MLIT,
2012).

Whereas Parady et al. (2015) uses count models to measure usage, this paper uses a three-
bin ordinal scale, with zero representing no trips, one representing one or two trips, and
two representing three or more trips. We do this for several reasons. First, multivariate
ordinal data models are more flexible than traditional multivariate count data models. The
former can accommodate positive correlations, negative correlations, over-dispersion, and
under-dispersion. In comparison, the multivariate Poisson model requires correlations to be
positive, and the multivariate Poisson-lognormal model, while capable of handling negative
correlations, assumes that the data is overdispersed (Jeliazkov et al., 2008). Second, because
a typical day’s worth of travelling is split between three to four transportation mode cate-
gories, the vast majority of trip counts are zero, and the non-zero counts tend to be small.
Count data models are not suited to deal with this kind of data. Lastly, as mentioned in
section 2.1, the three-bin ordinal scale is convenient for the purposes of model specification
and estimation.

51
Table 2.2: Usage by Travel Mode and License Status

Walking/Biking

No Trips 1 or 2 Trips 3+ Trips Total

Nonlicensed 7,994 (31.05%) 2,089 (8.11%) 788 (3.06%) 10,871 (42.23%)


Licensed 12,220 (47.47%) 1,997 (7.76%) 655 (2.54%) 14,872 (57.77%)

Total 20,214 (78.52%) 4,086 (15.87%) 1,443 (5.61%) 25,743 (100%)

Transit

No Trips 1 or 2 Trips 3+ Trips Total

Nonlicensed 10,189 (39.58%) 599 (2.33%) 83 (0.32%) 10,871 (42.23%)


Licensed 14,312 (55.60%) 491 (1.91%) 69 (0.27%) 14,872 (57.77%)

Total 24,501 (95.18%) 1,090 (4.23%) 152 (0.59%) 25,743 (100%)

Getting a Ride

No Trips 1 or 2 Trips 3+ Trips Total

Nonlicensed 8,558 (33.24%) 1,624 (6.31%) 689 (2.68%) 10,871 (42.23%)


Licensed 13,187 (51.23%) 1,214 (4.72%) 471 (1.83%) 14,872 (57.77%)

Total 21,745 (84.47%) 2,838 (11.02%) 1,160 (4.51%) 25,743 (100%)

Drive

No Trips 1 or 2 Trips 3+ Trips Total

Nonlicensed 10,871 (42.23%) 0 (0.00%) 0 (0.00%) 10,871 (42.23%)


Licensed 6,662 (25.88%) 4,614 (17.92%) 3,596 (13.97%) 14,872 (57.77%)

Total 17,533 (68.11%) 4,614 (17.92%) 3,596 (13.97%) 25,743 (100%)

52
Table 2.2 tabulates mode usage by travel mode and license status, and offers two important
insights: First, nearly half (42.23%) of the sample cannot drive, yet driving is the most
popular mode option (it has the smallest no trip count). Second, those who can drive report
less usage on all non-driving modes than those who cannot drive. This suggests that driving
is a substitute to all other modes. Since mode usage is modeled jointly, any substitution
effects unaccounted for by the included covariates is captured by the covariance matrix.

2.4.2 Independent Variables

Individual and household-level controls are generated from the household survey portion
of the NPTS. These are age, sex, employment status, household size, homeownership sta-
tus, vehicle count, and an indicator for the household having a bicycle. An indicator for
weekday is also included in the mode use equations. Unlike the National Household Travel
Survey (NHTS), the NPTS does not collect information on household income and household
members’ education levels.

Built environment features considered here are population density, land use imbalance, and
accessibility to public transit. Measures for the first two are constructed using data from
the 2010 national census. The proportion of land devoted to different uses is proxied by
the proportion of the working population in twenty industries. These industries are: (1)
agriculture and forestry; (2) fisheries; (3) mining and quarrying of stone and gravel; (4)
construction; (5) manufacturing; (6) electricity, gas, heat suppy and water; (7) information
and communications; (8) transport and postal activities; (9) wholesale and retail trade; (10)
finance and insurance; (11) real estate and goods rental and leasing; (12) scientific research,
professional and technical services; (13) accommodations, eating and drinking services; (14)
living-related and personal services and amusement services; (15) education, learning sup-
port; (16) medical, health care and welfare; (17) compound services; (18) services, N.E.C.;

53
(19) public services; and (20) miscellaneous.

Table 2.3: Summary Statistics

Variable Mean S.D. Min Max

Individual-level variables

Age 73.93 7.04 65 100


Male 0.51 — — —
Unemployed 0.58 — — —
Weekday 0.48 — — —

Household-level variables

Household Size 2.75 1.35 1 7


Homeowner 0.94 — — —
Vehicle Count 1.58 1.12 0 5
Bicycle 0.64 — — —

City-level variables

Population Density (per m2 ) 1.29 1.55 0.06 7.90


Land Use Imbalance 0.08 0.06 0.02 0.26
Bus Stops (per km2 ) 1.83 1.38 0.26 7.27
Train Stations (per km2 × 10) 0.85 0.93 0 4.24

Accessibility to transit is measured as the number of bus stops and train stations in the
city normalized by city area. Data on transit facilities can be found on the National Land
Numerical Information Download Service website. Summary statistics for our independent
variables are reported in Table 2.3.

54
2.5 Results

Regression results are based on an uninformative prior that uses the following hyperparam-
eters: b0 = 0 × ι, B0 = 100 × I, Qg = I for g ∈ {N , L}, νN = 6, and νL = 7. The MCMC
chain ran for 10,000 iterations after a burn-in of 2,000 draws. Priors and posteriors are given
in the Appendix.

2.5.1 Results for β

Posterior means and standard deviations for the coefficient vector β are given in Table 2.4.
The coefficient estimates themselves are neither interpretable nor comparable with other
estimates due to the discrete/non-linear nature of the data. Effects that are decisively
positive or negative (posterior sign probabilities of at least 0.975) are indicated by a star (?).
We comment on the direction of effects here and on policy-relevance in the next subsection.

For the elderly who cannot drive, density is positively associated with transit use and neg-
atively associated with getting rides. For the elderly who can drive, densification leads to
more walking and biking, more transit use, and less driving. Private vehicle use in general
goes down (riding for those who cannot drive and driving for those who can), and the use
of other transport modes go up. This suggests that the effect of density on mode choice and
usage is facilitated by a substitution mechanism that involves the use of a private vehicle.

As for land use balance, we find that better balance (less imbalance) leads to more nonmo-
torized travel for both groups and also more transit use by the licensed group. In constrast
with density, we find no evidence suggesting that substitution effects are at play here.

Bus accessibility is found to have either no effect or an indeterministic effect everywhere. In


comparison, increased train accessibility leads to less driving by license holders and fewer

55
license holders overall.

56
Table 2.4: Posterior Means and Standard Deviations for β

Nonlicensed Licensed
License Walking/Biking Transit Getting a Ride Walking/Biking Transit Getting a Ride Driving
Age -0.105? -0.056? -0.045? -0.045? -0.011? -0.027? -0.009 -0.014?
(0.002) (0.004) (0.008) (0.005) (0.005) (0.006) (0.006) (0.003)
Male 1.885? 0.563? 0.441? -0.047 -0.058 0.090 -0.902? 0.354?
(0.023) (0.075) (0.149) (0.100) (0.076) (0.089) (0.097) (0.043)
Unemployed -0.178? -0.205? -0.245? -0.184? 0.099? -0.178? -0.062 -0.144?
(0.022) (0.032) (0.048) (0.039) (0.031) (0.048) (0.042) (0.023)
Weekday 0.271? 0.155? -0.020 0.026 0.108? -0.221? 0.329?
(0.029) (0.041) (0.035) (0.029) (0.044) (0.040) (0.022)
Household Size -0.469? -0.063? -0.012 -0.171? 0.045? 0.048 -0.004 -0.085?
(0.011) (0.021) (0.038) (0.027) (0.022) (0.029) (0.029) (0.014)
57

Homeowner 0.348? 0.111? 0.187? 0.445? 0.020 0.067 0.059 0.027


(0.046) (0.056) (0.078) (0.077) (0.072) (0.101) (0.100) (0.054)
Vehicle Count 0.809? -0.014 -0.204? 0.306? -0.208? -0.274? 0.001 0.106?
(0.015) (0.033) (0.063) (0.041) (0.031) (0.041) (0.041) (0.018)
Bicycle -0.262? 0.402? -0.096 -0.172? 0.375? -0.072 0.025 0.013
(0.023) (0.035) (0.050) (0.040) (0.033) (0.048) (0.042) (0.023)
Density 0.025 0.019 0.074? -0.086? 0.052? 0.072? -0.001 -0.042?
(0.014) (0.020) (0.025) (0.026) (0.020) (0.026) (0.029) (0.016)
LUI -0.143 -1.602? 0.333 0.524 -1.267? -2.473? -0.528 0.154
(0.193) (0.276) (0.406) (0.316) (0.278) (0.540) (0.380) (0.200)
Bus Stops 0.012 -0.003 0.021 0.026 -0.021 0.046 0.000 -0.021
(0.013) (0.018) (0.023) (0.025) (0.019) (0.025) (0.028) (0.015)
Train Stations -0.066? -0.008 -0.006 0.008 0.041 0.097? 0.006 -0.048?
(0.021) (0.030) (0.041) (0.037) (0.029) (0.041) (0.040) (0.022)
Star (?) indicates posterior sign probabilities that are > 0.975.
2.5.2 Policy Simulation Results

Regression tables for generalized linear models (GLMs) are useful in that they convey infor-
mation about statistical significance and the direction of effects. They are not in themselves
useful for policymaking purposes, however, as they provide no information on effect sizes
(Brownstone, 2008). In this section, we discuss two popular covariate effect estimation
methods, the partial effect at the average (PEA) and the average partial effect (APE). We
then describe a third estimation method that overcomes the shortcomings of the first two.
The third method is used to determine whether urban form effect sizes are sufficiently large
to be considered policy relevant.

To motivate the discussion, suppose that there is an exogenous urban form shock S such that
S : xpre
i → xpost
i (i.e., S transforms the pre-shock covariate vector xprei into its post-shock
post
form xi ) for all i. Note that S can represent the built environment changing in numerous
ways at once (e.g., population density doubling and land use imbalance halving at the same
time). Further suppose we want to estimate the overall effect that S has on the probability
of being licensed (yi1 = 1). Given β1 , the effect of S for subject i is

(Effecti |β1 ) = P (yi1 = 1|β1 , xpost pre post0 pre0


i1 ) − P (yi1 = 1|β1 , xi1 ) = Φ(xi1 β1 ) − Φ(xi1 β1 ). (2.13)

Note that, because the effect of S depends on the covariates, it varies from person to person.

Let xpre
1 and xpost
1 denote the sample means of xpre post
i1 and xi1 , respectively, and let β̂1 be
a point estimate for β1 . One way to find an overall effect based on Equation (2.13) is to
compute the partial effect at the average:
0
pre 0
   
\ = Φ xpost
Effect 1 β̂ 1 − Φ x1 β̂ 1 .

The problem with finding the effect at the average is that the average covariate vector
typically corresponds to an uninteresting, unrepresentative, and/or nonexistent individual.
A better estimator for the overall effect is the average partial effect, which takes the average
of individual effects:
\ = Φ(xpost0
Effect 1 β̂1 ) − Φ(xpre0
1 β̂1 ).

Neither PEA nor APE account for estimation uncertainty, however, and this can yield mis-
leading effect estimates (Jeliazkov and Vossmeyer, 2016). Moreover, statistical software
packages typically do not automatically give confidence intervals for the APE and PEA
estimators, and policy relevance cannot be judged on point estimates alone.

58
To account for both data variability and estimation uncertainty, we follow Fang (2008) and
Brownstone and Fang (2014) in examining the posterior distribution of the average treatment
effect (ATE), where the average is taken with respect to the sample:

N
1 X
(Effect|β1 ) = (Effecti |β1 ) = Φ(xpost0
1 β1 ) − Φ(xpre0
1 β1 ).
N i=1

The ATE formula is similar to that of APE, the difference being that the former is evaluated
at every β1 drawn from the posterior distribution π(β1 |y). The non-Bayesian analog to this
is to take draws of β̂1 from its sampling distribution and plug them into the APE formula.

Extending the ATE approach to the multivariate case requires only straightforward modi-
fications. For example, suppose we are interested in studying the effect of S on the joint
probability of not having a driver’s license (yi1 = 0) and taking zero nonmotorized trips
(yi2 = 0). Then, given (β1 , β2 , Ω12 , Ω22 ), the effect of S for subject i is

(Effecti |β1 , β2 , Ω12 , Ω22 )


= P (yi1 = yi2 = 0|β1 , β2 , Ω12 , Ω22 , xpost post
i1 , xi2 )

− P (yi1 = yi2 = 0|β1 , β2 , Ω12 , Ω22 , xpre pre


i1 , xi2 )
" #" # " #! " #" # " #!
0 xpost0
i1 β 1 1 Ω 12 0 x pre0
i1 β 1 1 Ω 12
=Φ post0 , −Φ pre0 , .
0 xi2 β2 Ω21 Ω22 0 xi2 β2 Ω21 Ω22

We can obtain the posterior distribution for the ATE by plugging draws of (β1 , β2 , Ω12 , Ω22 )
from its posterior π(β1 , β2 , Ω12 , Ω22 |y) into the expression above.

The primary disadvantage with the method we employ is that it is computationally demand-
ing relative to the PEA and APE methods. To cut code runtime, we use a thinned posterior
sample of the model parameters.

Table 2.5 reports probability changes following the simultaneous doubling of population
density and the halving of land use imbalance. Effect sizes are estimated precisely and found
to be very small. This suggests that urban planning measures have virtually no effect at
changing the transportation habits of the Japanese elderly. Our results and conclusions are
broadly consistent with those based on the United States (Bento et al., 2005; Fang, 2008;
Ewing and Cervero, 2010; Brownstone and Fang, 2014).

59
Table 2.5: Effect of Simultaneously Doubling Density and Halving Land Use Imbalance

Walking/Biking No Trips 1 or 2 Trips 3+ Trips

Nonlicensed -0.015 (0.004) 0.003 (0.002) 0.003 (0.002)


Licensed -0.009 (0.005) 0.011 (0.003) 0.007 (0.002)

Transit

Nonlicensed -0.013 (0.005) 0.004 (0.002) 0.001 (0.001)


Licensed -0.003 (0.005) 0.008 (0.002) 0.003 (0.001)

Getting a Ride

Nonlicensed 0.004 (0.004) -0.007 (0.002) -0.005 (0.001)


Licensed 0.006 (0.005) 0.002 (0.002) 0.001 (0.001)

Driving

Nonlicensed — — —
Licensed 0.015 (0.005) -0.001 (0.002) -0.006 (0.003)
Posterior standard deviations given in parentheses.

2.5.3 Results for Ω

A benefit of modeling mode usage jointly rather than separately is that substitution effects
not accounted for by the covariates in the model appear in Ω. To study how mode usage
relate to each other in the error term, we find the percentage of posterior draws that are
negative (indicating substitution) for select covariance terms. These are reported in Table
2.6. Our findings are as follows: Controlling for covariates, nonmotorized travel and getting
rides are viewed as substitutes. The elderly who drive view transit as complementary to
both nonmotorized travel and getting rides. Driving is also viewed as a substitute to all
other mode options even after accounting for the included covariates.

60
Table 2.6: Posterior Probability of Ωjk < 0 in Ω

Nonlicensed Licensed

Transit Ride Transit Ride Drive

Nonmotorized 38% 100% Nonmotorized 0% 98% 100%


Transit 26% Transit 0% 100%
Ride 100%

2.6 Conclusion

This paper presents an econometric framework to study the effects of the built environment
on transportation mode choice and usage when a large fraction of the population under
study is nonlicensed. We use a multivariate ordinal outcomes model with a binary selection
mechanism to allow for both heterogeneous and indirect urban form effects. Emphasis was
placed on the joint modeling of correlated discrete outcomes, strategizing with identification
restrictions and the lack of identification, and the efficient estimation of model parameters.
We also discussed the computationally efficient sampling of truncated multivariate normal
latent data.

This paper introduces the cross entropy index for land use imbalance and recommends it
as an alternative to the entropy index for land use mix/balance. Whereas entropy imposes
the assumption that the optimal land use distribution is given by the uniform distribution,
cross entropy gives the researcher the ability to choose a reference distribution. We show
that the two indices are connected in that if the uniform distribution is optimal, then the
entropy index is a valid measure of balance, and furthermore balance and imbalance sum to
unity. The topic of reference distribution selection is not pursued here and is open to future
research.

We apply our model on a sample of Japanese elderly to investigate whether urban planning
tools can be used to improve traffic safety conditions in Japan. Although we are successful
at identifying heterogeneous and indirect urban form effects, ultimately we find that built
environment effects are too small to warrant attention from policymakers. Our conclusion is
similar to those based on the United States, where built environment effects are reportedly
nonzero but economically irrelevant.

61
Chapter 3

An Efficient Gibbs Procedure for the


Binary Mixed Logit Model

This paper presents an efficient Gibbs sampling algorithm (Geman and Geman, 1984; Gelfand
et al., 1990) for the Bayesian estimation of the binary mixed logit model. Unlike Metropolis-
Hastings (MH) methods, the proposed procedure is fully automatic: It does not require tun-
ing, choosing a proposal density, using potentially time-consuming optimization algorithms
during the sampling process, running Markov chain Monte Carlo (MCMC) subroutines,
monitoring for reasonable acceptance rates, and so on.

To setup the model in question, let yit ∈ {0, 1} be the binary outcome of interest, where
the indices i and t denote unit (e.g., an individual, household, firm) and time. Let x̃it and
wit represent two covariate vectors that contain two disjoint sets of covariates. The binary
mixed logit model is given by

yit = 1{x̃0it δ + wit0 di + εit > 0},

where 1{·} is the indicator function that takes the value of one when its argument is true
and zero otherwise, δ is a vector of fixed effects, di is a vector of random effects, and
εit ∼ Logistic(0, 1) is an iid logistic error term which has a variance of π 2 /3. Whereas the
presence of random effects induces correlation within each unit (also known as cluster), units
are modeled as independent of one another.

A major hurdle in the Gibbs estimation of the logit models is the simulation of the scale
variable. Chen and Dey (1998) suggested the use of an MH step, but this means that the

62
overall sampler is also MH. Holmes et al. (2006) provide pseudo-code for generating the scale
variable using rejection sampling. This paper proposes the use of a griddy Gibbs sampler,
where the grid is made arbitrarily fine so that the use of a grid has virtually no effect on
estimation.

In designing the MCMC procedure, we emphasize both computational and mixing efficiency.
The former is achieved by leveraging sparse matrix algorithms and vectorizing the sampling
of certain quantities. Mixing efficiency is attained by the judicious collapsing of the Gibbs
sampler (Liu, 1994). This paper also provides details on the estimation of the marginal
likelihood, which can be used to compare and select a model.

The remainder of the paper is organized as follows. The binary mixed logit model is discussed
in further detail in Section 3.1, and its estimation algorithm is covered in Section 3.2. Section
3.3 provides details on computing the marginal likelihood. Section 3.4 presents a simulation
study, and Section 3.5 concludes. Select derivations are given in the appendix.

3.1 Model

3.1.1 Complete-Data Likelihood

Following Albert and Chib (1993), the binary mixed logit model is expressed in terms of the
latent (unobserved) variable zit as

zit = x̃0it δ + wit0 di + εit .

The observed datum yit is obtained from zit through the binary link function yit = 1{zit > 0}.
We make the common assumption that random effects are normally distributed. That is, we
assume that (di |d, D) ∼ N (d, D), where D is the heterogeneity matrix. Notice that, even
though the random effect di is specific to a unit, the mean of the random effects, d, is not.
A convenient reformulation of the model that makes use of this observation is

zit = x0it β + wit0 bi + εit , (3.1)

where xit = [wit0 x̃0it ]0 is a covariate vector, β = [d0 δ 0 ]0 is a vector of common effects,
and (bi |D) ∼ N (0p×1 , D) captures unit-specific deviations from the mean effect d so that
bi = di − d.

63
The recognition that logistic errors have a scale mixture of normals representation is key
in the Gibbs analysis of the logit model. In particular, εit ∼ Logistic(0, 1) comes from the
hierarchical model κit ∼ KS and (εit |κit ) ∼ N (0, 4κ2it ), where KS denotes the Kolmogorov-
Smirnov distribution fKS (κ) = 8κ ∞ j+1 2 −2j 2 κ2
P
j=1 (−1) j e (Andrews and Mallows, 1974; Poirier,
1978; Jeliazkov and Rahman, 2012).

For notational convenience, zit is stacked over t and then over i. Let
       
zi1 x0i1 0
wi1 εi1
 .   .   .   . 
zi =  .  .  .  . 
 . , Xi = 
 . , Wi = 
 . , and εi = 
 . .
ziT x0iT 0
wiT εiT

Let  
κ2i1
κ2i2
 
 
Ki = 4 
 ...
,

 
κ2iT
where the zeros in Ki are suppressed for clarity. Stacking Equation (3.1) over t ∈ {1, . . . , T }
yields

zi = Xi β + Wi bi + εi , (3.2)

where (εi |κi ) ∼ N (0T ×1 , Ki ) and κi = {κi1 , . . . , κiT }. Let


 
    W1    
z1 X1 b1 ε1
W2
 
.  .    . .
z= . .  . .
 . , X=
 . , W = , b=
 ..
.   . , ε=
 . ,
 
zn Xn bn εn
Wn

and  
K1
K2
 
 
K= ..
.

 . 

Kn
In the foregoing, W is block-diagonal whereas K is diagonal (since Ki is diagonal for all i).

64
Stack Equation (3.2) over i ∈ {1, . . . , n} to get

z = Xβ + W b + ε, (3.3)

where (b|D) ∼ N (0np×1 , In ⊗ D), (ε|κ) ∼ N (0nT ×1 , K), κ = {κ1 , . . . , κn }, and “⊗” is the
Kronecker product.

The process of stacking zit reveals how identification works in this model. First consider
Equation (3.2), where zit is stacked over t for some i. At this (cluster) level, information
on β and bi come from the same source, namely intracluster variation. We can make this
observation by noticing that Xi and Wi have some of the same columns. It follows that
likelihood contributions are not identified. Now consider Equation (3.3). Here, β and b
are identified because the former draws from two sources of variation (intracluster and in-
tercluster variation), whereas the latter only draws from one (intracluster variation). The
identifiability of the overall likelihood is reflected in X and W, which have no columns in
common (since the latter is block-diagonal whereas the former is not). We can also infer
that bi draws information solely from the i-th cluster based on the block-diagonal structure
of W.

Let yi and y be the observed counterparts to zi and z, respectively. The complete-data


likelihood, augmented with the complete set of auxiliary variables {z, b, κ}, is

f (y, z, b, κ|β, D) = f (y|z)f (z|β, b, κ)f (b|D)f (κ),

where
n
Y n Y
Y T n Y
Y T
f (y|z) = 1{zi ∈ Bi } = 1{zit ∈ Bit } and f (κ) = fKS (κit ).
i=1 i=1 t=1 i=1 t=1

The region of truncation Bi is given by Bi = Bi1 × · · · × BiT , where



(−∞, 0] y = 0
it
Bit =
(0, ∞) yit = 1.

65
3.1.2 Prior

The model is completed by specifying (semiconjugate) prior distributions over the model
parameters β and D. It is assumed that β and D are a priori independent and given by
β ∼ N (β0 , B0 ) and D ∼ IW (R0 , r0 ).

3.2 Estimation

Combining the the complete-data likelihood with the prior yields the posterior

π(z, β, b, D, κ|y) ∝ f (y|z)f (z|β, b, κ)f (b|D)f (κ)π(β)π(D),

which we simulate via Gibbs sampling. The sampling algorithm is summarized below. Pa-
rameter blocks and latent data blocks that do not factor into a posterior conditional dis-
tribution are suppressed in the notation. The notation “r” means “except” so that, for
example, zirt is zi with zit omitted.

Gibbs Estimation Algorithm for the Binary Mixed Logit Model

1. Sample zit from π(zit |y, β, D, κ, zirt ), which is marginal of b, for all i and t.

2. Sample {β, b} from π(β, b|y, z, D, κ), which is done in two steps:

(a) Sample β, marginally of b, from π(β|y, z, D, κ).


(b) Sample b, conditionally on β, from π(b|y, z, β, D, κ).

3. Sample D from π(D|y, b).

4. Sample κit from π(κit |y, z, β, b, D, κ r κit ) for all i and t.

Details of the sampler are given in the subsections, with select derivations presented in the
appendix.

A standard Gibbs algorithm draws z, β, and b conditionally on one another. While techni-
cally correct, the Markov chain that underlies this algorithm is slow at mixing. To resolve

66
this problem, we follow Holmes et al. (2006) in collapsing the Gibbs sampler (Liu, 1994),
and Chib and Carlin (1999) in blocking the fixed and random effects.

We collapse the standard Gibbs algorithm by drawing z marginally of b. Collapsing typically


improves mixing efficiency at some cost to computational efficiency. The proposed sampler is
no exception. In particular, the marginalization of b leads to intracluster correlation among
the elements in z. Whereas the elements of z can be generated in parallel in a standard
Gibbs sampler, the collapsed sampler must generate them sequentially, at least within each
cluster. This paper modifies the methodology of Holmes et al. (2006) to generate z in a
computationally efficient manner. The proposed sampling scheme generates zit sequentially
within each cluster but in parallel across clusters.

By “blocking the fixed and random effects,” we mean that β and b are sampled as a sin-
gle Gibbs block. This is achieved by first sampling β marginally of b, then sampling b
conditionally on β.

3.2.1 Sampling z

The posterior conditional of zi marginal of bi is the multivariate truncated normal distribu-


tion
(zi |y, β, D, κ) ∼ T NBi (zi |Xi β, Ki + Wi DWi0 ).

A draw from this distribution is obtained by drawing from its full conditionals, which are of
the form
(zit |y, β, D, κ, zirt ) ∼ T NBit (zit |mit , vit ), t ∈ {1, . . . , T }

(Geweke, 1991). The full conditional densities, which are univariate truncated normals, can
be sampled efficiently using Robert (1995). Following Holmes et al. (2006), parameters mi
and vi are computed efficiently using

hit 4κ2it
mit = 1
1−hit
(x0it β + wit0 b̂i ) − z
1−hit it
and vit = 1−hit
,

where hit is the t-th element in hi ≡ Ki−1 diag(Wi D̂i Wi0 ), zit uses the current value for zit ,
and b̂i and D̂i are the mean vector and covariance matrix of the posterior full conditional
on bi (see Subsection 3.3.2).

During the sampling process, b̂i = b̂i (zi ) has to be updated after a new value for zit is drawn,

67
before moving on to zi,t+1 . An efficient updating rule is

0 −1 0 −1
b̂new
i = b̂old new
i + D̂i Wi Ki (zi − ziold ) = b̂old new
i + (D̂i Wi Ki )·,t (zit − zitold ),

where b̂old
i and b̂new
i are b̂i before and after this update, and (D̂i Wi0 Ki−1 )·,t is the t-th column
of D̂i Wi0 Ki−1 .

To save on runtime, the sampling of z should be “vectorized” so that zit for all i are drawn
simultaneously at each t. This way, z is obtained in T for-loop iterations instead of nT for-
loop iterations. The sampling scheme for zi described above is amenable to vectorization.
For example, hit for all i and t can be taken from h = K −1 diag(W D̂W 0 ), and the updating
rule for all i simultaneously is

b̂new = b̂old + D̂W 0 K −1 (z new − z old ),

where expression for D̂ and b̂ can be found in Subsection 3.2.3. Calculations for hit , vit , and
D̂W 0 K −1 only have to be performed once per MCMC iteration.

3.2.2 Sampling β

Fixed effects are sampled marginally of the random effects so that β is drawn from (β|y, z, D, κ) ∼
N (β̂, B̂), where

B̂ −1 = B0−1 + X 0 [K + W (In ⊗ D)W 0 ]−1 X and β̂ = B̂{B0−1 β0 + X 0 [K + W (In ⊗ D)W 0 ]−1 z}.

3.2.3 Sampling b

The posterior full conditional for b is (b|y, z, β, D, κ) ∼ N (b̂, D̂), where

D̂−1 = (In ⊗ D−1 ) + W 0 K −1 W and b̂ = D̂W 0 K −1 (z − Xβ).

The key to generating b quickly is to leverage the fact that the precision matrix D̂−1 is both
sparse and block-diagonal. Sparsity in D̂−1 comes from the panel structure, i.e., large n and
small T. Block-diagonality comes from the random effects being conditionally independent
across clusters.

68
Let R be the Cholesky factor of D̂−1 so that R0 R = D̂−1 . The Cholesky factor of a sparse
matrix is, in general, dense. Fortunately, R inherits the block-diagonal structure of D̂−1 ,
and is therefore sparse (in addition to being block-diagonal and upper triangular). Owing to
sparsity, computations that involve R are fast, including the Cholesky factorization process
used to find it.

Let a = Rb̂. The steps below use R to sample b from its full conditional density in a
computationally efficient manner.

1. Solve R0 a = D̂−1 b̂ = W 0 K −1 (z − Xβ) for a. Use this solution to then solve Rb̂ = a
for b̂.

2. Generate η ∼ N (0p×1 , Ip ). Solve Ru = η for u ∼ N (0p×1 , D̂). Add b̂ to get b = b̂+u ∼


N (b̂, D̂).

3.2.4 Sampling D

Let Γ = [b1 · · · bn ]0 . The posterior full conditional for D is (D|y, b) ∼ IW (R̂, r̂), where
R̂ = R0 + Γ0 Γ and r̂ = r0 + n.

3.2.5 Sampling κ

The posterior full conditional density for κit is

fN (εit |0, 4κ2it )fKS (κit )


π(κit |y, z, β, b, D, κ \ κit ) = , (3.4)
fL (εit |0, 1)

where the denominator expression comes from the observation that


Z ∞
fL (εit |0, 1) = fN (εit |0, 4κ2it )fKS (κit )dκit
0

(Poirier, 1978; Jeliazkov and Rahman, 2012). Sampling proceeds using the method of inverse
transformation, where the cumulative distribution function (CDF) for the full conditional
on κit can be approximated over a fine grid based on Expression (3.4).

69
3.3 Model Comparison

The comparison of models is just as important an exercise as fitting models. Typically, the
researcher does not know what a suitable model is a priori. Instead, she has a collection of
candidate models, each embodying a different hypothesis about the process generating y.
Let {M1 , . . . , ML } denote this collection, where Ml indexes a candidate model.

Let f (y|θl , Ml ) and π(θl |Ml ) denote the likelihood function and prior distribution under
model Ml , where θl is the parameter vector specific to Ml . Bayesian model comparison
proceeds by forming the posterior odds ratio,

P r(Mi |y) P r(Mi ) m(y|Mi )


= × ,
P r(Mj |y) P r(Mj ) m(y|Mj )

for every pair of models Mi and Mj in {Ml }. The first fraction on the righthand side expres-
sion is called the prior odds ratio, and it represents the known a priori odds of Mi against
Mj . The second fraction is called the Bayes factor. It is a ratio of marginal likelihoods,
R
m(y|Ml ) = f (y|θl , Ml )π(θl |Ml )dθl , and is typically unknown.

There are numerous ways of estimating or approximating the Bayes factor. We use the
method of Chib (1995), as it was designed specifically for the Gibbs sampler. For notational
clarity, model indices are suppressed in the following.

The marginal likelihood is the normalizing constant to the posterior distribution:

f (y|z)f (z|β, b, κ)f (b|D)f (κ)π(β)π(D)


π(z, β, b, D, κ|y) = .
m(y)

Integrating out the latent data z, b, and κ yields

f (y|β, D)π(β)π(D)
π(β, D|y) = ,
m(y)

from which we obtain the expression that becomes the basis for our marginal likelihood
estimator:

f (y|β ∗ , D∗ )π(β ∗ )π(D∗ )


m(y) = . (3.5)
π(β ∗ , D∗ |y)

High posterior density values are used for β ∗ and D∗ out of efficiency considerations.

70
Due to the presence of heterogeneity, the likelihood ordinate
Z
∗ ∗
f (y|β , D ) = f (y|β ∗ , b)f (b|D∗ )db

needs to be estimated. A candidate estimator for f (y|β ∗ , D∗ ) is the so-called naive estimator

J
∗ ∗ 1X
f (y|β , D ) =
b f (y|β ∗ , b(j) ),
J j=1

where b(j) ∼ f (b|D∗ ) are simulated iid draws. This approach has been widely used in early
works in economics and transportation (e.g., Brownstone and Train, 1998). Although con-
ceptually straightforward, the naive estimator can perform poorly when f (y|β ∗ , b, D∗ ) and
f (b|D∗ ) exhibit mismatch. Examples of mismatch include the situation where f (y|β ∗ , b, D∗ )
puts “mass” in the tails of f (b|D∗ ), or when f (b|D∗ ) is diffuse relative to f (y|β ∗ , b, D∗ ).
The latter is known as the witch’s hat example. When mismatch occurs, the naive estimator
is based on a handful of influential draws, resulting in high variability. Estimators that use
quasi-Monte Carlo methods such as Halton sequences may improve upon the naive estimator
(for more on the topic, see Train, 2009). They are, however, subject to the same drawbacks
as the naive estimator.

This paper uses the insight that f (y|β, D) is the integrating constant to

f (y|β, b)f (b|D)


π(b|y, β, D) =
f (y|β, D)

to develop a simulation-efficient estimator for the likelihood ordinate. That is, by viewing
f (y|β ∗ , D∗ ) as a marginal likelihood, we can apply the method of Chib (1995) to

f (y|β ∗ , b∗ )f (b∗ |D∗ )


f (y|β ∗ , D∗ ) = , (3.6)
π(b∗ |y, β ∗ , D∗ )

where b∗ is set to a high density value. Pluggin (3.6) into (3.5), we get

f (y|β ∗ , b∗ )f (b∗ |D∗ )π(β ∗ )π(D∗ )


m(y) = ,
π(β ∗ , D∗ |y)π(b∗ |y, β ∗ , D∗ )

where the ordinates in the numerator are computed exactly. Estimating the marginal likeli-
hood boils down to estimating the posterior ordinate in the denominator.

Other estimators for m(y) are conceivable. However, the one described here has several

71
advantages over other alternatives. First, the likelihood function f (y|β, b) is known, and
is in fact just the product of logistic CDFs. Second, the high-dimensional posterior ordi-
nate π(b∗ |y, β ∗ , D∗ ) can be broken down into the product of lower-dimensional ordinates
Qn ∗ ∗ ∗
i=1 π(bi |y, β , D ). This representation leads to substantial efficiency gains because es-
timating lower-dimensional ordinates separately is more efficient than estimating a high-
dimensional ordinate jointly. Third, the proposed estimator only requires a single reduced
run, and can therefore be obtained in a reasonably timely manner.

The marginal likelihood is computed on the (natural) log scale to ensure numerical stability.
The following subsections provide detail to the computation of the log-marginal likelihood
estimate

ln m(y)
b = ln f (y|β ∗ , b∗ ) + ln f (b∗ |D∗ ) + ln π(β ∗ ) + ln π(D∗ )
X n
− ln πb(β ∗ , D∗ |y) − b(b∗i |y, β ∗ , D∗ ),
ln π
i=1

where the available terms are


n X
X T
ln f (y|β ∗ , b∗ ) = − ln [1 + exp{−sit µ∗it }]
i=1 t=1

where sit = 2yit − 1 and µ∗it = x0it β ∗ + wit0 b∗i ,

ln f (b∗ |D∗ ) = ln fN (b∗ |0np×1 , In ⊗ D∗ ) ,

ln π(β ∗ ) = ln fN (β ∗ |β0 , B0 ) ,

and
ln π(D∗ ) = ln fIW (D∗ |R0 , r0 ) .

3.3.1 Estimating ln π(β ∗ , D∗ |y)

Estimation of the log-posterior ordinate ln π(β ∗ , D∗ |y) is accomplished using the Gibbs tran-
sition kernel. Consider the Gibbs sampler for π(β, b, D|y, z, κ). Its transition kernel is given
by
K(β 0 , b0 , D0 |y, z, κ, β, b, D) = π(β 0 , b0 |y, z, D0 , κ)π(D0 |y, b),

72
where β 0 , b0 , and D0 denote draws in the next iteration. By definition, the transition kernel
satisfies the invariance condition
Z
π(β , b , D |y, z, κ) = K(β 0 , b0 , D0 |y, z, κ, β, b, D)π(β, b, D|y, z, κ)dβdbdD.
0 0 0

Multiplying by π(z, κ|y) and integrating with respect to z and κ on both sides, we get
Z
0 0 0
π(β , b , D |y) = π(β 0 , b0 |y, z, D0 , κ)π(D0 |y, b)π(z, b, κ|y)dzdbdκ.

Finally, integrating both sides with respect to b0 results in


Z
0 0
π(β , D |y) = π(β 0 |y, z, D0 , κ)π(D0 |y, b)π(z, b, κ|y)dzdbdκ,

from which we get the posterior ordinate estimator


"
G
#
1 X
b(β ∗ , D∗ |y) = ln
ln π fN (β ∗ |β̂ ∗(g) , B̂ ∗(g) )fIW (D∗ |R̂(g) , r̂) ,
G g=1

where
(B̂ ∗(g) )−1 = B0−1 + X 0 [K (g) + W (In ⊗ D∗ )W 0 ]−1 X,

β̂ ∗(g) = B̂ ∗(g) {B0−1 β0 + X 0 [K (g) + W (In ⊗ D∗ )W 0 ]−1 z (g) },

R̂(g) = R0 + Γ(g)0 Γ(g) ,



and z (g) , b(g) , κ(g) ∼ π(z, b, κ|y) are draws obtained during the main Gibbs run.

3.3.2 Estimating ln π(b∗i |y, β ∗ , D∗ )

Begin by noting that


Z
π(bi |yi , β, D) = fN (bi |b̂i , D̂i )π(zi , κi |yi , β, D)dzi dκi ,

where
D̂i−1 = D−1 + Wi0 Ki−1 Wi and b̂i = D̂i Wi0 Ki−1 (zi − Xi β).

73
We obtain from this expression the simulation-consistent estimator for ln π(b∗i |yi , β ∗ , D∗ ),
" G
#
1 X 
∗(g) ∗(g)

b(b∗i |yi , β ∗ , D∗ ) = ln
ln π fN b∗i b̂i , D̂i ,
G g=1

where
 −1  −1  −1  
∗(g) (g) ∗(g) ∗(g) (g) (g)
D̂i = (D∗ )−1 +Wi0 Ki Wi and b̂i = D̂i Wi0 Ki zi − Xi β ∗ ,

n o
(g) (g) (g)
and zi , bi , κi ∼ π(zi , bi , κi |yi , β ∗ , D∗ ) are draws obtained from a reduced Gibbs run.
n o
(g) (g) (g)
The reduced run draws zi , bi , κi for all i can be obtained simultaneously using a
reduced Gibbs sampler over π(z, b, κ|y, β , D∗ ). This way, components of the sampler used

in estimation can be reused instead of developing a new sampler.

3.4 Simulation Study

We carried out a simulation study to examine the performance of the proposed estimation
algorithm. Data were simulated from

yit = 1{x0it β + wit0 bi + εit > 0} and (bi |D) ∼ N (0p×1 , D),

where β = [−0.25, −0.2, −0.15, · · · , 0.25]0 consists of 11 elements, and D = 0.5 × I2 . To


ensure that yit varies within each cluster, xit was generated so that x0it β was within ±2 of 0
for nearly all i and t. This is an important consideration to make during the data simulation
process, as the estimability of the model hinges on the presence of intracluster variation.
With the exception of the intercept terms xit1 and wit1 , individual covariates were generated
independently from normal distributions. Prior hyperparameters were set to β0 = 0q×1 ,
B0 = Iq , R0 = Ip , and r0 = p + 1. Simulation results are based on a posterior sample of size
10,000 following a burn-in of 1,000 draws for a total of 11,000 cycles.

We compute inefficiency factors to gauge the performance of the Markov chain. Let θ(g)
denote a sequence of MCMC draws on some scalar parameter θ. The inefficiency factor corre-
sponding to θ is estimated using 1+2 L−1
P
l=1 (1−l/L)ρ̂l , where ρ̂l is the sample autocorrelation
(g)
of θ at lag l, and L is set to the lag length at which autocorrelation tapers off (i.e., dips
below some threshold, usually 0.05). The inefficiency factor is numerically equivalent to the
 
ratio of variances, var θ̄M CM C /var θ̄IID , where the numerator sample mean is based on

74
a set of correlated (MCMC) draws, and the denominator sample mean uses a hypothetical
sample of independent (IID) draws. An inefficiency factor of 1 indicates that the set of
MCMC draws has just as much information on the sample mean θ̄ as the set of IID draws
of the same sample size. An inefficiency factor of 2 means that twice as many MCMC draws
are needed to compute θ̄ to the same level of precision as a set of IID draws.

Table 3.1 reports estimation results based on a sample that uses n = 500 and T = 10. Model
parameters are estimated with reasonable to high accuracy, including those in D that are
only indirectly linked to the observed data. The posterior standard deviations on β1 and
β2 are higher than those on βj where j > 2 because xit1 and xit2 appear in wit , whereas
xitj for j > 2 do not. Inefficiency factors on the coefficients are under 3 across the board,
indicating good mixing. The Markov chain mixes relatively slowly over the elements in the
heterogeneity matrix D because D is generated based on b, which itself is generated during
the estimation process. Figure 3.1 shows autocorrelation functions (ACFs) for β1 and D11 .
The Gibbs sampler took less than 3 minutes to generate 11,000 draws.

Table 3.1: Simulation Results based on the Proposed Gibbs Algorithm

Parameter True Value Posterior Mean Posterior SD Inefficiency Factor

β1 -0.250 -0.258 0.094 1.961


β2 -0.200 -0.205 0.046 1.656
β3 -0.150 -0.156 0.025 2.749
β4 -0.100 -0.115 0.026 2.392
β5 -0.050 -0.034 0.025 2.399
β6 0.000 0.009 0.025 2.565
β7 0.050 0.066 0.025 2.619
β8 0.100 0.082 0.026 2.417
β9 0.150 0.160 0.026 2.426
β10 0.200 0.202 0.025 2.606
β11 0.250 0.268 0.026 2.689

D11 0.500 0.574 0.104 13.259


D12 0.000 0.026 0.063 10.097
D22 0.500 0.593 0.083 10.265

75
Figure 3.1: Autocorrelation Functions
(a) β1 (b) D11

Reference line place at ρ̂ = 0.05.

To give credence to the idea that collapsing and blocking leads to efficiency gains, we report
inefficiency factors under different sampling schemes in Table 3.2. The first column corre-
sponds to the proposed (collapsed and blocked) Gibbs sampler, where z and β are both
sampled marginally of b. The second column reports on the uncollapsed (but blocked) Gibbs
sampler, where z is sampled from its full conditional distribution, but β is drawn marginally
of b. Results for the standard (uncollapsed and unblocked) Gibbs sampler, which generates
z and β from their full conditionals, are given in the last column.

Comparing the first two columns in Table 3.2, we see that sampling z marginally of b leads
to improved mixing over the elements in D. Intuitively, this is because collapsing removes
the strong posterior correlation between z and b from the correlation in the draws of b,
which in turn reduces correlation in the draws of D. Driving the point further, Figure 3.2
shows ACFs for D22 under the proposed and standard Gibbs samplers. Lag lengths were
chosen to show L, the point at which the sample autocorrelation dips below 0.05. Under the
two collapsing schemes, we have L = 31 and L = 71, respectively. Inefficiency factors on the
coefficient terms in β are comparable owing to the fact that the way z is sampled relative
to b has no direct impact on β.

Comparing the second and third columns in Table 3.2, we see that drawing β marginally
of b improves the mixing of the chain over β. Inefficiency factors on D are nearly identical.
The key takeaway from these comparisons is that, by marginalizing out b from z and β, the
underlying Markov chain mixes better over D and β, respectively.

76
Table 3.2: Inefficiency Factors for Different Sampling Schemes

Proposed Uncollapsed but Blocked Standard


Parameter
Gibbs Sampler Gibbs Sampler Gibbs Sampler

β1 1.961 2.325 3.444


β2 1.656 1.956 12.560
β3 2.749 2.673 3.338
β4 2.392 2.549 3.072
β5 2.399 2.635 3.166
β6 2.565 2.613 3.224
β7 2.619 2.636 3.168
β8 2.417 2.561 3.109
β9 2.426 2.605 3.169
β10 2.606 2.862 3.303
β11 2.689 2.704 3.194

D11 13.259 16.945 16.997


D12 10.097 14.498 14.487
D22 10.265 20.477 20.729

Figure 3.2: Autocorrelation Functions for D22


(a) Proposed Gibbs Sampler (b) Standard Gibbs Sampler

Reference line place at ρ̂ = 0.05.

Returning to the proposed Gibbs sampler, we study the effect that n and T have on the
mixing of the chain by running the sampler for various combinations of n and T. Results

77
are reported in Table 3.3. As expected, we find that an increase in cluster size T results in
improved mixing over the elements in D. The heterogeneity matrix is better identified when
T is large because b is more precisely estimated. We find that n has no discernable effect on
mixing efficiency.

Table 3.3: Inefficiency Factors for n ∈ {250, 500, 1000} and T ∈ {10, 15, 20}

n = 250 n = 500 n = 1000


θ T = 10 T = 15 T = 20 T = 10 T = 15 T = 20 T = 10 T = 15 T = 20

β1 2.21 2.27 2.21 1.96 2.07 1.92 2.22 2.10 1.90


β2 1.72 1.38 1.42 1.66 1.36 1.38 1.43 1.31 1.37
β3 2.65 2.32 2.38 2.75 2.35 2.42 2.81 2.59 2.42
β4 2.42 2.39 2.41 2.39 2.51 2.53 2.55 2.44 2.46
β5 2.38 2.38 2.40 2.40 2.39 2.58 2.35 2.38 2.45
β6 2.74 2.59 2.39 2.56 2.32 2.58 2.36 2.41 2.63
β7 2.65 2.31 2.32 2.62 2.42 2.41 2.35 2.38 2.44
β8 2.60 2.36 2.38 2.42 2.55 2.37 2.38 2.58 2.41
β9 2.71 2.54 2.39 2.43 2.42 2.63 2.62 2.66 2.67
β10 2.69 2.43 2.67 2.61 2.63 2.61 2.60 2.68 2.68
β11 2.97 2.70 2.44 2.69 2.80 2.73 2.66 2.67 2.66

D11 13.01 7.02 7.14 13.26 8.49 5.36 17.87 8.35 6.55
D12 11.26 6.11 5.68 10.10 6.45 4.51 10.84 6.23 5.64
D22 10.84 5.27 4.88 10.26 5.57 5.23 10.05 6.89 6.37

3.5 Conclusion

This paper discusses the efficient estimation of the binary mixed logit model. The method
is fully automatic and simpler than existing methods that rely on MH or rejection sampling.
Simulation results show that the proposed estimation method works well.

In developing the proposed estimation algorithm, we emphasized both computational and


mixing efficiency. On the computation side, we showed how the latent zit can be sampled
in parallel over clusters. For improved mixing, we collapsed and blocked the Gibbs sampler.
A detailed simulation study was used to show how the Markov chain benefitted from these

78
correlation reduction methods.

79
Bibliography

Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response
data. Journal of the American Statistical Association, 88(422):669–679.
Andrews, D. F. and Mallows, C. L. (1974). Scale mixtures of normal distributions. Journal
of the Royal Statistical Society: Series B (Methodological), 36(1):99–102.
Bento, A. M., Cropper, M. L., Mobarak, A. M., and Vinha, K. (2005). The effects of
urban spatial structure on travel demand in the United States. Review of Economics and
Statistics, 87(3):466–478.
Brownstone, D. (2008). Key relationships between the built environment and VMT. spe-
cial report 298: Driving and the built enviroment: The effects of compact development
on motorized travel, energy use, and CO2 emissions. Irvine, CA, committee on the rela-
tionships among development patterns, vehicle miles traveled, and energy consumption.
Transportation Research Board and the Division on Engineering and Physical Sciences.
Brownstone, D. and Fang, H. (2014). A vehicle ownership and utilization choice model with
endogenous residential density. Journal of Transport and Land Use, 7(2):135–151.
Brownstone, D. and Golob, T. F. (2009). The impact of residential density on vehicle usage
and energy consumption. Journal of Urban Economics, 65(1):91–98.
Brownstone, D. and Train, K. (1998). Forecasting new product penetration with flexible
substitution patterns. Journal of Econometrics, 89(1-2):109–129.
Brueckner, J. K. and Largey, A. G. (2008). Social interaction and urban sprawl. Journal of
Urban Economics, 64(1):18–34.
Cervero, R. (1989). America’s Suburban Centers. Routledge.
Chan, J. C.-C. and Jeliazkov, I. (2009). MCMC estimation of restricted covariance matrices.
Journal of Computational and Graphical Statistics, 18(2):457–480.
Chen, M.-H. and Dey, D. K. (1998). Bayesian modeling of correlated binary responses via
scale mixture of multivariate normal link functions. Sankhyā: The Indian Journal of
Statistics, Series A, pages 322–343.
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American
Statistical Association, 90(432):1313–1321.

80
Chib, S. (2007). Analysis of treatment response data without the joint distribution of po-
tential outcomes. Journal of Econometrics, 140(2):401–412.

Chib, S. and Carlin, B. P. (1999). On MCMC sampling in hierarchical longitudinal models.


Statistics and Computing, 9(1):17–26.

Chib, S. and Greenberg, E. (2007). Semiparametric modeling and estimation of instrumental


variable models. Journal of Computational and Graphical Statistics, 16(1):86–114.

Chib, S., Greenberg, E., and Jeliazkov, I. (2009). Estimation of semiparametric models in
the presence of endogeneity and sample selection. Journal of Computational and Graphical
Statistics, 18(2):321–348.

Chib, S. and Jeliazkov, I. (2006). Inference in semiparametric dynamic models for binary
longitudinal data. Journal of the American Statistical Association, 101(474):685–700.

Ewing, R. and Cervero, R. (2010). Travel and the built environment: a meta-analysis.
Journal of the American Planning Association, 76(3):265–294.

Fang, H. A. (2008). A discrete–continuous model of households’ vehicle choice and usage,


with an application to the effects of residential density. Transportation Research Part B:
Methodological, 42(9):736–758.

Gelfand, A. E., Hills, S. E., Racine-Poon, A., and Smith, A. F. (1990). Illustration of
Bayesian inference in normal data models using Gibbs sampling. Journal of the American
Statistical Association, 85(412):972–985.

Gelfand, A. E. and Smith, A. F. (1990). Sampling-based approaches to calculating marginal


densities. Journal of the American statistical association, 85(410):398–409.

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine
Intelligence, (6):721–741.

Geweke, J. (1991). Efficient simulation from the multivariate normal and student-t distri-
butions subject to linear constraints and the evaluation of constraint probabilities. In
Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface,
pages 571–578. Fairfax, Virginia: Interface Foundation of North America, Inc.

Gupta, A. K. and Nagar, D. K. (2018). Matrix Variate Distributions. Chapman and


Hall/CRC.

Heckman, J. J. (1976). The common structure of statistical models of truncation, sample


selection and limited dependent variables and a simple estimator for such models. In
Annals of Economic and Social Measurement, volume 5, number 4, pages 475–492. NBER.

Heckman, J. J. (1977). Sample selection bias as a specification error (with an application to


the estimation of labor supply functions). Technical report, National Bureau of Economic
Research.

81
Henderson, H. V. and Searle, S. R. (1981). On deriving the inverse of a sum of matrices.
SIAM Review, 23(1):53–60.

Holmes, C. C., Held, L., et al. (2006). Bayesian auxiliary variable models for binary and
multinomial regression. Bayesian Analysis, 1(1):145–168.

Jeliazkov, I. (2008). Specification and inference in nonparametric additive regression.

Jeliazkov, I. (2013). Nonparametric vector autoregressions: Specification, estimation and


inference. VAR Models in Macroeconomics-New Developments and Applications: Essays
in Honor of Christopher A. Sims, 32.

Jeliazkov, I., Graves, J., and Kutzbach, M. (2008). Fitting and comparison of models for
multivariate ordinal outcomes. In Bayesian Econometrics, pages 115–156. Emerald Group
Publishing Limited.

Jeliazkov, I. and Rahman, M. A. (2012). Binary and ordinal data analysis in economics:
Modeling and estimation. Mathematical Modeling with Multidisciplinary Applications,
pages 123–150.

Jeliazkov, I. and Vossmeyer, A. (2016). The impact of estimation uncertainty on covariate


effects in nonlinear models. Statistical Papers, pages 1–12.

Kockelman, K. (1997). Travel behavior as function of accessibility, land use mixing, and
land use balance: evidence from San Francisco Bay Area. Transportation Research Record:
Journal of the Transportation Research Board, (1607):116–125.

Koop, G. and Poirier, D. J. (1997). Learning about the across-regime correlation in switching
regression models. Journal of Econometrics, 78(2):217–227.

Leck, E. (2006). The impact of urban form on travel behavior: A meta-analysis. Berkeley
Planning Journal, 19(1).

Li, P. (2011). Estimation of sample selection models with two selection mechanisms. Com-
putational Statistics & Data Analysis, 55(2):1099–1108.

Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to
a gene regulation problem. Journal of the American Statistical Association, 89(427):958–
966.

McCulloch, R. E., Polson, N. G., and Rossi, P. E. (2000). A Bayesian analysis of the multi-
nomial probit model with fully identified parameters. Journal of Econometrics, 99(1):173–
193.

Mitchell Hess, P., Vernez Moudon, A., and Logsdon, M. (2001). Measuring land use patterns
for transportation research. Transportation Research Record: Journal of the Transporta-
tion Research Board, (1780):17–24.

MLIT (2007). Results from the 4th Nationwide Person Trip Survey (press release).

82
MLIT (2012). Toshi ni okeru hito no ugoki -heisei 22nen zenkoku toshi koutsuu tokusei
chousa shuukei kekka kara-.

Nandram, B. and Chen, M.-H. (1996). Reparameterizing the generalized linear model to
accelerate Gibbs sampler convergence. Journal of Statistical Computation and Simulation,
54(1-3):129–144.

NPATB (2018). Heisei 29 nen ni okeru koutsuu shibou jiko no tokuchou nado ni tsuite.

Ohlson, M., Ahmad, M. R., and Von Rosen, D. (2013). The multilinear normal distribution:
Introduction and some basic properties. Journal of Multivariate Analysis, 113:37–47.

Parady, G. T., Chikaraishi, M., Takami, K., Ohmori, N., and Harata, N. (2015). On the
effect of the built environment and preferences on non-work travel: Evidence from Japan.
European Journal of Transport & Infrastructure Research, 15(1).

Poirier, D. J. (1978). A curious relationship between probit and logit models. Southern
Economic Journal (pre-1986), 44(3):640.

Poirier, D. J. (1998). Revising beliefs in nonidentified models. Econometric Theory,


14(4):483–509.

Poirier, D. J. (2020). Mostly Harmless Bayesian Econometrics. Unpublished manuscript.

Poirier, D. J. and Tobias, J. L. (2003). On the predictive distributions of outcome gains


in the presence of an unidentified parameter. Journal of Business & Economic Statistics,
21(2):258–268.

Robert, C. P. (1995). Simulation of truncated normal variables. Statistics and Computing,


5(2):121–125.

Shannon, C. E. and Weaver, W. (1963). The mathematical theory of communication. 1949.


Urbana, IL: University of Illinois Press.

Shively, T. S., Kohn, R., and Wood, S. (1999). Variable selection and function estimation
in additive nonparametric regression using a data-based prior. Journal of the American
Statistical Association, 94(447):777–794.

Song, Y., Merlin, L., and Rodriguez, D. (2013). Comparing measures of urban land use mix.
Computers, Environment and Urban Systems, 42:1–13.

Tobin, J. (1958). Estimation of relationships for limited dependent variables. Econometrica:


Journal of the Econometric Society, pages 24–36.

Train, K. E. (2009). Discrete Choice Methods with Simulation. Cambridge University Press.

Williams, C. K. and Rasmussen, C. E. (2006). Gaussian Processes for Machine Learning,


volume 2. MIT press Cambridge, MA.

83
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data. MIT
press.

Yannakakis, M. (1981). Computing the minimum fill-in is NP-complete. SIAM Journal on


Algebraic Discrete Methods, 2(1):77–79.

Zellner, A. (1962). An efficient method of estimating seemingly unrelated regressions and


tests for aggregation bias. Journal of the American Statistical Association, 57(298):348–
368.

84
Appendix A

Chapter 2 Appendix

Figure A.1: Posteriors for Equation 1 (License)

Posterior given by solid line, prior by dashed line.

85
Figure A.2: Posteriors for Equation 2 (Nonlicensed - Walking/Biking)

Posterior given by solid line, prior by dashed line.

Figure A.3: Posteriors for Equation 3 (Nonlicensed - Transit)

Posterior given by solid line, prior by dashed line.

86
Figure A.4: Posteriors for Equation 4 (Nonlicensed - Ride)

Posterior given by solid line, prior by dashed line.

Figure A.5: Posteriors for Equation 5 (Licensed - Walking/Biking)

Posterior given by solid line, prior by dashed line.

87
Figure A.6: Posteriors for Equation 6 (Licensed - Transit)

Posterior given by solid line, prior by dashed line.

Figure A.7: Posteriors for Equation 7 (Licensed - Ride)

Posterior given by solid line, prior by dashed line.

88
Figure A.8: Posteriors for Equation 8 (Licensed - Drive)

Posterior given by solid line, prior by dashed line.

89
Figure A.9: Posteriors for Implied Correlation Matrix (Nonlicensed)

Posterior given by solid line, prior by dashed line.

90
Figure A.10: Posteriors for Implied Correlation Matrix (Licensed)

Posterior given by solid line, prior by dashed line.

91
Appendix B

Chapter 3 Appendix

B.1 Derivations for mi and vi

Expressions for mit and vit are obtained in two steps. In the first step, mit and vit are
expressed in terms of b̂irt and D̂irt , respectively, which are parameters of the posterior full
conditional on birt . This is accomplished using the leave-one-out predictive density:
Z
π(zit |y, β, D, κ, zirt ) = π(zit |yit , β, κit , bi )π(bi |yirt , zirt , β, D, κirt )dbi
Z
∝ 1{zit ∈ Bit }fN (zit |x0it β + wit0 bi , 4κ2it )fN (bi |b̂irt , D̂irt )dbi

∝ 1{zit ∈ Bit }fN (zit |x0it β + wit0 b̂irt , 4κ2it + wit0 D̂irt wit ),

where

−1 −1 −1
D̂irt = D−1 + Wirt
0
Kirt Wirt and 0
b̂irt = D̂irt Wirt Kirt (zirt − Xirt β).

It follows that mit = x0it β + wit0 b̂irt and vit = 4κ2it + wit0 D̂irt wit .

In the second step, mit and vit are written in terms of b̂i and D̂i , respectively. We begin with
the latter. Because D̂i−1 = D̂irt
−1
+ (4κ2it )−1 wit wit0 , we can use Henderson and Searle (1981)
to arrive at
(4κ2it )−1 D̂i wit wit0 D̂i
D̂irt = D̂i + .
1 − (4κ2it )−1 wit0 D̂i wit
4κ2it
From here, it is straightforward to get to vit = 4κ2it [1 + (4κ2it )−1 wit0 D̂irt wit ] = 1−hit
.

92
−1
0
For mit , we begin by noting that Wirt Kirt (zirt − Xirt β) = D̂i−1 b̂i − (4κ2it )−1 wit (zit − x0it β).
Plugging this into b̂irt , we get
h i
b̂irt = D̂irt D̂i−1 b̂i − (4κ2it )−1 wit (zit − x0it β)
" #
(4κ2it )−1 D̂i wit wit0 D̂i h −1 i
= D̂i + D̂i b̂i − (4κ2it )−1 wit (zit − x0it β)
1 − hit
(4κ2it )−1 D̂i wit wit0 b̂i
= b̂i − (4κ2it )−1 D̂i wit (zit − x0it β) +
1 − hit
2 −2 0 0
(4κit ) D̂i wit wit D̂i wit (zit − xit β)
− .
1 − hit

0 0
Plugging the final expression
h intoimit = xit β + wit b̂hirt , only
i simple algebraic manipulations
1 0 0 hit
are needed to get to mit = 1−hit (xit β + wit b̂i ) − 1−hit zit .

B.2 Derivations for the Updating Rule on b̂i

The updating rule for b̂i is obtained by noticing that

b̂new
i = D̂i Wi0 Ki−1 (zinew − Xi β)
= D̂i Wi0 Ki−1 zinew − D̂i Xi β
 
= D̂i Wi0 Ki−1 zinew − D̂i Wi0 Ki−1 ziold − b̂old
i

0 −1
= b̂old new
i + D̂i Wi Ki (zi − ziold ).

93

You might also like