File

Assessing the Accuracy of Bayesian Additive
Regression Tree Credible Intervals
by
Fatimah AL Ahmad
Thesis
submitted in partial fulfillment of the requirements for
the Degree of Master of Science (Mathematics and Statistics )
Acadia University
Winter Convocation 2016
© by Fatimah Mohammed AL Ahmad, 2016

This thesis by Fatimah AL Ahmad was defended successfully in an oral examination on
1st April, 2016.
The examining committee for the thesis was:
________________________
Dr. Harish Kapoor, Chair
________________________
Dr. Thomas Loughin, External Reader
________________________
Dr. Wilson Lu, Internal Reader
________________________
Dr. Hugh Chipman, Supervisor
_________________________
Dr. Ying Zhang, Acting Head
This thesis is accepted in its present form by the Division of Research and Graduate
Studies as satisfying the thesis requirements for the degree Master of Science
(Mathematics and Statistics).
………………………………………….
ii
I, Fatimah AL Ahmad, grant permission to the University Librarian at Acadia University to
reproduce, loan or distribute copies of my thesis in microform, paper or electronic
formats on a non-profit basis. I, however, retain the copyright in my thesis.
_________________________
Fatimah AL Ahmad, Author
_________________________
Dr. Hugh Chipman, Supervisor
_________________________
Date
iii
Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 BART method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 The BART model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Bayesian estimation of the BART model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 BART prior parameters and default values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Prior variance on 𝜇𝑖𝑗 |𝑇𝑖 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 The σ prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 The choice of m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 BART models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 CV.BART mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.2 Defualt.BART model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 The simulation study and setting up the experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 The experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 The setting of experimental factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 Additional variables not treated as factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.3 Response variables for the experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iv
3.2 Choosing the number of MCMC iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 The simulated function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Parallel computing mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Analysis of coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Analysis of width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Analysis of SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4 Similarities and differences between responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Comparison between the two, three, and seven-way interaction models . . . . . . . . 65
4.6 The correlation between the responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
v
List of Figures
2.1 A simple realization of g(x; T; M) in the BART model . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 The simulated function and data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 BART uncertainty prediction. Training data are plotted as individual points . . . . . . 12
2.4 The prior σ distribution when 𝜎̂ = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 The relation between MCMC samples and different responses (width, coverage, SSE)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Posterior samples of σ versus MCMC iterations number, for training samples of size
1000 and 10000. Burn-in samples are red . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Six realizations of randomly simulated f(x) in two dimensions, as described in the
text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 The main effects for coverage90, 95, and 99 respectively. The factors’ order along
the horizontal axis is n, dimension p, σ, predictor correlation r, junk variables, error
distribution, and BART method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Main effects for coverage95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 The n: σ effect on coverage95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 The p: σ effect for coverage95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5 The n: method effect for coverage95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
vi
4.6 The main effects for width90, 95, and 99. The factors order is n, p, σ, predictor
correlation r, junk variables, error distribution, and method . . . . . . . . . . . . . . . . . . . . . . 55
4.7 The main effects for width95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.8 n: junk effect for width95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.9 The main effects for SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.10 The n: junk effect on SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.11 The n: p effect for predictive SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
vii
List of Tables
2.1 Mean of BART CI coverage, width and SSE over 500 test points, for 10 replicates . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 The prior parameters’ combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Summary of the factors of the designed experiment . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Predictive performance at various MCMC iterations (N) and sample sizes (n) . . . . 36
3.3 The 8 rows of D in the first job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1 ANOVA table for coverage95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 The ANOVA table for width95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 ANOVA table for SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 The 𝑅 2 of interaction models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 The correlation between responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
viii
Abstract
A common type of supervised learning problem is to use training data to estimate a
predictive model for a numeric response. Many supervised learning models such as
Bayesian Additive Regression Trees (BART) try to flexibly model the data. This Bayesian
“sum of trees” model uses MCMC back fitting to simulate posterior samples. BART also
provides credible intervals (CIs) for prediction.
This thesis studies the accuracy of BART credible intervals and analyzes various
factors’ effects on it. These factors include the sample size, dimension, noise standard
deviation, predictors’ correlations, junk variables, type of error distribution, and BART
method. Simulation is used to compute CI accuracy with a designed experiment that
systematically varies the factors to find their effects. Analysis of experimental results
gives conclusions about BART CI accuracy. It is found to depend considerably on sample
size and error variance.
ix
Acknowledgements
I would like to thank the Ministry of Higher Education in the Kingdom of Saudi Arabia
for financial support that enabled me to complete my studies. A special thanks to Dr.
Chipman who gave me the opportunity to work under his supervision. I am grateful for
all his effort, support, encouragement, and suggestions which helped me to finish this
thesis experiment and writing. I would also like to thank the faculty, staff and students
of Acadia University, specifically those who work and study in the Department of
Mathematics and Statistics.
To my husband, my entire family, and all my friends: I am really thankful and
appreciate all your assistance in different aspects of my life. Finally, thank you to
Canada which has welcomed us as a part of its international students and families, and
thank you to all nice people who we have met here and made us feel we live in our
home.
x
Chapter 1
Introduction
An important statistical problem is inference and modelling for an unknown function
f to predict a numeric response Y by using a p-dimensional vector x of predictors. This
kind of supervised learning problem, with a numeric response, is often referred to as a
regression model. Training data, consisting of (x, Y) pairs, are used to “learn” or
estimate the unknown function f. Supervised learning models have various kinds of
structure. For instance, they can be parametric, like a linear regression model, or
nonparametric, like a decision tree or a Random Forest. This thesis’ supervised model is
the nonparametric “Bayesian Additive Regression Trees” (BART), developed by
Chipman, George and McCulloch (2010), hereafter referred to as CGM (2010).
CGM (2010) develop BART as a Bayesian “sum of trees” model. A large number
(typically 50-200) of decision trees are estimated in such a way that their sum is an
accurate prediction of the response. It is an ensemble method, like bagging and random
forests where a large number of decision trees are combined into a prediction model. It
uses a Bayesian MCMC algorithm that generates simulated samples from the posterior
distribution for fitting and inference.
The Bayesian specification of BART provides posterior distributions that can be used
to quantify uncertainty in predictions for the response at an input value. In particular,
1
MCMC samples enable the construction of credible intervals (CIs) for the prediction of
response Y at input x. The objective of this thesis is examining the accuracy of BART’s
uncertainty prediction under the effect of various factors. These factors are mostly
related to properties of the training data, and include the sample size n, noise standard
deviation σ, dimension p, predictors’ correlation r, junk variables, types of error
distribution, and BART method. These factors have different number of levels (2 or 4
levels) and types (either numeric or categorical).
To study the performance of BART CIs accuracy under all these factors, a simulation
study is conducted where it is possible to compute the accuracy measures with a known
response function. A designed experiment in the seven factors is chosen to carry out
the study. A designed experiment enables the analysis of the factors’ influences on the
various responses. To keep the design and analysis as simple as possible, a full factorial
design is selected to conduct this study.
For the analysis, ANOVA tables are used to make conclusions about the BART CIs
accuracy and how much the factors impact the responses’ performances. We study
main effects and two way interactions which explain most of the variation in the
responses. We use three kinds of responses that measure performance of BART CIs:
coverage, width, and the SSE of prediction. The total number of the responses is seven
corresponding to coverage at three levels (90, 95, 99%), width at three levels (90, 95,
99%), and the predictive SSE.
2
The remainder of this thesis is organized as follows: Chapter two reviews the
background of the BART model which is utilized here to fit and model the population
function f. Chapter three describes the simulation study and the details of setting up
the experiments such as the factors and their levels, a generated function f, and the
management of this experiments’ runs as “parallel jobs” on a compute cluster. Chapter
four presents the analysis for three of the seven responses, coverage95, width95, and
the predictive SSE. They give a good representation of all responses analysis since the
results for levels 90, 95, and 99% are similar to each other. In Chapter five, we conclude
with the most important results and discuss future work.
3
Chapter 2
BART methods
This chapter has four sections. The first and second sections give background on
BART models, BART formulas, and BART credible intervals. The third section discusses
the choice of BART prior parameters. Section four describes two different versions of
the BART model: CV.BART and Default.BART.
2.1 The BART model
Here is a brief explanation of the BART model. The population model is given as
(2.1) y = f(x) + ε, ε~N(0,𝜎 2 ).
The BART model estimates the population model (2.1) where the conditional mean at a
specific point x equals the true function f(x),
(2.2) y = E(y|x) + ε.
The E(y|x) of BART can be expressed as a sum of decision trees
(2.3) y = g(x;𝑇1 ,𝑀1 ) +...+ g(x ;𝑇𝑚 ,𝑀𝑚 ) + ε.
In (2.3) the constant m determines the number of decision trees used in the sum of
trees. Each g represents the output from a different tree. That is, 𝑇1 , 𝑇2 , … , 𝑇𝑚 are
4
different trees, and associated with tree 𝑇𝑖 is a vector of terminal node parameters 𝑀𝑖 .
Suppose there are 𝑏𝑖 terminal nodes in 𝑇𝑖 . Then 𝑀𝑖 = (𝜇𝑖1 , 𝜇𝑖2 , … , 𝜇𝑖𝑏𝑖 ). The function
“g” is a generic function that takes a predictor vector x, a tree T and a set of terminal
node parameters 𝑀, and generates the output of the tree for input x. Thus, the model
tree 𝑇𝑖 can be defined as a binary splitting of x variables and by following the rules we
get one of the 𝜇 in 𝑀𝑖 .
Anand (2015) illustrated how a single tree T produces an output. We summarize this
example here. Figure 2.1 shows a tree model with three terminal nodes. The
parameter T represents the tree structure and the two decision rules 𝑥5 < 1 and 𝑥2 < 4.
The parameter M = (𝜇1 , 𝜇2 , 𝜇3 ) = (-2, 5, 7) represents the outputs from the three
terminal nodes. The input x = (𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 ) = (1.1, 5.4, 0.1, 2.3, 0.5) would lead to
prediction 𝜇2 = 5 since we branch left on 𝑥5 = 0.5 < 1 and then right on 𝑥2 = 5.4 ≥ 4.
The parameters of BART (𝑇1 , ..., 𝑇𝑚 , 𝑀1 , ..., 𝑀𝑚 , 𝜎) are unknown and must be
estimated from training data. Because these parameters are estimated, there is
statistical uncertainty in these estimates. This leads to uncertainty in functions of the
parameter, such as f(x), the predicted mean response at a particular input x. CGM use
Bayesian methods to estimate the parameters and quantify uncertainty. This will be
described in the next section.
5
Figure 2.1 : A simple realization of g(x; T; M) in the BART model.
2.2 Bayesian estimation of the BART model
All statistical learning models have same aim of getting a good prediction for
response y at any new point x. To estimate the BART model from training data,
Bayesian methods are used. In a Bayesian analysis, we must specify a prior distribution
𝑃(𝜃) for parameter θ. The inference joins the likelihood 𝑃(у│𝜃) with prior distribution
𝑃(θ) in Bayes theorem to give a posterior distribution
𝑃(у|𝜃)𝑃(𝜃)
𝑃(𝜃|у) = ,
𝑃(у)
6
where 𝑃(у) is the marginal probability function of y.
For BART, the parameter vector 𝜃 is the BART parameters (𝑇1 , ..., 𝑇𝑚 , 𝑀1 , ...,
𝑀𝑚 , and 𝜎). Prior probability distributions for the parameters will be discussed later in
Section 2.3, after outlining MCMC for BART.
Markov chain Monte Carlo (MCMC) is the computational technique used to calculate
posterior distributions for the parameters and quantify uncertainty. To show the
mechanism of MCMC, let us suppose that the objective is to find the posterior mean of
parameter 𝜃1 , where 𝜃1 is the first element of the θ vector and let us say the total
number of parameters is z. The posterior mean is given as
E( 𝜃1 │у) = ∫ 𝜃1 𝑃(𝜃|у) 𝑑𝜃2 𝑑𝜃3 … 𝑑𝜃𝑧
For large z, the integral will be complex to evaluate. Instead, we construct a Markov
chain that takes N samples from the posterior distribution 𝑃(𝜃|у) and indicates them as
𝜃1 , 𝜃 2 , …, 𝜃 𝑁 . Then, it takes specific samples of 𝜃1 by getting the first element of each
𝜃 𝑖 vector where i = 1,.., N. The posterior mean of 𝜃1 would then be approximated by
the sample mean of
𝜃11 , 𝜃12 , … , 𝜃1𝑁 .
7
With the BART model we are not interested in posterior distributions for individual
trees (𝑇𝑗 's) or terminal node parameters (𝑀𝑗 's). We are more interested in a posterior
distribution for the predictions, which is a function of the parameters. Analytically
obtaining a posterior distribution for predictions is not possible because this would
involve integration of the posterior distribution over all these parameters. However, it
is straightforward to compute the posterior for f(x) using the MCMC samples. We
compute f(x) for each MCMC sample, and these sampled f(x) values then correspond to
samples from the posterior for f(x). Credible intervals (CIs) can also be obtained from
MCMC samples of the posterior. For instance, for a parameter 𝜃1 , a 95% CI could be
computed as the 2.5% and 97.5% quantiles of the MCMC samples of 𝜃1 . The
computational details of MCMC, such as the number of MCMC iterations, and the
amount of burn-in and thinning for the chain, will be discussed later in Chapter 3.
The sample size of MCMC has to be adequate to give a good estimation for BART CIs.
The CI should be more accurate when there is a larger posterior sample. For that, we
conducted a trial in Chapter 3 to see if BART CIs gives better estimation with 10,000
MCMC samples than smaller numbers such as 4000 and 1000.
MCMC samples or iteration can be divided into two parts; one called burn-in
iterations and the second is thinning. MCMC samples cannot give the stationary
posterior distribution at its beginning iterations because the Markov chain will depend
8
strongly on the starting values. During the burn-in period of MCMC, samples are
discarded until they appear to have converged to a stationary distribution.
Once the MCMC has converged, sampled values may still be autocorrelated.
Thinning, which is the discarding of some MCMC samples, can reduce dependence and
use less memory to store sampled values. In our implementation, a fifth of MCMC
draws are discarded as a burn-in while the rest are thinned. Suppose that the MCMC
iterations are labelled 1, 2, ..., i, ..., 10,000, and the corresponding sampled parameters
are 𝜃1 , 𝜃2 , ..., 𝜃𝑖 , ..., 𝜃10000 . MCMC iteration i samples 𝜃𝑖 using a Markov chain, which is
conditional on 𝜃𝑖−1 . Although the MCMC is constructed so that it generates samples
from the posterior distribution of θ, because it is a Markov chain, 𝜃𝑖 values will be
correlated with 𝜃𝑖−1 values. When this correlation is high, it is not necessary to save 𝜃𝑖
for all values of i. That is, if we save 𝜃𝑖 , then 𝜃𝑖+1 will have values quite similar to
𝜃𝑖 . The correlation between MCMC samples decreases as there are more iterations
between them. The process of keeping every k-th sample is called "thinning" and it is a
way of reducing computer storage, while keeping most of the information contained in
the MCMC samples. In our use of BART, we keep every tenth MCMC sample from the
posterior, discarding the other 9 samples.
BART generates the credible intervals for f(x) by using MCMC samples. At a particular
x, every MCMC sample gives a corresponding output f(x). The MCMC samples then give
9
a MCMC sample of f(x) values that give us a set of samples from the posterior
distribution of the mean function f(x). By taking quantiles of f(x) values we obtain a
credible interval for f(x) values at that specific x. For example, we can get a 95% level of
CI by taking quantile with 2.5% and 97.5% of MCMC samples of f(x) at a particular x.
The new work of this thesis is studying the credible interval properties: coverage,
width, and sum of squared errors “SSE”. We calculate all three quantities by using a test
data set. This test set is from a simulation, so it consists of values of x and f(x) for a large
sample of inputs. Thus, we know the actual value of f(x). This is necessary for
computing coverage and SSE. Suppose BART has a CI at 𝑥𝑖 given by the CI lower bounds
“LB(𝑥𝑖 )” and upper bounds “UB(𝑥𝑖 )”. Equations (2.4), (2.5), (2.6) illustrate the CI
properties’ formulas.
∑𝑛.𝑡𝑒𝑠𝑡
𝑖=1 ( 𝛪 ( 𝐿𝐵 (𝑥𝑖 ) < 𝑓(𝑥𝑖 ) < 𝑈𝐵 (𝑥𝑖 )) )
(2.4) coverage = ,
𝑛.𝑡𝑒𝑠𝑡
∑𝑛.𝑡𝑒𝑠𝑡
𝑖=1 ( 𝑈𝐵 (𝑥𝑖 ) − 𝐿𝐵 (𝑥𝑖 ))
(2.5) width = ,
𝑛.𝑡𝑒𝑠𝑡
and
(2.6) 𝑆𝑆𝐸 = ∑𝑛.𝑡𝑒𝑠𝑡 ̂𝑖 )2.

𝑖=1 (𝑓(𝑥𝑖 ) − 𝑦
In our experiments in Chapter 4, we use n.test=104 .
10
For a quick clarification for BART credible intervals and its properties, we illustrate
this example. Suppose our real function is f(x) = sin (2πx) + 2  (x>0.5) + exp {2.5 x-0.5 },
then forty data points are generated randomly. The predictor x has a uniform
distribution and the response is generated as f(x) plus i.i.d N(0,0.252 ) random errors.
Figure 2.2 shows f(x).
Figure 2.2: The simulated function and data set.
We use Default.BART with 1100 MCMC samples to estimate the model and predict at
500 test points. Predictions of f(x) and 90% credible intervals are shown in Figure 2.3. It
displays that BART function is a step function where each step corresponds to a change
in one of the outputs of the m functions.
11
Figure 2.3: BART uncertainty prediction. Training data are plotted as individual points.
Table 2.1 summarizes 10 replications of the experiment described above. At each
replicate, a different training set is generated, and CIs obtained at 500 test points. The
mean values of coverage, width and SSE over the 500 test points are shown in Table 2.1.
In Table 2.1, it seems that there is considerable variation in the values, due to the
randomness in the different samples. This suggests that in experiments to study CI
properties, it will be necessary to simulate multiple data sets for each combination of
factors. This is discussed further in Section 3.1.2.
12
Properties of BART 90% CIs
replicate coverage width SSE
1 0.926 0.702 41.289
2 0.864 0.674 29.240
3 0.874 0.664 67.795
4 0.972 0.872 16.255
5 0.924 0.734 27.493
6 0.826 0.638 56.190
7 0.954 0.760 52.083
8 0.902 0.653 31.152
9 0.956 0.840 23.738
10 0.810 0.609 52.074
mean 0.901 0.715 39.731
Table 2.1: Mean of BART CI coverage, width and SSE over 500 test points, for 10
replicates.
2.3 BART prior parameters and default values
From the previous section, the BART model can be defined as a development of a
Bayesian “sum of trees” model. CGM developed a specification for the prior, depending
on four key prior parameters. One of these prior parameters is related to the prior for
13
means 𝜇𝑖𝑗 , two other parameters are related to the prior for σ, while the fourth
parameter controls the number of trees in the BART model. These parameters are k, ѵ,
q, and m respectively.
The prior distributions are an essential part for BART model in particular the sum-of-
trees model components (𝑇1 , 𝑀1 ), … , (𝑇𝑚 , 𝑀𝑚 ), and σ. CGM factor the joint prior on all
parameters as
(2.7) P((T1 , M1 ), … , (Tm , Mm ), σ) = [ ∏𝑖 𝑃(𝑀𝑖 , 𝑇𝑖 )] P(σ)
= [ ∏i P(Mi |Ti )P(Ti )] P(σ)
(2.8) P(Mi |Ti ) = ∏𝑗 P(μij|Ti )
In (2.7), CGM assume that in the prior, different trees (and their terminal node
parameters) are independent from each other. In (2.8), given tree 𝑇𝑖 , the terminal node
parameters of that tree are conditionally independent.
To describe the prior parameters, we discuss each one individually in a separate
section. Unless otherwise noted, the prior specifications are summaries of priors
developed by CGM.
14
2.3.1 Prior variance on 𝝁𝒊𝒋 │𝑻𝒊
The prior 𝑃(𝜇𝑖𝑗 |𝑇𝑖 ), is specified as normal with mean 𝜇𝜇 , and variance 𝜎𝜇2 . In (2.3)
the sum of trees formula shows that E(y|x) is equal to the sum of m 𝜇𝑖𝑗 ′s. Then, the
prior mean and variance of E(y|x) will be m times the prior mean and variance of a
single 𝜇𝑖𝑗 since the 𝜇𝑖𝑗 ′s are i.i.d. Thus, the prior for E(y|x) is N(𝑚𝜇𝜇 , 𝑚𝜎𝜇2 ). CGM
assume that mostly, the E(y|x) values fall between the minimum and maximum of
observed y’s data. They suggest specification of 𝜇𝜇 and 𝜎𝜇2 values so that N(𝑚𝜇𝜇 , 𝑚𝜎𝜇2 )
gives most of the prior probability to the range (у𝑚𝑖𝑛 , у𝑚𝑎𝑥 ). This choosing of
𝜇𝜇 𝑎𝑛𝑑 𝜎𝜇 values can be obtained by setting у𝑚𝑖𝑛 = 𝑚𝜇𝜇 − 𝑘√𝑚𝜎𝜇 and у𝑚𝑎𝑥 =
𝑚𝜇𝜇 + 𝑘√𝑚𝜎𝜇 .
CGM suggest values of 1, 2, 3, or 5 for the parameter “k”. Each value of k specifies a
particular prior probability that E(y|x ) falls within (у𝑚𝑖𝑛 , у𝑚𝑎𝑥 ) intervals. For instance,
k = 2 gives approximately a 95% prior probability.
This strategy can be summarized as using the minimum (у𝑚𝑖𝑛 ), and maximum (у𝑚𝑎𝑥 )
of the data to define a probability for the range of plausible E(y|x ) values. To apply this
strategy, we shift and rescale the у values from у𝑚𝑖𝑛 = −0.5 to у𝑚𝑎𝑥 = 0.5. After that,
15
we specify the prior mean and standard deviation of 𝜇𝑖𝑗 as 𝜇𝜇 = 0 and 𝜎𝜇 where
𝑘√𝑚𝜎𝜇 = 0.5, yielding
(2.9) 𝜇𝑖𝑗 ~ N(0, 𝜎𝜇2 ) where 𝜎𝜇 = 0.5/ 𝑘√𝑚
From (2.9), we can define 𝑘 as a scaling and shrinking parameter. It keeps the effect
of each individual tree in (2.3) small by shrinking the 𝜇𝑖𝑗 toward zero. When m and 𝑘
increase, the 𝜇𝑖𝑗 ’s will have smaller prior variance. The posterior distribution for each
corresponding 𝜇𝑖𝑗 will be concentrated on values close to zero
CGM describe two methods to specify k. One uses cross-validation, (the CV.BART
method discussed later in Section 2.4). In the other method, default values are chosen.
CGM suggest k = 2 as a reasonable default. The corresponding Default.BART method is
described later in Section 2.4.
2.3.2 The σ prior
The prior parameters for σ are the degrees of freedom ѵ and quantile q which are
used in specifying an inverse chi-squared distribution for 𝜎 2 . CGM specify the prior on
residual variance 𝜎 2 as the inverse chi-square distribution 𝜎 2 ~ ѵλ/𝜒ѵ2 . Each of ѵ and λ
has a particular task: ѵ determines the spread of the prior, λ determines the location of
the prior. Small values of ѵ correspond to greater spread in the prior distribution. CGM
16
suggest choosing ѵ between 1 and 10, and then choosing λ so that an upper percentile
of the σ prior (say q = 75%, 90% or 99%) corresponds to 𝜎̂, a rough estimate of σ. CGM
suggest that this rough estimate of σ be obtained in one of two ways: either the residual
standard deviation from a linear regression, or a proportion (such as 20%) of the sample
standard deviation of the y values.
CGM suggest three combinations of the combined parameters (ѵ, q). They are (10,
0.75), (3, 0.90), and (3, 0.99). These three values of (ѵ, q) called conservative, default,
and aggressive, respectively. Figure (2.4) shows all the prior three values of (ѵ, q) when
𝜎̂ = 2. This is a specification of the prior σ where our belief is that σ < 2. Figure (2.4)
also illustrates an inverse a relationship between q and σ. When q increases, σ moves
toward smaller values.
Figure 2.4: The prior σ distribution when 𝜎̂ = 2.

17
We can choose between the three values of (ѵ, q) by using cross-validation or by
using a default value. CGM suggest the default setting (3, 0.90).
2.3.3 The choice of m.
The parameter m determines the number of trees in the BART model. CGM suggest
trying only two values m = 50 and m = 200. They suggest to choose between them by
using cross-validation or recommending a default value. The default choice of m is 200.
For checking the performance of this choice, CGM observed that BART prediction
shows more improvement through the increasing of m until a specific point where the
improvement goes down slowly to give worse prediction. Therefore, it is very important
to select m not small and large enough. The values 50 and 200 are typically large
enough to give reasonable performance.
2.4 BART models
This research use two types of BART models which are distinguished by the way in
which the four key prior parameters are specified. Either the default settings of CGM
are used, or the parameters are chosen by cross-validation. These models are
18
Default.BART and CV.BART model. Section 2.4.1 describes the algorithm for CV.BART
while Section 2.4.2 has an explanation of Default.BART model.
2.4.1 CV.BART mechanism
The implementation of cross-validation is similar to that given in Section 5.1 of CGM.
The CV.BART algorithm is outlined below. It begins with a training and a testing dataset
(simulated, in the case of this thesis).
1- The prior parameters’ combinations are set as a small designed experiment with
all factorial combinations of
k = 1, 2*, 3, 5
(ѵ, q) = (10, 0.75), (3, 0.90)*, (3, 0.99)
m= 50, 200*.
Default values are indicated by “*”.
The number of theses combination is i = k’s levels x (ѵ, q)’s levels x m’s levels =
4x3x2 = 24, Table 2.2 shows these combinations.
2- Divide training data randomly into five sets of roughly equal size, denoted
𝐶1 , 𝐶2 , 𝐶3 , 𝐶4 , 𝐶5 . Denote the set of all training data except 𝐶𝑗 by 𝐶(𝑗) , for j = 1, …,
5.
19
The prior parameters
i k (v,q) M
1 1 (10, 0.75) 50
2 2 (10, 0.75) 50
3 3 (10, 0.75) 50
4 5 (10, 0.75) 50
5 1 (10, 0.75) 200
6 2 (10, 0.75) 200
7 3 (10, 0.75) 200
8 5 (10, 0.75) 200
9 1 (3, 0.90) 50
10 2 (3, 0.90) 50
11 3 (3, 0.90) 50
12 5 (3, 0.90) 50
13 1 (3, 0.90) 200
14 2 (3, 0.90) 200
15 3 (3, 0.90) 200
16 5 (3, 0.90) 200
17 1 (3, 0.99) 50
18 2 (3, 0.99) 50
19 3 (3, 0.99) 50
20 5 (3, 0.99) 50
21 1 (3, 0.99) 200
22 2 (3, 0.99) 200
23 3 (3, 0.99) 200
24 5 (3, 0.99) 200
Table 2.2: The prior parameters’ combinations.
20
3- For the BART parameters in one of the 24 rows of Table 2.2, train 5 BART
models. The jth model (j = 1, …, 5) would be trained using 𝐶(𝑗) . The prediction
(𝑗)
for observation i using model j will be denoted 𝑌̂𝑖 , 𝑗 = 1, … , 5.
4- The validation 𝑆𝑆𝐸 corresponding to a row of Table 2.2 will be
(𝑗)
(2.11) 𝑆𝑆𝐸 = ∑5𝑗=1 ∑𝑖𝜖𝐶𝑗 (𝑌𝑖 − 𝑌̂𝑖 )2.
Note that the predicted values are for the fold of data not used in training.
5- The validation 𝑆𝑆𝐸 will be calculated for each of the 24 rows of Table 2.2. This
involves fitting a total of 5 x 24 = 120 BART models.
6- Determine the best CV.BART combination of prior parameters by finding the row
z among the 24 combinations in Table 2.2 that has the lowest value of 𝑆𝑆𝐸.
Thus, z represents the best CV.BART model over 24 CV.BART models.
7- Using the parameters from row z, re-estimate the BART model over all training
data set without splitting.
8- Use the fitted model from step 7 to make predictions for the test data. Then
calculate the credible interval coverage, credible interval width, and error sum of
squares “SSE”. These quantities are defined in Section 2.2.
2.4.2 Defualt .BART model
The Default.BART model is very simple compared to CV.BART. A single BART model is
fit using a specific choice of BART prior parameters values that are denoted with (*) in
21
step 1 above. Therefore, to find the Default.BART CI properties, we just need to follow
the above steps 7 and 8 with using Default.BART model instead of CV.BART.
22
Chapter 3
The simulation study and setting up the experiment
The thesis studies the accuracy of BART credible intervals under the effect of specific
factors. An analysis of variance (ANOVA) summarizes these factors’ effects on BART CIs.
The experiment has seven factors: sample size “n”, predictors’ dimension “p”, noise
standard deviation “σ”, junk variables, predictor correlation “r”, type of error
distribution, and BART method. A convenient way to conduct this experiment is using a
simulation study in which the true mean function is known. This makes it possible to
calculate measures of CI performance. This simulation is carried out using a designed
experiment.
This chapter describes this study, including the experimental factors and their levels.
It shows a separate small study for determining the sufficient number of MCMC
iterations. It also includes the details of the simulated function which is used here and
the use of parallel computing to run this experiment.
The results of the experiment will be analyzed in Chapter 4.
23
3.1 The experimental settings
3.1.1 The setting of experimental factors
Experimental design has extensive applications in many fields. For instance, in
factories, it assists in the identification of significant factors which impact the quantity
or quality of production. Experimental design techniques have many advantages for
applications. They can develop the yields of process, reduce the time and the overall
cost (Montgomery, 2013). The design of experiments is a basic arrangement to evaluate
a response performance under an impact of one or more independent factors. It has
ability to determine the important level of the design parameters which affect the
response’s values. Using an additive model allows one to estimate the main and
interaction effect of factors simply.
Oehlert (2010) and Montgomery (2013) describe different types of experimental
design such as full factorial, fractional factorial, and block designs. Each design has
particular characteristics and features which lead us to use it. Regarding this research’s
goal which is studying the performance of BART CIs against various factors, we decided
to carry out our study by using a full factorial design. It is the simplest design and
analysis is easy.
In all experimental designs, it is necessary to determine what factors to study and
their levels. The selection of the seven factors in our study comes from our belief of
24
their importance. Six of the seven factors are related directly to the population model
(2.1), while the factor “BART method” concerns the choice of prior parameters for BART.
Other studies consider similar sets of factors. For instance, Friedman (2001) used three
factors to examine the performance of various function estimation methods. These
factors were the sample size, the underlying true function, and whether the error
distribution is normal or slash. An experiment in CGM (2010) considers some factors
similar to those studied in this thesis. They conducted an experiment to show the
performance and features of the BART model against the true underlying function. They
used a fixed function with five actual predictors 𝑥𝑖 and included additional junk variables
to give a total of 10, 100, or 1000 predictors. Thus, in the BART study the number of
junk variables and the dimension were varied. From previous examples we see that
how our seven factors are reasonable to select to examine their effects on BART CIs.
These factors have various types and numbers of levels. Some are four level factors,
while other factors have only two levels. Some are numeric factors, while others are
categorical. More details of our factors and their levels are given below.
1. Sample size “n”
The sample size determines the number of observations in a training data set. It is
the first factor we decided to include in this full factorial design, to see how the size of
training data sets can impact BART CI performance. The sample size “n” is a numeric
25
factor with four levels: n = (40, 200, 1000, 10000). The reason for not choosing larger n
values is that the BART process would take a very long time to run.
2. Dimension “p”
In our simulated data set, it is necessary to determine the generated values of x. The
dimension p gives the number of important predictors. The value p is the number of
predictors that affect the response function. This excludes junk variables, which are
described below. The four levels p = (1, 2, 5, 10) were chosen for this factor. We did not
select the p values to be too large because we know BART would take a very long time
to run.
The decision to study the effect of dimension on BART CIs is motivated by studies in
CGM (2010). That trial was a comparison between BART and other supervised learning
methods such as gradient boosting, random forests, and neural nets. The goal was to
examine which method had the best performance on a simulated function of p=5
inputs, with additional junk variables. The dimension p=5 in that study is similar to the
range 1, 2, 5, and 10 that we study in this thesis.
3. Junk variables
In the experiment, the mean function f(x) is a function of p variables. In addition to
those p variables, we may add “junk” variables which have no effect on f(x) or the
26
response y. The factor “junk” is a two level categorical factor with levels (“no”, “yes”).
If “junk” = “no”, then the predictor matrix X has p columns. If it is equal to “yes”, then X
has 11*p columns, and for every actual x variable, there are 10 “junk” x variables. For
example, if p=10 and junk=”yes”, then our X matrix will have 110 columns. The first 10
columns will be used in f(x) to generate the mean function, while the other 100 columns
will be junk variables. Thus, the percentage of active inputs are less than 10%. The
purpose of considering junk variables is to see if the performance of CIs is affected for
Default.BART and CV.BART. Experiments in CGM (2010) suggest that Default.BART had
worse performance than CV.BART for high-dimensional problems with many junk
variables. That experiment had p=5 fixed, and varied the number of junk variables.
That is, in all experiments CGM used a fixed function in five predictors and a different
number of junk variables which were 5, 95, and 995. Thus, the total number of active
and junk variables were 10, 100, and 1000. Both BART versions illustrated similar
behaviour at 10, but Default.BART was worse than CV.BART at the greater values 100,
and 1000.
In this thesis, we emphasize that f(x) is built to be a function only of the p variables
but BART does not have information on which variables are the junk variables (if there
are any).
27
4. Standard deviation “σ”
The noise level of the residual in population formula (2.1) is √Var (𝜀) = σ. We vary σ
over four levels including very low values where σ = (0.01, 0.1, 0.25, 0.5).
5. Predictor correlation “r”
We decided to make the correlation between our simulated predictors to be positive
if it is nonzero. The positive correlation means that there is an increasing relation
between any two predictors. The correlation r is a numeric factor with two levels (0,
0.5). All predictors in the X matrix (including junk variables, if present) are generated as
multivariate normal random variables with mean vector 0 and covariance matrix R. If
r=0, then R=𝙸. If r=0.5, then all off-diagonal elements of R are 0.5, and the diagonal of R
contains 1’s.
6. Type of error distribution
In (2.1), the distribution of the noise term ε is specified as normal. We study the
effect of violating this assumption, by using either normal or t errors. Thus, the factor
“error distribution” is a categorical factor with two levels corresponding to Gaussian or a
t-distribution with 3 degrees of freedom. The t-distribution was chosen to have three df
because this is the smallest degrees of freedom for which the variance is finite. Errors
28
were generated as a scalar multiple of either a standard normal or a t-distribution, with
multiplier σ. Thus, the variance of the t-distribution was actually 3*𝜎 2 .
7. BART method
We study two different forms of the BART model: CV.BART and Default.BART, as
described in Section 2.4. The main difference is that the default version uses default
values of all prior parameters, while the cv version choses these parameters by cross-
validation. Thus, “BART method” is a categorical factor with two levels. Both BART
methods are implemented with the BayesTree package in R.
The seven factors are used to construct a design matrix “D” which is a full factorial
design matrix. Table 3.1 summarizes the seven factors.
We simulate five replications for each combination of the experimental design factors
(i.e., for each row of the D matrix). The experimental design matrix D has 4 x 4 x 4 x 2 x
2 x 2 x 2 = 1024 rows of factor combinations with replications, D will be a big matrix with
5120 = 5 x 1024 rows.
29
Factor Description Number Levels
of levels
N sample size 4 40, 200, 103 , 104
p number of important predictors 4 1, 2, 5, 10
junk variables extra variables having no effect on f(x) 2 “yes”, “no”
σ noise standard deviation 4 0.01, 0.1, 0.25, 0.5
r predictors correlation 2 0, 0.5
error distribution distribution type of ε 2 “N”, “t”
BART method Procedure to choose prior parameters 2 “cv”, “default”
Table 3.1: Summary of the factors of the designed experiment.
3.1.2 Additional variables not treated as factors
There are several other quantities that could have been treated as experimental
factors, but have instead have been set to fixed levels. In this section we identify those
quantities and discuss how their levels were chosen. These quantities are BART prior
parameters, and the number of MCMC iterations. There is a brief clarification for these
quantities below:
30
1. BART prior parameters
Four prior parameters that are important for BART are mentioned in Chapter 2. Each
of CV.BART and Default.BART has a particular way to select these prior parameters.
These parameters are the prior on 𝜇𝑖𝑗 “k”, the prior on σ “(v,q)”, and the number of
trees “m” in the ensemble. The setting of all these prior parameters’ levels is the same
method as is mentioned in Section 2.3 except there is a restriction relating to the choice
of the prior on σ.
In Chapter two we indicated that there are two ways to obtain 𝜎̂, a rough guess of σ
used in specifying the prior for σ. One may use the residual standard error from a linear
regression model or a sample standard deviation sd(у). In this thesis study, we decided
to calculate 𝜎̂ as a fraction of the whole training data variation, 𝜎̂ = sd(y)/2. Therefore,
the method for specifying 𝜎̂ can be considered as a fixed factor. The reason of not
choosing the linear model to specify 𝜎̂ is that in some cases the training set can have
fewer observation than variables (e.g. n=40, p=5, 10 with junk=’yes’ gives 55 and 110
variables and only 40 observations).
31
2. MCMC iterations
a. The MCMC sample
As described in Section 2.2, the MCMC samples of parameters values of the BART
model are used to generate credible intervals for predictions. The number of MCMC
iterations affects the quality of the BART CI, where increasing the number of iterations
enables the MCMC to better explore the posterior probability distribution. However,
increasing the number of MCMC iterations will lead to very long run times for BART. We
decide to make a small experiment to choose the best number of MCMC iterations that
balances CI accuracy with algorithm run time. This best value of MCMC we call “an
adequate number of MCMC iterations”. The details of the small experiment and results
are discussed in Section 3.2. Another important issue is that the MCMC algorithm will
not converge immediately, and we have to discard some of the early MCMC iterations
that are called “burn-in” iterations. In this thesis, burn-in iterations always represent
the first 20% of an MCMC run. CGM typically used 20% burn-in. BART runs more
quickly with 1 in 10 thinning. Later in this Chapter, we will see that the eventual
decision is to run 5000 iterations, discarding the first 1000 iterations as burn-in, and
keeping samples from the last 4000 iterations.
The mechanism of MCMC algorithm involves that: at a specific x each MCMC
iteration “sample” produces a corresponding output f(x). Thus, the 4000 MCMC
iterations shall give 4000 samples from the posterior distribution of the f(x) to give BART
32
CI. For this study, at each of 10,000 test points, these samples must be stored. These
saved samples “iterations 1001-5000” consume computer memory. To reduce storage,
we “thin” the MCMC sample by recording only results from every 10th MCMC iteration.
That is, we keep the 1001st, 1011th, 1021st, … of MCMC samples after burn-in.
Consequently, 90% of the computer memory shall be saved with negligible effect on the
quality of the results. To summarize, 5000 MCMC iterations are run; the first 1000 are
discarded as burn-in; the last 4000 values are thinned, resulting in 400 saved MCMC
samples.
b. Using shorter MCMC runs for CV.BART
The CV.BART procedure, described earlier in Section 2.4.1. is very computationally
intensive, requiring 120 runs of the BART algorithm, corresponding to 5 folds and 24
combinations of prior parameter choices. To speed up the CV.BART algorithm, we
shorten our MCMC runs to 100 burn-in iterations followed by 400 sampling iterations
again keeping every 10th value. The selection of the best of the 24 parameter
combinations in CV.BART is done by minimal cross-validated 𝑆𝑆𝐸. CGM (2010) observe
that if the low predictive error is the objective, instead of getting accurate coverage,
fewer MCMC iterations are needed. Note that the final re-fitting of BART with all
training data and the optimal parameters uses the same MCMC settings as default
BART.
33
3.1.3 Response variables for the experiment
This thesis goal is to study the BART properties CI coverage (at levels 90%, 95%, and
99%), CI width (90%, 95%, and 99%) and predictive SSE. These are defined in Chapter 2.
All these results are calculated by using a test data set with 104 observations and f(x)
values without error, generated using corresponding setting of p, junk, correlation and
σ. We utilize the test data to compute our response variables. If the results were
evaluated using the f(x) values at the training points, they would not be as reliable. A
model may overfit the training data, but prediction for the test data give a more realistic
assessment of predictive accuracy.
3.2 Choosing the number of MCMC iterations
Markov chain Monte Carlo (MCMC) samples are an important part of this research,
giving the posterior probability distribution for model predictions. Specifically it allows
calculation of the quantiles of the posterior probability density function of E (Y|X).
This section considers the choice of an appropriate number of MCMC iterations. A
larger number of iterations gives more accurate prediction since they cover the
posterior distribution more fully. However, a smaller number of iterations will give
results faster. The number of MCMC iterations needs to be chosen to balance accuracy
and speed.
34
For finding an adequate number of MCMC iterations, a small experiment was
conducted. We used the following test function: f(x) = 10 sin (π(𝑥1 )(𝑥2 )) + 20
(𝑥3 − 0.5)2+ 10 (𝑥4 ) + 5(𝑥5 ) (Chipman, George and McCulloch 2010; Friedman 1991).
In this simulated data, observations were generated randomly with i.i.d. N(0,1)
residuals. Ten x variables are generated independently from a Uniform (0,1)
distribution. The first five x variables affect the response, and the rest are junk
variables. The values N= 500, 1000, 1500, 2000, 3000, 4000, 5000, 10000 were used as
different numbers of MCMC iterations to fit the BART model. Each time the BART model
was estimated with a particular number of iterations and the prediction SSE, coverage
and width of 90% credible intervals were recorded. Two different training sample sizes
were selected to compare the behaviour of the Markov chain. These sample sizes were
𝑛1 = 1000 𝑎𝑛𝑑 𝑛2 = 10000. In all cases, a test set of 10000 observations was used.
An initial burn-in portion of the MCMC samples was discarded. In all cases, the
number of burn-in iterations was taken to be 25% of the number of the saved iterations
that we denote as “N”. For instance, if N=500 iterations were saved, then prior to that
125 burn-in iterations were discarded.
35
Table 3.2 illustrates the responses’ results and different values for training set sample
sizes and number of MCMC iteration. All CIs were constructed with level 90%. From
Table 3.2, it is obvious that coverage and width have a positive relation with the number
of MCMC iterations while there is a negative relationship between SSE and the number
of MCMC iterations. Figure 3.1 plots columns 2-7 of Table 3.2 as a function of the
number of MCMC iterations. The same trends identified in the table are visible in the
plots.
n=1000 n=10000
N coverage% width SSE coverage% width SSE
500 88.59 2.19 4600.23 76.69 0.744 945.38
1000 91.28 2.27 4367.53 77.53 0.743 945.15
1500 92.81 2.31 4132.30 79.29 0.760 900.11
2000 93.23 2.33 4039.64 81.87 0.787 852.72
2500 93.85 2.35 3955.37 83.88 0.808 833.77
3000 94.18 2.37 3940.56 83.98 0.813 833.60
4000 94.42 2.36 3902.00 85.48 0.831 814.12
5000 94.90 2.39 3814.29 85.31 0.836 815.50
10000 96.72 2.56 3385.59 86.23 0.839 793.16
Table 3.2: Predictive performance at various MCMC iterations (N) and sample sizes (n).
36
Figure 3.1: The relation between MCMC samples and different responses (width,
coverage, SSE).
37
The largest changes to SSE, width and coverage happen by 4000 MCMC iterations.
That suggests N=4000 as an adequate number of iterations. Choosing more MCMC
iterations gives more reliable results but not very different from 4000 iterations. Thus,
4000 is a good number of iterations for this research, giving a good balance of
computing speed and accuracy.
When this experiment was repeated, there were only small changes in results for the
n=1000 training set. However, the same trends as in Table 3.1 and Figure 3.1 were
observed. For n=10000, the results did not change.
Figure 3.2 displays an explanation for the different behaviour of Markov iterations at
both values of training sample size. It plots the posterior samples of σ against the
MCMC iteration, with burn-in samples indicated in red and saved samples indicated in
black.
38
Figure 3.2: Posterior samples of σ versus MCMC iterations number, for training
samples of size 1000 and 10000. Burn-in samples are red.
The parameter σ is plotted instead of other model parameters because it summarizes
the fit of the BART model. The true value of σ is 1.0. The right plot of Figure 3.2 shows
more uniform MCMC iterations and less noise than in the plot for n=1000. This is to be
expected, since the larger training set will result in a posterior distribution with less
uncertainty. The plots indicate that the MCMC is mixing well among these iterations.
In the conclusion, even though the larger number of MCMC iterations do give more
reliable findings, smaller values won’t give very different results. While 4000 MCMC
iterations is less than 10000, it leads to credible predictions and results similar to 10000.
39
3.3 The simulated function
To have more general results, for each of the 5120 different experimental runs in D, a
different f(x) is generated, instead of using just one population function. This function
could be called a mean function because it is equal to E(у│x) where y = f(x) + ɛ in the
population model. At each level of D factors’ combination, the function f(x) is chosen
randomly and used as the conditional mean, in simulating the data set. This leads to a
specific data set for each row of D.
Friedman (2001) described an algorithm for generating this random function. The
thesis uses the implementation of that algorithm from Chipman (2011). This Friedman
function can be considered as uncontrollable factor of the experiment, since it varies
randomly across each row of D. The formula of this generated function is given as
(3.1) f(x) = ∑20 20

𝑗=1 𝑎𝑗 ℎ𝑗 = ∑𝑗=1 𝑎𝑗 ℎ𝑗 (𝑥𝑗 ; 𝜇𝑗 , 𝑉𝑗 )
The coefficients 𝑎𝑗 are scalars that are randomly generated as U(-1,1). The function ℎ𝑗 is
an unnormalized multivariate Gaussian probability density function
1
(3.2) ℎ𝑗 (𝑥𝑗 ; 𝜇𝑗 , 𝑉𝑗 ) = exp (- ((𝑥𝑗 - 𝜇𝑗 )𝑇 𝑉𝑗−1 (𝑥𝑗 − 𝜇𝑗 ))
2
The size “𝑝𝑗 " of each 𝑥𝑗 is taken randomly as floor [1.5 + r] where r is drawn from an
exponential distribution with mean λ=2. If r > p, then we set r = p. The vector 𝑥𝑗 has a
random subset of the p predictor variables of length 𝑝𝑗 where 0 < 𝑝𝑗 ≤ p. The
40
parameters of ℎ𝑗 are the mean vector “𝜇𝑗 ” and covariance matrix “𝑉𝑗 " which are not
constant. The vector { 𝜇𝑗 }120 is randomly generated from MVN(0, I). The
𝑝𝑗 x𝑝𝑗 matrix "𝑉𝑗 " is a random orthonormal matrix given as
(3.3) 𝑉𝐽 = 𝑈𝑗 𝑊𝑗 𝑈𝑗𝑇
where 𝑈𝑗 is a uniformly random orthonormal matrix and 𝑊𝑗 = diag {𝑤1 , … , 𝑤𝑝𝑗𝑗 }. The
𝑊𝑗 ’s elements are generated as the square of a 𝑈(𝑎 = 0.1, 𝑏 = 2. ).
In each ℎ𝑗 , there are a new 𝜇𝑗 and 𝑉𝑗 . These two parameters control the shape of ℎ𝑗
which contribute to the shape of f(x). Figure 3.3 is a contour plot of generated f(x)
corresponding to different values of 𝜇𝑗 and 𝑉𝑗 . It shows six simulated functions f(x)’s by
setting p =2, and the number of ℎ𝑗 terms is 10, instead of 20 as in (3.1). Each one of the
ten ℎ𝑗 is a sub function of a vector of one or two randomly chosen variables from two
inputs that belongs to the vector x.
41
Figure 3.3: Six realizations of randomly simulated f(x) in two dimensions, as described in
the text.
Each one of f(x) illustrates different shape, different peak and surface. This confirms
how much variety this function has which leads the thesis results being more general.
42
We made some small changes to the mechanism of the function generation by fixing
the number of ℎ𝑗 functions to be 20 and making the random length of 𝑥𝑗 variables “𝑝𝑗 "
to be bounded by p, where p is the number of active variables in the model specified by
the design. In the Friedman algorithm, and the Chipman implementation, the number
of ℎ𝑗 functions was random instead of fixed at = 20.
3.4 Parallel computing mechanism
In this thesis, there are computational challenges caused by the size of the
experiment, and the long execution time of BART. These challenges led us to use
parallel computing resources from ACENET.
This section describes the use of parallel computing to carry out the 5120 runs of the
experiment. All computing was carried out on ACENET parallel computers. ACENET is a
high performance computing consortium for academic research, consisting of several
thousand individual computers connected together (http://www.ace-
net.ca/wiki/Compute_Resources). The ACENET clusters are suitable for running jobs
that require large computing resources that can also be distributed among several
computers. A large job should be split into multiple small jobs which are then executed
in the individual computers in the clusters. The cluster also helps in executing multiple
43
runs of a process with different sets of data, for example the same algorithm is run with
several sub combinations of the given data.
This thesis has a large experiment where the final design matrix D has 5120 runs after
replication. To run such a large experiment on a personal computer, we would need to
spend months. To save time, we decided to run the thesis experiment on a cluster of
computers by dividing the entire experiment into smaller “jobs”. The technique used
here is breaking the 5120 rows of the D matrix into 640 independent jobs which can be
run on one or more ACENET clusters. Each job consists of 8 rows or runs of the D
matrix. Each one of these jobs is an independent task. Table 3.3 shows the first of the
640 jobs. We divide the 5120 runs among the 640 jobs so that the execution time of
jobs is similar from one job to another. This time is less than 48 hours for jobs which
include CV.BART, and it is less than 24 hours for jobs that contain the Default. BART
method. For CV.BART, each job took between approximately 16 and 29 hours to
complete while the job which corresponded to Default.BART took about 4 to 10 hours.
This controlling of time for these parallel jobs was a result of dividing them equally
according to the sample size. All the 640 jobs have the same levels of sample size, in the
same order. This indicates that the n column of Table 3.3 is exactly the same for every
job. It was very reasonable to set our jobs in this way since the execution of the BART
44
row n p σ junk r error.dist method replicate
1 40 1 0.01 no 0 t cv 1
2 200 1 0.01 no 0 t cv 1
3 1000 1 0.01 no 0 t cv 1
4 10000 1 0.01 no 0 t cv 1
5 40 2 0.01 no 0 t cv 1
6 200 2 0.01 no 0 t cv 1
7 1000 2 0.01 no 0 t cv 1
8 10000 2 0.01 no 0 t cv 1
Table 3.3: The 8 rows of D in the first job.
algorithm is roughly proportional to training set size n. Through running the MCMC
experiment which is shown in Section 3.2, we have seen that MCMC runs take a longer
time when n values increase. Therefore, we select our jobs to have the same
combinations of n levels.
To run these jobs on ACENET machines, we used a collections of R code for parallel
computing (Chipman, 2011). It includes five main files. The files respectively are:
main.R, doit.R, setbatch.R, postprocess.R, and cleantemp.R. Each one of these files has
a particular task regarding R jobs. Briefly, we will mention these files tasks:
45
- doit.R has most of the code to execute the runs for this thesis experiment. It builds
the D matrix, specifies the combinations of prior parameters, the number of
replications, and the MCMC iteration paramters. It also loads all the R packages that
program needs to run such as “mvtnorm”, and it splits the D matrix in small jobs that
each contains 8 rows of D. It executes Default.BART or CV.BART and then saves the
results in files to be available to read later. It includes some of the following files as
sources such as setbatch.R file.
- main.R is the first file that has to be run of these computing files group. It prepares all
other files before execution on the cluster. It specifies each job’s maximum execution
time in hours and minutes. Thus, it can be considered as a timer file that specifies how
long the job is going to take. Execution of main.R prepares separate files associated
with the 640 different jobs to be run. For each job, an R file specifies which rows of D
will be used by doit.R. Code is also generated that submits the jobs to the cluster queue
for execution.
- setbatch.R is a function that works to set up these parallel jobs on an ACENET queue.
Its objective is giving names of files and the parallel jobs to identify them on ACENET.
- postprocess.R is an results collection file, which is run after all parallel jobs have been
completed. This file will read all results files and carry out all required analysis.
- cleantemp.R is a cleaning file, we use to discard the junk files produced by the
computations.
46
We did some minor modifications on this setbatch example to run our jobs on Ace-
net. For example, the doit.R file has our main program and it contains all files that
correspond to the program as sources such as mainfunc.R, cvbart.R, and Friedman.R.
The roles of these three programs are as follows:
- mainfunc.R is a complementary file to doit.R that contains the rest of the program
settings. Here all the small changes to the function generation settings were done such
as fixing the number of ℎ𝑗 terms to be 20.
- cvbart.R includes the CV.BART method, described in Sections 2.4.1.
- Friedman.R defines the Chipman (2011) implementation of the Friedman function
generator that was discussed in Section 3.3.
47
Chapter 4
Analysis
This thesis uses ANOVA for analysis of the experimental results. ANOVA displays the
main effects and interaction effects of factors. This thesis focuses on main effects and
two-way interactions and ignores three-way interactions since the inclusion of three-
way interactions in the model does not lead to large increases in 𝑅 2 . More details shall
be shown later in this chapter.
This experiment has seven responses to study: coverage90, coverage95, coverage99,
width90, width95, width99, and predictive SSE. However, this analysis concentrates on
the three responses coverage95, width95, and SSE. The definitions of the responses
were mentioned in Section 2.2. Results for responses with 90% and 99% level are quite
similar to those for 95% level, as illustrated later in this chapter.
Below the factors’ main effects and interactions are explored via plots and ANOVA
tables.
48
4.1 Analysis of coverage
We have three possible coverage responses: coverage90, coverage95, and
coverage99. Figure 4.1 presents the main effects for all three responses. The effects
have the same relative size across different coverages, with a shift in mean level from
one response to another. There is a high correlation between these responses. Section
4.6 contains the details. We choose to analyze only 95% coverage because its results
are similar to 90% and 99% responses.
Figure 4.1: The main effects for coverage90, 95, and 99 respectively. The factors’ order
along the horizontal axis is n, dimension p, σ, predictor correlation r, junk variables,
error distribution, and BART method.
49
Figure 4.2: Main effects for coverage95.
Figure 4.2 displays the main effects for coverage95. We use it to examine the effect
sizes. The factor n has the largest effect on the coverage of BART credible intervals. The
factors σ and error distribution have the second and third largest main effects,
respectively. Then, junk predictor, dimension p, and BART method have small but
visible effects in Figure 4.2. The predictor correlation “r” appears to have little or no
impact on coverage. Overall, the mean level of coverage (about 80%, indicated by the
horizontal line in Figure 4.2) is considerably lower than the nominal 95% level.
50
Mostly there is an inverse relationship between the sample size n and coverage.
Coverage increases when n decreases, except when n=40. Thus, for large sample size,
such as n = 10,000, the coverage of BART credible intervals is much lower than the
desired 95%. The factor σ has a positive relationship with coverage, with a very low
coverage at σ =0.01. Even though the results say the presence of junk predictors and
error distribution has little effect on coverage, they give coverage closer to the nominal
95% level when there are no junk variables and the error is normally distributed. The
difference between cv and default versions of BART is very small.
We now give a more detailed analysis of coverage95, using an ANOVA table and
interaction plots. Table 4.1 shows the ANOVA for the 2-way interaction effect model.
The rows of the ANOVA table are ordered by the mean square values (MS). F-tests and
the corresponding p-values can be used to identify significant effects. The last column
of the table indicates the significance of effects by p-values, where 0.05> * > 0.01> **>
0.001> ***.
In the reminder of this section we examine some of the large effects. The order of the
largest effects for coverage95 shows that n, σ, n:σ, error distribution, p:σ, and n:
method are the six most important effects for the coverage of BART credible intervals.
All main effects except r are highly significant, with p-values less than 10−4 , and seven
two-way interactions are this significant or more so.
51
Df SS MS F- value Pr(>F)
n 3 73.672 24.557 4168.230 <2.2e-16 ***
σ 3 21.214 7.071 1200.234 <2.2e-16 ***
n: σ 9 43.023 4.780 811.399 <2.2e-16 ***
error distribution 1 0.887 0.888 150.639 <2.2e-16 ***
p: σ 9 6.456 0.717 121.765 <2.2e-16 ***
n: method 3 1.678 0.559 94.945 <2.2e-16 ***
p: junk 3 1.264 0.421 71.529 <2.2e-16 ***
n: p 9 3.348 0.372 63.140 <2.2e-16 ***
junk 1 0.263 0.263 44.582 2.702e-11 ***
n: junk 3 0.769 0.256 43.526 <2.2e-16 ***
n: error distribution 3 0.410 0.137 23.176 6.791e-15 ***
method 1 0.091 0.091 15.390 8.861e-05 ***
p: method 3 0.171 0.057 9.655 2.369e-06 ***
p 3 0.089 0.030 5.046 0.001718 **
σ: error distribution 3 0.063 0.021 3.577 0.013338 *
σ: r 3 0.059 0.020 3.359 0.018007 *
σ: method 3 0.056 0.019 3.172 0.023263 *
σ: junk 3 0.036 0.012 2.011 0.0901220
r: method 1 0.009 0.009 1.574 0.209706
r: junk 1 0.007 0.007 1.115 0.291069
junk: method 1 0.005 0.005 0.780 0.377086
p: error distribution 3 0.012 0.004 0.682 0.562851
n: r 3 0.005 0.002 0.265 0.850610
p: r 3 0.003 0.001 0.169 0.917679
r 1 0.001 0.001 0.162 0.687229
error distribution: method 1 0.001 0.001 0.105 0.746514
junk: error distribution 1 0.000 0.000 0.044 0.833871
r: error distribution 1 0.000 0.000 0.002 0.967963
residual 5037 29.676 0.006
Table 4.1: ANOVA table for coverage95.
52
Earlier in this section, interpretations of the main effects for coverage95 were given.
Here we examine the three interactions among effects with the six largest MS values.
We first consider the n:σ interaction displayed as an interaction plot in Figure 4.3.
Figure 4.3: The n:σ effect on coverage95.
The n:σ interaction plot shows that when σ=0.01 and n=10000 coverage is
approximately 30%, well below the nominal 95% level. As a result, BART credible
intervals will be very inaccurate at very small values of σ and very large values of sample
size n. At other levels of σ (0.1, 0.25 and 0.5), coverage is less sensitive to the effect of
sample size n. Figure 4.3 shows coverage closest to the nominal 95% level when n has
small values (40 or 200) and σ is large (0.25 or 0.5).
53
Figure 4.4 illustrates the effect of p:σ against coverage95. Coverage is much lower
when σ =0.01, and for other values of σ, coverage is similar. The factor p has a
complicated relationship with coverage. For instance, sometimes p gives the best
values for coverage at its lowest values such as what happens at p=1, 2 and σ =0.25, 0.5.
In other cases, p gives the worst values for coverage at 1, and 2 when σ equals 0.01
Figure 4.4: The p:σ effect for coverage95.
Figure 4.5 shows the n:method interaction effect. The n:method interaction appears
smaller than the other two interactions, since the two lines are reasonably close to each
other. The coverage of BART.Default seems more sensitive to n than the coverage of
BART.CV. BART.Default varies more with n than BART.CV. It is clear both BART methods
give the best coverages at the lowest levels of sample size. The factor n gives the worst
54
values for coverages at its large values. At n = 40 and n =10000 CV.BART has coverage
closer to 95% than does Default.BART. Both methods give good coverage at n=40 and
200, and the coverage of CV.BART changes less with n than the coverage of
Default.BART.
Figure 4.5: The n: method effect for coverage95.
4.2 Analysis of width
Here we begin by comparing main effect plots of credible interval width over the
three different coverages (Figure 4.6). Figure 4.6 suggests the factors’ effects have the
same relative sizes across widths corresponding to 90%, 95% and 99% levels, other than
55
a shift in mean level from one response to another. Thus, we only analyze width95
(Figure 4.7).
Figure 4.6: The main effects for width90, 95, and 99. The factors order is n, p, σ,
predictor correlation r, junk variables, error distribution, and method.
Figure 4.7 shows the size of different effects. The factor n has the largest impact on
credible interval widths. The factors σ and junk variables are the second and third
largest effect factors while dimension p and error distribution have the fourth and fifth
56
largest effects. The correlation “r” shows some impact on width, with a slightly
narrower width when predictors are correlated. BART methods appear to have little or
no impact on width. The ordering of some factor effects by size, such as junk and p, can
be difficult because of differing numbers of factor levels. A more precise ordering is
given later using ANOVA.
Figure 4.7: The main effects for width95.
The labelled effects in Figure 4.7 can be used to identify the relationship between
each factor and interval width. There is a negative relation between sample size n and
width. Reasonably, CI width decreases when n increases. The factors σ and p have a
57
positive relationship with width, it increases when they increase. The categorical
variables junk and error distribution give lower width when there are no junk variables
or the response’s error has a t distribution.
The width95 ANOVA is illustrated in Table 4.2, with rows ordered according to the
size of MS.
The p-values in Table 4.2 indicate that all main effects except method are significant,
and r has a smaller effect than the other 5 main effects. However, the first four effects
are all considerably larger than other effects based upon their mean squared error. The
sample size n is the most significant factor. It has the largest MS, at least three times
more than any other significant factors. In addition, the main effects have the largest
impact on width, with the five largest MS values. Although the factor junk has a larger
MS value than p, the ordering is reversed when considering SS. That is, the four levels
of p account for more overall variation than the two levels of junk, but when
standardized by the degrees of freedom, junk has the larger mean effect.
58
Df SS MS F-value Pr(>F)
n 3 1369.23 456.41 6369.026 <2.2e-16 ***
σ 3 450.36 150.12 2094.890 <2.2e-16 ***
junk 1 114.38 114.38 1596.080 <2.2e-16 ***
p 3 219.22 73.07 1019.734 <2.2e-16 ***
error distribution 1 26.94 26.94 375.989 <2.2e-16 ***
n: junk 3 74.10 24.70 344.673 <2.2e-16 ***
n: method 3 40.79 13.60 189.729 <2.2e-16 ***
n: p 9 69.46 7.72 107.697 <2.2e-16 ***
σ: error distribution 3 18.94 6.31 88.122 <2.2e-16 ***
n: σ 9 44.51 4.95 69.010 <2.2e-16 ***
σ: junk 3 6.25 2.08 29.065 <2.2e-16 ***
r 1 1.92 1.92 26.820 2.320e-07 ***
junk: method 1 1.39 1.39 19.344 1.114e-05 ***
p: method 3 4.11 1.37 19.120 2.518e-12 ***
n: error distribution 3 3.32 1.11 15.442 5.350e-10 ***
r: junk 1 0.87 0.87 12.208 0.0004799 ***
p: σ 9 7.40 0.82 11.474 <2.2e-16 ***
junk: error distribution 1 0.75 0.75 10.497 0.0012035 **
error distribution: method 1 0.43 0.43 5.957 0.0146920 *
n: r 3 1.21 0.40 5.607 0.0007794 ***
p: r 3 0.79 0.26 3.694 0.0113546 *
σ: method 3 0.59 0.20 2.726 0.0426079 *
p: junk 3 0.55 0.18 2.567 0.0527259
r: method 1 0.16 0.16 2.174 0.1404399
r: error distribution 1 0.05 0.05 0.631 0.4268844
σ: r 3 0.12 0.04 0.541 0.6544282
p: error distribution 3 0.11 0.04 0.5330 0.6596303
method 1 0.01 0.01 0.071 0.7897343

residual 5037 360.96 0.07
Table 4.2: The ANOVA table for width95.
59
We now focus on the interaction n: junk which is the 6th largest effect. Figure 4.8
shows the n:junk interaction. When n increases, width decreases. This confirms the
inverse relation between n and width as stated previously. The n:junk interaction
corresponds to the fact that the lines are not parallel. That is, when there are junk
variables, the effect of sample size on width is larger.
Figure 4.8: n: junk effect for width95.
4.3 Analysis of SSE
Figure 4.9 displays the main effects for the predictive SSE. All factors have a visible
effect on the sum of squared errors (SSE).
60
Figure 4.9: The main effects for SSE.
The factor n has the largest effect on SSE. The factors junk, σ, and p are the next
largest main effects. As with the coverage95 response, the ranking of effect size is
complicated by the varying number of levels for different factors. The rest of the main
effects are smaller but still visible in Figure 4.9. SSE decreases as n increases, p
decreases and σ decreases. SSE is smaller with no junk variables, and slightly smaller
with t errors, cross validation and correlated predictors.
The ANOVA for the response “predictive SSE” shown in Table 4.3. As in the earlier
ANOVA tables, rows are ordered by the MS column.
61
Df SS MS F-value Pr(>F)
n 3 8260741213 2753580404 2529.662 <2.2e-16 ***
junk 1 1149670834 1149670834 1056.181 <2.2e-16 ***
σ 3 2876836357 958945452 880.965 <2.2e-16 **
n: junk 3 1654503795 551501265 506.654 <2.2e-16 ***
P 3 1595725522 531908507 488.654 <2.2e-16 ***
n: p 9 1279609758 142178862 130.617 <2.2e-16 ***
n: σ 9 1019100689 113233410 104.0254 <2.2e-16 ***
error distribution 1 74723433 74723433 68.647 <2.2e-16 ***
p: junk 3 168766357 56255452 51.681 <2.2e-16 ***
method 1 52389336 52389336 48.129 4.495e-12 ***
junk: method 1 47800181 47800181 43.913 3.792e-11 ***
σ: error distribution 3 94367583 31455861 28.898 <2.2e-16 ***
p: σ 9 248525987 27613999 25.369 <2.2e-16 ***
n: method 3 77281290 25760430 23.666 3.322e-15 ***
σ: junk 3 62331280 20777093 19.088 2.640e-12 ***
r 1 12565677 12565677 11.544 0.0006850 ***
r: junk 1 10704712 10704712 9.834 0.0017228 **
n:r 3 19421017 6473672 5.947 0.0004809 ***
n: error distribution 3 13244277 4414759 4.056 0.0068745 **
p: r 3 10530418 3510139 3.2247 0.0216316 *
error distribution: method 1 1873202 1873202 1.721 0.1896402
σ: r 3 3744049 1248016 1.1465 0.3287928
p: method 3 3168696 1056232 0.9703 0.4056361
p: error distribution 3 1063607 354536 0.326 0.8067886
r: method 1 297215 297215 0.273 0.6013186
σ: method 3 566540 188847 0.174 0.9143665
r: error distribution 1 35733 35733 0.0328 0.8562321
junk: error distribution 1 2911 2911 0.003 0.9587577
residuals 5037 5482860544 1088517
Table 4.3: ANOVA table for SSE.
62
Overwhelmingly, n has the most significant effect. By looking at the MS of the six
largest effect factors, n is approximately twice as large as the next largest MS. Then
junk, σ, n: junk, p, and n: p have the largest effect on SSE respectively.
Now we examine interactions among the largest six effects. First, consider the n:
junk interaction displayed in Figure 4.10. At both levels of “junk”, there is an inverse
relationship between n and SSE. When there are junk variables, SSE is larger at small
values of n, and about the same as without junk variables, for large n.
Figure 4.10: The n: junk effect on SSE.
63
In Figure 4.11, the interaction between n and p is evident from the non-parallel lines.
At all levels of n, there is a positive relationship between p and SSE. At all n levels, SSE is
smaller when p decreases. The decreasing relationship between SSE and n changes
depending on the dimension p. As p increases, SSE values for small n become larger,
resulting in a stronger relationship between SSE and sample size.
Figure 4.11: The n: p effect for predictive SSE.
4.4 Similarities and differences between responses
In this section similarities and differences in the separate analysis of the responses
“coverage, “width”, and “SSE” are summarized in the points below.
64
1- The effect of junk is a little different for coverage99 than 90, and 95%. It has no
effect on coverage 99 while there is a small effect on both coverage90 and 95.
2- The effect sizes for each factor are approximately the same over the three
“coverage” responses. For the “width” response, the effect sizes are the same
over the 90, 95, and 99% levels. For both coverage and width, there is a shift in
the factor sizes within their three responses.
3- The factor p has very low impact on coverage while it has a large influence on
width and SSE responses.
4- At very low values of σ, such as 0.01, BART models have coverage far below the
nominal level.
5- Most of the factors have significant main effects for coverage except the
correlation “r”.
6- Most of the factors have significant main effects for width except BART method.
7- For SSE, all the seven factors have an impact.
8- The main effects for width have the same pattern as the main effects for SSE.
9- The main effects for coverage have a similar pattern as for SSE, with two
exceptions: junk and method. Their effects on coverage are the opposite from
their effects on SSE.
10- From points 8 and 9, we can see that although there is a very good relation
among these three responses, the relation between width and SSE is stronger
than the relation between coverage and SSE.
65
BART gives worst values for coverage when n is very high and σ is small. This might
occur because the MCMC iterations do not mix very well to explore the whole posterior
distribution. By referring to choosing the adequate number of MCMC experiment in
Chapter 3, our observations suggest the same thing, the coverage values are better
when BART method is used on a smaller training data set.
Generally, BART method has a small effect on coverage and SSE responses, and it
almost has no effect on width. Therefore, we recommend to use Default.BART instead
of using CV.BART which is much slower than the default version. It represents the best
cross validation choice of 120 CV.BART models corresponding to 24 combinations of
prior parameters x 5 folds. In addition, we suggest to increase the number of MCMC
iterations by a factor of two or more to increase the accuracy of Default.BART. Even
though that will cause Default.BART to run more slowly, it is still much faster than using
cross validation.
4.5 Comparison between the two, three, and seven-way interaction models
The linear model for assessing the main predictive effect of seven independent
variables n, σ, p, r, junk, error distribution and method is
(4.1) у = μ + 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ + 𝑏13 𝑥13 + Ɛ.
66
In (4.1), each four-level factor (n, p and σ) has three predictors 𝑥𝑖 and three parameters
𝑏𝑖 . For example, we can code 𝑥1 =1 if n=200 and 0 otherwise, 𝑥2 =1 if n=1000 and 0
otherwise, 𝑥3 =1 if n=10000 and 0 otherwise. Each one of the two-level factors such as
junk and method has only one predictor and its parameter. This gives a total of 3 + 3 + 3
+ 1 + 1 + 1 + 1 = 13 terms corresponding to main effects.
We consider studying various interaction effects. It is easy to define the 2-way
interaction as a product of two main effect terms appearing in (4.1). Therefore, the
regression formula for a two-way interaction model contains all main effect terms in
(4.1) plus 69 additional terms. Each term consists of a product between two main
effects and a related parameter. Similarly, a three-way interaction effect can be defined
as a product of three main factors. The formula of the three-way regression model
consists of the two way regression model plus 193 extra terms. The 69 two-factor terms
and the 193 three-factor terms correspond to 21 and 35 combinations of factors
respectively, or a total of 56 interactions.
We also consider a full model, in which all possible interaction effects are estimated.
This corresponds to a seven-way interaction model, with 1023 degrees of freedom for
effects. This full model is not considered for interpretation, but only to illustrate how
much variation is explained by the second and third order models. The seven-way linear
model consists of the three-way linear model plus 748 terms and their corresponding
67
parameters. These terms correspond to 35 + 21 + 7 + 1 = 64 combinations of factors
which relate to the four-, five, six-and seven-way interactions respectively.
To decide whether to consider a model with just two-way interactions, both two-way
and three-way interactions, or all two, three, …, seven-way interactions, F tests were
conducted. These tests compare whether the full model with all interactions up to order
7 could be simplified to:
 a third order model (main effects, 2 way interactions and 3 way interactions)
 a second order model (main effects and 2 way interactions)
 a first order model (main effects)
For the test comparing the full model and third order model, the responses coverage95
and SSE had p-values of approximately 0.0005, while the response width95 had a p-
value of 0.82. All other comparisons between the full model and first or second order
models had p-values of 0. The tests indicate that for all responses, interactions up to
the third order were significant, and for two of the three responses, higher order
interactions were also significant.
Although third order and some higher order terms are significant, the largest effects
are main effects and two-way interactions. This is indicated by Table 4.4, which gives 𝑅 2
values for first, second, third and seventh order models. There is a large increase in 𝑅 2
when moving from a main effect model to a second order model. Although 𝑅 2 values
continue to increase for higher order models, the increases are smaller and correspond
68
to a large number of degrees of freedom. For instance, there are 193 additional degrees
of freedom associated with the three-way interactions. Considering the relatively
smaller increases in 𝑅 2 , analysis in this thesis focused on main effects and two-way
interactions, which represented the largest amount of variation in the responses.
Response main effects 2fi 3fi 7fi
coverage95 0.525 0.838 0.886 0.907
width95 0.774 0.872 0.895 0.910
SSE 0.579 0.774 0.797 0.833
Table 4.4: The 𝑅 2 of interaction models.
4.6 The correlation between the responses
In this section, we examine the correlation between our seven responses. This
suggests how much it is necessary to analyze each individual response. If the
correlation between responses of a certain type is high, it is enough to illustrate the
analysis for one response of this type. Table 4.5 shows the correlation values between
our seven responses.
69
response coverage90 coveage95 coverage99 width90 width95 width99 SSE
coverage90 1 0.985 0.919 0.379 0.380 0.382 0.032
coveage95 0.985 1 0.970 0.389 0.389 0.391 0.069
coverage99 0.919 0.970 1 0.373 0.373 0.374 0.114
width90 0.379 0.389 0.373 1 0.999 0.999 0.799
width95 0.380 0.389 0.373 0.999 1 0.999 0.798
width99 0.382 0.391 0.374 0.999 0.999 1 0.797
SSE 0.032 0.069 0.114 0.799 0.798 0.797 1
Table 4.5: The correlation between responses.
The correlation ranges between coverages are similar and very strong. Coverage95
has the largest correlation with the other two coverages. Therefore, choosing
coverage95 for analysis was reasonable. The correlation among all three width types is
0.99. Thus, it was enough to illustrate the analysis for only one width. Choosing the
“middle” width, i.e. width95, seems reasonable. There is a moderate correlation
between coverage and width, it does not exceed 0.4. SSE and the coverage variables
have almost no correlation. Width and SSE have a strong correlation of approximately
0.8.
70
Chapter 5
Conclusion and future work
A supervised learning problem is a statistically fundamental problem that uses
training data to make an inference on an unknown population function that predicts an
output by using predictor variables. In this thesis, we study the supervised learning
model BART which works to model that population function to give a numeric value of
the response. It is a development of a Bayesian “sum of trees” model where each tree
is concentrated to have a small effect by regularization of the prior. This ensemble
model uses MCMC to generate samples from the posterior distribution, enabling the
calculation of credible intervals.
This thesis focuses on examining the accuracy of BART CIs across various factors such
as the sample size “n”, the noise standard deviation “σ”, and the error distribution. To
conduct the simulation study, we selected a full factorial design which varies the levels
of factors automatically and is easy to analyze. ANOVA was used to analyze the seven
responses we studied here. These responses include three sorts of coverage (90, 95,
99)%, three sorts of width (90, 95, 99)%, and the SSE of prediction.
The analysis in Chapter 4, shows that the credible intervals from CV.BART and
Default.BART are similar over all factor combinations. The effect of the type of BART
71
method on coverage and SSE responses is very small, and there is no effect on width.
Thus, we suggest using Default.BART since it is much faster than CV.BART. Recall that
for 5-fold CV, with 24 combinations of prior parameters, BART.CV requires 120 runs of
BART.
A very important conclusion is that the coverage of BART CIs was significantly
affected by noise standard deviation σ and sample size n. It was poor at very low values
of σ and very high values of n. This may be because of poor mixing of MCMC iterations
at large vales of n, they do not cover or explore the entire posterior distribution well. In
Chapter 3, the experiment of selecting the adequate number of MCMC iterations
suggests this result. That small experiment shows that coverage gets better when BART
is used on a smaller training data set.
We recommend increasing the number of MCMC iterations which could improve the
coverage of BART CIs. Although this will lead Default.BART to spend more time, it will
still be less computationally intensive than CV.BART.
This research suggests multiple directions for future work. We could further examine
the effects of various different factors on BART CI’s. The first suggestion is studying the
influence of a larger number of MCMC iterations such as 4000, 10000, and 20000
iterations. In Chapter 4 we saw that coverage was not always good with MCMC
iterations = 4000. The increasing of MCMC iterations to large values such as 10000 and
72
20000 should give better coverage and might also improve CI width and SSE. A second
factor which could be examined is the prior parameters. The idea of utilizing the prior
parameters as a factor of the design experiment would be to see if there is a better
Default than what CGM give. At each row of the design matrix, there will be a particular
combinations of prior parameters, and by analyzing the results we can determine the
best combination.
Also, it is interesting to conduct a similar designed experiment with using a
completely different model than BART to estimate credible intervals. Treed Gaussian
processes (Gramacy & Lee, 2008), Bayesian generalized additive models (Hastie &
Tibshirani, 2000) and random forests (Breiman, 2001) are all examples of statistical
learning models that give credible intervals. Then, we could gauge the accuracy of these
models’ CI’s across the all given factors of the designed experiment. There are other
statistical learning models such as new versions of random forests or BART model.
Wager, Hastie, & Efron (2014) built a random forests model based on the jackknife and
infinitesimal jackknife, which uses variance estimates for bagging proposed by Efron
(1992, 2014). Pratola (2012) created a version of BART with better MCMC mixing than
CGM. It should show better CI’s coverage because the new BART has a better MCMC
algorithm that should mix very well compared to the implementation of BART used in
this thesis.
73
Another direction for future work is studying the influence of the seven factors on
BART CI’s, but with different factor levels. For instance, including larger values of the
sample size and predictor dimension. This would consume more time, but it could give
a more general conclusion. Also, it is might be a good proposal to consider an error
distribution that was more different from the Normal distribution. For example, taking
a t-distribution with a very low degree of freedom such as 1, or utilizing a skewed
distribution such as gamma.
Fractional factorial design is another possibility for future work instead of conducting
a full factorial design. In the full factorial design, all possible combinations of the factor
levels are run. A fractional factorial design is defined as a fraction or partition of the full
factorial design, reducing the size of the full factorial. Fractional factorials of 2-level
designs are well studied. Fractional factorials for factors with 3 or more levels can be
more challenging to construct, although optimal design could be used to construct
fractional factorials. Fractional factorial designs will have some aliasing of estimated
effects. Since there is a concern of estimating the effect of the two-way interaction
model in this thesis, we should use a fractional factorial design with no aliasing or
confounding between the main and 2-factor interactions.
In the experiments described in Chapter 4, we noticed that for some jobs which have
the same combination of factors, there is variation of their execution time. Usually, this
time variation is between 1 and 1.5 hour, while it is around three hours in some extreme
74
cases. For instance, the 4th and 132nd parallel jobs have the same combination of
factors, but they finished in 19 and 22 hours respectively. Therefore, we think it is
probably suitable to add the job’s time as a new response. The effect of factors on
execution time, and also the random variation in execution time, might give insight into
the use of BART.
75
Bibliography
[1] Anand, K., (2015) An Expected Improvement Criterion for the Global Optimization of
a Noisy Computer Simulator unpublished M.Sc thesis, Acadia University.
[2] Breiman, L. (2001) Random Forests. Machine Learning, 45, 5–32.
[3] Chipman, H (2011) “Friedman Function Machine”, http://florence.acadiau.ca/collab
/hugh_public/index.php?title=R:friedman.function.machine retrieved October 25, 2014.
[4] Chipman, H., George, E. & McCulloch R. (2010) Bayesian Additive Regression
Trees. The Annals of Applied Statistics, 4, 266-298.
[5] Chipman (2011) “setbatch”, http://florence.acadiau.ca/collab/hugh_public/index.
php? title=R:setbatch retrieved October 25, 2014.
[6] Efron, B. (1992) Jackknife-After-Bootstrap Standard Errors and Influence Functions.
Journal of the Royal Statistical Society, Series B, 54, 83–127.
[7] Efron, B. (2014) Estimation and Accuracy After Model Selection. Journal of the
American Statistical Association, 109, 991-1007.
[8] Friedman, J.H. (2001) Greedy Function Approximation: A Gradient Boosting Machine.
Annals of Statistics, 29, 1189-1232.
[9] Friedman, J.H. (1991) Multivariate Adaptive Regression Splines. The Annals of
Statistics, 19, 1–67.
76
[10] Gramacy, R. B., & Lee, H. K. H., (2008). Bayesian Treed Gaussian Process Models
With an Application to Computer Modeling. Journal of the American Statistical
Association, 103, 1119-1130.
[11] Hastie, T. & Tibshirani, R., (2000) Bayesian Backfitting, Statistical Science. 15, 196–
223.
[12] James, G., Witten, D., Hastie, T. & Tibshirani, R., (2013) An Introduction to Statistical
Learning with Applications in R, Springer, New York.
[13] Montgomery, D. C. (2013) Design and Analysis of Experiments. (8th ed). USA: John
Wiley & Sons, Inc.
[14] Oehlert, G. W. (2010) A First Course in Design and Analysis of Experiments.
Retrieved from http://users.stat.umn.edu/~gary/book/fcdae.pdf
[15] Pratola, M. T., (2013) Efficient Metropolis-Hastings Proposal Mechanisms for
Bayesian Regression Tree Models, arxiv eprint 1312.1395.
[16] Wager, S., Hastie, T. & Efron, B., (2014) Confidence Intervals for Random Forests:
The Jackknife and the Infinitesimal Jackknife. Journal of Machine Learning Research. 15,
1625−1651.
77

File

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

File

Uploaded by

Copyright:

Available Formats

Assessing the Accuracy of Bayesian Additive

Regression Tree Credible Intervals

© by Fatimah Mohammed AL Ahmad, 2016

The examining committee for the thesis was:

Dr. Harish Kapoor, Chair

Dr. Thomas Loughin, External Reader

Dr. Wilson Lu, Internal Reader

Dr. Hugh Chipman, Supervisor

Dr. Ying Zhang, Acting Head

Fatimah AL Ahmad, Author

Dr. Hugh Chipman, Supervisor

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

2.1 The BART model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Bayesian estimation of the BART model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 BART prior parameters and default values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Prior variance on 𝜇𝑖𝑗 |𝑇𝑖 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.2 The σ prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.3 The choice of m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 BART models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.1 CV.BART mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.2 Defualt.BART model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 The simulation study and setting up the experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 The experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.1 The setting of experimental factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.2 Additional variables not treated as factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.3 Response variables for the experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 The simulated function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4 Parallel computing mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1 Analysis of coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Analysis of width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 Analysis of SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4 Similarities and differences between responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6 The correlation between the responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

2.1 A simple realization of g(x; T; M) in the BART model . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 The simulated function and data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 The prior σ distribution when 𝜎̂ = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1000 and 10000. Burn-in samples are red . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

the horizontal axis is n, dimension p, σ, predictor correlation r, junk variables, error

distribution, and BART method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Main effects for coverage95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 The n: σ effect on coverage95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 The p: σ effect for coverage95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.5 The n: method effect for coverage95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

correlation r, junk variables, error distribution, and method . . . . . . . . . . . . . . . . . . . . . . 55

4.7 The main effects for width95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.8 n: junk effect for width95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.9 The main effects for SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.10 The n: junk effect on SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.11 The n: p effect for predictive SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.2 The prior parameters’ combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1 Summary of the factors of the designed experiment . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 The 8 rows of D in the first job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1 ANOVA table for coverage95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 The ANOVA table for width95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 ANOVA table for SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4 The 𝑅 2 of interaction models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.5 The correlation between responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

A common type of supervised learning problem is to use training data to estimate a

provides credible intervals (CIs) for prediction.

method. Simulation is used to compute CI accuracy with a designed experiment that

gives conclusions about BART CI accuracy. It is found to depend considerably on sample

size and error variance.

Mathematics and Statistics.