Professional Documents
Culture Documents
File
File
by
Fatimah AL Ahmad
Thesis
submitted in partial fulfillment of the requirements for
the Degree of Master of Science (Mathematics and Statistics )
Acadia University
Winter Convocation 2016
________________________
________________________
________________________
________________________
_________________________
This thesis is accepted in its present form by the Division of Research and Graduate
Studies as satisfying the thesis requirements for the degree Master of Science
(Mathematics and Statistics).
………………………………………….
ii
I, Fatimah AL Ahmad, grant permission to the University Librarian at Acadia University to
reproduce, loan or distribute copies of my thesis in microform, paper or electronic
formats on a non-profit basis. I, however, retain the copyright in my thesis.
_________________________
_________________________
_________________________
Date
iii
Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 BART method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
iv
3.2 Choosing the number of MCMC iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Comparison between the two, three, and seven-way interaction models . . . . . . . . 65
v
List of Figures
2.3 BART uncertainty prediction. Training data are plotted as individual points . . . . . . 12
3.1 The relation between MCMC samples and different responses (width, coverage, SSE)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Posterior samples of σ versus MCMC iterations number, for training samples of size
3.3 Six realizations of randomly simulated f(x) in two dimensions, as described in the
text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 The main effects for coverage90, 95, and 99 respectively. The factors’ order along
vi
4.6 The main effects for width90, 95, and 99. The factors order is n, p, σ, predictor
vii
List of Tables
2.1 Mean of BART CI coverage, width and SSE over 500 test points, for 10 replicates . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Predictive performance at various MCMC iterations (N) and sample sizes (n) . . . . 36
viii
Abstract
predictive model for a numeric response. Many supervised learning models such as
Bayesian Additive Regression Trees (BART) try to flexibly model the data. This Bayesian
“sum of trees” model uses MCMC back fitting to simulate posterior samples. BART also
This thesis studies the accuracy of BART credible intervals and analyzes various
factors’ effects on it. These factors include the sample size, dimension, noise standard
deviation, predictors’ correlations, junk variables, type of error distribution, and BART
systematically varies the factors to find their effects. Analysis of experimental results
ix
Acknowledgements
I would like to thank the Ministry of Higher Education in the Kingdom of Saudi Arabia
for financial support that enabled me to complete my studies. A special thanks to Dr.
Chipman who gave me the opportunity to work under his supervision. I am grateful for
all his effort, support, encouragement, and suggestions which helped me to finish this
thesis experiment and writing. I would also like to thank the faculty, staff and students
of Acadia University, specifically those who work and study in the Department of
appreciate all your assistance in different aspects of my life. Finally, thank you to
Canada which has welcomed us as a part of its international students and families, and
thank you to all nice people who we have met here and made us feel we live in our
home.
x
Chapter 1
Introduction
regression model. Training data, consisting of (x, Y) pairs, are used to “learn” or
estimate the unknown function f. Supervised learning models have various kinds of
structure. For instance, they can be parametric, like a linear regression model, or
nonparametric, like a decision tree or a Random Forest. This thesis’ supervised model is
CGM (2010) develop BART as a Bayesian “sum of trees” model. A large number
(typically 50-200) of decision trees are estimated in such a way that their sum is an
accurate prediction of the response. It is an ensemble method, like bagging and random
forests where a large number of decision trees are combined into a prediction model. It
uses a Bayesian MCMC algorithm that generates simulated samples from the posterior
The Bayesian specification of BART provides posterior distributions that can be used
1
MCMC samples enable the construction of credible intervals (CIs) for the prediction of
response Y at input x. The objective of this thesis is examining the accuracy of BART’s
uncertainty prediction under the effect of various factors. These factors are mostly
related to properties of the training data, and include the sample size n, noise standard
distribution, and BART method. These factors have different number of levels (2 or 4
To study the performance of BART CIs accuracy under all these factors, a simulation
study is conducted where it is possible to compute the accuracy measures with a known
response function. A designed experiment in the seven factors is chosen to carry out
the study. A designed experiment enables the analysis of the factors’ influences on the
various responses. To keep the design and analysis as simple as possible, a full factorial
For the analysis, ANOVA tables are used to make conclusions about the BART CIs
accuracy and how much the factors impact the responses’ performances. We study
main effects and two way interactions which explain most of the variation in the
responses. We use three kinds of responses that measure performance of BART CIs:
coverage, width, and the SSE of prediction. The total number of the responses is seven
corresponding to coverage at three levels (90, 95, 99%), width at three levels (90, 95,
2
The remainder of this thesis is organized as follows: Chapter two reviews the
background of the BART model which is utilized here to fit and model the population
function f. Chapter three describes the simulation study and the details of setting up
the experiments such as the factors and their levels, a generated function f, and the
four presents the analysis for three of the seven responses, coverage95, width95, and
the predictive SSE. They give a good representation of all responses analysis since the
results for levels 90, 95, and 99% are similar to each other. In Chapter five, we conclude
3
Chapter 2
BART methods
This chapter has four sections. The first and second sections give background on
BART models, BART formulas, and BART credible intervals. The third section discusses
the choice of BART prior parameters. Section four describes two different versions of
Here is a brief explanation of the BART model. The population model is given as
The BART model estimates the population model (2.1) where the conditional mean at a
(2.2) y = E(y|x) + ε.
In (2.3) the constant m determines the number of decision trees used in the sum of
trees. Each g represents the output from a different tree. That is, 𝑇1 , 𝑇2 , … , 𝑇𝑚 are
4
different trees, and associated with tree 𝑇𝑖 is a vector of terminal node parameters 𝑀𝑖 .
Suppose there are 𝑏𝑖 terminal nodes in 𝑇𝑖 . Then 𝑀𝑖 = (𝜇𝑖1 , 𝜇𝑖2 , … , 𝜇𝑖𝑏𝑖 ). The function
“g” is a generic function that takes a predictor vector x, a tree T and a set of terminal
node parameters 𝑀, and generates the output of the tree for input x. Thus, the model
tree 𝑇𝑖 can be defined as a binary splitting of x variables and by following the rules we
Anand (2015) illustrated how a single tree T produces an output. We summarize this
example here. Figure 2.1 shows a tree model with three terminal nodes. The
parameter T represents the tree structure and the two decision rules 𝑥5 < 1 and 𝑥2 < 4.
The parameter M = (𝜇1 , 𝜇2 , 𝜇3 ) = (-2, 5, 7) represents the outputs from the three
terminal nodes. The input x = (𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 ) = (1.1, 5.4, 0.1, 2.3, 0.5) would lead to
prediction 𝜇2 = 5 since we branch left on 𝑥5 = 0.5 < 1 and then right on 𝑥2 = 5.4 ≥ 4.
The parameters of BART (𝑇1 , ..., 𝑇𝑚 , 𝑀1 , ..., 𝑀𝑚 , 𝜎) are unknown and must be
estimated from training data. Because these parameters are estimated, there is
parameter, such as f(x), the predicted mean response at a particular input x. CGM use
Bayesian methods to estimate the parameters and quantify uncertainty. This will be
5
Figure 2.1 : A simple realization of g(x; T; M) in the BART model.
All statistical learning models have same aim of getting a good prediction for
response y at any new point x. To estimate the BART model from training data,
Bayesian methods are used. In a Bayesian analysis, we must specify a prior distribution
𝑃(𝜃) for parameter θ. The inference joins the likelihood 𝑃(у│𝜃) with prior distribution
𝑃(у|𝜃)𝑃(𝜃)
𝑃(𝜃|у) = ,
𝑃(у)
6
where 𝑃(у) is the marginal probability function of y.
For BART, the parameter vector 𝜃 is the BART parameters (𝑇1 , ..., 𝑇𝑚 , 𝑀1 , ...,
𝑀𝑚 , and 𝜎). Prior probability distributions for the parameters will be discussed later in
Markov chain Monte Carlo (MCMC) is the computational technique used to calculate
posterior distributions for the parameters and quantify uncertainty. To show the
mechanism of MCMC, let us suppose that the objective is to find the posterior mean of
parameter 𝜃1 , where 𝜃1 is the first element of the θ vector and let us say the total
For large z, the integral will be complex to evaluate. Instead, we construct a Markov
chain that takes N samples from the posterior distribution 𝑃(𝜃|у) and indicates them as
7
With the BART model we are not interested in posterior distributions for individual
trees (𝑇𝑗 's) or terminal node parameters (𝑀𝑗 's). We are more interested in a posterior
obtaining a posterior distribution for predictions is not possible because this would
involve integration of the posterior distribution over all these parameters. However, it
is straightforward to compute the posterior for f(x) using the MCMC samples. We
compute f(x) for each MCMC sample, and these sampled f(x) values then correspond to
samples from the posterior for f(x). Credible intervals (CIs) can also be obtained from
MCMC samples of the posterior. For instance, for a parameter 𝜃1 , a 95% CI could be
computed as the 2.5% and 97.5% quantiles of the MCMC samples of 𝜃1 . The
computational details of MCMC, such as the number of MCMC iterations, and the
amount of burn-in and thinning for the chain, will be discussed later in Chapter 3.
The sample size of MCMC has to be adequate to give a good estimation for BART CIs.
The CI should be more accurate when there is a larger posterior sample. For that, we
conducted a trial in Chapter 3 to see if BART CIs gives better estimation with 10,000
MCMC samples or iteration can be divided into two parts; one called burn-in
iterations and the second is thinning. MCMC samples cannot give the stationary
posterior distribution at its beginning iterations because the Markov chain will depend
8
strongly on the starting values. During the burn-in period of MCMC, samples are
Once the MCMC has converged, sampled values may still be autocorrelated.
Thinning, which is the discarding of some MCMC samples, can reduce dependence and
use less memory to store sampled values. In our implementation, a fifth of MCMC
draws are discarded as a burn-in while the rest are thinned. Suppose that the MCMC
iterations are labelled 1, 2, ..., i, ..., 10,000, and the corresponding sampled parameters
are 𝜃1 , 𝜃2 , ..., 𝜃𝑖 , ..., 𝜃10000 . MCMC iteration i samples 𝜃𝑖 using a Markov chain, which is
correlated with 𝜃𝑖−1 values. When this correlation is high, it is not necessary to save 𝜃𝑖
for all values of i. That is, if we save 𝜃𝑖 , then 𝜃𝑖+1 will have values quite similar to
𝜃𝑖 . The correlation between MCMC samples decreases as there are more iterations
between them. The process of keeping every k-th sample is called "thinning" and it is a
way of reducing computer storage, while keeping most of the information contained in
the MCMC samples. In our use of BART, we keep every tenth MCMC sample from the
BART generates the credible intervals for f(x) by using MCMC samples. At a particular
x, every MCMC sample gives a corresponding output f(x). The MCMC samples then give
9
a MCMC sample of f(x) values that give us a set of samples from the posterior
distribution of the mean function f(x). By taking quantiles of f(x) values we obtain a
credible interval for f(x) values at that specific x. For example, we can get a 95% level of
CI by taking quantile with 2.5% and 97.5% of MCMC samples of f(x) at a particular x.
The new work of this thesis is studying the credible interval properties: coverage,
width, and sum of squared errors “SSE”. We calculate all three quantities by using a test
data set. This test set is from a simulation, so it consists of values of x and f(x) for a large
sample of inputs. Thus, we know the actual value of f(x). This is necessary for
computing coverage and SSE. Suppose BART has a CI at 𝑥𝑖 given by the CI lower bounds
“LB(𝑥𝑖 )” and upper bounds “UB(𝑥𝑖 )”. Equations (2.4), (2.5), (2.6) illustrate the CI
properties’ formulas.
∑𝑛.𝑡𝑒𝑠𝑡
𝑖=1 ( 𝛪 ( 𝐿𝐵 (𝑥𝑖 ) < 𝑓(𝑥𝑖 ) < 𝑈𝐵 (𝑥𝑖 )) )
(2.4) coverage = ,
𝑛.𝑡𝑒𝑠𝑡
∑𝑛.𝑡𝑒𝑠𝑡
𝑖=1 ( 𝑈𝐵 (𝑥𝑖 ) − 𝐿𝐵 (𝑥𝑖 ))
(2.5) width = ,
𝑛.𝑡𝑒𝑠𝑡
and
10
For a quick clarification for BART credible intervals and its properties, we illustrate
this example. Suppose our real function is f(x) = sin (2πx) + 2 (x>0.5) + exp {2.5 x-0.5 },
then forty data points are generated randomly. The predictor x has a uniform
distribution and the response is generated as f(x) plus i.i.d N(0,0.252 ) random errors.
We use Default.BART with 1100 MCMC samples to estimate the model and predict at
500 test points. Predictions of f(x) and 90% credible intervals are shown in Figure 2.3. It
displays that BART function is a step function where each step corresponds to a change
11
Figure 2.3: BART uncertainty prediction. Training data are plotted as individual points.
replicate, a different training set is generated, and CIs obtained at 500 test points. The
mean values of coverage, width and SSE over the 500 test points are shown in Table 2.1.
In Table 2.1, it seems that there is considerable variation in the values, due to the
properties, it will be necessary to simulate multiple data sets for each combination of
12
Properties of BART 90% CIs
Table 2.1: Mean of BART CI coverage, width and SSE over 500 test points, for 10
replicates.
From the previous section, the BART model can be defined as a development of a
Bayesian “sum of trees” model. CGM developed a specification for the prior, depending
on four key prior parameters. One of these prior parameters is related to the prior for
13
means 𝜇𝑖𝑗 , two other parameters are related to the prior for σ, while the fourth
parameter controls the number of trees in the BART model. These parameters are k, ѵ,
q, and m respectively.
The prior distributions are an essential part for BART model in particular the sum-of-
trees model components (𝑇1 , 𝑀1 ), … , (𝑇𝑚 , 𝑀𝑚 ), and σ. CGM factor the joint prior on all
parameters as
In (2.7), CGM assume that in the prior, different trees (and their terminal node
parameters) are independent from each other. In (2.8), given tree 𝑇𝑖 , the terminal node
section. Unless otherwise noted, the prior specifications are summaries of priors
developed by CGM.
14
2.3.1 Prior variance on 𝝁𝒊𝒋 │𝑻𝒊
The prior 𝑃(𝜇𝑖𝑗 |𝑇𝑖 ), is specified as normal with mean 𝜇𝜇 , and variance 𝜎𝜇2 . In (2.3)
the sum of trees formula shows that E(y|x) is equal to the sum of m 𝜇𝑖𝑗 ′s. Then, the
prior mean and variance of E(y|x) will be m times the prior mean and variance of a
single 𝜇𝑖𝑗 since the 𝜇𝑖𝑗 ′s are i.i.d. Thus, the prior for E(y|x) is N(𝑚𝜇𝜇 , 𝑚𝜎𝜇2 ). CGM
assume that mostly, the E(y|x) values fall between the minimum and maximum of
observed y’s data. They suggest specification of 𝜇𝜇 and 𝜎𝜇2 values so that N(𝑚𝜇𝜇 , 𝑚𝜎𝜇2 )
gives most of the prior probability to the range (у𝑚𝑖𝑛 , у𝑚𝑎𝑥 ). This choosing of
𝜇𝜇 𝑎𝑛𝑑 𝜎𝜇 values can be obtained by setting у𝑚𝑖𝑛 = 𝑚𝜇𝜇 − 𝑘√𝑚𝜎𝜇 and у𝑚𝑎𝑥 =
𝑚𝜇𝜇 + 𝑘√𝑚𝜎𝜇 .
CGM suggest values of 1, 2, 3, or 5 for the parameter “k”. Each value of k specifies a
particular prior probability that E(y|x ) falls within (у𝑚𝑖𝑛 , у𝑚𝑎𝑥 ) intervals. For instance,
This strategy can be summarized as using the minimum (у𝑚𝑖𝑛 ), and maximum (у𝑚𝑎𝑥 )
of the data to define a probability for the range of plausible E(y|x ) values. To apply this
strategy, we shift and rescale the у values from у𝑚𝑖𝑛 = −0.5 to у𝑚𝑎𝑥 = 0.5. After that,
15
we specify the prior mean and standard deviation of 𝜇𝑖𝑗 as 𝜇𝜇 = 0 and 𝜎𝜇 where
From (2.9), we can define 𝑘 as a scaling and shrinking parameter. It keeps the effect
of each individual tree in (2.3) small by shrinking the 𝜇𝑖𝑗 toward zero. When m and 𝑘
increase, the 𝜇𝑖𝑗 ’s will have smaller prior variance. The posterior distribution for each
CGM describe two methods to specify k. One uses cross-validation, (the CV.BART
method discussed later in Section 2.4). In the other method, default values are chosen.
The prior parameters for σ are the degrees of freedom ѵ and quantile q which are
used in specifying an inverse chi-squared distribution for 𝜎 2 . CGM specify the prior on
has a particular task: ѵ determines the spread of the prior, λ determines the location of
the prior. Small values of ѵ correspond to greater spread in the prior distribution. CGM
16
suggest choosing ѵ between 1 and 10, and then choosing λ so that an upper percentile
of the σ prior (say q = 75%, 90% or 99%) corresponds to 𝜎̂, a rough estimate of σ. CGM
suggest that this rough estimate of σ be obtained in one of two ways: either the residual
standard deviation from a linear regression, or a proportion (such as 20%) of the sample
CGM suggest three combinations of the combined parameters (ѵ, q). They are (10,
0.75), (3, 0.90), and (3, 0.99). These three values of (ѵ, q) called conservative, default,
and aggressive, respectively. Figure (2.4) shows all the prior three values of (ѵ, q) when
𝜎̂ = 2. This is a specification of the prior σ where our belief is that σ < 2. Figure (2.4)
using a default value. CGM suggest the default setting (3, 0.90).
The parameter m determines the number of trees in the BART model. CGM suggest
trying only two values m = 50 and m = 200. They suggest to choose between them by
For checking the performance of this choice, CGM observed that BART prediction
shows more improvement through the increasing of m until a specific point where the
improvement goes down slowly to give worse prediction. Therefore, it is very important
to select m not small and large enough. The values 50 and 200 are typically large
This research use two types of BART models which are distinguished by the way in
which the four key prior parameters are specified. Either the default settings of CGM
are used, or the parameters are chosen by cross-validation. These models are
18
Default.BART and CV.BART model. Section 2.4.1 describes the algorithm for CV.BART
The CV.BART algorithm is outlined below. It begins with a training and a testing dataset
1- The prior parameters’ combinations are set as a small designed experiment with
k = 1, 2*, 3, 5
m= 50, 200*.
The number of theses combination is i = k’s levels x (ѵ, q)’s levels x m’s levels =
2- Divide training data randomly into five sets of roughly equal size, denoted
5.
19
The prior parameters
i k (v,q) M
1 1 (10, 0.75) 50
2 2 (10, 0.75) 50
3 3 (10, 0.75) 50
4 5 (10, 0.75) 50
5 1 (10, 0.75) 200
6 2 (10, 0.75) 200
7 3 (10, 0.75) 200
8 5 (10, 0.75) 200
9 1 (3, 0.90) 50
10 2 (3, 0.90) 50
11 3 (3, 0.90) 50
12 5 (3, 0.90) 50
13 1 (3, 0.90) 200
14 2 (3, 0.90) 200
15 3 (3, 0.90) 200
16 5 (3, 0.90) 200
17 1 (3, 0.99) 50
18 2 (3, 0.99) 50
19 3 (3, 0.99) 50
20 5 (3, 0.99) 50
21 1 (3, 0.99) 200
22 2 (3, 0.99) 200
23 3 (3, 0.99) 200
24 5 (3, 0.99) 200
20
3- For the BART parameters in one of the 24 rows of Table 2.2, train 5 BART
models. The jth model (j = 1, …, 5) would be trained using 𝐶(𝑗) . The prediction
(𝑗)
for observation i using model j will be denoted 𝑌̂𝑖 , 𝑗 = 1, … , 5.
(𝑗)
(2.11) 𝑆𝑆𝐸 = ∑5𝑗=1 ∑𝑖𝜖𝐶𝑗 (𝑌𝑖 − 𝑌̂𝑖 )2.
Note that the predicted values are for the fold of data not used in training.
5- The validation 𝑆𝑆𝐸 will be calculated for each of the 24 rows of Table 2.2. This
6- Determine the best CV.BART combination of prior parameters by finding the row
z among the 24 combinations in Table 2.2 that has the lowest value of 𝑆𝑆𝐸.
7- Using the parameters from row z, re-estimate the BART model over all training
8- Use the fitted model from step 7 to make predictions for the test data. Then
calculate the credible interval coverage, credible interval width, and error sum of
The Default.BART model is very simple compared to CV.BART. A single BART model is
fit using a specific choice of BART prior parameters values that are denoted with (*) in
21
step 1 above. Therefore, to find the Default.BART CI properties, we just need to follow
the above steps 7 and 8 with using Default.BART model instead of CV.BART.
22
Chapter 3
The simulation study and setting up the experiment
The thesis studies the accuracy of BART credible intervals under the effect of specific
factors. An analysis of variance (ANOVA) summarizes these factors’ effects on BART CIs.
The experiment has seven factors: sample size “n”, predictors’ dimension “p”, noise
standard deviation “σ”, junk variables, predictor correlation “r”, type of error
distribution, and BART method. A convenient way to conduct this experiment is using a
simulation study in which the true mean function is known. This makes it possible to
experiment.
This chapter describes this study, including the experimental factors and their levels.
It shows a separate small study for determining the sufficient number of MCMC
iterations. It also includes the details of the simulated function which is used here and
23
3.1 The experimental settings
factories, it assists in the identification of significant factors which impact the quantity
applications. They can develop the yields of process, reduce the time and the overall
ability to determine the important level of the design parameters which affect the
response’s values. Using an additive model allows one to estimate the main and
design such as full factorial, fractional factorial, and block designs. Each design has
particular characteristics and features which lead us to use it. Regarding this research’s
goal which is studying the performance of BART CIs against various factors, we decided
to carry out our study by using a full factorial design. It is the simplest design and
analysis is easy.
their levels. The selection of the seven factors in our study comes from our belief of
24
their importance. Six of the seven factors are related directly to the population model
(2.1), while the factor “BART method” concerns the choice of prior parameters for BART.
Other studies consider similar sets of factors. For instance, Friedman (2001) used three
factors were the sample size, the underlying true function, and whether the error
similar to those studied in this thesis. They conducted an experiment to show the
performance and features of the BART model against the true underlying function. They
used a fixed function with five actual predictors 𝑥𝑖 and included additional junk variables
to give a total of 10, 100, or 1000 predictors. Thus, in the BART study the number of
junk variables and the dimension were varied. From previous examples we see that
how our seven factors are reasonable to select to examine their effects on BART CIs.
These factors have various types and numbers of levels. Some are four level factors,
while other factors have only two levels. Some are numeric factors, while others are
categorical. More details of our factors and their levels are given below.
The sample size determines the number of observations in a training data set. It is
the first factor we decided to include in this full factorial design, to see how the size of
training data sets can impact BART CI performance. The sample size “n” is a numeric
25
factor with four levels: n = (40, 200, 1000, 10000). The reason for not choosing larger n
values is that the BART process would take a very long time to run.
2. Dimension “p”
In our simulated data set, it is necessary to determine the generated values of x. The
dimension p gives the number of important predictors. The value p is the number of
predictors that affect the response function. This excludes junk variables, which are
described below. The four levels p = (1, 2, 5, 10) were chosen for this factor. We did not
select the p values to be too large because we know BART would take a very long time
to run.
The decision to study the effect of dimension on BART CIs is motivated by studies in
CGM (2010). That trial was a comparison between BART and other supervised learning
methods such as gradient boosting, random forests, and neural nets. The goal was to
examine which method had the best performance on a simulated function of p=5
inputs, with additional junk variables. The dimension p=5 in that study is similar to the
3. Junk variables
those p variables, we may add “junk” variables which have no effect on f(x) or the
26
response y. The factor “junk” is a two level categorical factor with levels (“no”, “yes”).
If “junk” = “no”, then the predictor matrix X has p columns. If it is equal to “yes”, then X
has 11*p columns, and for every actual x variable, there are 10 “junk” x variables. For
example, if p=10 and junk=”yes”, then our X matrix will have 110 columns. The first 10
columns will be used in f(x) to generate the mean function, while the other 100 columns
will be junk variables. Thus, the percentage of active inputs are less than 10%. The
purpose of considering junk variables is to see if the performance of CIs is affected for
Default.BART and CV.BART. Experiments in CGM (2010) suggest that Default.BART had
worse performance than CV.BART for high-dimensional problems with many junk
variables. That experiment had p=5 fixed, and varied the number of junk variables.
That is, in all experiments CGM used a fixed function in five predictors and a different
number of junk variables which were 5, 95, and 995. Thus, the total number of active
and junk variables were 10, 100, and 1000. Both BART versions illustrated similar
behaviour at 10, but Default.BART was worse than CV.BART at the greater values 100,
and 1000.
In this thesis, we emphasize that f(x) is built to be a function only of the p variables
but BART does not have information on which variables are the junk variables (if there
are any).
27
4. Standard deviation “σ”
The noise level of the residual in population formula (2.1) is √Var (𝜀) = σ. We vary σ
over four levels including very low values where σ = (0.01, 0.1, 0.25, 0.5).
between any two predictors. The correlation r is a numeric factor with two levels (0,
0.5). All predictors in the X matrix (including junk variables, if present) are generated as
multivariate normal random variables with mean vector 0 and covariance matrix R. If
r=0, then R=𝙸. If r=0.5, then all off-diagonal elements of R are 0.5, and the diagonal of R
contains 1’s.
In (2.1), the distribution of the noise term ε is specified as normal. We study the
effect of violating this assumption, by using either normal or t errors. Thus, the factor
t-distribution with 3 degrees of freedom. The t-distribution was chosen to have three df
because this is the smallest degrees of freedom for which the variance is finite. Errors
28
were generated as a scalar multiple of either a standard normal or a t-distribution, with
7. BART method
We study two different forms of the BART model: CV.BART and Default.BART, as
described in Section 2.4. The main difference is that the default version uses default
values of all prior parameters, while the cv version choses these parameters by cross-
validation. Thus, “BART method” is a categorical factor with two levels. Both BART
The seven factors are used to construct a design matrix “D” which is a full factorial
We simulate five replications for each combination of the experimental design factors
(i.e., for each row of the D matrix). The experimental design matrix D has 4 x 4 x 4 x 2 x
2 x 2 x 2 = 1024 rows of factor combinations with replications, D will be a big matrix with
29
Factor Description Number Levels
of levels
There are several other quantities that could have been treated as experimental
factors, but have instead have been set to fixed levels. In this section we identify those
quantities and discuss how their levels were chosen. These quantities are BART prior
parameters, and the number of MCMC iterations. There is a brief clarification for these
quantities below:
30
1. BART prior parameters
Four prior parameters that are important for BART are mentioned in Chapter 2. Each
of CV.BART and Default.BART has a particular way to select these prior parameters.
These parameters are the prior on 𝜇𝑖𝑗 “k”, the prior on σ “(v,q)”, and the number of
trees “m” in the ensemble. The setting of all these prior parameters’ levels is the same
method as is mentioned in Section 2.3 except there is a restriction relating to the choice
of the prior on σ.
In Chapter two we indicated that there are two ways to obtain 𝜎̂, a rough guess of σ
used in specifying the prior for σ. One may use the residual standard error from a linear
regression model or a sample standard deviation sd(у). In this thesis study, we decided
the method for specifying 𝜎̂ can be considered as a fixed factor. The reason of not
choosing the linear model to specify 𝜎̂ is that in some cases the training set can have
fewer observation than variables (e.g. n=40, p=5, 10 with junk=’yes’ gives 55 and 110
31
2. MCMC iterations
As described in Section 2.2, the MCMC samples of parameters values of the BART
model are used to generate credible intervals for predictions. The number of MCMC
iterations affects the quality of the BART CI, where increasing the number of iterations
enables the MCMC to better explore the posterior probability distribution. However,
increasing the number of MCMC iterations will lead to very long run times for BART. We
decide to make a small experiment to choose the best number of MCMC iterations that
balances CI accuracy with algorithm run time. This best value of MCMC we call “an
adequate number of MCMC iterations”. The details of the small experiment and results
are discussed in Section 3.2. Another important issue is that the MCMC algorithm will
not converge immediately, and we have to discard some of the early MCMC iterations
that are called “burn-in” iterations. In this thesis, burn-in iterations always represent
the first 20% of an MCMC run. CGM typically used 20% burn-in. BART runs more
quickly with 1 in 10 thinning. Later in this Chapter, we will see that the eventual
decision is to run 5000 iterations, discarding the first 1000 iterations as burn-in, and
iteration “sample” produces a corresponding output f(x). Thus, the 4000 MCMC
iterations shall give 4000 samples from the posterior distribution of the f(x) to give BART
32
CI. For this study, at each of 10,000 test points, these samples must be stored. These
we “thin” the MCMC sample by recording only results from every 10th MCMC iteration.
That is, we keep the 1001st, 1011th, 1021st, … of MCMC samples after burn-in.
Consequently, 90% of the computer memory shall be saved with negligible effect on the
quality of the results. To summarize, 5000 MCMC iterations are run; the first 1000 are
discarded as burn-in; the last 4000 values are thinned, resulting in 400 saved MCMC
samples.
intensive, requiring 120 runs of the BART algorithm, corresponding to 5 folds and 24
shorten our MCMC runs to 100 burn-in iterations followed by 400 sampling iterations
again keeping every 10th value. The selection of the best of the 24 parameter
that if the low predictive error is the objective, instead of getting accurate coverage,
fewer MCMC iterations are needed. Note that the final re-fitting of BART with all
training data and the optimal parameters uses the same MCMC settings as default
BART.
33
3.1.3 Response variables for the experiment
This thesis goal is to study the BART properties CI coverage (at levels 90%, 95%, and
99%), CI width (90%, 95%, and 99%) and predictive SSE. These are defined in Chapter 2.
All these results are calculated by using a test data set with 104 observations and f(x)
values without error, generated using corresponding setting of p, junk, correlation and
σ. We utilize the test data to compute our response variables. If the results were
evaluated using the f(x) values at the training points, they would not be as reliable. A
model may overfit the training data, but prediction for the test data give a more realistic
Markov chain Monte Carlo (MCMC) samples are an important part of this research,
giving the posterior probability distribution for model predictions. Specifically it allows
larger number of iterations gives more accurate prediction since they cover the
posterior distribution more fully. However, a smaller number of iterations will give
results faster. The number of MCMC iterations needs to be chosen to balance accuracy
and speed.
34
For finding an adequate number of MCMC iterations, a small experiment was
conducted. We used the following test function: f(x) = 10 sin (π(𝑥1 )(𝑥2 )) + 20
(𝑥3 − 0.5)2+ 10 (𝑥4 ) + 5(𝑥5 ) (Chipman, George and McCulloch 2010; Friedman 1991).
In this simulated data, observations were generated randomly with i.i.d. N(0,1)
distribution. The first five x variables affect the response, and the rest are junk
variables. The values N= 500, 1000, 1500, 2000, 3000, 4000, 5000, 10000 were used as
different numbers of MCMC iterations to fit the BART model. Each time the BART model
was estimated with a particular number of iterations and the prediction SSE, coverage
and width of 90% credible intervals were recorded. Two different training sample sizes
were selected to compare the behaviour of the Markov chain. These sample sizes were
𝑛1 = 1000 𝑎𝑛𝑑 𝑛2 = 10000. In all cases, a test set of 10000 observations was used.
An initial burn-in portion of the MCMC samples was discarded. In all cases, the
number of burn-in iterations was taken to be 25% of the number of the saved iterations
that we denote as “N”. For instance, if N=500 iterations were saved, then prior to that
35
Table 3.2 illustrates the responses’ results and different values for training set sample
sizes and number of MCMC iteration. All CIs were constructed with level 90%. From
Table 3.2, it is obvious that coverage and width have a positive relation with the number
of MCMC iterations while there is a negative relationship between SSE and the number
of MCMC iterations. Figure 3.1 plots columns 2-7 of Table 3.2 as a function of the
number of MCMC iterations. The same trends identified in the table are visible in the
plots.
n=1000 n=10000
Table 3.2: Predictive performance at various MCMC iterations (N) and sample sizes (n).
36
Figure 3.1: The relation between MCMC samples and different responses (width,
coverage, SSE).
37
The largest changes to SSE, width and coverage happen by 4000 MCMC iterations.
iterations gives more reliable results but not very different from 4000 iterations. Thus,
4000 is a good number of iterations for this research, giving a good balance of
When this experiment was repeated, there were only small changes in results for the
n=1000 training set. However, the same trends as in Table 3.1 and Figure 3.1 were
Figure 3.2 displays an explanation for the different behaviour of Markov iterations at
both values of training sample size. It plots the posterior samples of σ against the
MCMC iteration, with burn-in samples indicated in red and saved samples indicated in
black.
38
Figure 3.2: Posterior samples of σ versus MCMC iterations number, for training
the fit of the BART model. The true value of σ is 1.0. The right plot of Figure 3.2 shows
more uniform MCMC iterations and less noise than in the plot for n=1000. This is to be
expected, since the larger training set will result in a posterior distribution with less
uncertainty. The plots indicate that the MCMC is mixing well among these iterations.
In the conclusion, even though the larger number of MCMC iterations do give more
reliable findings, smaller values won’t give very different results. While 4000 MCMC
iterations is less than 10000, it leads to credible predictions and results similar to 10000.
39
3.3 The simulated function
To have more general results, for each of the 5120 different experimental runs in D, a
different f(x) is generated, instead of using just one population function. This function
could be called a mean function because it is equal to E(у│x) where y = f(x) + ɛ in the
population model. At each level of D factors’ combination, the function f(x) is chosen
randomly and used as the conditional mean, in simulating the data set. This leads to a
Friedman (2001) described an algorithm for generating this random function. The
thesis uses the implementation of that algorithm from Chipman (2011). This Friedman
randomly across each row of D. The formula of this generated function is given as
The coefficients 𝑎𝑗 are scalars that are randomly generated as U(-1,1). The function ℎ𝑗 is
1
(3.2) ℎ𝑗 (𝑥𝑗 ; 𝜇𝑗 , 𝑉𝑗 ) = exp (- ((𝑥𝑗 - 𝜇𝑗 )𝑇 𝑉𝑗−1 (𝑥𝑗 − 𝜇𝑗 ))
2
The size “𝑝𝑗 " of each 𝑥𝑗 is taken randomly as floor [1.5 + r] where r is drawn from an
exponential distribution with mean λ=2. If r > p, then we set r = p. The vector 𝑥𝑗 has a
40
parameters of ℎ𝑗 are the mean vector “𝜇𝑗 ” and covariance matrix “𝑉𝑗 " which are not
constant. The vector { 𝜇𝑗 }120 is randomly generated from MVN(0, I). The
(3.3) 𝑉𝐽 = 𝑈𝑗 𝑊𝑗 𝑈𝑗𝑇
where 𝑈𝑗 is a uniformly random orthonormal matrix and 𝑊𝑗 = diag {𝑤1 , … , 𝑤𝑝𝑗𝑗 }. The
In each ℎ𝑗 , there are a new 𝜇𝑗 and 𝑉𝑗 . These two parameters control the shape of ℎ𝑗
which contribute to the shape of f(x). Figure 3.3 is a contour plot of generated f(x)
setting p =2, and the number of ℎ𝑗 terms is 10, instead of 20 as in (3.1). Each one of the
ten ℎ𝑗 is a sub function of a vector of one or two randomly chosen variables from two
41
Figure 3.3: Six realizations of randomly simulated f(x) in two dimensions, as described in
the text.
Each one of f(x) illustrates different shape, different peak and surface. This confirms
how much variety this function has which leads the thesis results being more general.
42
We made some small changes to the mechanism of the function generation by fixing
the number of ℎ𝑗 functions to be 20 and making the random length of 𝑥𝑗 variables “𝑝𝑗 "
the design. In the Friedman algorithm, and the Chipman implementation, the number
In this thesis, there are computational challenges caused by the size of the
experiment, and the long execution time of BART. These challenges led us to use
This section describes the use of parallel computing to carry out the 5120 runs of the
experiment. All computing was carried out on ACENET parallel computers. ACENET is a
that require large computing resources that can also be distributed among several
computers. A large job should be split into multiple small jobs which are then executed
in the individual computers in the clusters. The cluster also helps in executing multiple
43
runs of a process with different sets of data, for example the same algorithm is run with
This thesis has a large experiment where the final design matrix D has 5120 runs after
spend months. To save time, we decided to run the thesis experiment on a cluster of
computers by dividing the entire experiment into smaller “jobs”. The technique used
here is breaking the 5120 rows of the D matrix into 640 independent jobs which can be
run on one or more ACENET clusters. Each job consists of 8 rows or runs of the D
matrix. Each one of these jobs is an independent task. Table 3.3 shows the first of the
640 jobs. We divide the 5120 runs among the 640 jobs so that the execution time of
jobs is similar from one job to another. This time is less than 48 hours for jobs which
include CV.BART, and it is less than 24 hours for jobs that contain the Default. BART
method. For CV.BART, each job took between approximately 16 and 29 hours to
complete while the job which corresponded to Default.BART took about 4 to 10 hours.
This controlling of time for these parallel jobs was a result of dividing them equally
according to the sample size. All the 640 jobs have the same levels of sample size, in the
same order. This indicates that the n column of Table 3.3 is exactly the same for every
job. It was very reasonable to set our jobs in this way since the execution of the BART
44
row n p σ junk r error.dist method replicate
1 40 1 0.01 no 0 t cv 1
2 200 1 0.01 no 0 t cv 1
3 1000 1 0.01 no 0 t cv 1
4 10000 1 0.01 no 0 t cv 1
5 40 2 0.01 no 0 t cv 1
6 200 2 0.01 no 0 t cv 1
7 1000 2 0.01 no 0 t cv 1
8 10000 2 0.01 no 0 t cv 1
algorithm is roughly proportional to training set size n. Through running the MCMC
experiment which is shown in Section 3.2, we have seen that MCMC runs take a longer
time when n values increase. Therefore, we select our jobs to have the same
combinations of n levels.
To run these jobs on ACENET machines, we used a collections of R code for parallel
computing (Chipman, 2011). It includes five main files. The files respectively are:
main.R, doit.R, setbatch.R, postprocess.R, and cleantemp.R. Each one of these files has
a particular task regarding R jobs. Briefly, we will mention these files tasks:
45
- doit.R has most of the code to execute the runs for this thesis experiment. It builds
replications, and the MCMC iteration paramters. It also loads all the R packages that
program needs to run such as “mvtnorm”, and it splits the D matrix in small jobs that
each contains 8 rows of D. It executes Default.BART or CV.BART and then saves the
results in files to be available to read later. It includes some of the following files as
- main.R is the first file that has to be run of these computing files group. It prepares all
other files before execution on the cluster. It specifies each job’s maximum execution
time in hours and minutes. Thus, it can be considered as a timer file that specifies how
long the job is going to take. Execution of main.R prepares separate files associated
with the 640 different jobs to be run. For each job, an R file specifies which rows of D
will be used by doit.R. Code is also generated that submits the jobs to the cluster queue
for execution.
- setbatch.R is a function that works to set up these parallel jobs on an ACENET queue.
Its objective is giving names of files and the parallel jobs to identify them on ACENET.
- postprocess.R is an results collection file, which is run after all parallel jobs have been
completed. This file will read all results files and carry out all required analysis.
- cleantemp.R is a cleaning file, we use to discard the junk files produced by the
computations.
46
We did some minor modifications on this setbatch example to run our jobs on Ace-
net. For example, the doit.R file has our main program and it contains all files that
- mainfunc.R is a complementary file to doit.R that contains the rest of the program
settings. Here all the small changes to the function generation settings were done such
47
Chapter 4
Analysis
This thesis uses ANOVA for analysis of the experimental results. ANOVA displays the
main effects and interaction effects of factors. This thesis focuses on main effects and
two-way interactions and ignores three-way interactions since the inclusion of three-
way interactions in the model does not lead to large increases in 𝑅 2 . More details shall
width90, width95, width99, and predictive SSE. However, this analysis concentrates on
the three responses coverage95, width95, and SSE. The definitions of the responses
were mentioned in Section 2.2. Results for responses with 90% and 99% level are quite
Below the factors’ main effects and interactions are explored via plots and ANOVA
tables.
48
4.1 Analysis of coverage
coverage99. Figure 4.1 presents the main effects for all three responses. The effects
have the same relative size across different coverages, with a shift in mean level from
one response to another. There is a high correlation between these responses. Section
4.6 contains the details. We choose to analyze only 95% coverage because its results
Figure 4.1: The main effects for coverage90, 95, and 99 respectively. The factors’ order
49
Figure 4.2: Main effects for coverage95.
Figure 4.2 displays the main effects for coverage95. We use it to examine the effect
sizes. The factor n has the largest effect on the coverage of BART credible intervals. The
factors σ and error distribution have the second and third largest main effects,
respectively. Then, junk predictor, dimension p, and BART method have small but
visible effects in Figure 4.2. The predictor correlation “r” appears to have little or no
impact on coverage. Overall, the mean level of coverage (about 80%, indicated by the
horizontal line in Figure 4.2) is considerably lower than the nominal 95% level.
50
Mostly there is an inverse relationship between the sample size n and coverage.
Coverage increases when n decreases, except when n=40. Thus, for large sample size,
such as n = 10,000, the coverage of BART credible intervals is much lower than the
desired 95%. The factor σ has a positive relationship with coverage, with a very low
coverage at σ =0.01. Even though the results say the presence of junk predictors and
error distribution has little effect on coverage, they give coverage closer to the nominal
95% level when there are no junk variables and the error is normally distributed. The
We now give a more detailed analysis of coverage95, using an ANOVA table and
interaction plots. Table 4.1 shows the ANOVA for the 2-way interaction effect model.
The rows of the ANOVA table are ordered by the mean square values (MS). F-tests and
the corresponding p-values can be used to identify significant effects. The last column
of the table indicates the significance of effects by p-values, where 0.05> * > 0.01> **>
0.001> ***.
In the reminder of this section we examine some of the large effects. The order of the
largest effects for coverage95 shows that n, σ, n:σ, error distribution, p:σ, and n:
method are the six most important effects for the coverage of BART credible intervals.
All main effects except r are highly significant, with p-values less than 10−4 , and seven
51
Df SS MS F- value Pr(>F)
n 3 73.672 24.557 4168.230 <2.2e-16 ***
σ 3 21.214 7.071 1200.234 <2.2e-16 ***
n: σ 9 43.023 4.780 811.399 <2.2e-16 ***
error distribution 1 0.887 0.888 150.639 <2.2e-16 ***
p: σ 9 6.456 0.717 121.765 <2.2e-16 ***
n: method 3 1.678 0.559 94.945 <2.2e-16 ***
p: junk 3 1.264 0.421 71.529 <2.2e-16 ***
n: p 9 3.348 0.372 63.140 <2.2e-16 ***
junk 1 0.263 0.263 44.582 2.702e-11 ***
n: junk 3 0.769 0.256 43.526 <2.2e-16 ***
n: error distribution 3 0.410 0.137 23.176 6.791e-15 ***
method 1 0.091 0.091 15.390 8.861e-05 ***
p: method 3 0.171 0.057 9.655 2.369e-06 ***
p 3 0.089 0.030 5.046 0.001718 **
σ: error distribution 3 0.063 0.021 3.577 0.013338 *
σ: r 3 0.059 0.020 3.359 0.018007 *
σ: method 3 0.056 0.019 3.172 0.023263 *
σ: junk 3 0.036 0.012 2.011 0.0901220
r: method 1 0.009 0.009 1.574 0.209706
r: junk 1 0.007 0.007 1.115 0.291069
junk: method 1 0.005 0.005 0.780 0.377086
p: error distribution 3 0.012 0.004 0.682 0.562851
n: r 3 0.005 0.002 0.265 0.850610
p: r 3 0.003 0.001 0.169 0.917679
r 1 0.001 0.001 0.162 0.687229
error distribution: method 1 0.001 0.001 0.105 0.746514
junk: error distribution 1 0.000 0.000 0.044 0.833871
r: error distribution 1 0.000 0.000 0.002 0.967963
residual 5037 29.676 0.006
52
Earlier in this section, interpretations of the main effects for coverage95 were given.
Here we examine the three interactions among effects with the six largest MS values.
We first consider the n:σ interaction displayed as an interaction plot in Figure 4.3.
The n:σ interaction plot shows that when σ=0.01 and n=10000 coverage is
approximately 30%, well below the nominal 95% level. As a result, BART credible
intervals will be very inaccurate at very small values of σ and very large values of sample
size n. At other levels of σ (0.1, 0.25 and 0.5), coverage is less sensitive to the effect of
sample size n. Figure 4.3 shows coverage closest to the nominal 95% level when n has
53
Figure 4.4 illustrates the effect of p:σ against coverage95. Coverage is much lower
when σ =0.01, and for other values of σ, coverage is similar. The factor p has a
complicated relationship with coverage. For instance, sometimes p gives the best
values for coverage at its lowest values such as what happens at p=1, 2 and σ =0.25, 0.5.
In other cases, p gives the worst values for coverage at 1, and 2 when σ equals 0.01
Figure 4.5 shows the n:method interaction effect. The n:method interaction appears
smaller than the other two interactions, since the two lines are reasonably close to each
other. The coverage of BART.Default seems more sensitive to n than the coverage of
BART.CV. BART.Default varies more with n than BART.CV. It is clear both BART methods
give the best coverages at the lowest levels of sample size. The factor n gives the worst
54
values for coverages at its large values. At n = 40 and n =10000 CV.BART has coverage
closer to 95% than does Default.BART. Both methods give good coverage at n=40 and
200, and the coverage of CV.BART changes less with n than the coverage of
Default.BART.
Here we begin by comparing main effect plots of credible interval width over the
three different coverages (Figure 4.6). Figure 4.6 suggests the factors’ effects have the
same relative sizes across widths corresponding to 90%, 95% and 99% levels, other than
55
a shift in mean level from one response to another. Thus, we only analyze width95
(Figure 4.7).
Figure 4.6: The main effects for width90, 95, and 99. The factors order is n, p, σ,
Figure 4.7 shows the size of different effects. The factor n has the largest impact on
credible interval widths. The factors σ and junk variables are the second and third
largest effect factors while dimension p and error distribution have the fourth and fifth
56
largest effects. The correlation “r” shows some impact on width, with a slightly
narrower width when predictors are correlated. BART methods appear to have little or
no impact on width. The ordering of some factor effects by size, such as junk and p, can
The labelled effects in Figure 4.7 can be used to identify the relationship between
each factor and interval width. There is a negative relation between sample size n and
width. Reasonably, CI width decreases when n increases. The factors σ and p have a
57
positive relationship with width, it increases when they increase. The categorical
variables junk and error distribution give lower width when there are no junk variables
The width95 ANOVA is illustrated in Table 4.2, with rows ordered according to the
size of MS.
The p-values in Table 4.2 indicate that all main effects except method are significant,
and r has a smaller effect than the other 5 main effects. However, the first four effects
are all considerably larger than other effects based upon their mean squared error. The
sample size n is the most significant factor. It has the largest MS, at least three times
more than any other significant factors. In addition, the main effects have the largest
impact on width, with the five largest MS values. Although the factor junk has a larger
MS value than p, the ordering is reversed when considering SS. That is, the four levels
of p account for more overall variation than the two levels of junk, but when
standardized by the degrees of freedom, junk has the larger mean effect.
58
Df SS MS F-value Pr(>F)
59
We now focus on the interaction n: junk which is the 6th largest effect. Figure 4.8
shows the n:junk interaction. When n increases, width decreases. This confirms the
inverse relation between n and width as stated previously. The n:junk interaction
corresponds to the fact that the lines are not parallel. That is, when there are junk
Figure 4.9 displays the main effects for the predictive SSE. All factors have a visible
60
Figure 4.9: The main effects for SSE.
The factor n has the largest effect on SSE. The factors junk, σ, and p are the next
largest main effects. As with the coverage95 response, the ranking of effect size is
complicated by the varying number of levels for different factors. The rest of the main
effects are smaller but still visible in Figure 4.9. SSE decreases as n increases, p
decreases and σ decreases. SSE is smaller with no junk variables, and slightly smaller
The ANOVA for the response “predictive SSE” shown in Table 4.3. As in the earlier
61
Df SS MS F-value Pr(>F)
n 3 8260741213 2753580404 2529.662 <2.2e-16 ***
junk 1 1149670834 1149670834 1056.181 <2.2e-16 ***
σ 3 2876836357 958945452 880.965 <2.2e-16 **
n: junk 3 1654503795 551501265 506.654 <2.2e-16 ***
P 3 1595725522 531908507 488.654 <2.2e-16 ***
n: p 9 1279609758 142178862 130.617 <2.2e-16 ***
n: σ 9 1019100689 113233410 104.0254 <2.2e-16 ***
error distribution 1 74723433 74723433 68.647 <2.2e-16 ***
p: junk 3 168766357 56255452 51.681 <2.2e-16 ***
method 1 52389336 52389336 48.129 4.495e-12 ***
junk: method 1 47800181 47800181 43.913 3.792e-11 ***
σ: error distribution 3 94367583 31455861 28.898 <2.2e-16 ***
p: σ 9 248525987 27613999 25.369 <2.2e-16 ***
n: method 3 77281290 25760430 23.666 3.322e-15 ***
σ: junk 3 62331280 20777093 19.088 2.640e-12 ***
r 1 12565677 12565677 11.544 0.0006850 ***
r: junk 1 10704712 10704712 9.834 0.0017228 **
n:r 3 19421017 6473672 5.947 0.0004809 ***
n: error distribution 3 13244277 4414759 4.056 0.0068745 **
p: r 3 10530418 3510139 3.2247 0.0216316 *
error distribution: method 1 1873202 1873202 1.721 0.1896402
σ: r 3 3744049 1248016 1.1465 0.3287928
p: method 3 3168696 1056232 0.9703 0.4056361
p: error distribution 3 1063607 354536 0.326 0.8067886
r: method 1 297215 297215 0.273 0.6013186
σ: method 3 566540 188847 0.174 0.9143665
r: error distribution 1 35733 35733 0.0328 0.8562321
junk: error distribution 1 2911 2911 0.003 0.9587577
residuals 5037 5482860544 1088517
62
Overwhelmingly, n has the most significant effect. By looking at the MS of the six
largest effect factors, n is approximately twice as large as the next largest MS. Then
Now we examine interactions among the largest six effects. First, consider the n:
junk interaction displayed in Figure 4.10. At both levels of “junk”, there is an inverse
relationship between n and SSE. When there are junk variables, SSE is larger at small
values of n, and about the same as without junk variables, for large n.
63
In Figure 4.11, the interaction between n and p is evident from the non-parallel lines.
At all levels of n, there is a positive relationship between p and SSE. At all n levels, SSE is
smaller when p decreases. The decreasing relationship between SSE and n changes
depending on the dimension p. As p increases, SSE values for small n become larger,
In this section similarities and differences in the separate analysis of the responses
64
1- The effect of junk is a little different for coverage99 than 90, and 95%. It has no
effect on coverage 99 while there is a small effect on both coverage90 and 95.
2- The effect sizes for each factor are approximately the same over the three
“coverage” responses. For the “width” response, the effect sizes are the same
over the 90, 95, and 99% levels. For both coverage and width, there is a shift in
3- The factor p has very low impact on coverage while it has a large influence on
4- At very low values of σ, such as 0.01, BART models have coverage far below the
nominal level.
5- Most of the factors have significant main effects for coverage except the
correlation “r”.
6- Most of the factors have significant main effects for width except BART method.
8- The main effects for width have the same pattern as the main effects for SSE.
9- The main effects for coverage have a similar pattern as for SSE, with two
exceptions: junk and method. Their effects on coverage are the opposite from
10- From points 8 and 9, we can see that although there is a very good relation
among these three responses, the relation between width and SSE is stronger
65
BART gives worst values for coverage when n is very high and σ is small. This might
occur because the MCMC iterations do not mix very well to explore the whole posterior
Chapter 3, our observations suggest the same thing, the coverage values are better
Generally, BART method has a small effect on coverage and SSE responses, and it
of using CV.BART which is much slower than the default version. It represents the best
though that will cause Default.BART to run more slowly, it is still much faster than using
cross validation.
4.5 Comparison between the two, three, and seven-way interaction models
The linear model for assessing the main predictive effect of seven independent
66
In (4.1), each four-level factor (n, p and σ) has three predictors 𝑥𝑖 and three parameters
otherwise, 𝑥3 =1 if n=10000 and 0 otherwise. Each one of the two-level factors such as
junk and method has only one predictor and its parameter. This gives a total of 3 + 3 + 3
interaction as a product of two main effect terms appearing in (4.1). Therefore, the
regression formula for a two-way interaction model contains all main effect terms in
(4.1) plus 69 additional terms. Each term consists of a product between two main
effects and a related parameter. Similarly, a three-way interaction effect can be defined
as a product of three main factors. The formula of the three-way regression model
consists of the two way regression model plus 193 extra terms. The 69 two-factor terms
We also consider a full model, in which all possible interaction effects are estimated.
This corresponds to a seven-way interaction model, with 1023 degrees of freedom for
effects. This full model is not considered for interpretation, but only to illustrate how
much variation is explained by the second and third order models. The seven-way linear
model consists of the three-way linear model plus 748 terms and their corresponding
67
parameters. These terms correspond to 35 + 21 + 7 + 1 = 64 combinations of factors
To decide whether to consider a model with just two-way interactions, both two-way
and three-way interactions, or all two, three, …, seven-way interactions, F tests were
conducted. These tests compare whether the full model with all interactions up to order
a third order model (main effects, 2 way interactions and 3 way interactions)
For the test comparing the full model and third order model, the responses coverage95
and SSE had p-values of approximately 0.0005, while the response width95 had a p-
value of 0.82. All other comparisons between the full model and first or second order
models had p-values of 0. The tests indicate that for all responses, interactions up to
the third order were significant, and for two of the three responses, higher order
Although third order and some higher order terms are significant, the largest effects
are main effects and two-way interactions. This is indicated by Table 4.4, which gives 𝑅 2
values for first, second, third and seventh order models. There is a large increase in 𝑅 2
when moving from a main effect model to a second order model. Although 𝑅 2 values
continue to increase for higher order models, the increases are smaller and correspond
68
to a large number of degrees of freedom. For instance, there are 193 additional degrees
smaller increases in 𝑅 2 , analysis in this thesis focused on main effects and two-way
In this section, we examine the correlation between our seven responses. This
analysis for one response of this type. Table 4.5 shows the correlation values between
69
response coverage90 coveage95 coverage99 width90 width95 width99 SSE
The correlation ranges between coverages are similar and very strong. Coverage95
has the largest correlation with the other two coverages. Therefore, choosing
coverage95 for analysis was reasonable. The correlation among all three width types is
0.99. Thus, it was enough to illustrate the analysis for only one width. Choosing the
between coverage and width, it does not exceed 0.4. SSE and the coverage variables
have almost no correlation. Width and SSE have a strong correlation of approximately
0.8.
70
Chapter 5
Conclusion and future work
output by using predictor variables. In this thesis, we study the supervised learning
model BART which works to model that population function to give a numeric value of
the response. It is a development of a Bayesian “sum of trees” model where each tree
model uses MCMC to generate samples from the posterior distribution, enabling the
This thesis focuses on examining the accuracy of BART CIs across various factors such
as the sample size “n”, the noise standard deviation “σ”, and the error distribution. To
conduct the simulation study, we selected a full factorial design which varies the levels
of factors automatically and is easy to analyze. ANOVA was used to analyze the seven
responses we studied here. These responses include three sorts of coverage (90, 95,
99)%, three sorts of width (90, 95, 99)%, and the SSE of prediction.
The analysis in Chapter 4, shows that the credible intervals from CV.BART and
Default.BART are similar over all factor combinations. The effect of the type of BART
71
method on coverage and SSE responses is very small, and there is no effect on width.
Thus, we suggest using Default.BART since it is much faster than CV.BART. Recall that
for 5-fold CV, with 24 combinations of prior parameters, BART.CV requires 120 runs of
BART.
A very important conclusion is that the coverage of BART CIs was significantly
affected by noise standard deviation σ and sample size n. It was poor at very low values
of σ and very high values of n. This may be because of poor mixing of MCMC iterations
at large vales of n, they do not cover or explore the entire posterior distribution well. In
suggests this result. That small experiment shows that coverage gets better when BART
We recommend increasing the number of MCMC iterations which could improve the
coverage of BART CIs. Although this will lead Default.BART to spend more time, it will
This research suggests multiple directions for future work. We could further examine
the effects of various different factors on BART CI’s. The first suggestion is studying the
influence of a larger number of MCMC iterations such as 4000, 10000, and 20000
iterations. In Chapter 4 we saw that coverage was not always good with MCMC
iterations = 4000. The increasing of MCMC iterations to large values such as 10000 and
72
20000 should give better coverage and might also improve CI width and SSE. A second
factor which could be examined is the prior parameters. The idea of utilizing the prior
Default than what CGM give. At each row of the design matrix, there will be a particular
combinations of prior parameters, and by analyzing the results we can determine the
best combination.
completely different model than BART to estimate credible intervals. Treed Gaussian
processes (Gramacy & Lee, 2008), Bayesian generalized additive models (Hastie &
Tibshirani, 2000) and random forests (Breiman, 2001) are all examples of statistical
learning models that give credible intervals. Then, we could gauge the accuracy of these
models’ CI’s across the all given factors of the designed experiment. There are other
statistical learning models such as new versions of random forests or BART model.
Wager, Hastie, & Efron (2014) built a random forests model based on the jackknife and
infinitesimal jackknife, which uses variance estimates for bagging proposed by Efron
(1992, 2014). Pratola (2012) created a version of BART with better MCMC mixing than
CGM. It should show better CI’s coverage because the new BART has a better MCMC
algorithm that should mix very well compared to the implementation of BART used in
this thesis.
73
Another direction for future work is studying the influence of the seven factors on
BART CI’s, but with different factor levels. For instance, including larger values of the
sample size and predictor dimension. This would consume more time, but it could give
distribution that was more different from the Normal distribution. For example, taking
Fractional factorial design is another possibility for future work instead of conducting
a full factorial design. In the full factorial design, all possible combinations of the factor
levels are run. A fractional factorial design is defined as a fraction or partition of the full
factorial design, reducing the size of the full factorial. Fractional factorials of 2-level
designs are well studied. Fractional factorials for factors with 3 or more levels can be
fractional factorials. Fractional factorial designs will have some aliasing of estimated
effects. Since there is a concern of estimating the effect of the two-way interaction
model in this thesis, we should use a fractional factorial design with no aliasing or
In the experiments described in Chapter 4, we noticed that for some jobs which have
the same combination of factors, there is variation of their execution time. Usually, this
time variation is between 1 and 1.5 hour, while it is around three hours in some extreme
74
cases. For instance, the 4th and 132nd parallel jobs have the same combination of
probably suitable to add the job’s time as a new response. The effect of factors on
execution time, and also the random variation in execution time, might give insight into
75
Bibliography
[1] Anand, K., (2015) An Expected Improvement Criterion for the Global Optimization of
[4] Chipman, H., George, E. & McCulloch R. (2010) Bayesian Additive Regression
[7] Efron, B. (2014) Estimation and Accuracy After Model Selection. Journal of the
[8] Friedman, J.H. (2001) Greedy Function Approximation: A Gradient Boosting Machine.
[9] Friedman, J.H. (1991) Multivariate Adaptive Regression Splines. The Annals of
76
[10] Gramacy, R. B., & Lee, H. K. H., (2008). Bayesian Treed Gaussian Process Models
[11] Hastie, T. & Tibshirani, R., (2000) Bayesian Backfitting, Statistical Science. 15, 196–
223.
[12] James, G., Witten, D., Hastie, T. & Tibshirani, R., (2013) An Introduction to Statistical
[13] Montgomery, D. C. (2013) Design and Analysis of Experiments. (8th ed). USA: John
[16] Wager, S., Hastie, T. & Efron, B., (2014) Confidence Intervals for Random Forests:
The Jackknife and the Infinitesimal Jackknife. Journal of Machine Learning Research. 15,
1625−1651.
77