Download as pdf or txt
Download as pdf or txt
You are on page 1of 87

Assessing the Accuracy of Bayesian Additive

Regression Tree Credible Intervals

by

Fatimah AL Ahmad

Thesis
submitted in partial fulfillment of the requirements for
the Degree of Master of Science (Mathematics and Statistics )

Acadia University
Winter Convocation 2016

© by Fatimah Mohammed AL Ahmad, 2016


This thesis by Fatimah AL Ahmad was defended successfully in an oral examination on
1st April, 2016.

The examining committee for the thesis was:

________________________

Dr. Harish Kapoor, Chair

________________________

Dr. Thomas Loughin, External Reader

________________________

Dr. Wilson Lu, Internal Reader

________________________

Dr. Hugh Chipman, Supervisor

_________________________

Dr. Ying Zhang, Acting Head

This thesis is accepted in its present form by the Division of Research and Graduate
Studies as satisfying the thesis requirements for the degree Master of Science
(Mathematics and Statistics).

………………………………………….

ii
I, Fatimah AL Ahmad, grant permission to the University Librarian at Acadia University to
reproduce, loan or distribute copies of my thesis in microform, paper or electronic
formats on a non-profit basis. I, however, retain the copyright in my thesis.

_________________________

Fatimah AL Ahmad, Author

_________________________

Dr. Hugh Chipman, Supervisor

_________________________

Date

iii
Contents

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 BART method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 The BART model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Bayesian estimation of the BART model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 BART prior parameters and default values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Prior variance on 𝜇𝑖𝑗 |𝑇𝑖 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.2 The σ prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.3 The choice of m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 BART models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.1 CV.BART mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.2 Defualt.BART model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 The simulation study and setting up the experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 The experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.1 The setting of experimental factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.2 Additional variables not treated as factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1.3 Response variables for the experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

iv
3.2 Choosing the number of MCMC iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 The simulated function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4 Parallel computing mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1 Analysis of coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Analysis of width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 Analysis of SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4 Similarities and differences between responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.5 Comparison between the two, three, and seven-way interaction models . . . . . . . . 65

4.6 The correlation between the responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

v
List of Figures

2.1 A simple realization of g(x; T; M) in the BART model . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 The simulated function and data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 BART uncertainty prediction. Training data are plotted as individual points . . . . . . 12

2.4 The prior σ distribution when 𝜎̂ = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 The relation between MCMC samples and different responses (width, coverage, SSE)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Posterior samples of σ versus MCMC iterations number, for training samples of size

1000 and 10000. Burn-in samples are red . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Six realizations of randomly simulated f(x) in two dimensions, as described in the

text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1 The main effects for coverage90, 95, and 99 respectively. The factors’ order along

the horizontal axis is n, dimension p, σ, predictor correlation r, junk variables, error

distribution, and BART method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Main effects for coverage95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 The n: σ effect on coverage95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 The p: σ effect for coverage95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.5 The n: method effect for coverage95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

vi
4.6 The main effects for width90, 95, and 99. The factors order is n, p, σ, predictor

correlation r, junk variables, error distribution, and method . . . . . . . . . . . . . . . . . . . . . . 55

4.7 The main effects for width95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.8 n: junk effect for width95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.9 The main effects for SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.10 The n: junk effect on SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.11 The n: p effect for predictive SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

vii
List of Tables

2.1 Mean of BART CI coverage, width and SSE over 500 test points, for 10 replicates . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 The prior parameters’ combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1 Summary of the factors of the designed experiment . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Predictive performance at various MCMC iterations (N) and sample sizes (n) . . . . 36

3.3 The 8 rows of D in the first job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1 ANOVA table for coverage95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 The ANOVA table for width95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 ANOVA table for SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4 The 𝑅 2 of interaction models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.5 The correlation between responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

viii
Abstract

A common type of supervised learning problem is to use training data to estimate a

predictive model for a numeric response. Many supervised learning models such as

Bayesian Additive Regression Trees (BART) try to flexibly model the data. This Bayesian

“sum of trees” model uses MCMC back fitting to simulate posterior samples. BART also

provides credible intervals (CIs) for prediction.

This thesis studies the accuracy of BART credible intervals and analyzes various

factors’ effects on it. These factors include the sample size, dimension, noise standard

deviation, predictors’ correlations, junk variables, type of error distribution, and BART

method. Simulation is used to compute CI accuracy with a designed experiment that

systematically varies the factors to find their effects. Analysis of experimental results

gives conclusions about BART CI accuracy. It is found to depend considerably on sample

size and error variance.

ix
Acknowledgements

I would like to thank the Ministry of Higher Education in the Kingdom of Saudi Arabia

for financial support that enabled me to complete my studies. A special thanks to Dr.

Chipman who gave me the opportunity to work under his supervision. I am grateful for

all his effort, support, encouragement, and suggestions which helped me to finish this

thesis experiment and writing. I would also like to thank the faculty, staff and students

of Acadia University, specifically those who work and study in the Department of

Mathematics and Statistics.

To my husband, my entire family, and all my friends: I am really thankful and

appreciate all your assistance in different aspects of my life. Finally, thank you to

Canada which has welcomed us as a part of its international students and families, and

thank you to all nice people who we have met here and made us feel we live in our

home.

x
Chapter 1
Introduction

An important statistical problem is inference and modelling for an unknown function

f to predict a numeric response Y by using a p-dimensional vector x of predictors. This

kind of supervised learning problem, with a numeric response, is often referred to as a

regression model. Training data, consisting of (x, Y) pairs, are used to “learn” or

estimate the unknown function f. Supervised learning models have various kinds of

structure. For instance, they can be parametric, like a linear regression model, or

nonparametric, like a decision tree or a Random Forest. This thesis’ supervised model is

the nonparametric “Bayesian Additive Regression Trees” (BART), developed by

Chipman, George and McCulloch (2010), hereafter referred to as CGM (2010).

CGM (2010) develop BART as a Bayesian “sum of trees” model. A large number

(typically 50-200) of decision trees are estimated in such a way that their sum is an

accurate prediction of the response. It is an ensemble method, like bagging and random

forests where a large number of decision trees are combined into a prediction model. It

uses a Bayesian MCMC algorithm that generates simulated samples from the posterior

distribution for fitting and inference.

The Bayesian specification of BART provides posterior distributions that can be used

to quantify uncertainty in predictions for the response at an input value. In particular,

1
MCMC samples enable the construction of credible intervals (CIs) for the prediction of

response Y at input x. The objective of this thesis is examining the accuracy of BART’s

uncertainty prediction under the effect of various factors. These factors are mostly

related to properties of the training data, and include the sample size n, noise standard

deviation σ, dimension p, predictors’ correlation r, junk variables, types of error

distribution, and BART method. These factors have different number of levels (2 or 4

levels) and types (either numeric or categorical).

To study the performance of BART CIs accuracy under all these factors, a simulation

study is conducted where it is possible to compute the accuracy measures with a known

response function. A designed experiment in the seven factors is chosen to carry out

the study. A designed experiment enables the analysis of the factors’ influences on the

various responses. To keep the design and analysis as simple as possible, a full factorial

design is selected to conduct this study.

For the analysis, ANOVA tables are used to make conclusions about the BART CIs

accuracy and how much the factors impact the responses’ performances. We study

main effects and two way interactions which explain most of the variation in the

responses. We use three kinds of responses that measure performance of BART CIs:

coverage, width, and the SSE of prediction. The total number of the responses is seven

corresponding to coverage at three levels (90, 95, 99%), width at three levels (90, 95,

99%), and the predictive SSE.

2
The remainder of this thesis is organized as follows: Chapter two reviews the

background of the BART model which is utilized here to fit and model the population

function f. Chapter three describes the simulation study and the details of setting up

the experiments such as the factors and their levels, a generated function f, and the

management of this experiments’ runs as “parallel jobs” on a compute cluster. Chapter

four presents the analysis for three of the seven responses, coverage95, width95, and

the predictive SSE. They give a good representation of all responses analysis since the

results for levels 90, 95, and 99% are similar to each other. In Chapter five, we conclude

with the most important results and discuss future work.

3
Chapter 2
BART methods

This chapter has four sections. The first and second sections give background on

BART models, BART formulas, and BART credible intervals. The third section discusses

the choice of BART prior parameters. Section four describes two different versions of

the BART model: CV.BART and Default.BART.

2.1 The BART model

Here is a brief explanation of the BART model. The population model is given as

(2.1) y = f(x) + ε, ε~N(0,𝜎 2 ).

The BART model estimates the population model (2.1) where the conditional mean at a

specific point x equals the true function f(x),

(2.2) y = E(y|x) + ε.

The E(y|x) of BART can be expressed as a sum of decision trees

(2.3) y = g(x;𝑇1 ,𝑀1 ) +...+ g(x ;𝑇𝑚 ,𝑀𝑚 ) + ε.

In (2.3) the constant m determines the number of decision trees used in the sum of

trees. Each g represents the output from a different tree. That is, 𝑇1 , 𝑇2 , … , 𝑇𝑚 are

4
different trees, and associated with tree 𝑇𝑖 is a vector of terminal node parameters 𝑀𝑖 .

Suppose there are 𝑏𝑖 terminal nodes in 𝑇𝑖 . Then 𝑀𝑖 = (𝜇𝑖1 , 𝜇𝑖2 , … , 𝜇𝑖𝑏𝑖 ). The function

“g” is a generic function that takes a predictor vector x, a tree T and a set of terminal

node parameters 𝑀, and generates the output of the tree for input x. Thus, the model

tree 𝑇𝑖 can be defined as a binary splitting of x variables and by following the rules we

get one of the 𝜇 in 𝑀𝑖 .

Anand (2015) illustrated how a single tree T produces an output. We summarize this

example here. Figure 2.1 shows a tree model with three terminal nodes. The

parameter T represents the tree structure and the two decision rules 𝑥5 < 1 and 𝑥2 < 4.

The parameter M = (𝜇1 , 𝜇2 , 𝜇3 ) = (-2, 5, 7) represents the outputs from the three

terminal nodes. The input x = (𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 ) = (1.1, 5.4, 0.1, 2.3, 0.5) would lead to

prediction 𝜇2 = 5 since we branch left on 𝑥5 = 0.5 < 1 and then right on 𝑥2 = 5.4 ≥ 4.

The parameters of BART (𝑇1 , ..., 𝑇𝑚 , 𝑀1 , ..., 𝑀𝑚 , 𝜎) are unknown and must be

estimated from training data. Because these parameters are estimated, there is

statistical uncertainty in these estimates. This leads to uncertainty in functions of the

parameter, such as f(x), the predicted mean response at a particular input x. CGM use

Bayesian methods to estimate the parameters and quantify uncertainty. This will be

described in the next section.

5
Figure 2.1 : A simple realization of g(x; T; M) in the BART model.

2.2 Bayesian estimation of the BART model

All statistical learning models have same aim of getting a good prediction for

response y at any new point x. To estimate the BART model from training data,

Bayesian methods are used. In a Bayesian analysis, we must specify a prior distribution

𝑃(𝜃) for parameter θ. The inference joins the likelihood 𝑃(у│𝜃) with prior distribution

𝑃(θ) in Bayes theorem to give a posterior distribution

𝑃(у|𝜃)𝑃(𝜃)
𝑃(𝜃|у) = ,
𝑃(у)

6
where 𝑃(у) is the marginal probability function of y.

For BART, the parameter vector 𝜃 is the BART parameters (𝑇1 , ..., 𝑇𝑚 , 𝑀1 , ...,

𝑀𝑚 , and 𝜎). Prior probability distributions for the parameters will be discussed later in

Section 2.3, after outlining MCMC for BART.

Markov chain Monte Carlo (MCMC) is the computational technique used to calculate

posterior distributions for the parameters and quantify uncertainty. To show the

mechanism of MCMC, let us suppose that the objective is to find the posterior mean of

parameter 𝜃1 , where 𝜃1 is the first element of the θ vector and let us say the total

number of parameters is z. The posterior mean is given as

E( 𝜃1 │у) = ∫ 𝜃1 𝑃(𝜃|у) 𝑑𝜃2 𝑑𝜃3 … 𝑑𝜃𝑧

For large z, the integral will be complex to evaluate. Instead, we construct a Markov

chain that takes N samples from the posterior distribution 𝑃(𝜃|у) and indicates them as

𝜃1 , 𝜃 2 , …, 𝜃 𝑁 . Then, it takes specific samples of 𝜃1 by getting the first element of each

𝜃 𝑖 vector where i = 1,.., N. The posterior mean of 𝜃1 would then be approximated by

the sample mean of

𝜃11 , 𝜃12 , … , 𝜃1𝑁 .

7
With the BART model we are not interested in posterior distributions for individual

trees (𝑇𝑗 's) or terminal node parameters (𝑀𝑗 's). We are more interested in a posterior

distribution for the predictions, which is a function of the parameters. Analytically

obtaining a posterior distribution for predictions is not possible because this would

involve integration of the posterior distribution over all these parameters. However, it

is straightforward to compute the posterior for f(x) using the MCMC samples. We

compute f(x) for each MCMC sample, and these sampled f(x) values then correspond to

samples from the posterior for f(x). Credible intervals (CIs) can also be obtained from

MCMC samples of the posterior. For instance, for a parameter 𝜃1 , a 95% CI could be

computed as the 2.5% and 97.5% quantiles of the MCMC samples of 𝜃1 . The

computational details of MCMC, such as the number of MCMC iterations, and the

amount of burn-in and thinning for the chain, will be discussed later in Chapter 3.

The sample size of MCMC has to be adequate to give a good estimation for BART CIs.

The CI should be more accurate when there is a larger posterior sample. For that, we

conducted a trial in Chapter 3 to see if BART CIs gives better estimation with 10,000

MCMC samples than smaller numbers such as 4000 and 1000.

MCMC samples or iteration can be divided into two parts; one called burn-in

iterations and the second is thinning. MCMC samples cannot give the stationary

posterior distribution at its beginning iterations because the Markov chain will depend

8
strongly on the starting values. During the burn-in period of MCMC, samples are

discarded until they appear to have converged to a stationary distribution.

Once the MCMC has converged, sampled values may still be autocorrelated.

Thinning, which is the discarding of some MCMC samples, can reduce dependence and

use less memory to store sampled values. In our implementation, a fifth of MCMC

draws are discarded as a burn-in while the rest are thinned. Suppose that the MCMC

iterations are labelled 1, 2, ..., i, ..., 10,000, and the corresponding sampled parameters

are 𝜃1 , 𝜃2 , ..., 𝜃𝑖 , ..., 𝜃10000 . MCMC iteration i samples 𝜃𝑖 using a Markov chain, which is

conditional on 𝜃𝑖−1 . Although the MCMC is constructed so that it generates samples

from the posterior distribution of θ, because it is a Markov chain, 𝜃𝑖 values will be

correlated with 𝜃𝑖−1 values. When this correlation is high, it is not necessary to save 𝜃𝑖

for all values of i. That is, if we save 𝜃𝑖 , then 𝜃𝑖+1 will have values quite similar to

𝜃𝑖 . The correlation between MCMC samples decreases as there are more iterations

between them. The process of keeping every k-th sample is called "thinning" and it is a

way of reducing computer storage, while keeping most of the information contained in

the MCMC samples. In our use of BART, we keep every tenth MCMC sample from the

posterior, discarding the other 9 samples.

BART generates the credible intervals for f(x) by using MCMC samples. At a particular

x, every MCMC sample gives a corresponding output f(x). The MCMC samples then give

9
a MCMC sample of f(x) values that give us a set of samples from the posterior

distribution of the mean function f(x). By taking quantiles of f(x) values we obtain a

credible interval for f(x) values at that specific x. For example, we can get a 95% level of

CI by taking quantile with 2.5% and 97.5% of MCMC samples of f(x) at a particular x.

The new work of this thesis is studying the credible interval properties: coverage,

width, and sum of squared errors “SSE”. We calculate all three quantities by using a test

data set. This test set is from a simulation, so it consists of values of x and f(x) for a large

sample of inputs. Thus, we know the actual value of f(x). This is necessary for

computing coverage and SSE. Suppose BART has a CI at 𝑥𝑖 given by the CI lower bounds

“LB(𝑥𝑖 )” and upper bounds “UB(𝑥𝑖 )”. Equations (2.4), (2.5), (2.6) illustrate the CI

properties’ formulas.

∑𝑛.𝑡𝑒𝑠𝑡
𝑖=1 ( 𝛪 ( 𝐿𝐵 (𝑥𝑖 ) < 𝑓(𝑥𝑖 ) < 𝑈𝐵 (𝑥𝑖 )) )
(2.4) coverage = ,
𝑛.𝑡𝑒𝑠𝑡

∑𝑛.𝑡𝑒𝑠𝑡
𝑖=1 ( 𝑈𝐵 (𝑥𝑖 ) − 𝐿𝐵 (𝑥𝑖 ))
(2.5) width = ,
𝑛.𝑡𝑒𝑠𝑡

and

(2.6) 𝑆𝑆𝐸 = ∑𝑛.𝑡𝑒𝑠𝑡 ̂𝑖 )2.


𝑖=1 (𝑓(𝑥𝑖 ) − 𝑦

In our experiments in Chapter 4, we use n.test=104 .

10
For a quick clarification for BART credible intervals and its properties, we illustrate

this example. Suppose our real function is f(x) = sin (2πx) + 2  (x>0.5) + exp {2.5 x-0.5 },

then forty data points are generated randomly. The predictor x has a uniform

distribution and the response is generated as f(x) plus i.i.d N(0,0.252 ) random errors.

Figure 2.2 shows f(x).

Figure 2.2: The simulated function and data set.

We use Default.BART with 1100 MCMC samples to estimate the model and predict at

500 test points. Predictions of f(x) and 90% credible intervals are shown in Figure 2.3. It

displays that BART function is a step function where each step corresponds to a change

in one of the outputs of the m functions.

11
Figure 2.3: BART uncertainty prediction. Training data are plotted as individual points.

Table 2.1 summarizes 10 replications of the experiment described above. At each

replicate, a different training set is generated, and CIs obtained at 500 test points. The

mean values of coverage, width and SSE over the 500 test points are shown in Table 2.1.

In Table 2.1, it seems that there is considerable variation in the values, due to the

randomness in the different samples. This suggests that in experiments to study CI

properties, it will be necessary to simulate multiple data sets for each combination of

factors. This is discussed further in Section 3.1.2.

12
Properties of BART 90% CIs

replicate coverage width SSE

1 0.926 0.702 41.289

2 0.864 0.674 29.240

3 0.874 0.664 67.795

4 0.972 0.872 16.255

5 0.924 0.734 27.493

6 0.826 0.638 56.190

7 0.954 0.760 52.083

8 0.902 0.653 31.152

9 0.956 0.840 23.738

10 0.810 0.609 52.074

mean 0.901 0.715 39.731

Table 2.1: Mean of BART CI coverage, width and SSE over 500 test points, for 10

replicates.

2.3 BART prior parameters and default values

From the previous section, the BART model can be defined as a development of a

Bayesian “sum of trees” model. CGM developed a specification for the prior, depending

on four key prior parameters. One of these prior parameters is related to the prior for

13
means 𝜇𝑖𝑗 , two other parameters are related to the prior for σ, while the fourth

parameter controls the number of trees in the BART model. These parameters are k, ѵ,

q, and m respectively.

The prior distributions are an essential part for BART model in particular the sum-of-

trees model components (𝑇1 , 𝑀1 ), … , (𝑇𝑚 , 𝑀𝑚 ), and σ. CGM factor the joint prior on all

parameters as

(2.7) P((T1 , M1 ), … , (Tm , Mm ), σ) = [ ∏𝑖 𝑃(𝑀𝑖 , 𝑇𝑖 )] P(σ)

= [ ∏i P(Mi |Ti )P(Ti )] P(σ)

(2.8) P(Mi |Ti ) = ∏𝑗 P(μij|Ti )

In (2.7), CGM assume that in the prior, different trees (and their terminal node

parameters) are independent from each other. In (2.8), given tree 𝑇𝑖 , the terminal node

parameters of that tree are conditionally independent.

To describe the prior parameters, we discuss each one individually in a separate

section. Unless otherwise noted, the prior specifications are summaries of priors

developed by CGM.

14
2.3.1 Prior variance on 𝝁𝒊𝒋 │𝑻𝒊

The prior 𝑃(𝜇𝑖𝑗 |𝑇𝑖 ), is specified as normal with mean 𝜇𝜇 , and variance 𝜎𝜇2 . In (2.3)

the sum of trees formula shows that E(y|x) is equal to the sum of m 𝜇𝑖𝑗 ′s. Then, the

prior mean and variance of E(y|x) will be m times the prior mean and variance of a

single 𝜇𝑖𝑗 since the 𝜇𝑖𝑗 ′s are i.i.d. Thus, the prior for E(y|x) is N(𝑚𝜇𝜇 , 𝑚𝜎𝜇2 ). CGM

assume that mostly, the E(y|x) values fall between the minimum and maximum of

observed y’s data. They suggest specification of 𝜇𝜇 and 𝜎𝜇2 values so that N(𝑚𝜇𝜇 , 𝑚𝜎𝜇2 )

gives most of the prior probability to the range (у𝑚𝑖𝑛 , у𝑚𝑎𝑥 ). This choosing of

𝜇𝜇 𝑎𝑛𝑑 𝜎𝜇 values can be obtained by setting у𝑚𝑖𝑛 = 𝑚𝜇𝜇 − 𝑘√𝑚𝜎𝜇 and у𝑚𝑎𝑥 =

𝑚𝜇𝜇 + 𝑘√𝑚𝜎𝜇 .

CGM suggest values of 1, 2, 3, or 5 for the parameter “k”. Each value of k specifies a

particular prior probability that E(y|x ) falls within (у𝑚𝑖𝑛 , у𝑚𝑎𝑥 ) intervals. For instance,

k = 2 gives approximately a 95% prior probability.

This strategy can be summarized as using the minimum (у𝑚𝑖𝑛 ), and maximum (у𝑚𝑎𝑥 )

of the data to define a probability for the range of plausible E(y|x ) values. To apply this

strategy, we shift and rescale the у values from у𝑚𝑖𝑛 = −0.5 to у𝑚𝑎𝑥 = 0.5. After that,

15
we specify the prior mean and standard deviation of 𝜇𝑖𝑗 as 𝜇𝜇 = 0 and 𝜎𝜇 where

𝑘√𝑚𝜎𝜇 = 0.5, yielding

(2.9) 𝜇𝑖𝑗 ~ N(0, 𝜎𝜇2 ) where 𝜎𝜇 = 0.5/ 𝑘√𝑚

From (2.9), we can define 𝑘 as a scaling and shrinking parameter. It keeps the effect

of each individual tree in (2.3) small by shrinking the 𝜇𝑖𝑗 toward zero. When m and 𝑘

increase, the 𝜇𝑖𝑗 ’s will have smaller prior variance. The posterior distribution for each

corresponding 𝜇𝑖𝑗 will be concentrated on values close to zero

CGM describe two methods to specify k. One uses cross-validation, (the CV.BART

method discussed later in Section 2.4). In the other method, default values are chosen.

CGM suggest k = 2 as a reasonable default. The corresponding Default.BART method is

described later in Section 2.4.

2.3.2 The σ prior

The prior parameters for σ are the degrees of freedom ѵ and quantile q which are

used in specifying an inverse chi-squared distribution for 𝜎 2 . CGM specify the prior on

residual variance 𝜎 2 as the inverse chi-square distribution 𝜎 2 ~ ѵλ/𝜒ѵ2 . Each of ѵ and λ

has a particular task: ѵ determines the spread of the prior, λ determines the location of

the prior. Small values of ѵ correspond to greater spread in the prior distribution. CGM

16
suggest choosing ѵ between 1 and 10, and then choosing λ so that an upper percentile

of the σ prior (say q = 75%, 90% or 99%) corresponds to 𝜎̂, a rough estimate of σ. CGM

suggest that this rough estimate of σ be obtained in one of two ways: either the residual

standard deviation from a linear regression, or a proportion (such as 20%) of the sample

standard deviation of the y values.

CGM suggest three combinations of the combined parameters (ѵ, q). They are (10,

0.75), (3, 0.90), and (3, 0.99). These three values of (ѵ, q) called conservative, default,

and aggressive, respectively. Figure (2.4) shows all the prior three values of (ѵ, q) when

𝜎̂ = 2. This is a specification of the prior σ where our belief is that σ < 2. Figure (2.4)

also illustrates an inverse a relationship between q and σ. When q increases, σ moves

toward smaller values.

Figure 2.4: The prior σ distribution when 𝜎̂ = 2.


17
We can choose between the three values of (ѵ, q) by using cross-validation or by

using a default value. CGM suggest the default setting (3, 0.90).

2.3.3 The choice of m.

The parameter m determines the number of trees in the BART model. CGM suggest

trying only two values m = 50 and m = 200. They suggest to choose between them by

using cross-validation or recommending a default value. The default choice of m is 200.

For checking the performance of this choice, CGM observed that BART prediction

shows more improvement through the increasing of m until a specific point where the

improvement goes down slowly to give worse prediction. Therefore, it is very important

to select m not small and large enough. The values 50 and 200 are typically large

enough to give reasonable performance.

2.4 BART models

This research use two types of BART models which are distinguished by the way in

which the four key prior parameters are specified. Either the default settings of CGM

are used, or the parameters are chosen by cross-validation. These models are

18
Default.BART and CV.BART model. Section 2.4.1 describes the algorithm for CV.BART

while Section 2.4.2 has an explanation of Default.BART model.

2.4.1 CV.BART mechanism

The implementation of cross-validation is similar to that given in Section 5.1 of CGM.

The CV.BART algorithm is outlined below. It begins with a training and a testing dataset

(simulated, in the case of this thesis).

1- The prior parameters’ combinations are set as a small designed experiment with

all factorial combinations of

k = 1, 2*, 3, 5

(ѵ, q) = (10, 0.75), (3, 0.90)*, (3, 0.99)

m= 50, 200*.

Default values are indicated by “*”.

The number of theses combination is i = k’s levels x (ѵ, q)’s levels x m’s levels =

4x3x2 = 24, Table 2.2 shows these combinations.

2- Divide training data randomly into five sets of roughly equal size, denoted

𝐶1 , 𝐶2 , 𝐶3 , 𝐶4 , 𝐶5 . Denote the set of all training data except 𝐶𝑗 by 𝐶(𝑗) , for j = 1, …,

5.

19
The prior parameters
i k (v,q) M
1 1 (10, 0.75) 50
2 2 (10, 0.75) 50
3 3 (10, 0.75) 50
4 5 (10, 0.75) 50
5 1 (10, 0.75) 200
6 2 (10, 0.75) 200
7 3 (10, 0.75) 200
8 5 (10, 0.75) 200
9 1 (3, 0.90) 50
10 2 (3, 0.90) 50
11 3 (3, 0.90) 50
12 5 (3, 0.90) 50
13 1 (3, 0.90) 200
14 2 (3, 0.90) 200
15 3 (3, 0.90) 200
16 5 (3, 0.90) 200
17 1 (3, 0.99) 50
18 2 (3, 0.99) 50
19 3 (3, 0.99) 50
20 5 (3, 0.99) 50
21 1 (3, 0.99) 200
22 2 (3, 0.99) 200
23 3 (3, 0.99) 200
24 5 (3, 0.99) 200

Table 2.2: The prior parameters’ combinations.

20
3- For the BART parameters in one of the 24 rows of Table 2.2, train 5 BART

models. The jth model (j = 1, …, 5) would be trained using 𝐶(𝑗) . The prediction

(𝑗)
for observation i using model j will be denoted 𝑌̂𝑖 , 𝑗 = 1, … , 5.

4- The validation 𝑆𝑆𝐸 corresponding to a row of Table 2.2 will be

(𝑗)
(2.11) 𝑆𝑆𝐸 = ∑5𝑗=1 ∑𝑖𝜖𝐶𝑗 (𝑌𝑖 − 𝑌̂𝑖 )2.

Note that the predicted values are for the fold of data not used in training.

5- The validation 𝑆𝑆𝐸 will be calculated for each of the 24 rows of Table 2.2. This

involves fitting a total of 5 x 24 = 120 BART models.

6- Determine the best CV.BART combination of prior parameters by finding the row

z among the 24 combinations in Table 2.2 that has the lowest value of 𝑆𝑆𝐸.

Thus, z represents the best CV.BART model over 24 CV.BART models.

7- Using the parameters from row z, re-estimate the BART model over all training

data set without splitting.

8- Use the fitted model from step 7 to make predictions for the test data. Then

calculate the credible interval coverage, credible interval width, and error sum of

squares “SSE”. These quantities are defined in Section 2.2.

2.4.2 Defualt .BART model

The Default.BART model is very simple compared to CV.BART. A single BART model is

fit using a specific choice of BART prior parameters values that are denoted with (*) in

21
step 1 above. Therefore, to find the Default.BART CI properties, we just need to follow

the above steps 7 and 8 with using Default.BART model instead of CV.BART.

22
Chapter 3
The simulation study and setting up the experiment

The thesis studies the accuracy of BART credible intervals under the effect of specific

factors. An analysis of variance (ANOVA) summarizes these factors’ effects on BART CIs.

The experiment has seven factors: sample size “n”, predictors’ dimension “p”, noise

standard deviation “σ”, junk variables, predictor correlation “r”, type of error

distribution, and BART method. A convenient way to conduct this experiment is using a

simulation study in which the true mean function is known. This makes it possible to

calculate measures of CI performance. This simulation is carried out using a designed

experiment.

This chapter describes this study, including the experimental factors and their levels.

It shows a separate small study for determining the sufficient number of MCMC

iterations. It also includes the details of the simulated function which is used here and

the use of parallel computing to run this experiment.

The results of the experiment will be analyzed in Chapter 4.

23
3.1 The experimental settings

3.1.1 The setting of experimental factors

Experimental design has extensive applications in many fields. For instance, in

factories, it assists in the identification of significant factors which impact the quantity

or quality of production. Experimental design techniques have many advantages for

applications. They can develop the yields of process, reduce the time and the overall

cost (Montgomery, 2013). The design of experiments is a basic arrangement to evaluate

a response performance under an impact of one or more independent factors. It has

ability to determine the important level of the design parameters which affect the

response’s values. Using an additive model allows one to estimate the main and

interaction effect of factors simply.

Oehlert (2010) and Montgomery (2013) describe different types of experimental

design such as full factorial, fractional factorial, and block designs. Each design has

particular characteristics and features which lead us to use it. Regarding this research’s

goal which is studying the performance of BART CIs against various factors, we decided

to carry out our study by using a full factorial design. It is the simplest design and

analysis is easy.

In all experimental designs, it is necessary to determine what factors to study and

their levels. The selection of the seven factors in our study comes from our belief of

24
their importance. Six of the seven factors are related directly to the population model

(2.1), while the factor “BART method” concerns the choice of prior parameters for BART.

Other studies consider similar sets of factors. For instance, Friedman (2001) used three

factors to examine the performance of various function estimation methods. These

factors were the sample size, the underlying true function, and whether the error

distribution is normal or slash. An experiment in CGM (2010) considers some factors

similar to those studied in this thesis. They conducted an experiment to show the

performance and features of the BART model against the true underlying function. They

used a fixed function with five actual predictors 𝑥𝑖 and included additional junk variables

to give a total of 10, 100, or 1000 predictors. Thus, in the BART study the number of

junk variables and the dimension were varied. From previous examples we see that

how our seven factors are reasonable to select to examine their effects on BART CIs.

These factors have various types and numbers of levels. Some are four level factors,

while other factors have only two levels. Some are numeric factors, while others are

categorical. More details of our factors and their levels are given below.

1. Sample size “n”

The sample size determines the number of observations in a training data set. It is

the first factor we decided to include in this full factorial design, to see how the size of

training data sets can impact BART CI performance. The sample size “n” is a numeric

25
factor with four levels: n = (40, 200, 1000, 10000). The reason for not choosing larger n

values is that the BART process would take a very long time to run.

2. Dimension “p”

In our simulated data set, it is necessary to determine the generated values of x. The

dimension p gives the number of important predictors. The value p is the number of

predictors that affect the response function. This excludes junk variables, which are

described below. The four levels p = (1, 2, 5, 10) were chosen for this factor. We did not

select the p values to be too large because we know BART would take a very long time

to run.

The decision to study the effect of dimension on BART CIs is motivated by studies in

CGM (2010). That trial was a comparison between BART and other supervised learning

methods such as gradient boosting, random forests, and neural nets. The goal was to

examine which method had the best performance on a simulated function of p=5

inputs, with additional junk variables. The dimension p=5 in that study is similar to the

range 1, 2, 5, and 10 that we study in this thesis.

3. Junk variables

In the experiment, the mean function f(x) is a function of p variables. In addition to

those p variables, we may add “junk” variables which have no effect on f(x) or the

26
response y. The factor “junk” is a two level categorical factor with levels (“no”, “yes”).

If “junk” = “no”, then the predictor matrix X has p columns. If it is equal to “yes”, then X

has 11*p columns, and for every actual x variable, there are 10 “junk” x variables. For

example, if p=10 and junk=”yes”, then our X matrix will have 110 columns. The first 10

columns will be used in f(x) to generate the mean function, while the other 100 columns

will be junk variables. Thus, the percentage of active inputs are less than 10%. The

purpose of considering junk variables is to see if the performance of CIs is affected for

Default.BART and CV.BART. Experiments in CGM (2010) suggest that Default.BART had

worse performance than CV.BART for high-dimensional problems with many junk

variables. That experiment had p=5 fixed, and varied the number of junk variables.

That is, in all experiments CGM used a fixed function in five predictors and a different

number of junk variables which were 5, 95, and 995. Thus, the total number of active

and junk variables were 10, 100, and 1000. Both BART versions illustrated similar

behaviour at 10, but Default.BART was worse than CV.BART at the greater values 100,

and 1000.

In this thesis, we emphasize that f(x) is built to be a function only of the p variables

but BART does not have information on which variables are the junk variables (if there

are any).

27
4. Standard deviation “σ”

The noise level of the residual in population formula (2.1) is √Var (𝜀) = σ. We vary σ

over four levels including very low values where σ = (0.01, 0.1, 0.25, 0.5).

5. Predictor correlation “r”

We decided to make the correlation between our simulated predictors to be positive

if it is nonzero. The positive correlation means that there is an increasing relation

between any two predictors. The correlation r is a numeric factor with two levels (0,

0.5). All predictors in the X matrix (including junk variables, if present) are generated as

multivariate normal random variables with mean vector 0 and covariance matrix R. If

r=0, then R=𝙸. If r=0.5, then all off-diagonal elements of R are 0.5, and the diagonal of R

contains 1’s.

6. Type of error distribution

In (2.1), the distribution of the noise term ε is specified as normal. We study the

effect of violating this assumption, by using either normal or t errors. Thus, the factor

“error distribution” is a categorical factor with two levels corresponding to Gaussian or a

t-distribution with 3 degrees of freedom. The t-distribution was chosen to have three df

because this is the smallest degrees of freedom for which the variance is finite. Errors

28
were generated as a scalar multiple of either a standard normal or a t-distribution, with

multiplier σ. Thus, the variance of the t-distribution was actually 3*𝜎 2 .

7. BART method

We study two different forms of the BART model: CV.BART and Default.BART, as

described in Section 2.4. The main difference is that the default version uses default

values of all prior parameters, while the cv version choses these parameters by cross-

validation. Thus, “BART method” is a categorical factor with two levels. Both BART

methods are implemented with the BayesTree package in R.

The seven factors are used to construct a design matrix “D” which is a full factorial

design matrix. Table 3.1 summarizes the seven factors.

We simulate five replications for each combination of the experimental design factors

(i.e., for each row of the D matrix). The experimental design matrix D has 4 x 4 x 4 x 2 x

2 x 2 x 2 = 1024 rows of factor combinations with replications, D will be a big matrix with

5120 = 5 x 1024 rows.

29
Factor Description Number Levels

of levels

N sample size 4 40, 200, 103 , 104

p number of important predictors 4 1, 2, 5, 10

junk variables extra variables having no effect on f(x) 2 “yes”, “no”

σ noise standard deviation 4 0.01, 0.1, 0.25, 0.5

r predictors correlation 2 0, 0.5

error distribution distribution type of ε 2 “N”, “t”

BART method Procedure to choose prior parameters 2 “cv”, “default”

Table 3.1: Summary of the factors of the designed experiment.

3.1.2 Additional variables not treated as factors

There are several other quantities that could have been treated as experimental

factors, but have instead have been set to fixed levels. In this section we identify those

quantities and discuss how their levels were chosen. These quantities are BART prior

parameters, and the number of MCMC iterations. There is a brief clarification for these

quantities below:

30
1. BART prior parameters

Four prior parameters that are important for BART are mentioned in Chapter 2. Each

of CV.BART and Default.BART has a particular way to select these prior parameters.

These parameters are the prior on 𝜇𝑖𝑗 “k”, the prior on σ “(v,q)”, and the number of

trees “m” in the ensemble. The setting of all these prior parameters’ levels is the same

method as is mentioned in Section 2.3 except there is a restriction relating to the choice

of the prior on σ.

In Chapter two we indicated that there are two ways to obtain 𝜎̂, a rough guess of σ

used in specifying the prior for σ. One may use the residual standard error from a linear

regression model or a sample standard deviation sd(у). In this thesis study, we decided

to calculate 𝜎̂ as a fraction of the whole training data variation, 𝜎̂ = sd(y)/2. Therefore,

the method for specifying 𝜎̂ can be considered as a fixed factor. The reason of not

choosing the linear model to specify 𝜎̂ is that in some cases the training set can have

fewer observation than variables (e.g. n=40, p=5, 10 with junk=’yes’ gives 55 and 110

variables and only 40 observations).

31
2. MCMC iterations

a. The MCMC sample

As described in Section 2.2, the MCMC samples of parameters values of the BART

model are used to generate credible intervals for predictions. The number of MCMC

iterations affects the quality of the BART CI, where increasing the number of iterations

enables the MCMC to better explore the posterior probability distribution. However,

increasing the number of MCMC iterations will lead to very long run times for BART. We

decide to make a small experiment to choose the best number of MCMC iterations that

balances CI accuracy with algorithm run time. This best value of MCMC we call “an

adequate number of MCMC iterations”. The details of the small experiment and results

are discussed in Section 3.2. Another important issue is that the MCMC algorithm will

not converge immediately, and we have to discard some of the early MCMC iterations

that are called “burn-in” iterations. In this thesis, burn-in iterations always represent

the first 20% of an MCMC run. CGM typically used 20% burn-in. BART runs more

quickly with 1 in 10 thinning. Later in this Chapter, we will see that the eventual

decision is to run 5000 iterations, discarding the first 1000 iterations as burn-in, and

keeping samples from the last 4000 iterations.

The mechanism of MCMC algorithm involves that: at a specific x each MCMC

iteration “sample” produces a corresponding output f(x). Thus, the 4000 MCMC

iterations shall give 4000 samples from the posterior distribution of the f(x) to give BART

32
CI. For this study, at each of 10,000 test points, these samples must be stored. These

saved samples “iterations 1001-5000” consume computer memory. To reduce storage,

we “thin” the MCMC sample by recording only results from every 10th MCMC iteration.

That is, we keep the 1001st, 1011th, 1021st, … of MCMC samples after burn-in.

Consequently, 90% of the computer memory shall be saved with negligible effect on the

quality of the results. To summarize, 5000 MCMC iterations are run; the first 1000 are

discarded as burn-in; the last 4000 values are thinned, resulting in 400 saved MCMC

samples.

b. Using shorter MCMC runs for CV.BART

The CV.BART procedure, described earlier in Section 2.4.1. is very computationally

intensive, requiring 120 runs of the BART algorithm, corresponding to 5 folds and 24

combinations of prior parameter choices. To speed up the CV.BART algorithm, we

shorten our MCMC runs to 100 burn-in iterations followed by 400 sampling iterations

again keeping every 10th value. The selection of the best of the 24 parameter

combinations in CV.BART is done by minimal cross-validated 𝑆𝑆𝐸. CGM (2010) observe

that if the low predictive error is the objective, instead of getting accurate coverage,

fewer MCMC iterations are needed. Note that the final re-fitting of BART with all

training data and the optimal parameters uses the same MCMC settings as default

BART.

33
3.1.3 Response variables for the experiment

This thesis goal is to study the BART properties CI coverage (at levels 90%, 95%, and

99%), CI width (90%, 95%, and 99%) and predictive SSE. These are defined in Chapter 2.

All these results are calculated by using a test data set with 104 observations and f(x)

values without error, generated using corresponding setting of p, junk, correlation and

σ. We utilize the test data to compute our response variables. If the results were

evaluated using the f(x) values at the training points, they would not be as reliable. A

model may overfit the training data, but prediction for the test data give a more realistic

assessment of predictive accuracy.

3.2 Choosing the number of MCMC iterations

Markov chain Monte Carlo (MCMC) samples are an important part of this research,

giving the posterior probability distribution for model predictions. Specifically it allows

calculation of the quantiles of the posterior probability density function of E (Y|X).

This section considers the choice of an appropriate number of MCMC iterations. A

larger number of iterations gives more accurate prediction since they cover the

posterior distribution more fully. However, a smaller number of iterations will give

results faster. The number of MCMC iterations needs to be chosen to balance accuracy

and speed.

34
For finding an adequate number of MCMC iterations, a small experiment was

conducted. We used the following test function: f(x) = 10 sin (π(𝑥1 )(𝑥2 )) + 20

(𝑥3 − 0.5)2+ 10 (𝑥4 ) + 5(𝑥5 ) (Chipman, George and McCulloch 2010; Friedman 1991).

In this simulated data, observations were generated randomly with i.i.d. N(0,1)

residuals. Ten x variables are generated independently from a Uniform (0,1)

distribution. The first five x variables affect the response, and the rest are junk

variables. The values N= 500, 1000, 1500, 2000, 3000, 4000, 5000, 10000 were used as

different numbers of MCMC iterations to fit the BART model. Each time the BART model

was estimated with a particular number of iterations and the prediction SSE, coverage

and width of 90% credible intervals were recorded. Two different training sample sizes

were selected to compare the behaviour of the Markov chain. These sample sizes were

𝑛1 = 1000 𝑎𝑛𝑑 𝑛2 = 10000. In all cases, a test set of 10000 observations was used.

An initial burn-in portion of the MCMC samples was discarded. In all cases, the

number of burn-in iterations was taken to be 25% of the number of the saved iterations

that we denote as “N”. For instance, if N=500 iterations were saved, then prior to that

125 burn-in iterations were discarded.

35
Table 3.2 illustrates the responses’ results and different values for training set sample

sizes and number of MCMC iteration. All CIs were constructed with level 90%. From

Table 3.2, it is obvious that coverage and width have a positive relation with the number

of MCMC iterations while there is a negative relationship between SSE and the number

of MCMC iterations. Figure 3.1 plots columns 2-7 of Table 3.2 as a function of the

number of MCMC iterations. The same trends identified in the table are visible in the

plots.

n=1000 n=10000

N coverage% width SSE coverage% width SSE

500 88.59 2.19 4600.23 76.69 0.744 945.38

1000 91.28 2.27 4367.53 77.53 0.743 945.15

1500 92.81 2.31 4132.30 79.29 0.760 900.11

2000 93.23 2.33 4039.64 81.87 0.787 852.72

2500 93.85 2.35 3955.37 83.88 0.808 833.77

3000 94.18 2.37 3940.56 83.98 0.813 833.60

4000 94.42 2.36 3902.00 85.48 0.831 814.12

5000 94.90 2.39 3814.29 85.31 0.836 815.50

10000 96.72 2.56 3385.59 86.23 0.839 793.16

Table 3.2: Predictive performance at various MCMC iterations (N) and sample sizes (n).

36
Figure 3.1: The relation between MCMC samples and different responses (width,

coverage, SSE).

37
The largest changes to SSE, width and coverage happen by 4000 MCMC iterations.

That suggests N=4000 as an adequate number of iterations. Choosing more MCMC

iterations gives more reliable results but not very different from 4000 iterations. Thus,

4000 is a good number of iterations for this research, giving a good balance of

computing speed and accuracy.

When this experiment was repeated, there were only small changes in results for the

n=1000 training set. However, the same trends as in Table 3.1 and Figure 3.1 were

observed. For n=10000, the results did not change.

Figure 3.2 displays an explanation for the different behaviour of Markov iterations at

both values of training sample size. It plots the posterior samples of σ against the

MCMC iteration, with burn-in samples indicated in red and saved samples indicated in

black.

38
Figure 3.2: Posterior samples of σ versus MCMC iterations number, for training

samples of size 1000 and 10000. Burn-in samples are red.

The parameter σ is plotted instead of other model parameters because it summarizes

the fit of the BART model. The true value of σ is 1.0. The right plot of Figure 3.2 shows

more uniform MCMC iterations and less noise than in the plot for n=1000. This is to be

expected, since the larger training set will result in a posterior distribution with less

uncertainty. The plots indicate that the MCMC is mixing well among these iterations.

In the conclusion, even though the larger number of MCMC iterations do give more

reliable findings, smaller values won’t give very different results. While 4000 MCMC

iterations is less than 10000, it leads to credible predictions and results similar to 10000.

39
3.3 The simulated function

To have more general results, for each of the 5120 different experimental runs in D, a

different f(x) is generated, instead of using just one population function. This function

could be called a mean function because it is equal to E(у│x) where y = f(x) + ɛ in the

population model. At each level of D factors’ combination, the function f(x) is chosen

randomly and used as the conditional mean, in simulating the data set. This leads to a

specific data set for each row of D.

Friedman (2001) described an algorithm for generating this random function. The

thesis uses the implementation of that algorithm from Chipman (2011). This Friedman

function can be considered as uncontrollable factor of the experiment, since it varies

randomly across each row of D. The formula of this generated function is given as

(3.1) f(x) = ∑20 20


𝑗=1 𝑎𝑗 ℎ𝑗 = ∑𝑗=1 𝑎𝑗 ℎ𝑗 (𝑥𝑗 ; 𝜇𝑗 , 𝑉𝑗 )

The coefficients 𝑎𝑗 are scalars that are randomly generated as U(-1,1). The function ℎ𝑗 is

an unnormalized multivariate Gaussian probability density function

1
(3.2) ℎ𝑗 (𝑥𝑗 ; 𝜇𝑗 , 𝑉𝑗 ) = exp (- ((𝑥𝑗 - 𝜇𝑗 )𝑇 𝑉𝑗−1 (𝑥𝑗 − 𝜇𝑗 ))
2

The size “𝑝𝑗 " of each 𝑥𝑗 is taken randomly as floor [1.5 + r] where r is drawn from an

exponential distribution with mean λ=2. If r > p, then we set r = p. The vector 𝑥𝑗 has a

random subset of the p predictor variables of length 𝑝𝑗 where 0 < 𝑝𝑗 ≤ p. The

40
parameters of ℎ𝑗 are the mean vector “𝜇𝑗 ” and covariance matrix “𝑉𝑗 " which are not

constant. The vector { 𝜇𝑗 }120 is randomly generated from MVN(0, I). The

𝑝𝑗 x𝑝𝑗 matrix "𝑉𝑗 " is a random orthonormal matrix given as

(3.3) 𝑉𝐽 = 𝑈𝑗 𝑊𝑗 𝑈𝑗𝑇

where 𝑈𝑗 is a uniformly random orthonormal matrix and 𝑊𝑗 = diag {𝑤1 , … , 𝑤𝑝𝑗𝑗 }. The

𝑊𝑗 ’s elements are generated as the square of a 𝑈(𝑎 = 0.1, 𝑏 = 2. ).

In each ℎ𝑗 , there are a new 𝜇𝑗 and 𝑉𝑗 . These two parameters control the shape of ℎ𝑗

which contribute to the shape of f(x). Figure 3.3 is a contour plot of generated f(x)

corresponding to different values of 𝜇𝑗 and 𝑉𝑗 . It shows six simulated functions f(x)’s by

setting p =2, and the number of ℎ𝑗 terms is 10, instead of 20 as in (3.1). Each one of the

ten ℎ𝑗 is a sub function of a vector of one or two randomly chosen variables from two

inputs that belongs to the vector x.

41
Figure 3.3: Six realizations of randomly simulated f(x) in two dimensions, as described in

the text.

Each one of f(x) illustrates different shape, different peak and surface. This confirms

how much variety this function has which leads the thesis results being more general.

42
We made some small changes to the mechanism of the function generation by fixing

the number of ℎ𝑗 functions to be 20 and making the random length of 𝑥𝑗 variables “𝑝𝑗 "

to be bounded by p, where p is the number of active variables in the model specified by

the design. In the Friedman algorithm, and the Chipman implementation, the number

of ℎ𝑗 functions was random instead of fixed at = 20.

3.4 Parallel computing mechanism

In this thesis, there are computational challenges caused by the size of the

experiment, and the long execution time of BART. These challenges led us to use

parallel computing resources from ACENET.

This section describes the use of parallel computing to carry out the 5120 runs of the

experiment. All computing was carried out on ACENET parallel computers. ACENET is a

high performance computing consortium for academic research, consisting of several

thousand individual computers connected together (http://www.ace-

net.ca/wiki/Compute_Resources). The ACENET clusters are suitable for running jobs

that require large computing resources that can also be distributed among several

computers. A large job should be split into multiple small jobs which are then executed

in the individual computers in the clusters. The cluster also helps in executing multiple

43
runs of a process with different sets of data, for example the same algorithm is run with

several sub combinations of the given data.

This thesis has a large experiment where the final design matrix D has 5120 runs after

replication. To run such a large experiment on a personal computer, we would need to

spend months. To save time, we decided to run the thesis experiment on a cluster of

computers by dividing the entire experiment into smaller “jobs”. The technique used

here is breaking the 5120 rows of the D matrix into 640 independent jobs which can be

run on one or more ACENET clusters. Each job consists of 8 rows or runs of the D

matrix. Each one of these jobs is an independent task. Table 3.3 shows the first of the

640 jobs. We divide the 5120 runs among the 640 jobs so that the execution time of

jobs is similar from one job to another. This time is less than 48 hours for jobs which

include CV.BART, and it is less than 24 hours for jobs that contain the Default. BART

method. For CV.BART, each job took between approximately 16 and 29 hours to

complete while the job which corresponded to Default.BART took about 4 to 10 hours.

This controlling of time for these parallel jobs was a result of dividing them equally

according to the sample size. All the 640 jobs have the same levels of sample size, in the

same order. This indicates that the n column of Table 3.3 is exactly the same for every

job. It was very reasonable to set our jobs in this way since the execution of the BART

44
row n p σ junk r error.dist method replicate

1 40 1 0.01 no 0 t cv 1

2 200 1 0.01 no 0 t cv 1

3 1000 1 0.01 no 0 t cv 1

4 10000 1 0.01 no 0 t cv 1

5 40 2 0.01 no 0 t cv 1

6 200 2 0.01 no 0 t cv 1

7 1000 2 0.01 no 0 t cv 1

8 10000 2 0.01 no 0 t cv 1

Table 3.3: The 8 rows of D in the first job.

algorithm is roughly proportional to training set size n. Through running the MCMC

experiment which is shown in Section 3.2, we have seen that MCMC runs take a longer

time when n values increase. Therefore, we select our jobs to have the same

combinations of n levels.

To run these jobs on ACENET machines, we used a collections of R code for parallel

computing (Chipman, 2011). It includes five main files. The files respectively are:

main.R, doit.R, setbatch.R, postprocess.R, and cleantemp.R. Each one of these files has

a particular task regarding R jobs. Briefly, we will mention these files tasks:

45
- doit.R has most of the code to execute the runs for this thesis experiment. It builds

the D matrix, specifies the combinations of prior parameters, the number of

replications, and the MCMC iteration paramters. It also loads all the R packages that

program needs to run such as “mvtnorm”, and it splits the D matrix in small jobs that

each contains 8 rows of D. It executes Default.BART or CV.BART and then saves the

results in files to be available to read later. It includes some of the following files as

sources such as setbatch.R file.

- main.R is the first file that has to be run of these computing files group. It prepares all

other files before execution on the cluster. It specifies each job’s maximum execution

time in hours and minutes. Thus, it can be considered as a timer file that specifies how

long the job is going to take. Execution of main.R prepares separate files associated

with the 640 different jobs to be run. For each job, an R file specifies which rows of D

will be used by doit.R. Code is also generated that submits the jobs to the cluster queue

for execution.

- setbatch.R is a function that works to set up these parallel jobs on an ACENET queue.

Its objective is giving names of files and the parallel jobs to identify them on ACENET.

- postprocess.R is an results collection file, which is run after all parallel jobs have been

completed. This file will read all results files and carry out all required analysis.

- cleantemp.R is a cleaning file, we use to discard the junk files produced by the

computations.

46
We did some minor modifications on this setbatch example to run our jobs on Ace-

net. For example, the doit.R file has our main program and it contains all files that

correspond to the program as sources such as mainfunc.R, cvbart.R, and Friedman.R.

The roles of these three programs are as follows:

- mainfunc.R is a complementary file to doit.R that contains the rest of the program

settings. Here all the small changes to the function generation settings were done such

as fixing the number of ℎ𝑗 terms to be 20.

- cvbart.R includes the CV.BART method, described in Sections 2.4.1.

- Friedman.R defines the Chipman (2011) implementation of the Friedman function

generator that was discussed in Section 3.3.

47
Chapter 4
Analysis

This thesis uses ANOVA for analysis of the experimental results. ANOVA displays the

main effects and interaction effects of factors. This thesis focuses on main effects and

two-way interactions and ignores three-way interactions since the inclusion of three-

way interactions in the model does not lead to large increases in 𝑅 2 . More details shall

be shown later in this chapter.

This experiment has seven responses to study: coverage90, coverage95, coverage99,

width90, width95, width99, and predictive SSE. However, this analysis concentrates on

the three responses coverage95, width95, and SSE. The definitions of the responses

were mentioned in Section 2.2. Results for responses with 90% and 99% level are quite

similar to those for 95% level, as illustrated later in this chapter.

Below the factors’ main effects and interactions are explored via plots and ANOVA

tables.

48
4.1 Analysis of coverage

We have three possible coverage responses: coverage90, coverage95, and

coverage99. Figure 4.1 presents the main effects for all three responses. The effects

have the same relative size across different coverages, with a shift in mean level from

one response to another. There is a high correlation between these responses. Section

4.6 contains the details. We choose to analyze only 95% coverage because its results

are similar to 90% and 99% responses.

Figure 4.1: The main effects for coverage90, 95, and 99 respectively. The factors’ order

along the horizontal axis is n, dimension p, σ, predictor correlation r, junk variables,

error distribution, and BART method.

49
Figure 4.2: Main effects for coverage95.

Figure 4.2 displays the main effects for coverage95. We use it to examine the effect

sizes. The factor n has the largest effect on the coverage of BART credible intervals. The

factors σ and error distribution have the second and third largest main effects,

respectively. Then, junk predictor, dimension p, and BART method have small but

visible effects in Figure 4.2. The predictor correlation “r” appears to have little or no

impact on coverage. Overall, the mean level of coverage (about 80%, indicated by the

horizontal line in Figure 4.2) is considerably lower than the nominal 95% level.

50
Mostly there is an inverse relationship between the sample size n and coverage.

Coverage increases when n decreases, except when n=40. Thus, for large sample size,

such as n = 10,000, the coverage of BART credible intervals is much lower than the

desired 95%. The factor σ has a positive relationship with coverage, with a very low

coverage at σ =0.01. Even though the results say the presence of junk predictors and

error distribution has little effect on coverage, they give coverage closer to the nominal

95% level when there are no junk variables and the error is normally distributed. The

difference between cv and default versions of BART is very small.

We now give a more detailed analysis of coverage95, using an ANOVA table and

interaction plots. Table 4.1 shows the ANOVA for the 2-way interaction effect model.

The rows of the ANOVA table are ordered by the mean square values (MS). F-tests and

the corresponding p-values can be used to identify significant effects. The last column

of the table indicates the significance of effects by p-values, where 0.05> * > 0.01> **>

0.001> ***.

In the reminder of this section we examine some of the large effects. The order of the

largest effects for coverage95 shows that n, σ, n:σ, error distribution, p:σ, and n:

method are the six most important effects for the coverage of BART credible intervals.

All main effects except r are highly significant, with p-values less than 10−4 , and seven

two-way interactions are this significant or more so.

51
Df SS MS F- value Pr(>F)
n 3 73.672 24.557 4168.230 <2.2e-16 ***
σ 3 21.214 7.071 1200.234 <2.2e-16 ***
n: σ 9 43.023 4.780 811.399 <2.2e-16 ***
error distribution 1 0.887 0.888 150.639 <2.2e-16 ***
p: σ 9 6.456 0.717 121.765 <2.2e-16 ***
n: method 3 1.678 0.559 94.945 <2.2e-16 ***
p: junk 3 1.264 0.421 71.529 <2.2e-16 ***
n: p 9 3.348 0.372 63.140 <2.2e-16 ***
junk 1 0.263 0.263 44.582 2.702e-11 ***
n: junk 3 0.769 0.256 43.526 <2.2e-16 ***
n: error distribution 3 0.410 0.137 23.176 6.791e-15 ***
method 1 0.091 0.091 15.390 8.861e-05 ***
p: method 3 0.171 0.057 9.655 2.369e-06 ***
p 3 0.089 0.030 5.046 0.001718 **
σ: error distribution 3 0.063 0.021 3.577 0.013338 *
σ: r 3 0.059 0.020 3.359 0.018007 *
σ: method 3 0.056 0.019 3.172 0.023263 *
σ: junk 3 0.036 0.012 2.011 0.0901220
r: method 1 0.009 0.009 1.574 0.209706
r: junk 1 0.007 0.007 1.115 0.291069
junk: method 1 0.005 0.005 0.780 0.377086
p: error distribution 3 0.012 0.004 0.682 0.562851
n: r 3 0.005 0.002 0.265 0.850610
p: r 3 0.003 0.001 0.169 0.917679
r 1 0.001 0.001 0.162 0.687229
error distribution: method 1 0.001 0.001 0.105 0.746514
junk: error distribution 1 0.000 0.000 0.044 0.833871
r: error distribution 1 0.000 0.000 0.002 0.967963
residual 5037 29.676 0.006

Table 4.1: ANOVA table for coverage95.

52
Earlier in this section, interpretations of the main effects for coverage95 were given.

Here we examine the three interactions among effects with the six largest MS values.

We first consider the n:σ interaction displayed as an interaction plot in Figure 4.3.

Figure 4.3: The n:σ effect on coverage95.

The n:σ interaction plot shows that when σ=0.01 and n=10000 coverage is

approximately 30%, well below the nominal 95% level. As a result, BART credible

intervals will be very inaccurate at very small values of σ and very large values of sample

size n. At other levels of σ (0.1, 0.25 and 0.5), coverage is less sensitive to the effect of

sample size n. Figure 4.3 shows coverage closest to the nominal 95% level when n has

small values (40 or 200) and σ is large (0.25 or 0.5).

53
Figure 4.4 illustrates the effect of p:σ against coverage95. Coverage is much lower

when σ =0.01, and for other values of σ, coverage is similar. The factor p has a

complicated relationship with coverage. For instance, sometimes p gives the best

values for coverage at its lowest values such as what happens at p=1, 2 and σ =0.25, 0.5.

In other cases, p gives the worst values for coverage at 1, and 2 when σ equals 0.01

Figure 4.4: The p:σ effect for coverage95.

Figure 4.5 shows the n:method interaction effect. The n:method interaction appears

smaller than the other two interactions, since the two lines are reasonably close to each

other. The coverage of BART.Default seems more sensitive to n than the coverage of

BART.CV. BART.Default varies more with n than BART.CV. It is clear both BART methods

give the best coverages at the lowest levels of sample size. The factor n gives the worst

54
values for coverages at its large values. At n = 40 and n =10000 CV.BART has coverage

closer to 95% than does Default.BART. Both methods give good coverage at n=40 and

200, and the coverage of CV.BART changes less with n than the coverage of

Default.BART.

Figure 4.5: The n: method effect for coverage95.

4.2 Analysis of width

Here we begin by comparing main effect plots of credible interval width over the

three different coverages (Figure 4.6). Figure 4.6 suggests the factors’ effects have the

same relative sizes across widths corresponding to 90%, 95% and 99% levels, other than

55
a shift in mean level from one response to another. Thus, we only analyze width95

(Figure 4.7).

Figure 4.6: The main effects for width90, 95, and 99. The factors order is n, p, σ,

predictor correlation r, junk variables, error distribution, and method.

Figure 4.7 shows the size of different effects. The factor n has the largest impact on

credible interval widths. The factors σ and junk variables are the second and third

largest effect factors while dimension p and error distribution have the fourth and fifth

56
largest effects. The correlation “r” shows some impact on width, with a slightly

narrower width when predictors are correlated. BART methods appear to have little or

no impact on width. The ordering of some factor effects by size, such as junk and p, can

be difficult because of differing numbers of factor levels. A more precise ordering is

given later using ANOVA.

Figure 4.7: The main effects for width95.

The labelled effects in Figure 4.7 can be used to identify the relationship between

each factor and interval width. There is a negative relation between sample size n and

width. Reasonably, CI width decreases when n increases. The factors σ and p have a

57
positive relationship with width, it increases when they increase. The categorical

variables junk and error distribution give lower width when there are no junk variables

or the response’s error has a t distribution.

The width95 ANOVA is illustrated in Table 4.2, with rows ordered according to the

size of MS.

The p-values in Table 4.2 indicate that all main effects except method are significant,

and r has a smaller effect than the other 5 main effects. However, the first four effects

are all considerably larger than other effects based upon their mean squared error. The

sample size n is the most significant factor. It has the largest MS, at least three times

more than any other significant factors. In addition, the main effects have the largest

impact on width, with the five largest MS values. Although the factor junk has a larger

MS value than p, the ordering is reversed when considering SS. That is, the four levels

of p account for more overall variation than the two levels of junk, but when

standardized by the degrees of freedom, junk has the larger mean effect.

58
Df SS MS F-value Pr(>F)

n 3 1369.23 456.41 6369.026 <2.2e-16 ***

σ 3 450.36 150.12 2094.890 <2.2e-16 ***

junk 1 114.38 114.38 1596.080 <2.2e-16 ***

p 3 219.22 73.07 1019.734 <2.2e-16 ***

error distribution 1 26.94 26.94 375.989 <2.2e-16 ***

n: junk 3 74.10 24.70 344.673 <2.2e-16 ***

n: method 3 40.79 13.60 189.729 <2.2e-16 ***

n: p 9 69.46 7.72 107.697 <2.2e-16 ***

σ: error distribution 3 18.94 6.31 88.122 <2.2e-16 ***

n: σ 9 44.51 4.95 69.010 <2.2e-16 ***

σ: junk 3 6.25 2.08 29.065 <2.2e-16 ***

r 1 1.92 1.92 26.820 2.320e-07 ***

junk: method 1 1.39 1.39 19.344 1.114e-05 ***

p: method 3 4.11 1.37 19.120 2.518e-12 ***

n: error distribution 3 3.32 1.11 15.442 5.350e-10 ***

r: junk 1 0.87 0.87 12.208 0.0004799 ***

p: σ 9 7.40 0.82 11.474 <2.2e-16 ***

junk: error distribution 1 0.75 0.75 10.497 0.0012035 **

error distribution: method 1 0.43 0.43 5.957 0.0146920 *

n: r 3 1.21 0.40 5.607 0.0007794 ***

p: r 3 0.79 0.26 3.694 0.0113546 *

σ: method 3 0.59 0.20 2.726 0.0426079 *

p: junk 3 0.55 0.18 2.567 0.0527259

r: method 1 0.16 0.16 2.174 0.1404399

r: error distribution 1 0.05 0.05 0.631 0.4268844

σ: r 3 0.12 0.04 0.541 0.6544282

p: error distribution 3 0.11 0.04 0.5330 0.6596303

method 1 0.01 0.01 0.071 0.7897343


residual 5037 360.96 0.07

Table 4.2: The ANOVA table for width95.

59
We now focus on the interaction n: junk which is the 6th largest effect. Figure 4.8

shows the n:junk interaction. When n increases, width decreases. This confirms the

inverse relation between n and width as stated previously. The n:junk interaction

corresponds to the fact that the lines are not parallel. That is, when there are junk

variables, the effect of sample size on width is larger.

Figure 4.8: n: junk effect for width95.

4.3 Analysis of SSE

Figure 4.9 displays the main effects for the predictive SSE. All factors have a visible

effect on the sum of squared errors (SSE).

60
Figure 4.9: The main effects for SSE.

The factor n has the largest effect on SSE. The factors junk, σ, and p are the next

largest main effects. As with the coverage95 response, the ranking of effect size is

complicated by the varying number of levels for different factors. The rest of the main

effects are smaller but still visible in Figure 4.9. SSE decreases as n increases, p

decreases and σ decreases. SSE is smaller with no junk variables, and slightly smaller

with t errors, cross validation and correlated predictors.

The ANOVA for the response “predictive SSE” shown in Table 4.3. As in the earlier

ANOVA tables, rows are ordered by the MS column.

61
Df SS MS F-value Pr(>F)
n 3 8260741213 2753580404 2529.662 <2.2e-16 ***
junk 1 1149670834 1149670834 1056.181 <2.2e-16 ***
σ 3 2876836357 958945452 880.965 <2.2e-16 **
n: junk 3 1654503795 551501265 506.654 <2.2e-16 ***
P 3 1595725522 531908507 488.654 <2.2e-16 ***
n: p 9 1279609758 142178862 130.617 <2.2e-16 ***
n: σ 9 1019100689 113233410 104.0254 <2.2e-16 ***
error distribution 1 74723433 74723433 68.647 <2.2e-16 ***
p: junk 3 168766357 56255452 51.681 <2.2e-16 ***
method 1 52389336 52389336 48.129 4.495e-12 ***
junk: method 1 47800181 47800181 43.913 3.792e-11 ***
σ: error distribution 3 94367583 31455861 28.898 <2.2e-16 ***
p: σ 9 248525987 27613999 25.369 <2.2e-16 ***
n: method 3 77281290 25760430 23.666 3.322e-15 ***
σ: junk 3 62331280 20777093 19.088 2.640e-12 ***
r 1 12565677 12565677 11.544 0.0006850 ***
r: junk 1 10704712 10704712 9.834 0.0017228 **
n:r 3 19421017 6473672 5.947 0.0004809 ***
n: error distribution 3 13244277 4414759 4.056 0.0068745 **
p: r 3 10530418 3510139 3.2247 0.0216316 *
error distribution: method 1 1873202 1873202 1.721 0.1896402
σ: r 3 3744049 1248016 1.1465 0.3287928
p: method 3 3168696 1056232 0.9703 0.4056361
p: error distribution 3 1063607 354536 0.326 0.8067886
r: method 1 297215 297215 0.273 0.6013186
σ: method 3 566540 188847 0.174 0.9143665
r: error distribution 1 35733 35733 0.0328 0.8562321
junk: error distribution 1 2911 2911 0.003 0.9587577
residuals 5037 5482860544 1088517

Table 4.3: ANOVA table for SSE.

62
Overwhelmingly, n has the most significant effect. By looking at the MS of the six

largest effect factors, n is approximately twice as large as the next largest MS. Then

junk, σ, n: junk, p, and n: p have the largest effect on SSE respectively.

Now we examine interactions among the largest six effects. First, consider the n:

junk interaction displayed in Figure 4.10. At both levels of “junk”, there is an inverse

relationship between n and SSE. When there are junk variables, SSE is larger at small

values of n, and about the same as without junk variables, for large n.

Figure 4.10: The n: junk effect on SSE.

63
In Figure 4.11, the interaction between n and p is evident from the non-parallel lines.

At all levels of n, there is a positive relationship between p and SSE. At all n levels, SSE is

smaller when p decreases. The decreasing relationship between SSE and n changes

depending on the dimension p. As p increases, SSE values for small n become larger,

resulting in a stronger relationship between SSE and sample size.

Figure 4.11: The n: p effect for predictive SSE.

4.4 Similarities and differences between responses

In this section similarities and differences in the separate analysis of the responses

“coverage, “width”, and “SSE” are summarized in the points below.

64
1- The effect of junk is a little different for coverage99 than 90, and 95%. It has no

effect on coverage 99 while there is a small effect on both coverage90 and 95.

2- The effect sizes for each factor are approximately the same over the three

“coverage” responses. For the “width” response, the effect sizes are the same

over the 90, 95, and 99% levels. For both coverage and width, there is a shift in

the factor sizes within their three responses.

3- The factor p has very low impact on coverage while it has a large influence on

width and SSE responses.

4- At very low values of σ, such as 0.01, BART models have coverage far below the

nominal level.

5- Most of the factors have significant main effects for coverage except the

correlation “r”.

6- Most of the factors have significant main effects for width except BART method.

7- For SSE, all the seven factors have an impact.

8- The main effects for width have the same pattern as the main effects for SSE.

9- The main effects for coverage have a similar pattern as for SSE, with two

exceptions: junk and method. Their effects on coverage are the opposite from

their effects on SSE.

10- From points 8 and 9, we can see that although there is a very good relation

among these three responses, the relation between width and SSE is stronger

than the relation between coverage and SSE.

65
BART gives worst values for coverage when n is very high and σ is small. This might

occur because the MCMC iterations do not mix very well to explore the whole posterior

distribution. By referring to choosing the adequate number of MCMC experiment in

Chapter 3, our observations suggest the same thing, the coverage values are better

when BART method is used on a smaller training data set.

Generally, BART method has a small effect on coverage and SSE responses, and it

almost has no effect on width. Therefore, we recommend to use Default.BART instead

of using CV.BART which is much slower than the default version. It represents the best

cross validation choice of 120 CV.BART models corresponding to 24 combinations of

prior parameters x 5 folds. In addition, we suggest to increase the number of MCMC

iterations by a factor of two or more to increase the accuracy of Default.BART. Even

though that will cause Default.BART to run more slowly, it is still much faster than using

cross validation.

4.5 Comparison between the two, three, and seven-way interaction models

The linear model for assessing the main predictive effect of seven independent

variables n, σ, p, r, junk, error distribution and method is

(4.1) у = μ + 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ + 𝑏13 𝑥13 + Ɛ.

66
In (4.1), each four-level factor (n, p and σ) has three predictors 𝑥𝑖 and three parameters

𝑏𝑖 . For example, we can code 𝑥1 =1 if n=200 and 0 otherwise, 𝑥2 =1 if n=1000 and 0

otherwise, 𝑥3 =1 if n=10000 and 0 otherwise. Each one of the two-level factors such as

junk and method has only one predictor and its parameter. This gives a total of 3 + 3 + 3

+ 1 + 1 + 1 + 1 = 13 terms corresponding to main effects.

We consider studying various interaction effects. It is easy to define the 2-way

interaction as a product of two main effect terms appearing in (4.1). Therefore, the

regression formula for a two-way interaction model contains all main effect terms in

(4.1) plus 69 additional terms. Each term consists of a product between two main

effects and a related parameter. Similarly, a three-way interaction effect can be defined

as a product of three main factors. The formula of the three-way regression model

consists of the two way regression model plus 193 extra terms. The 69 two-factor terms

and the 193 three-factor terms correspond to 21 and 35 combinations of factors

respectively, or a total of 56 interactions.

We also consider a full model, in which all possible interaction effects are estimated.

This corresponds to a seven-way interaction model, with 1023 degrees of freedom for

effects. This full model is not considered for interpretation, but only to illustrate how

much variation is explained by the second and third order models. The seven-way linear

model consists of the three-way linear model plus 748 terms and their corresponding

67
parameters. These terms correspond to 35 + 21 + 7 + 1 = 64 combinations of factors

which relate to the four-, five, six-and seven-way interactions respectively.

To decide whether to consider a model with just two-way interactions, both two-way

and three-way interactions, or all two, three, …, seven-way interactions, F tests were

conducted. These tests compare whether the full model with all interactions up to order

7 could be simplified to:

 a third order model (main effects, 2 way interactions and 3 way interactions)

 a second order model (main effects and 2 way interactions)

 a first order model (main effects)

For the test comparing the full model and third order model, the responses coverage95

and SSE had p-values of approximately 0.0005, while the response width95 had a p-

value of 0.82. All other comparisons between the full model and first or second order

models had p-values of 0. The tests indicate that for all responses, interactions up to

the third order were significant, and for two of the three responses, higher order

interactions were also significant.

Although third order and some higher order terms are significant, the largest effects

are main effects and two-way interactions. This is indicated by Table 4.4, which gives 𝑅 2

values for first, second, third and seventh order models. There is a large increase in 𝑅 2

when moving from a main effect model to a second order model. Although 𝑅 2 values

continue to increase for higher order models, the increases are smaller and correspond

68
to a large number of degrees of freedom. For instance, there are 193 additional degrees

of freedom associated with the three-way interactions. Considering the relatively

smaller increases in 𝑅 2 , analysis in this thesis focused on main effects and two-way

interactions, which represented the largest amount of variation in the responses.

Response main effects 2fi 3fi 7fi

coverage95 0.525 0.838 0.886 0.907

width95 0.774 0.872 0.895 0.910

SSE 0.579 0.774 0.797 0.833

Table 4.4: The 𝑅 2 of interaction models.

4.6 The correlation between the responses

In this section, we examine the correlation between our seven responses. This

suggests how much it is necessary to analyze each individual response. If the

correlation between responses of a certain type is high, it is enough to illustrate the

analysis for one response of this type. Table 4.5 shows the correlation values between

our seven responses.

69
response coverage90 coveage95 coverage99 width90 width95 width99 SSE

coverage90 1 0.985 0.919 0.379 0.380 0.382 0.032

coveage95 0.985 1 0.970 0.389 0.389 0.391 0.069

coverage99 0.919 0.970 1 0.373 0.373 0.374 0.114

width90 0.379 0.389 0.373 1 0.999 0.999 0.799

width95 0.380 0.389 0.373 0.999 1 0.999 0.798

width99 0.382 0.391 0.374 0.999 0.999 1 0.797

SSE 0.032 0.069 0.114 0.799 0.798 0.797 1

Table 4.5: The correlation between responses.

The correlation ranges between coverages are similar and very strong. Coverage95

has the largest correlation with the other two coverages. Therefore, choosing

coverage95 for analysis was reasonable. The correlation among all three width types is

0.99. Thus, it was enough to illustrate the analysis for only one width. Choosing the

“middle” width, i.e. width95, seems reasonable. There is a moderate correlation

between coverage and width, it does not exceed 0.4. SSE and the coverage variables

have almost no correlation. Width and SSE have a strong correlation of approximately

0.8.

70
Chapter 5
Conclusion and future work

A supervised learning problem is a statistically fundamental problem that uses

training data to make an inference on an unknown population function that predicts an

output by using predictor variables. In this thesis, we study the supervised learning

model BART which works to model that population function to give a numeric value of

the response. It is a development of a Bayesian “sum of trees” model where each tree

is concentrated to have a small effect by regularization of the prior. This ensemble

model uses MCMC to generate samples from the posterior distribution, enabling the

calculation of credible intervals.

This thesis focuses on examining the accuracy of BART CIs across various factors such

as the sample size “n”, the noise standard deviation “σ”, and the error distribution. To

conduct the simulation study, we selected a full factorial design which varies the levels

of factors automatically and is easy to analyze. ANOVA was used to analyze the seven

responses we studied here. These responses include three sorts of coverage (90, 95,

99)%, three sorts of width (90, 95, 99)%, and the SSE of prediction.

The analysis in Chapter 4, shows that the credible intervals from CV.BART and

Default.BART are similar over all factor combinations. The effect of the type of BART

71
method on coverage and SSE responses is very small, and there is no effect on width.

Thus, we suggest using Default.BART since it is much faster than CV.BART. Recall that

for 5-fold CV, with 24 combinations of prior parameters, BART.CV requires 120 runs of

BART.

A very important conclusion is that the coverage of BART CIs was significantly

affected by noise standard deviation σ and sample size n. It was poor at very low values

of σ and very high values of n. This may be because of poor mixing of MCMC iterations

at large vales of n, they do not cover or explore the entire posterior distribution well. In

Chapter 3, the experiment of selecting the adequate number of MCMC iterations

suggests this result. That small experiment shows that coverage gets better when BART

is used on a smaller training data set.

We recommend increasing the number of MCMC iterations which could improve the

coverage of BART CIs. Although this will lead Default.BART to spend more time, it will

still be less computationally intensive than CV.BART.

This research suggests multiple directions for future work. We could further examine

the effects of various different factors on BART CI’s. The first suggestion is studying the

influence of a larger number of MCMC iterations such as 4000, 10000, and 20000

iterations. In Chapter 4 we saw that coverage was not always good with MCMC

iterations = 4000. The increasing of MCMC iterations to large values such as 10000 and

72
20000 should give better coverage and might also improve CI width and SSE. A second

factor which could be examined is the prior parameters. The idea of utilizing the prior

parameters as a factor of the design experiment would be to see if there is a better

Default than what CGM give. At each row of the design matrix, there will be a particular

combinations of prior parameters, and by analyzing the results we can determine the

best combination.

Also, it is interesting to conduct a similar designed experiment with using a

completely different model than BART to estimate credible intervals. Treed Gaussian

processes (Gramacy & Lee, 2008), Bayesian generalized additive models (Hastie &

Tibshirani, 2000) and random forests (Breiman, 2001) are all examples of statistical

learning models that give credible intervals. Then, we could gauge the accuracy of these

models’ CI’s across the all given factors of the designed experiment. There are other

statistical learning models such as new versions of random forests or BART model.

Wager, Hastie, & Efron (2014) built a random forests model based on the jackknife and

infinitesimal jackknife, which uses variance estimates for bagging proposed by Efron

(1992, 2014). Pratola (2012) created a version of BART with better MCMC mixing than

CGM. It should show better CI’s coverage because the new BART has a better MCMC

algorithm that should mix very well compared to the implementation of BART used in

this thesis.

73
Another direction for future work is studying the influence of the seven factors on

BART CI’s, but with different factor levels. For instance, including larger values of the

sample size and predictor dimension. This would consume more time, but it could give

a more general conclusion. Also, it is might be a good proposal to consider an error

distribution that was more different from the Normal distribution. For example, taking

a t-distribution with a very low degree of freedom such as 1, or utilizing a skewed

distribution such as gamma.

Fractional factorial design is another possibility for future work instead of conducting

a full factorial design. In the full factorial design, all possible combinations of the factor

levels are run. A fractional factorial design is defined as a fraction or partition of the full

factorial design, reducing the size of the full factorial. Fractional factorials of 2-level

designs are well studied. Fractional factorials for factors with 3 or more levels can be

more challenging to construct, although optimal design could be used to construct

fractional factorials. Fractional factorial designs will have some aliasing of estimated

effects. Since there is a concern of estimating the effect of the two-way interaction

model in this thesis, we should use a fractional factorial design with no aliasing or

confounding between the main and 2-factor interactions.

In the experiments described in Chapter 4, we noticed that for some jobs which have

the same combination of factors, there is variation of their execution time. Usually, this

time variation is between 1 and 1.5 hour, while it is around three hours in some extreme

74
cases. For instance, the 4th and 132nd parallel jobs have the same combination of

factors, but they finished in 19 and 22 hours respectively. Therefore, we think it is

probably suitable to add the job’s time as a new response. The effect of factors on

execution time, and also the random variation in execution time, might give insight into

the use of BART.

75
Bibliography

[1] Anand, K., (2015) An Expected Improvement Criterion for the Global Optimization of

a Noisy Computer Simulator unpublished M.Sc thesis, Acadia University.

[2] Breiman, L. (2001) Random Forests. Machine Learning, 45, 5–32.

[3] Chipman, H (2011) “Friedman Function Machine”, http://florence.acadiau.ca/collab

/hugh_public/index.php?title=R:friedman.function.machine retrieved October 25, 2014.

[4] Chipman, H., George, E. & McCulloch R. (2010) Bayesian Additive Regression

Trees. The Annals of Applied Statistics, 4, 266-298.

[5] Chipman (2011) “setbatch”, http://florence.acadiau.ca/collab/hugh_public/index.

php? title=R:setbatch retrieved October 25, 2014.

[6] Efron, B. (1992) Jackknife-After-Bootstrap Standard Errors and Influence Functions.

Journal of the Royal Statistical Society, Series B, 54, 83–127.

[7] Efron, B. (2014) Estimation and Accuracy After Model Selection. Journal of the

American Statistical Association, 109, 991-1007.

[8] Friedman, J.H. (2001) Greedy Function Approximation: A Gradient Boosting Machine.

Annals of Statistics, 29, 1189-1232.

[9] Friedman, J.H. (1991) Multivariate Adaptive Regression Splines. The Annals of

Statistics, 19, 1–67.

76
[10] Gramacy, R. B., & Lee, H. K. H., (2008). Bayesian Treed Gaussian Process Models

With an Application to Computer Modeling. Journal of the American Statistical

Association, 103, 1119-1130.

[11] Hastie, T. & Tibshirani, R., (2000) Bayesian Backfitting, Statistical Science. 15, 196–

223.

[12] James, G., Witten, D., Hastie, T. & Tibshirani, R., (2013) An Introduction to Statistical

Learning with Applications in R, Springer, New York.

[13] Montgomery, D. C. (2013) Design and Analysis of Experiments. (8th ed). USA: John

Wiley & Sons, Inc.

[14] Oehlert, G. W. (2010) A First Course in Design and Analysis of Experiments.

Retrieved from http://users.stat.umn.edu/~gary/book/fcdae.pdf

[15] Pratola, M. T., (2013) Efficient Metropolis-Hastings Proposal Mechanisms for

Bayesian Regression Tree Models, arxiv eprint 1312.1395.

[16] Wager, S., Hastie, T. & Efron, B., (2014) Confidence Intervals for Random Forests:

The Jackknife and the Infinitesimal Jackknife. Journal of Machine Learning Research. 15,

1625−1651.

77

You might also like