MGP SeparableA

Journal of Computational Physics 241 (2013) 212239
Contents lists available at SciVerse ScienceDirect
Journal of Computational Physics

journal homepage: www.elsevier.com/locate/jcp
Multi-output separable Gaussian process: Towards an efcient,

fully Bayesian paradigm for uncertainty quantication
Ilias Bilionis a,b, Nicholas Zabaras a,b,, Bledar A. Konomi c, Guang Lin c
a
Center for Applied Mathematics, 657 Frank H.T. Rhodes Hall, Cornell University, Ithaca, NY 14853, USA
Materials Process Design and Control Laboratory, Sibley School of Mechanical and Aerospace Engineering, 101 Frank H.T. Rhodes Hall, Cornell University,
Ithaca, NY 14853-3801, USA
c
Computational Sciences and Mathematics Division, Pacic Northwest National Laboratory, 902 Battelle Boulevard, P.O. Box 999, MSIN K7-90, Richland, WA 99352,
USA
b
a r t i c l e
i n f o
Article history:
Received 29 July 2012
Accepted 10 January 2013
Available online 7 February 2013
Keywords:
Bayesian
Gaussian process
Uncertainty quantication
Separable covariance function
Surrogate models
Stochastic partial differential equations
Kronecker product
a b s t r a c t
Computer codes simulating physical systems usually have responses that consist of a set of
distinct outputs (e.g., velocity and pressure) that evolve also in space and time and depend
on many unknown input parameters (e.g., physical constants, initial/boundary conditions,
etc.). Furthermore, essential engineering procedures such as uncertainty quantication,
inverse problems or design are notoriously difcult to carry out mostly due to the limited
simulations available. The aim of this work is to introduce a fully Bayesian approach for
treating these problems which accounts for the uncertainty induced by the nite number
of observations. Our model is built on a multi-dimensional Gaussian process that explicitly
treats correlations between distinct output variables as well as space and/or time. The
proper use of a separable covariance function enables us to describe the huge covariance
matrix as a Kronecker product of smaller matrices leading to efcient algorithms for carrying out inference and predictions. The novelty of this work, is the recognition that the
Gaussian process model denes a posterior probability measure on the function space of
possible surrogates for the computer code and the derivation of an algorithmic procedure
that allows us to sample it efciently. We demonstrate how the scheme can be used in
uncertainty quantication tasks in order to obtain error bars for the statistics of interest
that account for the nite number of observations.
2013 Elsevier Inc. All rights reserved.
1. Introduction
It is very common for a research group or a company to spend years of development of sophisticated software in order to
simulate realistically important physical phenomena. However, carrying out tasks like uncertainty quantication, model calibration or design using the full-edged model is in all but the simplest cases a daunting task, since a single simulation
might take days or even weeks to complete, even with state-of-the-art modern computing systems. One, then, has to resort
to computationally inexpensive surrogates of the computer code. The idea is to run the solver on a small, well-selected set of
inputs and then use these data to learn the response surface. The surrogate surface may be subsequently used to carry out
any of the computationally intensive engineering tasks.
Corresponding author at: Materials Process Design and Control Laboratory, Sibley School of Mechanical and Aerospace Engineering, 101 Frank H.T.
Rhodes Hall, Cornell University, Ithaca, NY 14853-3801, USA. Tel.: +1 607 255 9104.
E-mail address: zabaras@cornell.edu (N. Zabaras).
URL: http://mpdc.mae.cornell.edu/ (N. Zabaras).
0021-9991/$ - see front matter 2013 Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.jcp.2013.01.011
I. Bilionis et al. / Journal of Computational Physics 241 (2013) 212239
213
The engineering community and, in particular, the researchers in uncertainty quantication, have been making extensive
use of surrogates, even though most times it is not explicitly stated. One example is the so-called stochastic collocation (SC)
method (see [1] for a classic illustration) in which the response is modeled using a generalized Polynomial Chaos (gPC) basis
[2] whose coefcients are approximated via a collocation scheme based on a tensor product rule of one-dimensional Gauss
quadrature points. Of course, such approaches scale extremely badly with the number of input dimensions since the number
of required collocation points explodes quite rapidly. A partial remedy of the situation can be found by using sparse grids
(SG) based on the Smolyak algorithm [3], which have a weaker dependence on the dimensionality of the problem (see
[46] and the adaptive version developed by our group [7]). Despite the rigorous convergence results of all these methods,
their applicability to the situation of very limited observations is questionable. In that case, a statistical approach seems
more suitable.
To the best of our knowledge, the rst attempt of the statistics community to build a computer surrogate starts with the
seminal papers of Currin et al. [8] and independently Sacks et al. [9], both making use of Gaussian processes. On the same
spirit is also the subsequent paper by Currin et al. [10] and the work of Welch et al. [11]. One of the rst applications to
uncertainty quantication can be found in OHagan et al. [12] and Oakley and OHagan [13]. The problem of model calibration is considered in [14,15]. Refs. [16,17] model non-stationary responses, while [18,15] (in quite different ways) attempt to
capture correlations between multiple outputs. Following these trends, we will consider a Bayesian scheme based on Gaussian processes.
Despite the simplistic nature of the surrogate idea, there are still many hidden obstacles. Firstly, the question about the
choice of the design of the inputs on which the full model is to be evaluated arises. It is generally admitted that a good starting point is a Latin hyper-cube design [19], because of its great coverage properties. However, it is more than obvious, that
this choice should be inuenced by the task in which the surrogate will be used. For example, in uncertainty quantication, it
makes sense to bias the design using the probability density of the inputs [17] so that highly probable regions are adequately
explored. Furthermore, it also pays off to consider a sequential design that depends on what is already known about the surface. Such a procedure, known as active learning, is particularly suitable for Bayesian surrogates since one may use the predictive variance as an indicative measure of the informational content of particular points in the input space [20]. Secondly,
computer codes solving partial differential equations usually have responses that are multi-output functions of space and/or
time. One can hope, that explicitly modeling this fact may squeeze more information out of the observations. Finally, it is
essential to be able to say something about the epistemic uncertainty induced by the limited number of observations
and, in particular, about its effect on the task for which the surrogate is constructed for.
In this work, we are mainly concerned with the second and the third points of the last paragraph. Even though we are
making use of active learning ideas in a completely different context (see Section 2.3), we will be assuming that the observations are simply given to us. Our rst goal is the construction of a multi-output Gaussian process model that explicitly
treats space and time (Section 2.1). This model, in its full generality is extremely computationally demanding. In Section 2.2,
we carefully develop the so-called separable model, which allows us to express the inference and prediction tasks using Kronecker products of small matrices. This, in turn, results in highly efcient computations. Finally, in Section 2.3, we apply our
scheme to uncertainty quantication tasks. Contrary to other approaches, we recognize the fact that the predictive distribution of the Gaussian process conditional on the observations, actually denes a probability measure on the function space of
possible surrogates. The weight of this measures corresponds to the epistemic uncertainty induced by the limited data.
Extending on ideas of [13], we develop a procedure that allows us to approximately sample this probability space. Each sample, is a kernel approximation of a candidate surrogate for the code. In the same section, we show how we can semi-analytically compute all the statistics of the candidate surrogate up to second order. Higher order statistics or even probability
densities may be calculated quite effortlessly via a Monte Carlo procedure. By repeatedly sampling the posterior surrogate
space, we are able to provide error bars for practically anything that depends on it. In Section 3.1, we apply our scheme to a
stochastic ordinary differential equation with three distinct outputs and two random variables. The purpose of this example,
is to demonstrate the validity of our approach in a simple problem. In Section 3.2, we consider the problem of ow through
random porous media. There, we model the velocity eld and the pressure as a function of space and 50 random variables. In
this more challenging problem, we clearly see the advantages of a fully Bayesian approach to uncertainty quantication.
Namely, the ability to say something about the statistics of a 50 dimensional stochastic problem with as little as 24 observations is intriguing. Finally, we conclude by noting the limitations of the approach and discuss the many possibilities for
extension.
2. Methodology
We are interested in modeling computer codes returning a multi-output response y 2 Rq , where q > 0 is the number of
distinct outputs, given an input n 2 X n Rkn , dened over a spatial domain X s Rks and/or a time interval X t 0; T, where
kn is the number of inputs to the code, ks the spatial dimension (either 1, 2 or 3) and T > 0 is the time horizon. Even though for
a given input n the code reports the response simultaneously at various spatial and time locations, we will be modeling it as a
function f : X n X s X t ! Rq . As an example, you may consider the problem of two-dimensional ow in random porous
media. The input variables n would represent the permeability eld while for a xed n f(n, xs, t) would give the velocity
components as well as the pressure at the spatial location xs and time t (here q = 3).
214
2.1. Multi-output Gaussian process regression

Throughout this work, we will collectively denote the inputs of f() by:
x n; xs ; t
and its input domain by X X n X s X t Rk , where k = kn + ks + 1. Following [18], we model f() as a q-dimensional Gaussian process:
fjB; R; h N q m; B; c; ; hR;
conditional on hyper-parameters B 2 Rmq ; R 2 Rqq and h, with h denoting the parameters of the correlation function
c(, ; h). Notice that Eq. (1) essentially means that:
EfxjB; R; h mx; B
and
Covfx1 ; fx2 jB; R; h cx1 ; x2 ; hR:

It is apparent that R must be a symmetric, positive denite matrix and that it models the linear part of the correlations between the q distinct outputs. The mean is chosen to be a generalized linear model:
mx; B BT hx;
where h : X ! R ; h h1 ; . . . ; hm are regression functions shared by each of the components of f().

2.1.1. Prior distributions
We are assuming that a priori the pair (B, R) and h are independent:
pB; R; h pB; Rph:

For the moment, we let the prior p(h) of h to be undened. For B and R, we choose to use the non-informative prior [18]:
q1
pB; R / jRj 2 :
Notice, that the prior corresponding to R is a pathological case of an Inverse Wishart distribution. The real reason for this
choice is that it leads to an analytically tractable model, since as is shown in what follows both B and R can be integrated
out. Of course, one might choose an alternative prior distribution that reects her own beliefs at the cost of having to numerically sample these two parameters.
2.1.2. The likelihood of the data
Given a data set of n observations with inputs X x1 ; . . . ; xn T 2 Rnk and outputs Y fx1 ; . . . ; fxn T 2 Rnq , it follows
from Eq. (1) that the likelihood is given by the matrix-normal (see [21]):
YjB; R; h N nq HB; R; A;
H hx1 ; . . . ; hxn T 2 Rnm
where
is the design matrix and

A cxi ; xj ; h 2 Rnn
is the usual covariance matrix.

2.1.3. The predictive distribution
Using the vectorization operation [22] to cast the matrix-normal distribution to a simple multivariate normal and some
trivial identities (for example see Chapter 2.3.1 of [23]), it is easy to show that the predictive distribution is given by:
fjB; R; h; Y N q m ; B; c ; ; hR;
where
m x; B BT hx Y HBT A1 ax;

c x1 ; x2 ; h cx1 ; x2 ; h aT x1 A1 ax2 ;
215
where a c; x1 ; h; . . . ; c; xn ; h 2 Rn . If n > m + q (so that all distributions involved are proper), it is possible to integrate
out both B and R resulting in the predictive distribution of f() conditional only on h. It is a q-variate Students process with
n m degrees of freedom (see [18]):
^ n m;
fjh; Y T q m ; c ; ; hR;
where
^ T hx Y HB
^ T A1 ax;
m x B

T
c x1 ; x2 ; h c x1 ; x2 ; h hx1 HT A1 ax1 HT A1 H1 hx2 HT A1 ax2 ;

1
^ HT A1 H
B
HT A1 Y;
^
R
1
^ T A1 Y HB:
^
Y HB
nm
2.1.4. The posterior distribution of h

Let us conclude this section by giving the posterior distribution of the hyper-parameters of the correlation function. Using
Bayes theorem to write down the joint posterior for B, R and h conditional on Y (combining Eqs. (3) and (4)) and integrating
out B and R, we obtain:
q
nm
2
^
phjY / phjAj2 jHT A1 Hj2 jRj
2.2. The separable model

It is apparent that the above mentioned model becomes computationally intractable quite fast due to the fact that a highdimensional and dense covariance matrix has to be inverted. An important simplication can be achieved if the spatial and
the temporal points at which the output is observed remain xed independent of the input n and if we assume that the correlation function is separable, i.e.:
cx1 ; x2 ; h : cn n1 ; n2 ; hn cs xs;1 ; xs;2 ; hs ct t 1 ; t 2 ; ht ;
10
where cn(, ; hn), cs(, ; hs) and ct(, ; ht) are the correlation functions of the parameter space, the spatial domain and the time
domain, respectively, and h = (hn, hs, ht). We will now show that under these assumptions, the covariance matrix can be written as the Kronecker product of smaller covariance matrices. Using this observation, it is possible to carry out inference and
make predictions without ever forming the full covariance matrix. Finally, we also assume that the hyper-parameters of the
various covariance functions are a priori independent:
ph phn phs pht :
11
Remark 1. Another more general model for the covariance function is the linear model of coregionalization (LMC) [2426].
The more general nature of this covariance function does not necessarily make it more attractive for the applications of
interest. The introduction of such models is usually associated with higher computational cost which we try to avoid in this
paper.
2.2.1. Organizing the inputs
Let us consider how the data are collected from a computer code. For a parameter n 2 X n , the computer code returns the
(multi-output) response on a given (a priori known) set of ns spatial points Xs xs;1 ; . . . ; xs;ns T 2 Rns ks ; where ks = 1, 2 or 3 is
the spatial dimension (X s Rks ), at each one of the nt timesteps Xt t1 ; . . . ; tnt 2 Rnt 1 . That is, a single choice of the parameter n generates a total of nsnt training samples. Therefore, the response of the code is a matrix Yn 2 Rns nt q , which we call the
output matrix. The output matrix is assembled as follows:
Yn yTn;1 . . . yTn;ns T ;
where each yn;i 2 Rnt q is the response at the spatial point xs,i at each timestep, that is:
yn;i yn;i;1 . . . yn;i;nt T ;
216
where yn;i;j 2 Rq is the response at the spatial point xs,i at time tj:
yn;i;j yn;i;j;1 . . . yn;i;j;q T ;

where, of course, yn,i,j,l is the lth output of the response at xs,i at tj.
2.2.2. Separating the covariance matrices
If we take a total of nn samples of the parameters:
Xn n1 ; . . . ; nnn T 2 Rnn kn ;

where kn is the dimension of the input variables n (X n Rkn ), we will have a total of
n nn ns nt
training samples for our model. The covariance matrix A 2 Rnn can now be written as:
A An As At ;
12
where An 2 Rnn nn is the covariance matrix generated by Xn and cn(, ; hn), As 2 Rns ns is the covariance matrix generated by Xs
and cs(; ; hs) and At 2 Rnt nt is the covariance matrix generated by Xt and ct(; ; ht) and corresponds to the Kronecker
product.
2.2.3. Separating the design matrices
Now let us consider the basis functions used in the generalized linear model of Eq. (2). Suppose, we wish to use mt basis
functions to capture the time dependence of the mean:
Ht fht;1 t; . . . ; ht;mt tg:

We choose also ms basis functions to capture the spatial dependence of the mean:
Hs fhs;1 xs ; . . . ; hs;ms xs g:
These can be for example the nite element basis of the model or any other suitable functions. Finally, we choose mn basis
functions to capture the dependence of the mean on the stochastic parameter:
Hn fhn;1 n; . . . ; hn;mn ng:

For example, in an uncertainty quantication setting these could be a gPC basis as induced by the probability distribution of
the ns. The global basis functions are formed from the tensor product:
H Hn Hs Ht :
Thus, the total number of basis functions present in the model is:
m mn ms mt :
In order to have a consistent enumeration, we proceed as follows:
h1 x : hn;1 nhs;1 xs ht;1 t;

h2 x : hn;1 nhs;1 xs ht;2 t;
..
.
hmt x : hn;1 nhs;1 xs ht;mt t;
hmt 1 x : hn;1 nhs;2 xs ht;1 t;
..
.
hms mt i1mt j1l : hn;i nhs;j xs ht;l t;
where, at the last line, i = 1, . . . , mn, j = 1, . . . , ms and l = 1, . . . , mt. With this enumeration, the design matrix H dened in Eq. (5)
breaks down as:
H Hn Hs Ht ;
where Hn 2 Rnn mn is:
Hn;ij hn;j ni ;
Hs 2 Rns ms is:
Hs;ij hs;j xs;i
13
217
and Ht 2 Rnt mt is:
Ht;ij ht;j t i :
2.2.4. Efcient predictions and inference
Given a set of hyper-parameters h, all the statistics that are required to make predictions or evaluate the posterior of p(h) can
be calculated efciently by exploiting the properties of the Kronecker product. Its most important property is that various factorizations (e.g. Cholesky or QR) of a matrix formed by Kronecker products is given by the Kronecker products of the factorizations of the individual matrices [27]. Furthermore, matrixvector multiplications as well as solving linear systems when the
matrices forming the Kronecker product are triangular can be carried out without additional memory. Therefore, working consistently with the Cholesky decomposition of the covariance matrices, leads to very efcient computations. All the linear algebra details pertaining to efcient computations with Kronecker products are documented in Appendices A and B.
The posterior of the hyper-parameters (see Eq. (9)) can now be sampled efciently via Gibbs sampling [28], as described
in Algorithm 1. Each one of the steps can be carried out using MCMC [29,30]. The prior distributions as well as the proposal
distributions for hn, hs and ht are given in the next paragraph.
Algorithm 1. Sampling the posterior distribution
Require: Observed data X and Y and initial h = (hn, hs, ht).
Ensure Repeated application ensures that h = (hn, hs, ht) is a sample from Eq. (9).
Sample:
hn phn jY; hs ; ht / phn jAn j
qn
2n
jHTn A1
n Hn j
qm
2m
nm
2
^
jRj
Sample:
qn
qm
2m ^
s jRj
hs phs jY; hn ; ht / phs jAs j2ns jHTs A1
s Hs j
nm
2
Sample:
ht pht jY; hn ; hs / pht jAt j
qn
2n
qm
2m
jHTt A1
t Ht j
nm
2
^
jRj
2.2.5. Choice of the covariance function

The separable model described in this section requires the specication of three covariance functions cn(, ; hn), cs(, ; hs)
and ct(, ; ht). In this work, we choose to work with:
cn nn1 ; nn2 ; hn : exp
n
nn1 ;k nn2 ;k 2
1X
2 k1
r 2n;k
!)
ks
xs;n1 ;k xs;n2 ;k 2
1X
cs xs;n1 ; xs;n2 ; hs : exp
2 k1
r 2s;k
2
1 tn1 t n2
ct tn1 ; t n2 ; ht : exp
2
r 2t
g n dn1 n2 ;
!)
g s dn1 n2 ;
!)
g t dn1 n2 ;
with the hyper-parameters completely specied by:
hn rn ; g n ; hs rs ; g s and ht r t ; g t :
The core part of the covariance functions is based on the Square Exponential (SE) kernels with the ra, a = n, s, t hyper-parameters being interpreted as the length scale of each input dimension. The ga, a = n, s, t are termed nuggets. The main purpose
of the nuggets is to ensure the well-conditioning of the covariance matrices involved in the calculations and they are expected to be typically small (of the order of 102). By looking at the full covariance function of the separable model and
ignoring second-order products of the nuggets, we can see that the gn + gs + gt can be interpreted as the measurement noise.
Apart from improving the stability of the computations, one can argue that the presence of the nugget can also lead to better
predictive accuracy [31].
218
The priors of the hyper-parameters rn, gn, rs, gs, rt and gt, should be chosen to represent any prior knowledge about the
computer code that might be available. In order to ensure positive support, we make the common choice a = n, s and t:
pra;k jca Era;k jca ;
14
pg a jfa Eg a jfa ;
15
where Ejk denotes the probability density of the exponential distribution with parameter k > 0.
For the proposal required by the MCMC sampling schemes described in the previous section, we use a log-normal random
walk for all hyper-parameters (again because they are all positive). The step size of the random walk is selected so that the
observed acceptance ratio of the MCMC is between 30 and 60%.
The particular values of ca and fa for a = n, s and t are specied in each numerical example.
2.3. Application to uncertainty quantication
In uncertainty quantication tasks, one species a probability density on the inputs ns, p(n), and tries to quantify the
probability measure induced by it on the output eld. In this work, we quantify this uncertainty by interrogating the surrogate built using the Gaussian process model introduced in the previous sections (see Eq. (8)). The whole process is complicated by the fact that our model in reality denes a probability measure over the function space of potential surrogates. This
probability measure essentially quanties the lack of information regarding the real response due to the nite number of
observations. In a fully Bayesian setting, this probability measure will be reected as a probability measure on the predicted
statistics (e.g. mean, variance, PDFs, etc.). To the best of our knowledge, such ideas were introduced for the rst time in the
statistics literature [13] but were largely ignored by the UQ community. Inspired by the above mentioned work, we will describe in this section how our model can be used to essentially sample the posterior distribution of the induced statistics. The
procedure is conceptually simple and described in Algorithm 2. The key component of this algorithm is the ability to sample
a response surface based on Eq. (8) that can be described analytically via a kernel representation. This is achieved through
the generalization of the techniques discussed in [13]. The nal component of the algorithm has to do with the evaluation of
the statistics of interest induced by this response surface. We will show that our model allows for semi-analytic calculation
of all statistics up to second-order. Higher-order statistics, or full probability densities have to be obtained using Monte Carlo
techniques on the sampled surrogate surface.
Algorithm 2. Sampling the posterior of the statistics. By repeatedly calling this algorithm, error bars for the desired statistics
may be obtained
Require Observed data X and Y and h0 sample from Eq. (9).

Ensure S is a sample from the statistic of interest.
Sample a new h1 following the Gibbs procedure given in Section 2.2.
Sample a response surface using Algorithm 3.
Interrogate the obtained response surface (analytically or via MC) to obtain S.
2.3.1. Sampling a response surface

In order to obtain an analytical representation of the response surface, Ref. [13] suggests selecting a space lling design of
the input variables, using Eq. (8) to sample the outputs and then augmenting the original data set with the new observations
to derive an updated Eq. (8) with reduced variance. The mean of the updated posterior predictive distribution is an analytic
function that can be thought of (if its variance is sufciently small) as a sample from the predictive probability measure. Several problems arise if one follows this approach. To start with, one does not know a priori how many design points are required in order to reduce the predictive variance to a pre-specied tolerance. Furthermore, design points must be well placed
and far away from the initial observations in order to avoid numerical instabilities. Finally and this is particular to our model including design points in all sets of different inputs (n, xs and t) breaks down the Kronecker product representation of
the covariance and design matrices which, in turn, leads to a tremendous computational burden. In order to avoid the latter
of these conundrums, we choose to work with the same spatial and time points as the ones included in the original data
(namely Xs and Xt). This approximation, will ignore only a hopefully small part of the epistemic uncertainty due to
the nite number of observations. That is, we only choose design points in X n . The former two problems are addressed
by employing a sequential strategy in which the ns are selected one by maximizing the predictive variance until a specic
tolerance is achieved. This approach is guaranteed to produce a space lling design that is well separated from the original
observations. In addition, the only covariance matrix that needs to be updated is the one pertaining to n. In Appendix C, we
describe how the Cholesky decomposition of the covariance matrix as well as solutions to the relevant linear systems can be
updated in quadratic time when new design points are added.
219
Consider h xed and let Xn;d 2 Rnn dkn and Yd 2 Rnn dns nt q denote the set of ns and the corresponding outputs when d
design points have been observed. That is
0
Xn;d
Xn
C
Bn
B nn 1 C
C
B
B . C:
B .. C
A
@
nnn d
For d = 0, we obtain the observed data:
Xn;0 Xn
and Y0 Y:
^ d 2 Rmq , Hn;d 2 Rnn dmn , An;d 2 Rnn dnn d to be the weight, design and covariance matrices pertaining to n, respecDene B
tively, when Xn,d and Yd have been observed. In order to avoid cluttering the nal formulas, let us also dene:
an
C
B c n
B n nn 1 ; n; h C
C
B
an;d n B
C 2 Rnnd ;
..
C
B
.
A
@
cn nnn d ; n; h
Ad An;d As At
and
Hd Hn;d Hs Ht :
Now, let n 2 X n and Zn 2 Rns nt q be the output at n and all spatial and time points in Xs and Xt. By using Bayes theorem and
Eq. (8), we can show that:

^ nd m ;
ZnjYd ; h T ns nt q Md n; Cd n; R;
16
where nd = (nn + d)nsnt and the mean is given by:

T
^d
^ d an;d nT As At A1 Yd Hd B
Md n hn n Hs Ht B
d
17
and the covariance matrix by:
T

T

Cd n cn n; n; hAs At an;d n As At A1
an;d n As At
hn n Hs Ht HTd A1
d
d

1
HTd A1
hn n Hs Ht HTd A1
d Hd
d :
18
In order to sample Eq. (16), we need to compute the Cholesky decomposition of Cd(n). This is not trivial, since Cd(n) does not
have a particular structure. In the numerical examples and in particular for the porous ow problem considered in Section 3.2 this matrix turned out to be extremely ill-conditioned. Even though, theoretically Cd(n) is guaranteed to be symmetric positive denite, numerically it must be treated as positive semi-denite. For this reason, one has to use a low-rank
approximation of Cd(n) using the pivoted Cholesky factorization [32]. This can be carried out using the LAPACK routine dpstrf.
The tolerance we used for this approximation was 103 for all numerical examples we considered. We found no observable
difference between samples obtained with the normal Cholesky factorization and this approach. Finally, let us mention that a
scalar quantity that is associated directly with the uncertainty pertaining n is given by:
r2d n
^
trCd ntrR
pn:
ns nt q
19
This is the sum of the variances of all outputs at all different spatial and time points weighted by the input probability distribution p(n). The idea is to sequentially augment the data set by including the design points from a dense subset X n of X n
that maximize Eq. (19) until a pre-specied tolerance is achieved. At that point, the joint predictive mean given by Eq. (17)
may be used as an analytic sample surrogate surface. In general, we would like to evaluate the response surface on a denser

spatial design Xs 2 Rns ks and/or more time steps Xt 2 Rnt ). The joint predictive mean at those points is given by:

^ d ;
^ d an;d nT A;T A;T A1 Yd Hd B
Md n hn nT Hs Ht B
s
t
d
20
220

where Hs 2 Rns ms ; Ht 2 Rnt mt are the design matrices that pertain to the test spatial and time points, respectively, while

As 2 Rns ns ; At 2 Rnt nt are the corresponding cross covariance matrices. We identify Eq. (20) as a sample response surface
from the function space of possible surrogates. The complete algorithmic details are given in Algorithm 3.
Algorithm 3. Sample a response surface
n
o
Require: Observed data X and Y, h sampled from Eq. (8), a dense set of design points X n n1 ; . . . ; nn , the desired nal
n
tolerance d > 0 and dense spatial and time designs Xs 2 Rns ks and Xt 2 Rnt on which we wish to make predictions.
Ensure: After d P 1 steps, the uncertainty of Eq. (16) as captured by Eq. (19) is less than d and Md n given by Eq. (20)
can be used as an analytic representation of the sampled response surface.
Initialize d
0.
repeat
Find the next design point:
nnn d1 arg max

r2d n:

21
n2X n
Sample Znnn d1 from Eq. (16).

Augment the set of observations with the pair nnn d1 ; Znnn d1 .
d
d + 1.
until r2d nnn d < d.
2.3.2. Analytic rst-order and second-order statistics

In applications, we are usually interested in rst and second-order statistics. We can obtain a sample of the mean response by integrating out n from Eq. (20):
Md :
Md npndn:
22
It can be easily shown that:

^d ;
^ d T A;T A;T A1 Yd Hd B
Md Th Hs Ht B
a;d
s
t
d
23
where
h
hn npndn and a;d
an;d npndn:
Now, let i,j 2 {1, . . . , q} be two arbitrary outputs. The covariance matrix between all possible spatial and time test points is
dened by:
Cij;
d :
Z
Md;i n Md;i

Md;j n Md;j
T
pndn;
24
where the subscripts i and j select columns of the associated matrices. This matrix, contains all second-order statistics of the
surrogate. For example, the variance of each output i = 1, . . . , q is on all spatial and time locations Xs and Xt , respectively is
given by:
ii;
Vi;
d : diag Cd :
It can be shown using tensorial notation, that
Cij;
d
25
Cij;
d
may be evaluated by:
h

h
iT

i
j iT ;T
~ i m aa;d A;T A;T Y
~j
^ m hh H H B
^
Y
Hs Ht B
As A;T
s
t
d
d
t
s
t
d
d
h
iT

h

j iT
~ j A;T A;T Y
~ i m ah;d H H B
^ i m ha;d A;T A;T Y
^ ;
Hs Ht B
d
d
s
t
s
t
s
t
d
d
^ i 2 Rms mt mn is such that vecB

^i B
^ d;i (i.e. the ith column of Bd), Y
~ d 2 Rnq is dened by:
where B
d
d
~ d A1 Yd HB
^ d ;
Y
d
~i Y
~ d;i ; m hh 2 Rmn mn is given by:
~ i 2 Rns nt nn is such that vecY
Y
d
d
26
221

KO2, n=100
KO2, n=100
95% Conf. Interval

Mean
95% Conf. Interval

Mean
0.5
Mean of y3(t)
Mean of y1(t)
0.8
0.6
0.4
0.2
0
0
0.5
1.5
0
10
Time (t)
(a)
10
(b)
KO2, n=100
KO2, n=100
0.8
0.2
95% Conf. Interval
Mean
95% Conf. Interval

Mean
0.7
0.6
3
Variance of y (t)
0.15
Variance of y1(t)
6
Time (t)
0.1
0.5
0.4
0.3
0.2
0.05
0.1
0
0
0
0
10
6
Time (t)
(c)
(d)
10
KO2, n=150
KO2, n=150
0.8
0.2
95% Conf. Interval
Mean
95% Conf. Interval

Mean
0.7
0.6
Variance of y (t)
0.15
Variance of y (t)
Time (t)
0.1
0.5
0.4
0.3
0.2
0.05
0.1
0
0
10
0
0
Time (t)
Time (t)
(e)
(f)
10
Fig. 1. KO-2: the thick blue line is the mean of the statistic predicted by our model while the gray area provides 95% condence intervals. The rst row ((a)
and (b)) corresponds to the mean of the response as captured with nn = 100. The second ((c) and (d)) and the last ((e) and (f)) rows show the variance of the
response for nn = 100 and nn = 150, respectively. (For interpretation of the references to color in this gure legend, the reader is referred to the web version
of this article.)
mhh
hn n h hn n h T pndn;
27
m aa;d 2 Rnn nn by:
maa;d
an;d n d;a an;d n a;d T pndn;
28
m ha;d 2 Rmn nn by:
mha;d
hn n h an;d n a;d T pndn
and mah,d = mha,dT.
29
222
1.4
1.4
95% error bars

Mean of PDF
1.2
95% error bars

Mean of PDF
95% error bars

Mean of PDF
1.2
0.8
0.8
0.6
Probability
1
Probability
Probability
1
0.6
0.4
0.4
0.8
0.6
0.4
0.2
0.2
0
3
0.2
2
0
y2(t=4.0)
0
3
(a)
0
3
0.3
0.2
0.1
0.1
0.1
(d)
0
y2(t=6.0)
0
3
1.5
1
0.5
1
0
y2(t=8.0)
3
2.5
2
1.5
1
0.5
2
0
y2(t=8.0)
0
3
0.5
(j)
95% error bars

Mean of PDF
1.5
0
3
0.5
0
y2(t=8.0)
2.5
Probability
Probability
0
y2(t=10.0)
95% error bars

Mean of PDF
2
(i)
2.5
1.5
95% error bars

Mean of PDF
2
95% error bars

Mean of PDF
(h)
2.5
1.5
0.5
0
3
0
y2(t=6.0)
3.5
2.5
(g)
4
95% error bars
Mean of PDF
Probability
Probability
(f)
3.5
2.5
0
3
4
95% error bars
Mean of PDF
0
3
95% error bars

Mean of PDF
(e)
3.5
0.3
0.2
0
y2(t=6.0)
0.4
0.2
0.5
0.4
0
3
0
y2(t=4.0)
0.6
Probability
Probability
0.3
0.7
95% error bars
Mean of PDF
0.5
0.4
0
3
(c)
0.6
0.5
Probability
0.7
95% error bars
Mean of PDF
0.6
Probability
(b)
0.7
Probability
0
y2(t=4.0)
1.5
0.5
0
y2(t=10.0)
(k)
0
3
0
y2(t=10.0)
(l)
Fig. 2. KO-2: the rst column corresponds to nn = 70, the second to nn = 100 and the third to nn = 150. Each row depicts the PDF of y2(t) for times t = 4, 6, 8, 10.
The thick blue line is the mean of the PDF predicted by our model while the gray area provides 95% condence intervals. (For interpretation of the references
to color in this gure legend, the reader is referred to the web version of this article.)
223

Mean of PDF
Mean of PDF
1.5
1.5
0.5
0.5
0
1.5
y (t=4.0)
0.5
1.5
2.5
0.5
1
0.5
0
y2(t=4.0)
0.5
1.5
2
2
(a)
Mean of PDF
0.6
0.5
0.4
0.2
1.5
0
y2(t=6.0)
1.6
1.4
1.2
0.5
0.5
0
0.5
2
2
0.5
1.5
1.5
2.5
y3(t=8.0)
0.5
0
y2(t=6.0)
0.5
1.5
0.4
2
2
0.2
1
2.5
1
1.5
0
1
0.5
0.5
0
1.5
0.5
0.5
0.5
1.5
1
(g)
0
y (t=8.0)
2
2
Mean of PDF
Mean of PDF
0.5
1.5
2
0.5
0
0.5
1
(j)
2
2
1.5
0.5
1.5
1
1.6
1.5
1.4
0.5
Mean of PDF
1
y3(t=10.0)
1.5
1.5
0.5
(i)
0
y (t=8.0)
2
(h)
1.5
1.5
0.5
2
2
Mean of PDF
1.5
0
y2(t=6.0)
0
y2(t=10.0)
0.6
(f)
0.8
0.5
1.5
1
2
0
y (t=8.0)
Mean of PDF
2
1.5
1.5
1.5
1.5
Mean of PDF
Mean of PDF
(e)
0.5
0
y2(t=4.0)
(d)
y3(t=8.0)
2
2
(c)
y3(t=8.0)
y (t=6.0)
y3(t=6.0)
0.8
0.5
y (t=10.0)
1.5
2
2
Mean of PDF
1.2
1.5
2
2
0
y2(t=4.0)
0.5
(b)
2
2
1.5
y3(t=6.0)
1.5
0
0.5
1
1.2
y3(t=10.0)
2
2
2
0.5
1
1
1.5
2.5
1
y (t=4.0)
3
2.5
2
3
1.5
y (t=4.0)
Mean of PDF
0.5
0.8
0.5
0.6
0.4
1.5
1
0
y2(t=10.0)
(k)
2
2
0.2
1
0
y2(t=10.0)
(l)
Fig. 3. KO-2: the rst column corresponds to nn = 70, the second to nn = 100 and the third to nn = 150. Each row depicts the joint PDF of y2(t) and y3(t) for
times t = 4, 6, 8, 10.
3. Numerical examples
3.1. KrainchnanOrszag three-mode problem
Consider the system of ordinary differential equations [33]:
224

Variance of PDF
Variance of PDF
Variance of PDF
1.5
1.5
0.5
0.3
0.2
0.2
0
y2(t=4.0)
(a)
Variance of PDF
y3(t=6.0)
y (t=6.0)
3
0.5
0.05
0.12
1.5
0.07
0.06
0.5
0.05
0.04
0.5
0.03
0.02
1.5
0.01
0.1
0.08
0
0.06
0.5
0.04
0.02
1.5
1
2
2
0
y2(t=6.0)
(d)
Variance of PDF
0.4
0.3
0
0.5
0.2
0.3
0.5
0.25
0.2
0.5
2
2
0.3
0
0.2
0.5
1
0.1
1.5
0.5
0.15
0.1
0.4
1
0.35
1.5
0
y (t=8.0)
0.1
1.5
0.05
2
2
(g)
(h)
Variance of PDF
1.5
Variance of PDF
0.2
0
y2(t=8.0)
(i)
Variance of PDF
1.5
0.4
1
y3(t=8.0)
0.5
0.45
1.5
0
y2(t=6.0)
Variance of PDF
Variance of PDF
0.5
0
y2(t=8.0)
(f)
1.5
y (t=8.0)
2
2
(e)
Variance of PDF
1.5
1.5
0.5
0.1
0.5
0
0
y2(t=4.0)
0.14
(c)
0.15
2
2
0.05
0
y2(t=6.0)
2
2
Variance of PDF
1.5
0.1
(b)
2
2
0.15
0.5
1.5
y3(t=6.0)
0
y2(t=4.0)
2
2
y3(t=8.0)
0.2
0.1
1.5
0.25
0.5
0.5
1
0.1
1.5
2
2
y (t=4.0)
0.5
0.5
0.4
y (t=4.0)
0.3
y (t=4.0)
0.4
0.5
0.3
1.5
0.5
1.5
0.12
1.5
0.25
0.1
1
0.15
0.5
0.15
0.1
0.5
1
0.05
1.5
2
2
0.5
0.1
0
y2(t=10.0)
(j)
2
2
0.5
0.08
0.06
0.5
0.04
1
0.05
1.5
1
y3(t=10.0)
0.5
1
0.2
y (t=10.0)
y3(t=10.0)
0
y2(t=10.0)
(k)
0.02
1.5
2
2
0
y2(t=10.0)
(l)
Fig. 4. KO-2: the rst column corresponds to nn = 70, the second to nn = 100 and the third to nn = 150. Each row depicts the predictive variance of the joint
PDF of y2(t) and y3(t) for times t = 4, 6, 8, 10.
dy1
y1 y3 ;
dt
dy2
y2 y3 ;
dt
225
10
0.1
0.08
Spatial length scales
Length scales of
6
4
2
0
200
400
600
800
Number of samples x 100
1000
0.06
0.04
0.02
0
200
400
600
800
1000
0.02
Nuggets
0.015
0.01
0.005
200
400
600
800
1000
Fig. 5. Porous ow: samples drawn from the posterior of the hyper-parameters ((a) for rn, (b) for rs and (c) for the nuggets) for the case of 120 observations.
It is apparent that the spatial length scales are clearly identied, while the hyper-parameters of the stochastic variables have a much wider posterior. Of
course, this is expected given the limited number of observations.
dy3
y21 y22
dt
subject to random initial conditions at t = 0: The stochastic initial conditions are dened by:
y1 0 1;
y2 0 0:1n1 ;
y3 0 n2 ;
where
ni U1; 1;
i 1; 2:
This dynamical system is particularly interesting because the response has a discontinuity at n1 = 0. The deterministic solver we use is a 4th order RungeKutta method as implemented in GNU Scientic Library [34].
As is apparent, the input variables n represent the initial conditions. We will consider the case of two input dimensions,
i.e. kn = 2. The output consist of three distinct variables (q = 3) that are functions of time (ks = 0). For convenience, we choose
to work with a constant prior mean by selecting:
hn n 1 and ht t 1:
That is, mn = 1, mt = 1. We x nn and gather the input data Xn 2 Rnn kn from a Latin hyper-cube design [35]. We solve the system for the time interval [0, 10] and record the response at 20 equidistant time steps, i.e. Xt 2 Rnt with nt = 20. Both Xn and Xt
are scaled in [0, 1]. The priors are specied by selecting:
ca 1=0:05 and fa 106 ; for a n; t:

This means, that we a priori assume that the mean for all length scales is 0.05 and the mean of all the nuggets is 106. We
train our model for nn = 70, 100 and 150 observations by sampling the posterior of h = (rn, gn, rt, gt) given in Eq. (9) following
the Gibbs-MCMC procedure described in Algorithm 1. To initialize the Markov chain, we sample the prior (Eqs. (14) and (15))
of the hyper-parameters 100 times and set h0 equal to the sample with maximum posterior probability dened by Eq. (9).
The proposals are selected to be log-normal random walks and the step size (the same for all types of inputs) is set to 0.01.
The chain is well mixed after about 500 iterations of the Gibbs scheme.
After the Markov chain has been sufciently mixed, we are ready to start making predictions. Predictions are made at 50

equidistant time steps in [0, 10], i.e. Xt 2 Rnt with nt 50. Then, we draw 100 samples from the posterior distribution of the
statistics of interest as described in Algorithm 2 with tolerance d = 102. We plot the mean of the statistics as well as 95%
error bars (2 times the standard deviation of the statistic). To calculate the mean of a sampled response surface, we use
226

1
0.45
0.4
0.4
0.8
0.8
0.35
0.35
0.3
0.3
0.6
0.6
0.2
0.4
0.25
0.25
0.2
0.4
0.15
0.15
0.1
0.1
0.2
0.2
0.05
0.05
0
0
0.5
x
0
0
0.5
x
0.45
0.09
0.4
0.8
0.35
0.07
0.3
0.6
0.08
0.8
0.6
0.06
0.25
0.2
0.4
0.05
0.4
0.15
0.04
0.1
0.2
0.2
0.03
0.05
0
0
0.02
0.5
x
0
0
0.5
x
0.5
0.45
0.8
0.4
0.35
0.6
y
0.3
0.25
0.4
0.2
0.15
0.2
0.1
0.05
0
0
0.5
x
Fig. 6. Porous ow: mean of ux. Subgures (a)(c) show the mean of the mean of ux for 24, 64 and 120 observations, respectively. Subgure (d) plots two
standard deviations of the mean of ux for 120 observations. Finally, (e) shows the MC estimate of the same quantity using 108,000 observations.
Eq. (23) while for the variance we use the diagonal of Cii;
d ; i 1; . . . ; q (Eq. (26)). One or two dimensional probability densities
for each sampled response surface are evaluated by the following MC procedure: (1) We draw 10,000 samples from p(n); (2)
We evaluate the sampled response (Eq. (20)) at each one of these ns; (3) We use a one- or two-dimensional kernel density
estimator [36] to approximate the desired PDF. The predicted means of all the statistics are practically identical to the ones
obtained via Monte Carlo estimate (not shown in the gures, see [17]). The rst row of Fig. 1 shows the time evolution of the
mean of y1(t) and y3(t) for nn = 100. Notice that the error bars are very tight. The second and third rows of the same gures
depict the variance of the same quantities for nn = 100 and nn = 150, respectively. We can see the width of the error bars
decreasing as the number of observations is increased. Fig. 2 shows the time evolution of the probability density of y2(t).
The four rows correspond to different time instants (specically t = 4, 6, 8 and 10). The columns correspond to nn = 70,
100 and 150 counting from the left. Fig. 3 shows the time evolution of the joint probability density of y2(t) and y3(t). The
four rows correspond to different time instants (specically t = 4, 6, 8 and 10). The columns correspond to nn = 70, 100
and 150 counting from the left. The variance of the same joint probability density is shown in Fig. 4.
3.2. Flow through porous media
In this example, we study a two-dimensional, single phase, steady-state ow through a random permeability eld. A good
review of the mathematical models of ow through porous media can be found in [37]. The spatial domain X s is chosen to be
the unit square [0, 1]2, representing an idealized oil reservoir. Let us denote with p and u the pressure and the velocity elds
of the uid, respectively. These are connected via the Darcy law:
u Krp;
in X s ;
30
227

1
0.45
0.4
0.4
0.8
0.8
0.35
0.35
0.3
0.3
0.6
0.6
0.2
0.4
0.25
0.25
0.2
0.4
0.15
0.15
0.1
0.1
0.2
0.2
0.05
0.05
0
0
0
0.5
x
0
0
0.5
x
0.45
0.09
0.4
0.8
0.07
0.3
0.6
0.6
0.06
0.25
0.2
0.4
0.08
0.8
0.35
0.05
0.4
0.04
0.15
0.1
0.2
0.03
0.2
0.05
0
0
0.02
0.5
x
0
0
0.5
x
0.5
0.45
0.8
0.4
0.35
0.6
y
0.3
0.25
0.4
0.2
0.15
0.2
0.1
0.05
0
0
0.5
x
Fig. 7. Porous ow: mean of uy. Subgures (a)(c) show the mean of the mean of uy for 24, 64 and 120 observations, respectively. Subgure (d) plots two
standard deviations of the mean of uy for 120 observations. Finally, subgure (e) shows the MC estimate of the same quantity using 108,000 observations.
where K is the permeability tensor that models the easiness with which the liquid ows through the reservoir. Combining
the Darcy law with the continuity equation, it is easy to show that the governing PDE for the pressure is:
r Krp f ;
in X s ;
31
where the source term f may be used to model injection/production wells. We use two model square wells: an injection well
on the left-bottom corner of X s and a production well on the top-right corner. The particular mathematical form is as
follows:
8
1
1
>
< r; if jxsi 2 wj < 2 w; for i 1; 2;
1
f xs r;
if jxsi 1 2 wj < 12 w; for i 1; 2;
>
:
0;
otherwise;
32
where r species the rate of the wells and w their size (chosen to be r = 10 and w = 1/8). Furthermore, we impose no-ux
boundary conditions on the walls of the reservoir:
^ 0;
un
on @X s ;
33
^ is the unit normal vector of the boundary. These boundary conditions specify the pressure p up to an additive conwhere n
stant. To assure uniqueness of the boundary value problem dened by Eqs. (30), (31) and (33), we impose the constraint
[38]:
Z
Xs
pxdx 0:
228

1
0.15
0.15
0.1
0.8
0.8
0.1
0.05
0.6
0.05
0.6
0.4
0.4
0.05
0.1
0.2
0.05
0.1
0.2
0.15
0.15
0
0
0.5
x
0
0
0.15
0.1
0.8
0.5
x
0.035
0.03
0.8
0.05
0.025
0.6
0.6
0.4
0.4
0.05
0.1
0.2
0.02
0.015
0.01
0.2
0.15
0
0
0.5
x
0.005
0
0
0.5
x
1
0.15
0.8
0.1
0.05
0.6
y
0.4
0.05
0.1
0.2
0.15
0
0
0.5
x
Fig. 8. Porous ow: mean of p. Subgures (a)(c) show the mean of the mean of p for 24, 64 and 120 observations, respectively. Subgure (d) plots two
standard deviations of the mean of p for 120 observations. Finally, subgure (e) shows the MC estimate of the same quantity using 108,000 observations.
We restrict ourselves to an isotropic permeability tensor:
K ij Kdij :
K is modeled as
Kxs expfGxs g;
where G is a Gaussian random eld:
G N m; cG ; ;
with constant mean m and an exponential covariance function given by
cG xs1 ; xs2
s2G
)
ks
X
jxs1;k xs2;k j
:
exp
k
k1
34
The parameters k represent the correlation lengths of the eld, while sG > 0 is its variance. The values we choose for the
parameters are m = 0, k = 0.1 and sG = 1. In order to obtain a nite dimensional representation of G, we employ the KarhunenLove expansion [39] and truncate it after kn = 50 terms:
Gw; xs m
kn
X
wk wk xs ;
k1
where w w1 ; . . . ; wkn is vector of independent, zero mean and unit variance Gaussian random variables and wk(xs) are the
eigenfunctions of the exponential covariance given in Eq. (34) (suitably normalized, of course). In order to guarantee the analytical calculation of statistics of the rst-order p and u of Section 2.3, we choose to work with the uniform random variables
229

3
x 10
x 10
14
10
0.8
0.8
9
8
7
0.4
0.4
0.2
10
0.6
y
0.6
12
0.2
0
0
0.5
x
0
0
0.5
x
x 10
x 10
16
3.5
14
0.8
0.8
3
12
0.6
0.6
2.5
10
0.4
0.4
1.5
0.2
0.2
0
0
0.5
x
0.5
0
0
0.5
x
0.025
0.02
0.6
0.015
0.4
0.01
0.2
0.005
0.8
0
0
0.5
x
Fig. 9. Porous ow: variance of ux. Subgures (a)(c) show the mean of the variance of ux for 24, 64 and 120 observations, respectively. Subgure (d) plots
two standard deviations of the variance of ux for 120 observations. Finally, subgure (e) shows the MC estimate of the same quantity using 108,000
observations.
nk Uwk U0; 1;
k 1; . . . ; kn ;
where U() is the cumulative distribution function of the standard normal distribution. Putting it all together, the nitedimensional stochastic representation of the permeability eld is:
(
Kn; xs exp m
kn
X
)
1
U nk wk xs :
35
k1
In order to make the notational connection with the rest of the paper obvious, let us dene the response of the physical
model as
f : X n X s ! Rq ;
where, of course, X n 0; 1kn , X s 0; 12 and q = 3. That is,
fn; xs pn; xs ; un; xs ;

where p(n; xs) and u(n; xs) is the solution of the boundary problem dened by Eqs. (30), (31) and (33) at the spatial point xs for
a permeability eld given by Eq. (35). Our purpose is to learn this map and also propagate the uncertainty of the stochastic
variables through it by using a nite number of simulations.
The boundary value problem is solved using the Mixed Finite Element formulation. We use rst-order RaviartThomas
elements for the velocity [40], and zero-order discontinuous elements for the pressure [41]. The spatial domain is discretized
230

3
x 10
14
x 10
12
12
0.8
0.8
10
10
0.6
0.4
0.6
8
0.4
6
0.2
0.2
0
0
0.5
x
0
0
0.5
x
x 10
x 10
3.5
16
14
0.8
0.8
12
0.6
2.5
0.6
2
10
8
0.4
0.4
1.5
6
4
0.2
0.2
0.5
0
0
0.5
x
0
0
0.5
x
0.025
0.02
0.6
0.015
0.4
0.01
0.2
0.005
0.8
0
0
0.5
x
Fig. 10. Porous ow: variance of uy. Subgures (a)(c) show the mean of the variance of uy for 24, 64 and 120 observations, respectively. Subgure (d) plots
two standard deviations of the variance of uy for 120 observations. Finally, subgure (e) shows the MC estimate of the same quantity using 108,000
observations.
using a 64 64 triangular mesh. The solver was implemented using the Doln C++library [42]. The eigenfunctions of the
exponential random eld used to model the permeability were calculated via Stokhos which is part of Trilinos [43].
For each stochastic input n, the response is observed on a regular 32 32 square spatial grid. Because of the regular nature of the spatial grid as well as the separable nature of the Square Exponential correlation function we use, it can be easily
shown that the 1024 1024 spatial covariance matrix As can be written as
As As;1 As;2 ;
where As,i, i = 1, 2 are 32 32 covariance matrices pertaining to the horizontal and vertical spatial directions, respectively. Of
course, it is vital to make use of this simplication. The data collected this way are used to train a 3-dimensional Gaussian
process which is then used to make predictions on the same spatial grid. We train our model, using in sequence 24, 64 and
120 observations of the deterministic solver in which the stochastic inputs are selected from a Latin hyper-cube design. The
prior hyper-parameters are set to:
cn 1=3
cs 1=0:01
fa 102 ;
for a n; s:
The initial values of the hyper-parameters used to start the Gibbs procedure are chosen to be the means of the priors. For
each training set, we sample the posterior of the hyper-parameters 100,000 times (see Fig. 5 for a representative example).
Then, we draw 100 sample surrogates as described in Algorithm 3. For each sampled surrogate, we calculate the statistics of
231

4
x 10
x 10
2.5
12
0.8
0.8
10
0.6
0.6
1.5
0.4
0.4
0.2
0.5
0.2
2
0
0
0.5
x
0
0
0.5
x
x 10
x 10
2.5
5
4.5
0.8
0.8
4
3.5
0.6
0.6
1.5
2.5
0.4
0.4
0.2
0.5
0.2
2
1.5
1
0.5
0
0
0.5
x
0
0
0.5
x
x 10
4.5
0.8
4
3.5
0.6
y
3
2.5
0.4
2
1.5
0.2
1
0.5
0
0
0.5
x
Fig. 11. Porous ow: variance of p. Subgures (a)(c) show the mean of the variance of p for 24, 64 and 120 observations, respectively. Subgure (d) plots
two standard deviations of the variance of p for 120 observations. Finally, subgure (e) shows the MC estimate of the same quantity using 108,000
observations.
interest. Finally, we compute and report the mean and the standard deviation of these statistics. The results are compared to
Monte Carlo estimates.
Fig. 6 compares the mean of the mean of ux to a Monte Carlo estimate using 108,000 observations. Two standard deviations of the mean of ux for the case of 120 observations are shown in subgure (d). The same statistic for uy and p is reported
in Figs. 7 and 8, respectively. Fig. 9 compares the mean of the variance of ux to a Monte Carlo estimate using 108,000 observations. Two standard deviations of the variance of ux for the case of 120 observations are shown in subgure (d). The same
statistic for uy and p is reported in Figs. 10 and 11, respectively. We observe especially for the cases of 24 and 64 observations that the variance is underestimated. Of course, this is to be expected given the very limited set of data available.
Fortunately, the error bars seem to compensate for this under-estimation with the exception of the case that corresponds to
the variance of the pressure p.
Fig. 12 depicts the predicted probability densities of ux(0.5, 0.5) along with their error bars, for all available training sets.
We see that the tails of the probability density are not estimated correctly. In particular, we observe two different types of
potential problems. Firstly, the left hand side puts too much weight on negative values for ux(0.5, 0.5) even though it is quite
clear (see subgure (d)) that ux is always positive on that particular spatial location. However, our prior assumption is that
the response is a sample from a Gaussian random eld. Hence, negative values for ux(0.5, 0.5) are very plausible. The model,
can correct this belief only by observing an adequate number of data points. It is a fact, that all 120 observations of ux near
(0.5, 0.5) are strictly positive. However, these observations are not enough to change the prior belief that ux(0.5, 0.5) might
232
Probability density
10
Probability density
10
6
4
2
0
0.1
0.1
0.2
6
4
2
0
0.1
0.3
ux(0.50,0.50)
Probability density
10
Probability density
10
6
4
2
0
0.1
0.1
ux(0.50,0.50)
0.1
0.2
0.3
0.2
0.3
ux(0.50,0.50)
0.2
6
4
2
0
0.1
0.3
0.1
ux(0.50,0.50)
0
0.1
Probability density
Probability density
0.1
0.2
0.3
ux(0.25,0.25)
0.4
0
0.1
0.1
0.2
0.3
ux(0.25,0.25)
0.4
0
0.1
Probability density
Probability density
Fig. 12. Porous ow: the PDF of ux(0.5, 0.5). The blue lines show the average PDF over 100 sampled surrogates for the cases of 24 (a), 64 (b) and 120 (c)
observations. The lled gray area corresponds to two standard deviations of the PDFs about the mean PDF. The solid red line of (d) is the Monte Carlo
estimate using 10,000 observations. (For interpretation of the references to color in this gure legend, the reader is referred to the web version of this
article.)
0.1
0.2
0.3
ux(0.25,0.25)
0.4
0.1
0.2
0.3
ux(0.25,0.25)
0.4
0
0.1
Fig. 13. Porous ow: the pdf of ux(0.25,0.25). The blue lines show the average PDF over 100 sampled surrogates for the cases of 24 (a), 64 (b) and 120 (c)
article.)
also take negative values. Notice, though, that as we go from (a)(c), the trend is gradually corrected. If one wanted to incorporate the fact that a quantity of interest is strictly positive, then it is usually recommended to observe the logarithm of the
quantity instead. Let us now get to the second problem which has to do with the underestimation of the right tail of the
233
70
70
60
60
Probability density
Probability density
50
40
30
20
10
40
30
20
10
0.02
0
0.02
p(0.50,0.50)
0
0.04
0.04
70
70
60
60
Probability density
Probability density
0
0.04
50
50
40
30
20
10
0
0.04
0.02
0
0.02
p(0.50,0.50)
0.04
0.02
0
0.02
p(0.50,0.50)
0.04
50
40
30
20
10
0.02
0
0.02
p(0.50,0.50)
0
0.04
0.04
Fig. 14. Porous ow: the pdf of p(0.5, 0.5). The blue lines show the average PDF over 100 sampled surrogates for the cases of 24 (a), 64 (b) and 120 (c)
article.)
40
Probability density
Probability density
40
30
20
10
0
0
0.05
0.1
p(0.25,0.25)
10
0.05
0.1
p(0.25,0.25)
0.15
0.05
0.1
p(0.25,0.25)
0.15
40
Probability density
Probability density
20
0
0
0.15
40
30
20
10
0
0
30
0.05
0.1
p(0.25,0.25)
0.15
30
20
10
0
0
Fig. 15. Porous ow: the PDF of p(0.25, 0.25). The blue lines show the average PDF over 100 sampled surrogates for the cases of 24 (a), 64 (b) and 120 (c)
article.)
distribution. Let us start by noticing that cases (a) and (b) put enough weight on it. The reason for this is not that there are
observations close to this region. It is again a consequence of the Gaussian assumption, just like in the rst problem we
discussed. However, for the case of 120 observations, we see that the model signicantly underestimates the right tail.
234
60
40
0.2
35
u (0.50,0.50)
30
0.2
25
20
0.1
15
50
40
0.1
30
uy(0.50,0.50)
0.3
20
10
10
5
0.1
0.2
0
0.2
u (0.50,0.50)
0.4
0.1
0.1
0
0.1
0.2
ux(0.50,0.50)
(a)
0.3
(b)
0.25
70
0.25
0.2
60
0.2
20
u (0.50,0.50)
0.15
40
0.1
30
u (0.50,0.50)
18
50
0.05
16
14
0.15
12
0.1
10
8
0.05
20
10
0.05
0
0.1
ux(0.50,0.50)
0.05
0.2
0.1
ux(0.50,0.50)
(c)
0.2
(d)
80
70
0.3
uy(0.50,0.50)
60
50
0.2
40
30
0.1
20
10
0.1
0.2
u (0.50,0.50)
0.3
(e)
Fig. 16. Porous ow: the joint PDF of ux(0.5, 0.5) and uy(0.5, 0.5). The contours show the average joint PDF over 100 sampled surrogates for the cases of 24
(a), 64 (b) and 120 (c) observations, respectively. Subgure (d) plots two standard deviations of the joint PDF for 120 observations. Finally, subgure (e)
shows the MC estimate of the same quantity using 10,000 observations.
The reason, of course, is that there is not a single observation in the training set that takes values close to that region. One
cannot possibly expect to capture a long tail without observing any events on it. The remedy here is a smarter choice of the
observations on the lines of the active learning techniques that we have investigated in other places [17]. It is needless to say,
that if the purpose of the practitioner is the investigation of improbable events, then she should favor active learning
schemes that have a bias for extreme values. This is clearly beyond the scope of the present work. Finally, Figs. 1315 show
the predicted probability densities of ux(0.25, 0.25), p(0.5, 0.5) and p(0.25, 0.25), respectively. The same comments as for the
ux(0.5, 0.5) case are also applicable here. Finally, the joint probability density of ux(0.5, 0.5) and uy(0.5, 0.5) is shown in Fig. 16.
We observe again, the underestimation of the top right long tail of the distribution and the broadening that occurs close to
(0, 0).
4. Conclusions
We developed a multi-output Gaussian process model that explicitly models the linear part of correlations between distinct outputs as well as space and/or time. By exploiting the static nature of the spatial/time inputs as well as the special
nature of separable covariance functions, we were able to express all required quantities for inference and predictions in
terms of Kronecker products. This led to highly efcient computations both in terms of memory and CPU time. We recognized the fact that the posterior predictive distribution of the Gaussian process denes a probability measure on the function
space of possible surrogates and we described an approximate method that yields kernel-based analytic sample surrogates.
The scheme was applied to uncertainty quantication tasks in which we were able to quantify the epistemic uncertainty
induced by the limited number of observations.
235
Despite the successes, we observe certain aspects that require further investigation. Firstly, we noticed a systematic
underestimation of the tails of the predicted probability densities. Of course, this is expected in a limited observations setting. However, we are condent that the model can be considerably improved without losing in efciency in several different
ways. To start with, in the ow through porous media example, we can see that the assumption of stationarity in space is
wrong. It is evident that the velocities vary more close to the wells, than they do away from them. The stationary covariance
in space, forces the model to make a compromise in the spatial length scales. On one hand, regions close to the wells seem
smoother than necessary while, on the other hand, regions away from them are more wavy. Hence, we are expecting that
using a non-stationary covariance or a tree-based model will improve the situation signicantly [17]. Furthermore, it would
be very interesting to see how the results would change if a sequential active learning approach was followed for the collection of the observations. It seems plausible, that the most effective way to improve the tails of the distributions would
be to select observations with extreme properties. A simple variance-based active learning scheme seems inadequate (of
course here we are talking about the case in which the observations are kept to very small number). The particulars of an
alternative are a very interesting research topic.
Acknowledgements
The research at Cornell was supported by an OSD/AFOSR MURI09 award on uncertainty quantication, the US Department of Energy, Ofce of Science, Advanced Scientic Computing Research and the Computational Mathematics program
of the National Science Foundation (NSF) (award DMS-1214282). The research at Pacic Northwest National Laboratory
(PNNL) was supported by the Applied Mathematics program of the US DOE Ofce of Advanced Scientic Computing Research. PNNL is operated by Battelle for the US Department of Energy under Contract DE-AC05-76RL01830. This research
used resources of the National Energy Research Scientic Computing Center, which is supported by the Ofce of Science
of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. Additional computing resources were provided
by the NSF through TeraGrid resources provided by NCSA under Grant No. TG-DMS090007.
Appendix A. Kronecker product properties
A.1. Calculating matrixvector and matrixmatrix products
A.1.1. Matrixvector product
Let A 2 Rm1 n1 ; B 2 Rm2 n2 and x 2 Rn1 n2 . We wish to calculate:
y A Bx 2 Rm1 m2 ;
without explicitly forming the Kronecker product. This may be easily achieved by exploiting the properties of the vectorization operation vec() [22]. Let X 2 Rn2 n1 be the matrix formed by folding the vector x so that x = vec(X). Then we obtain:
y vecBXAT :
A:1
So all we need to do is two matrix multiplications. Notice that for the case of triangular A and B no additional memory is
required.
A.1.2. Matrixmatrix product
Let A and B be as before and X 2 Rn1 n2 s . We wish to calculate:
Y A BX 2 Rm1 m2 s :
This can be trivially calculated by working column by column using the results of the previous subsection.
A.1.3. Three Kronecker products
Let C 2 Rm3 n3 and x 2 Rn1 n2 n3 . Then the product
y A B Cx 2 Rm1 m2 m3
can by calculated by observing that:
y vecCXA CT ;
where X 2 Rn3 n1 n2 such that x = vec(X). To simplify the expression inside the vectorization operator, let
Z A CXT 2 Rn1 n2 n3 . This matrix can be calculated using what was described in the previous subsection. Finally, we obtain the following:
y vecCZT :
Again notice that if all matrices are triangular all operations can be performed without additional memory.
A:2
236
A.2. Solving linear systems

Now let A 2 Rmm ; B 2 Rnn and y 2 Rmn . We wish to solve the linear system:
A Bx y
for x 2 Rmn . Let X 2 Rmn and Y 2 Rmn be such that x = vec(X) and y = vec(Y), respectively. Using, again, the properties of the
vectorization operator, we obtain:
BXAT Y:
Therefore, we can nd X by solving two linear systems:
BZ Y;
T
A:3
T
AX Z :
If A and B are triangular matrices, then no additional memory is required.
Finally, let C 2 Rss be another matrix and y 2 Rnms . We wish to solve the linear system:
A B Cx y
for x 2 Rnms . Let X 2 Rsmn and Y 2 Rsmn be such that x = vec(X) and y = vec(Y), respectively. Then
CXA BT Y:
We start by solving the system:
CZ Y
and then solve the system:
A BXT ZT ;
using the results of the previous paragraph on each of the rows of X and Z.
Appendix B. Implementation details
Given a set of hyper-parameters h, the various statistics may be evaluated efciently in the following sequence:
1. Compute the Cholesky decomposition of all covariance matrices:
An Ln LTn ;
As Ls LTs ;
At Lt LTt :
2. Scale the outputs by solving in place the linear system:

~ Y:
L n Ls Lt Y
3. Scale the design matrices by solving in place the linear systems:
Ln H~n Hn ;
Ls H~s Hs ;
Lt H~t Ht :
4. Calculate the QR factorizations of the scaled design matrices:
h
iT
H~n Q n Rn ; Q n Q n;1 Q n2 ; Rn RTn;1 0 ;
h
iT
H~s Q s Rs ; Q s Q s;1 Q s2 ; Rs RTs;1 0 ;
h
iT
H~t Q t Rt ; Q t Q t;1 Q t2 ; Rt RTt;1 0 ;
where for a = n, s, t, Q a;1 2 Rna ma ; Q a;2 2 Rna na ma and Ra;1 2 Rma ma is upper triangular.
^ by solving in place the upper triangular system:
5. Find B

^ Q Q Q Y:
~
Rn;1 Rs;1 Rt;1 B
n;1
s;1
t;1
^
6. Calculate (by doing n rank-1 updates) R:
A:4
^
R
237
iT h

i
1 h~ ~
~ H~n H~s H~t B
^ Y
^ :
Y Hn H~s H~t B
nm
^
7. Calculate the Cholesky decomposition of R:
^ LR LT :
R
R
8. Now we can evaluate all the determinants involved in the posterior of h:
log jAn j 2
log jAs j 2
log jAt j 2
nn
X
i1
ns
X
i1
nt
X
log Ln;ii ;
log Ls;ii ;
log Lt;ii ;
i1
mn
log jHTn A1

n Hn j 2
log jHTs A1
s Hs j 2
log jHTt A1
t Ht j 2
^ 2
log jRj
X
i1
ms
X
i1
mt
X
i1
q
X
log jRn;1;ii j;
log jRs;1;ii j;
log jRt;1;ii j;
log LR;ii ;
i1
n
n
n
log jAn j log jAs j log jAt j;
nn
ns
nt
m
log jHT A1 Hj
log jHTn A1
n Hn j
mn
m
log jHTs A1
s Hs j
ms
m
log jHTt A1

t Ht j:
mt
log jAj
Appendix C. Fast Cholesky updates

This section is concerned with updating the Cholesky decomposition of a covariance matrix in O(n2) time when a new
data point is observed. The updates useful in two occasions:
1. When we are doing active learning without updating the hyper-parameters.
2. When we wish to sample sequentially the joint distribution in order to obtain a response surface (see Section 2.3).
Let An 2 Rnn be a symmetric positive denite matrix and assume that we have already calculated its Cholesky decomposition Ln 2 Rnn (lower triangular). Now let An+m 2 R(n+m)(n+m) be another symmetric positive denite matrix (e.g. the
one we obtain if we observe m new data points). In particular let it be given by:
Anm
An

;
where B 2 Rnm (e.g. cross covariance) and C 2 Rmm (e.g. covariance matrix of the new data). Let Lnm 2 Rnmnm be the
lower triangular Cholesky factor of An+m (i.e. Anm Lnm LTnm ). It is convenient to represent it in the matrix block form:

Lnm
D11
0nm
D21
D22

;
where D11 2 Rnn ; D21 2 Rmn and D22 2 Rmm . We will derive formulas for the efcient calculation of the Dij based on the
Cholesky decomposition of An. Notice that:
238
Anm Ln1 LTn1

D11 0nm T
An B
D11 0nm
)
BT C
D21 D22
D21 D22
!
!
D11 DT11
D11 DT21
Ln LTn B
:
)
D21 DT11 D21 DT21 D22 DT22
C
BT

From the above equation, we see right away that D11 = Ln. D21 can be found by solving the triangular system Ln DT21 B and
nally D22 is the lower triangular Cholesky factor of C D21 DT21 . To wrap it up, here is how the update should be performed:
1. Set D11 = Ln.
2. Solve the following system for D21:
Ln DT21 B:
3. Compute the Cholesky decomposition of:
D22 DT22 C D21 DT21 ;

to nd D22.
For the special (but common in practice) case where m = 1, then D21 is a vector and C and D22 are numbers, step 3 can be
replaced by:
D22
q
C D21 DT21 :
We can also update solutions of linear systems. Suppose that we have already solved the following system:
L n xn y n
and after observing m more data points, we need to solve the following:
Lnm xnm ynm :

If you let

ynm
yn
z

;
it is trivial to show that xn+m is given by:
xnm
xn
D1
22 z
D21 xn

:
References
[1] I. Babuka, F. Nobile, R. Tempone, A stochastic collocation method for elliptic partial differential equations with random input data, SIAM Journal of
Numerical Analysis 45 (3) (2007) 10051034.
[2] D. Xiu, G.E. Karniadakis, The WienerAskey polynomial chaos for stochastic differential equations, Journal of Scientic Computing (24) (2002) 619
644.
[3] S.A. Smolyak, Quadrature and interpolation formulas for tensor products of certain classes of functions, in: Dokl. Akad. Nauk SSSR, vol. 4, 1963, p. 123.
[4] D. Xiu, J.S. Hesthaven, High-order collocation methods for differential equations with random inputs, SIAM Journal on Scientic Computing 27 (3)
(2005) 11181139.
[5] D. Xiu, Efcient collocational approach for parametric uncertainty analysis, Communications in Computational Physics 2 (2) (2007) 293309.
[6] F. Nobile, R. Tempone, C. Webster, A sparse grid collocation method for elliptic partial differential equations with random input data, SIAM Journal of
Numerical Analysis 45 (2008) 23092345.
[7] X. Ma, N. Zabaras, An adaptive hierarchical sparse grid collocation algorithm for the solution of stochastic differential equations, Journal of
Computational Physics 228 (8) (2009) 30843113.
[8] C. Currin, T. Mitchell, M. Morris, D. Ylvisaker, A Bayesian approach to the design and analysis of computer experiments, Tech. Rep. ORNL6498, Oak
Ridge Laboratory, 1988.
[9] J. Sacks, W.J. Welch, T.J. Mitchell, H.P. Wynn, Design and analysis of computer experiments, Statistical Science 4 (4) (1989) 409435.
[10] C. Currin, T. Mitchell, M. Morris, D. Ylvisaker, Bayesian prediction of deterministic functions, with applications to the design and analysis of computer
experiments, Journal of the American Statistical Association 86 (416) (1991) 953963.
[11] W.J. Welch, R.J. Buck, J. Sacks, H.P. Wynn, T.J. Mitchell, M.D. Morris, Screening, predicting, and computer experiments, Technometrics 34 (1) (1992) 15
25.
[12] A. OHagan, M.C. Kennedy, J.E. Oakley, Uncertainty analysis and other inference tools for complex computer codes, in: J.M. Bernardo et al. (Eds.),
Bayesian Statistics 6, Oxford University Press, 1999, pp. 503524.
[13] J. Oakley, A. OHagan, Bayesian inference for the uncertainty distribution of computer model outputs, Biometrika 89 (4) (2002) 769784.
[14] M.C. Kennedy, A. OHagan, Bayesian calibration of computer models, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (3)
(2001) 425464.
[15] D. Higdon, J. Gattiker, B. Williams, M. Rightley, Computer model calibration using high-dimensional output, Journal of the American Statistical
Association 103 (482) (2008) 570583.
239
[16] R.B. Gramacy, H.K.H. Lee, Bayesian treed Gaussian process models with an application to computer modeling, Journal of the American Statistical
Association 103 (483) (2008) 11191130.
[17] I. Bilionis, N. Zabaras, Multi-output local Gaussian process regression: applications to uncertainty quantication, Journal of Computational Physics 231
(17) (2012) 57185746.
[18] S. Conti, A. OHagan, Bayesian emulation of complex multi-output and dynamic computer models, Journal of Statistical Planning and Inference 140 (3)
(2010) 640651.
[19] M.D. McKay, R.J. Beckman, W.J. Conover, A comparison of three methods for selecting values of input variables in the analysis of output from a
computer code, Technometrics 21 (2) (1979) 239245.
[20] D.J. MacKay, Information-based objective functions for active data selection, Neural Computation 4 (1992) 590604.
[21] A.P. Dawid, Some matrix-variate distribution theory: notational considerations and a Bayesian application, Biometrika 68 (1) (1981) 265274.
[22] J. Magnus, H. Neudecker, Matrix Differential Calculus With Applications in Statistics and Econometrics, Wiley Series in Probability and Statistics, John
Wiley, 1999.
[23] C.M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag, Berlin, New York, Inc., Secaucus, NJ, USA,
2006.
[24] M. Goulard, M. Voltz, Linear coregionalization model: tools for estimation and choice of cross-variogram matrix, Mathematical Geology 24 (1992)
269286.
[25] G. Bourgault, D. Marcotte, Multivariable variogram and its application to the linear model of coregionalization, Mathematical Geology 23 (1991) 899
928.
[26] S. De Iaco, D. Myers, D. Posa, The linear coregionalization model and the productsum spacetime variogram, Mathematical Geology 35 (2003) 2538.
[27] C.F. van Loan, The ubiquitous Kronecker product, Journal of Computational and Applied Mathematics 123 (12) (2000) 85100.
[28] G. Casella, E.I. George, Explaining the Gibbs sampler, The American Statistician 46 (3) (1992) 167174.
[29] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Equation of state calculations by fast computing machines, The Journal of
Chemical Physics 21 (6) (1953) 10871092.
[30] W.K. Hastings, Monte Carlo sampling methods using Markov chains and their applications, Biometrika 57 (1) (1970) 97109.
[31] R.B. Gramacy, H.K. Lee, Cases for the nugget in modeling computer experiments, Statistics and Computing 22 (3) (2012) 713722.
[32] H. Harbrecht, M. Peters, R. Schneider, On the low-rank approximation by the pivoted Cholesky decomposition, Applied Numerical Mathematics 62 (4)
(2012) 428440.
[33] X. Wan, G.E. Karniadakis, An adaptive multi-element generalized polynomial chaos method for stochastic differential equations, Journal of
Computational Physics 209 (2005) 617642.
[34] M. Galassi, J. Davies, J. Theiler, B. Gough, G. Jungman, P. Alken, M. Booth, F. Rossi, GNU Scientic Library Reference Manual, 2009.
[35] M.D. Mckay, R.J. Beckman, W.J. Conover, A comparison of three methods for selecting values of input variables in the analysis of output from a
computer code, Technometrics 42 (1) (2000) 202208.
[36] A.W. Bowman, A. Azzalini, Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations, Oxford Statistical Science
Series, Oxford University Press, USA, 1997.
[37] J.E. Aarnes, V. Kippe, K.-A. Lie, A.B. Rustad, Modelling of Multiscale Structures in Flow Simulations for Petroleum Reservoirs, in: G. Hasle, K.-A. Lie, E.
Quak (Eds.), Geometric Modelling, Numerical Simulation, and Optimization, Springer, Berlin, Heidelberg, 2007, pp. 307360 (Ch. 10).
[38] P. Bochev, R.B. Lehoucq, On nite element solution of the pure Neumann problem, SIAM Review 47 (2001) 5066.
[39] R.G. Ghanem, P.D. Spanos, Stochastic Finite Elements: A Spectral Approach, Springer-Verlag, New York, 1991.
[40] P. Raviart, J. Thomas, A mixed nite element method for 2nd order elliptic problems, in: I. Galligani, E. Magenes (Eds.), Mathematical Aspects of Finite
Element Methods, Lecture Notes in Mathematics, vol. 606, Springer, Berlin/Heidelberg, 1977, pp. 292315 (Ch. 19).
[41] F. Brezzi, T.J.R. Hughes, L.D. Marini, A. Masud, Mixed discontinuous Galerkin methods for Darcy ow, Journal of Scientic Computing 2223 (1) (2005)
119145.
[42] A. Logg, G.N. Wells, J. Hake, DOLFIN: a C++/Python Finite Element Library, Springer, 2012 (Ch. 10).
[43] M.A. Heroux, R.A. Bartlett, V.E. Howle, R.J. Hoekstra, J.J. Hu, T.G. Kolda, R.B. Lehoucq, K.R. Long, R.P. Pawlowski, E.T. Phipps, A.G. Salinger, H.K.
Thornquist, R.S. Tuminaro, J.M. Willenbring, A. Williams, K.S. Stanley, An overview of the trilinos project, ACM Transactions on Mathematical Software
31 (3) (2005) 397423.

MGP SeparableA

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MGP SeparableA

Uploaded by

Copyright:

Available Formats

Journal of Computational Physics 241 (2013) 212239

Contents lists available at SciVerse ScienceDirect

Journal of Computational Physics

Multi-output separable Gaussian process: Towards an efcient,

I. Bilionis et al. / Journal of Computational Physics 241 (2013) 212239

I. Bilionis et al. / Journal of Computational Physics 241 (2013) 212239

2.1. Multi-output Gaussian process regression

fjB; R; h  N q m; B; c; ; hR;

Covfx1 ; fx2 jB; R; h cx1 ; x2 ; hR:

where h : X ! R ; h h1 ; . . . ; hm  are regression functions shared by each of the components of f().

pB; R; h pB; Rph:

YjB; R; h  N nq HB; R; A;

H hx1 ; . . . ; hxn T 2 Rnm

is the design matrix and

is the usual covariance matrix.

m x; B BT hx Y  HBT A1 ax;

I. Bilionis et al. / Journal of Computational Physics 241 (2013) 212239

2.1.4. The posterior distribution of h

2.2. The separable model

cx1 ; x2 ; h : cn n1 ; n2 ; hn cs xs;1 ; xs;2 ; hs ct t 1 ; t 2 ; ht ;

ph phn phs pht :

yn;i yn;i;1 . . . yn;i;nt T ;

I. Bilionis et al. / Journal of Computational Physics 241 (2013) 212239

yn;i;j yn;i;j;1 . . . yn;i;j;q T ;

Xn n1 ; . . . ; nnn T 2 Rnn kn ;

Ht fht;1 t; . . . ; ht;mt tg:

Hn fhn;1 n; . . . ; hn;mn ng:

h1 x : hn;1 nhs;1 xs ht;1 t;

Hs;ij hs;j xs;i

I. Bilionis et al. / Journal of Computational Physics 241 (2013) 212239

and Ht 2 Rnt mt is:

hn  phn jY; hs ; ht / phn jAn j

ht  pht jY; hn ; hs / pht jAt j

2.2.5. Choice of the covariance function

cn nn1 ; nn2 ; hn : exp 

with the hyper-parameters completely specied by:

I. Bilionis et al. / Journal of Computational Physics 241 (2013) 212239

pra;k jca Era;k jca ;

Require Observed data X and Y and h0 sample from Eq. (9).

2.3.1. Sampling a response surface

I. Bilionis et al. / Journal of Computational Physics 241 (2013) 212239

For d = 0, we obtain the observed data:

where nd = (nn + d)nsnt and the mean is given by:

and the covariance matrix by:

I. Bilionis et al. / Journal of Computational Physics 241 (2013) 212239

nnn d1 arg max

Sample Znnn d1 from Eq. (16).

2.3.2. Analytic rst-order and second-order statistics

It can be easily shown that:

hn npndn and a;d

It can be shown using tensorial notation, that

may be evaluated by:

^ i 2 Rms mt mn is such that vecB

I. Bilionis et al. / Journal of Computational Physics 241 (2013) 212239

95% Conf. Interval

95% Conf. Interval

95% Conf. Interval

95% Conf. Interval

m aa;d 2 Rnn nn by:

an;d n  d;a an;d n  a;d T pndn;

m ha;d 2 Rmn nn by:

hn n  h an;d n  a;d T pndn

and mah,d = mha,dT.

fjB; R; h N q m; B; c; ; hR;

where h : X ! R ; h h1 ; . . . ; hm are regression functions shared by each of the components of f().

YjB; R; h N nq HB; R; A;

m x; B BT hx Y HBT A1 ax;

hn phn jY; hs ; ht / phn jAn j

ht pht jY; hn ; hs / pht jAt j

cn nn1 ; nn2 ; hn : exp

hn npndn and a;d

an;d n d;a an;d n a;d T pndn;

hn n h an;d n a;d T pndn

nk Uwk U0; 1;

log jHTn A1

log jHTt A1

D22 DT22 C D21 DT21 ;