Wilson2020 Part1

Representation Learning with Gaussian Processes
My journey in working with Gaussian processes
Andrew Gordon Wilson
https://cims.nyu.edu/~andrewgw
Courant Institute of Mathematical Sciences
Center for Data Science
New York University
Gaussian Process Summer School

September 14, 2020
1 / 52
Gaussian processes
Definition
A Gaussian process (GP) is a collection of random variables, any finite number of
which have a joint Gaussian distribution.
Nonparametric Regression Model
I Prior: f (x) ⇠ GP(m(x), k(x, x0 )), meaning (f (x1 ), . . . , f (xN )) ⇠ N (µ, K),
with µi = m(xi ) and Kij = cov(f (xi ), f (xj )) = k(xi , xj ).
GP posterior Likelihood GP prior
z }| { z }| { z }| {
p(f (x)|D) / p(D|f (x)) p(f (x))
Sample Prior Functions Sample Posterior Functions

3 3
2 2
Outputs, f(x)
Outputs, f(x)
1 1
0 0
−1 −1
−2 −2
−3 −3
−10 −5 0 5 10 −10 −5 0 5 10
Inputs, x Inputs, x
2 / 52
Example: RBF Kernel
kRBF (x, x0 ) = cov(f (x), f (x0 ))

||x x0 ||2
= a2 exp( )
2`2
3 / 52
The Power of Non-Parametric Representations
Deriving the RBF Kernel
f (x)
ci 1 ci ci+1 x
J
X ✓ 2
◆ ✓ ◆
(x ci )2
f (x) = wi i (x) , wi ⇠ N 0, , i (x) = exp (1)
i=1
J 2`2
2 J
X
) k(x, x0 ) = i (x) i (x
0
) (2)
J i=1
4 / 52
f (x)
ci 1 ci ci+1 x
J
X ✓ 2
◆ ✓ ◆
(x ci )2
f (x) = wi i (x) , wi ⇠ N 0, , i (x) = exp (3)
i=1
J 2`2
2 J
X
) k(x, x0 ) = i (x) i (x
0
) (4)
J i=1
I Let cJ = log J, c1 = log J, and ci+1 ci = c = 2 logJ J , and J ! 1, the

kernel in Eq. (7) becomes a Riemann sum:
2 J
X Z c1
k(x, x0 ) = lim i (x) i (x
0
)= c (x) c (x
0
)dc (5)
J!1 J i=1 c0
5 / 52
J
X ✓ 2
◆ ✓ ◆
(x ci )2
f (x) = wi i (x) , wi ⇠ N 0, , i (x) = exp (6)
i=1
J 2`2
2 J
X
) k(x, x0 ) = i (x) i (x
0
) (7)
J i=1
I Let cJ = log J, c1 = log J, and ci+1 ci = c = 2 logJ J , and J ! 1, the

kernel in Eq. (7) becomes a Riemann sum:
2 J
X Z c1
k(x, x0 ) = lim i (x) i (x
0
)= c (x) c (x
0
)dc (8)
J!1 J i=1 c0
I By setting c0 = 1 and c1 = 1, we spread the infinitely many basis

functions across the whole real line, each a distance c ! 0 apart:
Z 1
(x c)2 (x0 c)2
k(x, x0 ) = exp( ) exp( )dc (9)
1 2` 2 2`2
p 2 (x x0 )2
= ⇡` exp( p ) / kRBF (x, x0 ) . (10)
2( 2`)2
6 / 52
Inference using an RBF kernel
I Specify f (x) ⇠ GP(0, k).
||x x0 ||2
I Choose kRBF (x, x0 ) = a20 exp( 2`20
). Choose values for a0 and `0 .
I Observe data, look at the prior and posterior over functions.
Samples from GP Prior Samples from GP Posterior

4 4
3 3
2 2
Output, f(x)
Output, f(x)
1 1
0 0
−1 −1
−2 −2
−3 −3
−4 −4
−10 −5 0 5 10 −10 −5 0 5 10
Input, x Input, x
I Does something look strange about these functions?

7 / 52
Inference using an RBF kernel
Increase the length-scale `.
Samples from GP Prior

4
2
Output, f(x)
−1
−2
−3
−4
−10 −5 0 5 10
Input, x
I Does something look strange about these functions?
8 / 52
Learning and Model Selection
p(y|Mi )p(Mi )
p(Mi |y) = (11)
p(y)
We can write the evidence of the model as
Z
p(y|Mi ) = p(y|f, Mi )p(f)df (12)
Complex Model Data

Simple Model 3
Simple
Appropriate Model
Complex
2
Appropriate
Output, f(x)
p(y|M)
−1
−2
−3
−4
y −10 −8 −6 −4 −2 0 2 4 6 8 10
All Possible Datasets Input, x
Gaussian processes for Machine Learning. Rasmussen, C.E. and Williams, C.K.I. MIT Press, 2006.
Bayesian Methods for Adaptive Models. MacKay, D.J. PhD Thesis, 1992.
Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes.
Wilson, A.G. PhD Thesis, 2014.
9 / 52
Machine Learning for Econometrics
(The Start of My Journey...)
Autoregressive Conditional Heteroscedasticity (ARCH)

2003 Nobel Prize in Economics
y(t) = N (y(t); 0, a0 + a1 y(t 1)2 )
10 / 52

y(t) = N (y(t); 0, a0 + a1 y(t 1)2 )
Gaussian Copula Process Volatility (GCPV)

(My First PhD Project)
y(x) = N (y(x); 0, f (x)2 )

f (x) ⇠ GP(m(x), k(x, x0 ))
Copula processes. Wilson, A.G., Ghahramani, Z. NeurIPS 2010.
11 / 52

y(t) = N (y(t); 0, a0 + a1 y(t 1)2 )

y(x) = N (y(x); 0, f (x)2 )

f (x) ⇠ GP(m(x), k(x, x0 ))
I Can approximate a much greater range of variance functions

I Operates on continuous inputs x
I Can effortlessly handle missing data
I Can effortlessly accommodate multivariate inputs x (covariates other than time)
12 / 52
y(t) = N (y(t); 0, a0 + a1 y(t 1)2 )

y(x) = N (y(x); 0, f (x)2 )

f (x) ⇠ GP(m(x), k(x, x0 ))

13 / 52
y(t) = N (y(t); 0, a0 + a1 y(t 1)2 )

y(x) = N (y(x); 0, f (x)2 )

f (x) ⇠ GP(m(x), k(x, x0 ))

I Observation: performance extremely sensitive to even small changes in
kernel hyperparameters
14 / 52
But what about the kernel?
I If hypers like length-scale have a substantial effect on performance, then surely

kernel selection is practically crucial!
I But many papers default to the RBF kernel, without even treating the kernel as
a major design decision.
I Aren’t these supposed to be principled models without challenging design
decisions, unlike neural networks? What is going on here???
I We need to automate kernel selection!
15 / 52
Gaussian Processes and Neural Networks
“How can Gaussian processes

possibly replace neural networks?
Have we thrown the baby out with
the bathwater?” (MacKay, 1998)
Introduction to Gaussian processes. MacKay, D. J. In Bishop, C. M. (ed.), Neural Networks and Machine
Learning, Chapter 11, pp. 133-165. Springer-Verlag, 1998.
16 / 52
Gaussian Process Regression Networks
x) y1 (x)
1(
W1 6 0.90
fˆ1 (x)
W1q (x
0.75
.. 5
.. .
. 0.60
Latitude
)
4
0.45
x fˆi (x) yj (x) 3

0.30
.. 2
W p1 (
0.15
. ..
.
1
x)
0.00
fˆq (x)
W 0.15
pq (
x) yp (x) 1 2 3 4 5
Longitude
Gaussian process regression networks. Wilson, A.G., Knowles, D.A., Ghahramani, Z. ICML 2012.
17 / 52
The Inverse Wishart Process
Introduce a completely flexible Bayesian non-parametric distribution over kernels:
c ⇠ IWP(⌫, k✓ ) (13)
y|c ⇠ GP( , (⌫ 2)c) (14)
c is as an infinite dimensional matrix, with any finite dimensional covariance matrix

being a marginal of this matrix with an inverse Wishart distribution. This prior has
support for all positive definite kernels with mean equal to k✓ .
Student-t processes as alternatives to GPs. Shah, A., Wilson, A.G., Ghahramani, Z. AISTATS 2014.
18 / 52
c ⇠ IWP(⌫, k✓ ) (15)
y|c ⇠ GP( , (⌫ 2)c) (16)

I The predictive distribution is
Z
p(y|x, D) = p(y|x, c)p(c|D)dc (17)
19 / 52
c ⇠ IWP(⌫, k✓ ) (18)
y|c ⇠ GP( , (⌫ 2)c) (19)

Z
p(y|x, D) = p(y|x, c)p(c|D)dc (20)
I Perhaps surprisingly, we can analytically marginalize this extremely flexible

posterior over kernels.
20 / 52
c ⇠ IWP(⌫, k✓ ) (21)
y|c ⇠ GP( , (⌫ 2)c) (22)

Z
p(y|x, D) = p(y|x, c)p(c|D)dc (23)

I The result is a Student t distribution, with a predictive mean exactly equal to
what we would get using a GP with kernel k✓ .
21 / 52
c ⇠ IWP(⌫, k✓ ) (24)
y|c ⇠ GP( , (⌫ 2)c) (25)

Z
p(y|x, D) = p(y|x, c)p(c|D)dc (26)

I The result is a Student-t distribution, with a predictive mean exactly equal to
I But the predictive uncertainty does depend on the data.
22 / 52
c ⇠ IWP(⌫, k✓ ) (27)
y|c ⇠ GP( , (⌫ 2)c) (28)

Z
p(y|x, D) = p(y|x, c)p(c|D)dc (29)
I The result is a Student-t distribution, with a predictive mean exactly equal to

I Shockingly, the marginal predictive distribution is the same as we get for
the wildly different generative model:
1 ⌫ ⇢
a ⇠ Gamma( , ) (30)
2 2
y|a ⇠ GP( , a(⌫ 2)k✓ /⇢) (31)
23 / 52
Heteroscedasticity revisited...
Which of these models do you prefer, and why?
Choice 1
y(x)|f (x), g(x) ⇠ N (y(x); f (x), g(x)2 )

f (x) ⇠ GP, g(x) ⇠ GP
Choice 2
y(x)|f (x), g(x) ⇠ N (y(x); f (x)g(x), g(x)2 )

f (x) ⇠ GP, g(x) ⇠ GP
24 / 52
Inductive Biases
I While the inverse Wishart process is flexible, its inductive biases are wrong.
I It is challenging to say anything about the covariance function of a stochastic
process from a single draw if no assumptions are made.
I If we allow the covariance between any two points in the input space to arise
from any positive definite function, with equal probability, then we gain
essentially no information from a single realization.
I Most commonly one assumes a restriction to stationary kernels, meaning that
covariances are invariant to translations in the input space.
25 / 52
Spectral Mixture Kernels for Pattern Discovery
Let ⌧ = x x0 2 RP . From Bochner’s Theorem,
Z
T
k(⌧ ) = S(s)e2⇡is ⌧ ds (32)
RP
For simplicity, assume ⌧ 2 R1 and let

2 2
S(s) = [N (s; µ, ) + N ( s; µ, )]/2 . (33)
Then
k(⌧ ) = exp{ 2⇡ 2 ⌧ 2 2
} cos(2⇡⌧ µ) . (34)
More generally, if S(s) is a symmetrized mixture of diagonal covariance Gaussians

on Rp , with covariance matrix Mq = diag(vq , . . . , vq ), then
(1) (P)
Q
X P
Y
k(⌧ ) = wq cos(2⇡⌧ T µq ) exp{ 2⇡ 2 ⌧p2 v(p)
q }. (35)
q=1 p=1
Gaussian Process Kernels for Pattern Discovery and Extrapolation with Gaussian Processes.
Wilson et. al, 2012
26 / 52
GP Model for Pattern Extrapolation
I 2
Observations y(x) ⇠ N (y(x); f (x), ) (can easily be relaxed).
I 0
f (x) ⇠ GP(0, kSM (x, x |✓)) (f (x) is a GP with SM kernel).
I kSM (x, x0 |✓) can approximate many different kernels with different settings of
its hyperparameters ✓.
I Learning involves training these hyperparameters through maximum marginal
likelihood optimization (using BFGS)
model fit complexity penalty
z }| { z }| {
1 T 2 1 1 2 N
log p(y|✓, X) = y (K✓ + I) y log |K✓ + I| log(2⇡) . (36)
2 2 2
I ˆ making predictions using

Once hyperparameters are trained as ✓,
ˆ which can be expressed in closed form.
p(f⇤ |y, X⇤ , ✓),
27 / 52
Function Learning Example
700
Train
Airline Passengers (Thousands)
600
500
400
300
200
100
1949 1951 1953 1955 1957 1959 1961
Year
28 / 52
Airline Passengers (Thousands) 700
600 Train
Human?
500
400
300
200
100
1949 1951 1953 1955 1957 1959 1961
Year
29 / 52
700
Train
600 Human?
500
400
300
200
100
1949 1951 1953 1955 1957 1959 1961
Year
30 / 52
Results of Representation Learning
700
Train
600 Alien?
Test
500 Human?
400
300
200
100
1949 1951 1953 1955 1957 1959 1961
Year
31 / 52
Results, Airline Passengers
20
SM
Log Spectral Density

SE
15
Empirical
10
0
0 0.1 0.2 0.3 0.4 0.5
Frequency (1/month)
32 / 52
Results, CO2
20
95% CR
CO2 Concentration (ppm)
400 SM
Train
15 SE
Log Spectral Density

Test Empirical
380 MA
10
RQ
PER
360 5
SE
SM
0
340
−5
320
−10
1968 1977 1986 1995 2004 0 0.2 0.4 0.6 0.8 1 1.2
Year Frequency (1/month)
33 / 52
What do we need for large scale pattern extrapolation?
(a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC
(g) GP-RBF (h) GP-MA (i) GP-RQ (j) GPatt Initialisation
(k) Train (l) GPatt (m) GP-MA (n) Train (o) GPatt (p) GP-MA
Fast kernel learning for multidimensional pattern extrapolation. Wilson et. al, NeurIPS 2014.
34 / 52
More Patterns
(a) Rubber mat (b) Tread plate (c) Pores
(d) Wood (e) Chain mail (f) Cone
35 / 52
Scalable Exact Gaussian Processes
I By developing stochastic Krylov methods that exploit parallel cores in GPUs

we can now run exact GPs on millions of points.
I Implemented in the new library GPyTorch: gpytorch.ai
GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration.

Gardner, J. R., Pleiss, G., Bindel, D., Weinberger, K. Q., Wilson, A. G. NeurIPS, 2018.
36 / 52
Exact Gaussian Processes on a Million Data Points
I Results show the benefit of retaining a non-parametric representation.
Exact Gaussian processes on a million data points.

Wang, K.A., Pleiss, G., Gardner, J., Tyree, S., Weinberger, K., Wilson, A.G. NeurIPS, 2019.
37 / 52
Deep Kernel Learning
Deep kernel learning combines the inductive biases of deep learning architectures
with the non-parametric flexibility of Gaussian processes.
W (1)
W (2) h1 (✓)
W (L)
(1)
Input layer h1
Output layer
...
...
(2)
h1
x1 (L)
h1
y1
...
...
...
...
...
... yP
xD (L)
hC
...
hB
(2)
...
(1)
hA
h1 (✓)
Hidden layers
1 layer
Base kernel hyperparameters ✓ and deep network hyperparameters w are

jointly trained through the marginal likelihood objective.
Deep Kernel Learning. Wilson, A.G., Hu, Z., Salakhutdinov, R., Xing, E.P. AISTATS, 2016
38 / 52
Face Orientation Extraction
-43.10 36.15 17.35 -3.49 -19.81 Label
Training data
Test data
Figure: Top: Randomly sampled examples of the training and test data. Bottom: The
two dimensional outputs of the convolutional network on a set of test cases. Each
point is shown using a line segment that has the same orientation as the input face.
39 / 52
Learning Flexible Non-Euclidean Similarity Metrics
0.2 300
2
100
100 100
0.1 200
200 200
200
0 100
300 300 300
400 −0.1 400 0 400 0

100 200 300 400 100 200 300 400 100 200 300 400
Figure: Left: The induced covariance matrix using DKL-SM (spectral mixture)
kernel on a set of test cases, where the test samples are ordered according to the
orientations of the input faces. Middle: The respective covariance matrix using
DKL-RBF kernel. Right: The respective covariance matrix using regular RBF
kernel. The models are trained with n = 12, 000.
40 / 52
Step Function
18
GP(RBF)
GP(SM)
16 DKL−SM
Training data
14
Output Y
12
10
4
−1 −0.5 0 0.5 1
Input X
Figure: Recovering a step function. We show the predictive mean and 95% of the
predictive probability mass for regular GPs with RBF and SM kernels, and DKL with
SM base kernel.
41 / 52
Deep Kernel Learning for Autonomous Driving
1.0
0 0
28
0.8
24
10 10 20
Speed, mi/s
0.6
North, mi
16
0.4
20 20 12
8
0.2
30 30 4
0.0 0
0.05 0 5 0.2 10 15 0.4
20 5 0.6 0 5 10
0.8 15 20 1.0
East, mi
50
Front distance, m
40
30
20
10
0
5 0 5 5 0 5 5 0 5 5 0 5 5 0 5
Side distance, m
50
Front distance, m
40
30
20
10
0
5 0 5 5 0 5 5 0 5 5 0 5 5 0 5
Side distance, m
Learning scalable deep kernels with recurrent structure.

Al-Shedivat, M., Wilson, A.G., Saatchi, Y., Hu, Z., Xing, E.P. JMLR, 2017.
42 / 52
Kernels from Infinite Bayesian Neural Networks
I The neural network kernel (Neal, 1996) is famous for triggering research on
Gaussian processes in the machine learning community.
Consider a neural network with one hidden layer:
J
X
f (x) = b + vi h(x; ui ) . (37)
i=1
I b is a bias, vi are the hidden to output weights, h is any bounded hidden unit
transfer function, ui are the input to hidden weights, and J is the number of
hidden units. Let b and vi be independent with zero mean and variances b2 and
2
v /J, respectively, and let the ui have independent identical distributions.
Collecting all free parameters into the weight vector w,
Ew [f (x)] = 0 , (38)
J
X
2 1 2
cov[f (x), f (x0 )] = Ew [f (x)f (x0 )] = b + 0
v Eu [hi (x; ui )hi (x ; ui )] , (39)
J i=1
2 2 0
= b + v Eu [h(x; u)h(x ; u)] . (40)
We can show any collection of values f (x1 ), . . . , f (xN ) must have a joint Gaussian
distribution using the central limit theorem.
Bayesian Learning for Neural Networks. Neal, R. Springer, 1996.
43 / 52
Neural Network Kernel
J
X
f (x) = b + vi h(x; ui ) . (41)
i=1
PP Rz t2
I Let h(x; u) = erf(u0 + uj xj ), where erf(z) = p2 e dt
j=1 ⇡ 0
I Choose u ⇠ N (0, ⌃)
Then we obtain
2 2x̃T ⌃x̃0
kNN (x, x0 ) = sin( p ), (42)
⇡ (1 + 2x̃T ⌃x̃)(1 + 2x̃0T ⌃x̃0 )
where x 2 RP and x̃ = (1, xT )T .
44 / 52
2 2x̃T ⌃x̃0
kNN (x, x0 ) = sin( p ) (43)
⇡ (1 + 2x̃ ⌃x̃)(1 + 2x̃0T ⌃x̃0 )
T
Set ⌃ = diag( 0 , ). Draws from a GP with a neural network kernel with varying :
Gaussian processes for Machine Learning. Rasmussen, C.E. and Williams, C.K.I. MIT Press, 2006
45 / 52
2 2x̃T ⌃x̃0
kNN (x, x0 ) = sin( p ) (44)
⇡ (1 + 2x̃T ⌃x̃)(1 + 2x̃0T ⌃x̃0 )
Set ⌃ = diag( 0 , ). Draws from a GP with a neural network kernel with varying :
Question: Is a GP with this kernel doing representation learning?
Gaussian processes for Machine Learning. Rasmussen, C.E. and Williams, C.K.I. MIT Press, 2006
46 / 52
Neural Tangent Kernels
I Recent work [e.g., 1, 2, 3] deriving neural tangent kernels from infinite neural
network limits, with promising results.
I Closely related to Radford Neal’s [4] result showing a Bayesian neural network
with infinitely many hidden units converges to a Gaussian process.
I Note that most kernels from infinite neural network limits have a fixed
structure. On the other hand, standard neural networks essentially learn a
similarity metric (kernel) for the data. Learning a kernel amounts to
representation learning. Bridging this gap is interesting future work.
[1] Neural tangent kernel: convergence and generalization in neural networks. Jacot et. al, NeurIPS 2018.
[2] On exact computation with an infinitely wide neural net. Arora et. al, NeurIPS 2019.
[3] Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks. Arora et. al, arXiv 2019.
[4] Bayesian Learning for Neural Networks. Neal, R. Springer, 1996.
47 / 52
Functional Kernel Learning
See, e.g.,
[1] Function-Space Distributions over Kernels.
Benton, G., Maddox, W. J., Salkey, J., Wilson, A. G. NeurIPS, 2019.
[2] Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes.
Wilson, 2014.
[3] Bayesian Nonparametric Spectral Estimation. Tobar, F, NeurIPS 2018.
48 / 52
Functional Kernel Learning (Model Specification)
Function-Space Distributions over Kernels.

Benton, G., Maddox, W. J., Salkey, J., Wilson, A. G. NeurIPS, 2019.
49 / 52
Results (Sinc)
50 / 52
Texture Extrapolation
51 / 52
Multi-Task Learning
52 / 52

Wilson2020 Part1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wilson2020 Part1

Uploaded by

Copyright:

Available Formats

Representation Learning with Gaussian Processes

My journey in working with Gaussian processes

Andrew Gordon Wilson

Gaussian Process Summer School

Sample Prior Functions Sample Posterior Functions

kRBF (x, x0 ) = cov(f (x), f (x0 ))

Deriving the RBF Kernel

I Let cJ = log J, c1 = log J, and ci+1 ci = c = 2 logJ J , and J ! 1, the

I Let cJ = log J, c1 = log J, and ci+1 ci = c = 2 logJ J , and J ! 1, the

I By setting c0 = 1 and c1 = 1, we spread the infinitely many basis

Samples from GP Prior Samples from GP Posterior

I Does something look strange about these functions?

Increase the length-scale `.

Samples from GP Prior

I Does something look strange about these functions?

Complex Model Data

Autoregressive Conditional Heteroscedasticity (ARCH)

y(t) = N (y(t); 0, a0 + a1 y(t 1)2 )

Autoregressive Conditional Heteroscedasticity (ARCH)

y(t) = N (y(t); 0, a0 + a1 y(t 1)2 )

Gaussian Copula Process Volatility (GCPV)

y(x) = N (y(x); 0, f (x)2 )

Copula processes. Wilson, A.G., Ghahramani, Z. NeurIPS 2010.

Autoregressive Conditional Heteroscedasticity (ARCH)

y(t) = N (y(t); 0, a0 + a1 y(t 1)2 )

Gaussian Copula Process Volatility (GCPV)

y(x) = N (y(x); 0, f (x)2 )

I Can approximate a much greater range of variance functions

y(t) = N (y(t); 0, a0 + a1 y(t 1)2 )

Gaussian Copula Process Volatility (GCPV)

y(x) = N (y(x); 0, f (x)2 )

I Can approximate a much greater range of variance functions

y(t) = N (y(t); 0, a0 + a1 y(t 1)2 )

Gaussian Copula Process Volatility (GCPV)

y(x) = N (y(x); 0, f (x)2 )

I Can approximate a much greater range of variance functions

I If hypers like length-scale have a substantial effect on performance, then surely

“How can Gaussian processes

x fˆi (x) yj (x) 3

Introduce a completely flexible Bayesian non-parametric distribution over kernels:

c is as an infinite dimensional matrix, with any finite dimensional covariance matrix

Introduce a completely flexible Bayesian non-parametric distribution over kernels:

c is as an infinite dimensional matrix, with any finite dimensional covariance matrix

Introduce a completely flexible Bayesian non-parametric distribution over kernels:

c is as an infinite dimensional matrix, with any finite dimensional covariance matrix

I Perhaps surprisingly, we can analytically marginalize this extremely flexible

Introduce a completely flexible Bayesian non-parametric distribution over kernels:

c is as an infinite dimensional matrix, with any finite dimensional covariance matrix

I Perhaps surprisingly, we can analytically marginalize this extremely flexible

Introduce a completely flexible Bayesian non-parametric distribution over kernels:

c is as an infinite dimensional matrix, with any finite dimensional covariance matrix

I Perhaps surprisingly, we can analytically marginalize this extremely flexible

c is as an infinite dimensional matrix, with any finite dimensional covariance matrix

I The result is a Student-t distribution, with a predictive mean exactly equal to

Which of these models do you prefer, and why?

y(x)|f (x), g(x) ⇠ N (y(x); f (x), g(x)2 )

y(x)|f (x), g(x) ⇠ N (y(x); f (x)g(x), g(x)2 )

For simplicity, assume ⌧ 2 R1 and let

More generally, if S(s) is a symmetrized mixture of diagonal covariance Gaussians

I ˆ making predictions using

Airline Passengers (Thousands) 700

Log Spectral Density