Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

Representation Learning with Gaussian Processes

My journey in working with Gaussian processes

Andrew Gordon Wilson

https://cims.nyu.edu/~andrewgw
Courant Institute of Mathematical Sciences
Center for Data Science
New York University

Gaussian Process Summer School


September 14, 2020

1 / 52
Gaussian processes
Definition
A Gaussian process (GP) is a collection of random variables, any finite number of
which have a joint Gaussian distribution.
Nonparametric Regression Model
I Prior: f (x) ⇠ GP(m(x), k(x, x0 )), meaning (f (x1 ), . . . , f (xN )) ⇠ N (µ, K),
with µi = m(xi ) and Kij = cov(f (xi ), f (xj )) = k(xi , xj ).
GP posterior Likelihood GP prior
z }| { z }| { z }| {
p(f (x)|D) / p(D|f (x)) p(f (x))

Sample Prior Functions Sample Posterior Functions


3 3

2 2
Outputs, f(x)

Outputs, f(x)
1 1

0 0

−1 −1

−2 −2

−3 −3
−10 −5 0 5 10 −10 −5 0 5 10
Inputs, x Inputs, x
2 / 52
Example: RBF Kernel

kRBF (x, x0 ) = cov(f (x), f (x0 ))


||x x0 ||2
= a2 exp( )
2`2

3 / 52
The Power of Non-Parametric Representations

Deriving the RBF Kernel

f (x)

ci 1 ci ci+1 x

J
X ✓ 2
◆ ✓ ◆
(x ci )2
f (x) = wi i (x) , wi ⇠ N 0, , i (x) = exp (1)
i=1
J 2`2
2 J
X
) k(x, x0 ) = i (x) i (x
0
) (2)
J i=1

4 / 52
Deriving the RBF Kernel
f (x)

ci 1 ci ci+1 x

J
X ✓ 2
◆ ✓ ◆
(x ci )2
f (x) = wi i (x) , wi ⇠ N 0, , i (x) = exp (3)
i=1
J 2`2
2 J
X
) k(x, x0 ) = i (x) i (x
0
) (4)
J i=1

I Let cJ = log J, c1 = log J, and ci+1 ci = c = 2 logJ J , and J ! 1, the


kernel in Eq. (7) becomes a Riemann sum:
2 J
X Z c1
k(x, x0 ) = lim i (x) i (x
0
)= c (x) c (x
0
)dc (5)
J!1 J i=1 c0

5 / 52
Deriving the RBF Kernel

J
X ✓ 2
◆ ✓ ◆
(x ci )2
f (x) = wi i (x) , wi ⇠ N 0, , i (x) = exp (6)
i=1
J 2`2
2 J
X
) k(x, x0 ) = i (x) i (x
0
) (7)
J i=1

I Let cJ = log J, c1 = log J, and ci+1 ci = c = 2 logJ J , and J ! 1, the


kernel in Eq. (7) becomes a Riemann sum:
2 J
X Z c1
k(x, x0 ) = lim i (x) i (x
0
)= c (x) c (x
0
)dc (8)
J!1 J i=1 c0

I By setting c0 = 1 and c1 = 1, we spread the infinitely many basis


functions across the whole real line, each a distance c ! 0 apart:
Z 1
(x c)2 (x0 c)2
k(x, x0 ) = exp( ) exp( )dc (9)
1 2` 2 2`2
p 2 (x x0 )2
= ⇡` exp( p ) / kRBF (x, x0 ) . (10)
2( 2`)2
6 / 52
Inference using an RBF kernel
I Specify f (x) ⇠ GP(0, k).
||x x0 ||2
I Choose kRBF (x, x0 ) = a20 exp( 2`20
). Choose values for a0 and `0 .
I Observe data, look at the prior and posterior over functions.

Samples from GP Prior Samples from GP Posterior


4 4

3 3

2 2
Output, f(x)

Output, f(x)
1 1

0 0

−1 −1

−2 −2

−3 −3

−4 −4
−10 −5 0 5 10 −10 −5 0 5 10
Input, x Input, x

I Does something look strange about these functions?


7 / 52
Inference using an RBF kernel

Increase the length-scale `.

Samples from GP Prior


4

2
Output, f(x)

−1

−2

−3

−4
−10 −5 0 5 10
Input, x

I Does something look strange about these functions?

8 / 52
Learning and Model Selection
p(y|Mi )p(Mi )
p(Mi |y) = (11)
p(y)
We can write the evidence of the model as
Z
p(y|Mi ) = p(y|f, Mi )p(f)df (12)

Complex Model Data


Simple Model 3
Simple
Appropriate Model
Complex
2
Appropriate

Output, f(x)
p(y|M)

−1

−2

−3

−4
y −10 −8 −6 −4 −2 0 2 4 6 8 10
All Possible Datasets Input, x

Gaussian processes for Machine Learning. Rasmussen, C.E. and Williams, C.K.I. MIT Press, 2006.
Bayesian Methods for Adaptive Models. MacKay, D.J. PhD Thesis, 1992.
Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes.
Wilson, A.G. PhD Thesis, 2014.
9 / 52
Machine Learning for Econometrics
(The Start of My Journey...)

Autoregressive Conditional Heteroscedasticity (ARCH)


2003 Nobel Prize in Economics

y(t) = N (y(t); 0, a0 + a1 y(t 1)2 )

10 / 52
Machine Learning for Econometrics

Autoregressive Conditional Heteroscedasticity (ARCH)


2003 Nobel Prize in Economics

y(t) = N (y(t); 0, a0 + a1 y(t 1)2 )

Gaussian Copula Process Volatility (GCPV)


(My First PhD Project)

y(x) = N (y(x); 0, f (x)2 )


f (x) ⇠ GP(m(x), k(x, x0 ))

Copula processes. Wilson, A.G., Ghahramani, Z. NeurIPS 2010.

11 / 52
Machine Learning for Econometrics

Autoregressive Conditional Heteroscedasticity (ARCH)


2003 Nobel Prize in Economics

y(t) = N (y(t); 0, a0 + a1 y(t 1)2 )

Gaussian Copula Process Volatility (GCPV)


(My First PhD Project)

y(x) = N (y(x); 0, f (x)2 )


f (x) ⇠ GP(m(x), k(x, x0 ))

I Can approximate a much greater range of variance functions


I Operates on continuous inputs x
I Can effortlessly handle missing data
I Can effortlessly accommodate multivariate inputs x (covariates other than time)

12 / 52
Autoregressive Conditional Heteroscedasticity (ARCH)
2003 Nobel Prize in Economics

y(t) = N (y(t); 0, a0 + a1 y(t 1)2 )

Gaussian Copula Process Volatility (GCPV)


(My First PhD Project)

y(x) = N (y(x); 0, f (x)2 )


f (x) ⇠ GP(m(x), k(x, x0 ))

I Can approximate a much greater range of variance functions


I Operates on continuous inputs x
I Can effortlessly handle missing data
I Can effortlessly accommodate multivariate inputs x (covariates other than time)

13 / 52
Autoregressive Conditional Heteroscedasticity (ARCH)
2003 Nobel Prize in Economics

y(t) = N (y(t); 0, a0 + a1 y(t 1)2 )

Gaussian Copula Process Volatility (GCPV)


(My First PhD Project)

y(x) = N (y(x); 0, f (x)2 )


f (x) ⇠ GP(m(x), k(x, x0 ))

I Can approximate a much greater range of variance functions


I Operates on continuous inputs x
I Can effortlessly handle missing data
I Can effortlessly accommodate multivariate inputs x (covariates other than time)
I Observation: performance extremely sensitive to even small changes in
kernel hyperparameters

14 / 52
But what about the kernel?

I If hypers like length-scale have a substantial effect on performance, then surely


kernel selection is practically crucial!
I But many papers default to the RBF kernel, without even treating the kernel as
a major design decision.
I Aren’t these supposed to be principled models without challenging design
decisions, unlike neural networks? What is going on here???
I We need to automate kernel selection!

15 / 52
Gaussian Processes and Neural Networks

“How can Gaussian processes


possibly replace neural networks?
Have we thrown the baby out with
the bathwater?” (MacKay, 1998)

Introduction to Gaussian processes. MacKay, D. J. In Bishop, C. M. (ed.), Neural Networks and Machine
Learning, Chapter 11, pp. 133-165. Springer-Verlag, 1998.

16 / 52
Gaussian Process Regression Networks

x) y1 (x)
1(
W1 6 0.90
fˆ1 (x)

W1q (x
0.75
.. 5
.. .
. 0.60

Latitude
)
4

0.45

x fˆi (x) yj (x) 3


0.30

.. 2
W p1 (

0.15
. ..
.
1
x)

0.00

fˆq (x)
W 0.15
pq (
x) yp (x) 1 2 3 4 5

Longitude

Gaussian process regression networks. Wilson, A.G., Knowles, D.A., Ghahramani, Z. ICML 2012.

17 / 52
The Inverse Wishart Process

Introduce a completely flexible Bayesian non-parametric distribution over kernels:

c ⇠ IWP(⌫, k✓ ) (13)
y|c ⇠ GP( , (⌫ 2)c) (14)

c is as an infinite dimensional matrix, with any finite dimensional covariance matrix


being a marginal of this matrix with an inverse Wishart distribution. This prior has
support for all positive definite kernels with mean equal to k✓ .

Student-t processes as alternatives to GPs. Shah, A., Wilson, A.G., Ghahramani, Z. AISTATS 2014.

18 / 52
The Inverse Wishart Process

Introduce a completely flexible Bayesian non-parametric distribution over kernels:

c ⇠ IWP(⌫, k✓ ) (15)
y|c ⇠ GP( , (⌫ 2)c) (16)

c is as an infinite dimensional matrix, with any finite dimensional covariance matrix


being a marginal of this matrix with an inverse Wishart distribution. This prior has
support for all positive definite kernels with mean equal to k✓ .
I The predictive distribution is
Z
p(y|x, D) = p(y|x, c)p(c|D)dc (17)

19 / 52
The Inverse Wishart Process

Introduce a completely flexible Bayesian non-parametric distribution over kernels:

c ⇠ IWP(⌫, k✓ ) (18)
y|c ⇠ GP( , (⌫ 2)c) (19)

c is as an infinite dimensional matrix, with any finite dimensional covariance matrix


being a marginal of this matrix with an inverse Wishart distribution. This prior has
support for all positive definite kernels with mean equal to k✓ .
I The predictive distribution is
Z
p(y|x, D) = p(y|x, c)p(c|D)dc (20)

I Perhaps surprisingly, we can analytically marginalize this extremely flexible


posterior over kernels.

20 / 52
The Inverse Wishart Process

Introduce a completely flexible Bayesian non-parametric distribution over kernels:

c ⇠ IWP(⌫, k✓ ) (21)
y|c ⇠ GP( , (⌫ 2)c) (22)

c is as an infinite dimensional matrix, with any finite dimensional covariance matrix


being a marginal of this matrix with an inverse Wishart distribution. This prior has
support for all positive definite kernels with mean equal to k✓ .
I The predictive distribution is
Z
p(y|x, D) = p(y|x, c)p(c|D)dc (23)

I Perhaps surprisingly, we can analytically marginalize this extremely flexible


posterior over kernels.
I The result is a Student t distribution, with a predictive mean exactly equal to
what we would get using a GP with kernel k✓ .

21 / 52
The Inverse Wishart Process

Introduce a completely flexible Bayesian non-parametric distribution over kernels:

c ⇠ IWP(⌫, k✓ ) (24)
y|c ⇠ GP( , (⌫ 2)c) (25)

c is as an infinite dimensional matrix, with any finite dimensional covariance matrix


being a marginal of this matrix with an inverse Wishart distribution. This prior has
support for all positive definite kernels with mean equal to k✓ .
I The predictive distribution is
Z
p(y|x, D) = p(y|x, c)p(c|D)dc (26)

I Perhaps surprisingly, we can analytically marginalize this extremely flexible


posterior over kernels.
I The result is a Student-t distribution, with a predictive mean exactly equal to
what we would get using a GP with kernel k✓ .
I But the predictive uncertainty does depend on the data.

22 / 52
The Inverse Wishart Process
Introduce a completely flexible Bayesian non-parametric distribution over kernels:

c ⇠ IWP(⌫, k✓ ) (27)
y|c ⇠ GP( , (⌫ 2)c) (28)

c is as an infinite dimensional matrix, with any finite dimensional covariance matrix


being a marginal of this matrix with an inverse Wishart distribution. This prior has
support for all positive definite kernels with mean equal to k✓ .
I The predictive distribution is
Z
p(y|x, D) = p(y|x, c)p(c|D)dc (29)

I The result is a Student-t distribution, with a predictive mean exactly equal to


what we would get using a GP with kernel k✓ .
I Shockingly, the marginal predictive distribution is the same as we get for
the wildly different generative model:
1 ⌫ ⇢
a ⇠ Gamma( , ) (30)
2 2
y|a ⇠ GP( , a(⌫ 2)k✓ /⇢) (31)

23 / 52
Heteroscedasticity revisited...

Which of these models do you prefer, and why?

Choice 1

y(x)|f (x), g(x) ⇠ N (y(x); f (x), g(x)2 )


f (x) ⇠ GP, g(x) ⇠ GP

Choice 2

y(x)|f (x), g(x) ⇠ N (y(x); f (x)g(x), g(x)2 )


f (x) ⇠ GP, g(x) ⇠ GP

24 / 52
Inductive Biases

I While the inverse Wishart process is flexible, its inductive biases are wrong.
I It is challenging to say anything about the covariance function of a stochastic
process from a single draw if no assumptions are made.
I If we allow the covariance between any two points in the input space to arise
from any positive definite function, with equal probability, then we gain
essentially no information from a single realization.
I Most commonly one assumes a restriction to stationary kernels, meaning that
covariances are invariant to translations in the input space.

25 / 52
Spectral Mixture Kernels for Pattern Discovery
Let ⌧ = x x0 2 RP . From Bochner’s Theorem,
Z
T
k(⌧ ) = S(s)e2⇡is ⌧ ds (32)
RP

For simplicity, assume ⌧ 2 R1 and let


2 2
S(s) = [N (s; µ, ) + N ( s; µ, )]/2 . (33)

Then

k(⌧ ) = exp{ 2⇡ 2 ⌧ 2 2
} cos(2⇡⌧ µ) . (34)

More generally, if S(s) is a symmetrized mixture of diagonal covariance Gaussians


on Rp , with covariance matrix Mq = diag(vq , . . . , vq ), then
(1) (P)

Q
X P
Y
k(⌧ ) = wq cos(2⇡⌧ T µq ) exp{ 2⇡ 2 ⌧p2 v(p)
q }. (35)
q=1 p=1

Gaussian Process Kernels for Pattern Discovery and Extrapolation with Gaussian Processes.
Wilson et. al, 2012

26 / 52
GP Model for Pattern Extrapolation

I 2
Observations y(x) ⇠ N (y(x); f (x), ) (can easily be relaxed).
I 0
f (x) ⇠ GP(0, kSM (x, x |✓)) (f (x) is a GP with SM kernel).
I kSM (x, x0 |✓) can approximate many different kernels with different settings of
its hyperparameters ✓.
I Learning involves training these hyperparameters through maximum marginal
likelihood optimization (using BFGS)
model fit complexity penalty
z }| { z }| {
1 T 2 1 1 2 N
log p(y|✓, X) = y (K✓ + I) y log |K✓ + I| log(2⇡) . (36)
2 2 2

I ˆ making predictions using


Once hyperparameters are trained as ✓,
ˆ which can be expressed in closed form.
p(f⇤ |y, X⇤ , ✓),

27 / 52
Function Learning Example

700

Train
Airline Passengers (Thousands)

600

500

400

300

200

100
1949 1951 1953 1955 1957 1959 1961
Year

28 / 52
Function Learning Example

Airline Passengers (Thousands) 700

600 Train
Human?

500

400

300

200

100
1949 1951 1953 1955 1957 1959 1961
Year

29 / 52
Function Learning Example

700
Train
Airline Passengers (Thousands)

600 Human?

500

400

300

200

100
1949 1951 1953 1955 1957 1959 1961
Year

30 / 52
Results of Representation Learning

700
Train
Airline Passengers (Thousands)

600 Alien?
Test
500 Human?

400

300

200

100
1949 1951 1953 1955 1957 1959 1961
Year

31 / 52
Results, Airline Passengers

20

SM

Log Spectral Density


SE
15
Empirical

10

0
0 0.1 0.2 0.3 0.4 0.5
Frequency (1/month)

32 / 52
Results, CO2

20
95% CR
CO2 Concentration (ppm)

400 SM
Train
15 SE

Log Spectral Density


Test Empirical
380 MA
10
RQ
PER
360 5
SE
SM
0
340

−5
320
−10
1968 1977 1986 1995 2004 0 0.2 0.4 0.6 0.8 1 1.2
Year Frequency (1/month)

33 / 52
What do we need for large scale pattern extrapolation?

(a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC

(g) GP-RBF (h) GP-MA (i) GP-RQ (j) GPatt Initialisation

(k) Train (l) GPatt (m) GP-MA (n) Train (o) GPatt (p) GP-MA

Fast kernel learning for multidimensional pattern extrapolation. Wilson et. al, NeurIPS 2014.

34 / 52
More Patterns

(a) Rubber mat (b) Tread plate (c) Pores

(d) Wood (e) Chain mail (f) Cone

35 / 52
Scalable Exact Gaussian Processes

I By developing stochastic Krylov methods that exploit parallel cores in GPUs


we can now run exact GPs on millions of points.
I Implemented in the new library GPyTorch: gpytorch.ai

GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration.


Gardner, J. R., Pleiss, G., Bindel, D., Weinberger, K. Q., Wilson, A. G. NeurIPS, 2018.

36 / 52
Exact Gaussian Processes on a Million Data Points

I Results show the benefit of retaining a non-parametric representation.

Exact Gaussian processes on a million data points.


Wang, K.A., Pleiss, G., Gardner, J., Tyree, S., Weinberger, K., Wilson, A.G. NeurIPS, 2019.

37 / 52
Deep Kernel Learning

Deep kernel learning combines the inductive biases of deep learning architectures
with the non-parametric flexibility of Gaussian processes.

W (1)
W (2) h1 (✓)
W (L)
(1)
Input layer h1
Output layer
...

...
(2)
h1
x1 (L)
h1
y1

...
...
...

...

...
... yP
xD (L)
hC

...
hB
(2)
...
(1)
hA
h1 (✓)
Hidden layers
1 layer

Base kernel hyperparameters ✓ and deep network hyperparameters w are


jointly trained through the marginal likelihood objective.

Deep Kernel Learning. Wilson, A.G., Hu, Z., Salakhutdinov, R., Xing, E.P. AISTATS, 2016

38 / 52
Face Orientation Extraction

-43.10 36.15 17.35 -3.49 -19.81 Label

Training data

Test data

Figure: Top: Randomly sampled examples of the training and test data. Bottom: The
two dimensional outputs of the convolutional network on a set of test cases. Each
point is shown using a line segment that has the same orientation as the input face.

39 / 52
Learning Flexible Non-Euclidean Similarity Metrics

0.2 300

2
100
100 100

0.1 200

200 200
200

0 100

300 300 300

400 −0.1 400 0 400 0


100 200 300 400 100 200 300 400 100 200 300 400

Figure: Left: The induced covariance matrix using DKL-SM (spectral mixture)
kernel on a set of test cases, where the test samples are ordered according to the
orientations of the input faces. Middle: The respective covariance matrix using
DKL-RBF kernel. Right: The respective covariance matrix using regular RBF
kernel. The models are trained with n = 12, 000.

40 / 52
Step Function

18
GP(RBF)
GP(SM)
16 DKL−SM
Training data

14
Output Y
12

10

4
−1 −0.5 0 0.5 1
Input X

Figure: Recovering a step function. We show the predictive mean and 95% of the
predictive probability mass for regular GPs with RBF and SM kernels, and DKL with
SM base kernel.

41 / 52
Deep Kernel Learning for Autonomous Driving
1.0

0 0
28

0.8
24

10 10 20

Speed, mi/s
0.6

North, mi
16

0.4
20 20 12

8
0.2

30 30 4

0.0 0
0.05 0 5 0.2 10 15 0.4
20 5 0.6 0 5 10
0.8 15 20 1.0
East, mi
50
Front distance, m

40

30

20

10

0
5 0 5 5 0 5 5 0 5 5 0 5 5 0 5
Side distance, m
50
Front distance, m

40

30

20

10

0
5 0 5 5 0 5 5 0 5 5 0 5 5 0 5
Side distance, m

Learning scalable deep kernels with recurrent structure.


Al-Shedivat, M., Wilson, A.G., Saatchi, Y., Hu, Z., Xing, E.P. JMLR, 2017.
42 / 52
Kernels from Infinite Bayesian Neural Networks
I The neural network kernel (Neal, 1996) is famous for triggering research on
Gaussian processes in the machine learning community.
Consider a neural network with one hidden layer:
J
X
f (x) = b + vi h(x; ui ) . (37)
i=1

I b is a bias, vi are the hidden to output weights, h is any bounded hidden unit
transfer function, ui are the input to hidden weights, and J is the number of
hidden units. Let b and vi be independent with zero mean and variances b2 and
2
v /J, respectively, and let the ui have independent identical distributions.
Collecting all free parameters into the weight vector w,
Ew [f (x)] = 0 , (38)
J
X
2 1 2
cov[f (x), f (x0 )] = Ew [f (x)f (x0 )] = b + 0
v Eu [hi (x; ui )hi (x ; ui )] , (39)
J i=1
2 2 0
= b + v Eu [h(x; u)h(x ; u)] . (40)
We can show any collection of values f (x1 ), . . . , f (xN ) must have a joint Gaussian
distribution using the central limit theorem.
Bayesian Learning for Neural Networks. Neal, R. Springer, 1996.
43 / 52
Neural Network Kernel

J
X
f (x) = b + vi h(x; ui ) . (41)
i=1

PP Rz t2
I Let h(x; u) = erf(u0 + uj xj ), where erf(z) = p2 e dt
j=1 ⇡ 0
I Choose u ⇠ N (0, ⌃)
Then we obtain
2 2x̃T ⌃x̃0
kNN (x, x0 ) = sin( p ), (42)
⇡ (1 + 2x̃T ⌃x̃)(1 + 2x̃0T ⌃x̃0 )

where x 2 RP and x̃ = (1, xT )T .

44 / 52
Neural Network Kernel

2 2x̃T ⌃x̃0
kNN (x, x0 ) = sin( p ) (43)
⇡ (1 + 2x̃ ⌃x̃)(1 + 2x̃0T ⌃x̃0 )
T

Set ⌃ = diag( 0 , ). Draws from a GP with a neural network kernel with varying :

Gaussian processes for Machine Learning. Rasmussen, C.E. and Williams, C.K.I. MIT Press, 2006

45 / 52
Neural Network Kernel

2 2x̃T ⌃x̃0
kNN (x, x0 ) = sin( p ) (44)
⇡ (1 + 2x̃T ⌃x̃)(1 + 2x̃0T ⌃x̃0 )

Set ⌃ = diag( 0 , ). Draws from a GP with a neural network kernel with varying :

Question: Is a GP with this kernel doing representation learning?

Gaussian processes for Machine Learning. Rasmussen, C.E. and Williams, C.K.I. MIT Press, 2006

46 / 52
Neural Tangent Kernels

I Recent work [e.g., 1, 2, 3] deriving neural tangent kernels from infinite neural
network limits, with promising results.
I Closely related to Radford Neal’s [4] result showing a Bayesian neural network
with infinitely many hidden units converges to a Gaussian process.
I Note that most kernels from infinite neural network limits have a fixed
structure. On the other hand, standard neural networks essentially learn a
similarity metric (kernel) for the data. Learning a kernel amounts to
representation learning. Bridging this gap is interesting future work.

[1] Neural tangent kernel: convergence and generalization in neural networks. Jacot et. al, NeurIPS 2018.
[2] On exact computation with an infinitely wide neural net. Arora et. al, NeurIPS 2019.
[3] Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks. Arora et. al, arXiv 2019.
[4] Bayesian Learning for Neural Networks. Neal, R. Springer, 1996.

47 / 52
Functional Kernel Learning

See, e.g.,
[1] Function-Space Distributions over Kernels.
Benton, G., Maddox, W. J., Salkey, J., Wilson, A. G. NeurIPS, 2019.
[2] Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes.
Wilson, 2014.
[3] Bayesian Nonparametric Spectral Estimation. Tobar, F, NeurIPS 2018.
48 / 52
Functional Kernel Learning (Model Specification)

Function-Space Distributions over Kernels.


Benton, G., Maddox, W. J., Salkey, J., Wilson, A. G. NeurIPS, 2019.

49 / 52
Results (Sinc)

50 / 52
Texture Extrapolation

51 / 52
Multi-Task Learning

52 / 52

You might also like