Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

Variational Gaussian Processes

Zhenwen Dai
Spotify

15 September 2020 @GPSS 2020

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 1 / 60
Gaussian process
Input and Output Data:
y = (y1 , . . . , yN ), X = (x1 , . . . , xN )>

p(y|f ) = N y|f , σ 2 I ,

p(f |X) = N (f |0, K(X, X))

10 Mean
Data
8 Confidence

−2

−4

−6

0.4 0.5 0.6 0.7 0.8 0.9 1.0

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 2 / 60
Behind a Gaussian process fit

Maximum likelihood estimate of the hyper-parameters.

θ∗ = arg max log p(y|X, θ) = arg max log N y|0, K + σ 2 I



θ θ

Prediction on a test point given the observed data and the optimized
hyper-parameters.

p(f∗ |X∗ , y, X, θ) =
N f∗ |K∗ (K + σ 2 I)−1 y, K∗∗ − K∗ (K + σ 2 I)−1 K>


Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 3 / 60
How to implement the log-likelihood (1)

Compute the covariance matrix K:


 
k(x1 , x1 ) · · · k(x1 , xN )
K=
 .. ... .. 
. . 
k(xN , x1 ) · · · k(xN , xN )

where k(xi , xj ) = γ exp − 2l12 (xi − xj )> (xi − xj )




The complexity is O(N 2 Q).

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 4 / 60
How to implement the log-likelihood (2)

Plug in the log-pdf of multi-variate normal distribution:

log p(y|X) = log N y|0, K + σ 2 I




1 1
= − log |2π(K + σ 2 I)| − y> (K + σ 2 I)−1 y
2 2
1 X
= − (||L−1 y||2 + N log 2π) − log Lii
2 i

Take a Cholesky decomposition: L = chol(K + σ 2 I).


The computational complexity is O(N 3 + N 2 + N ). Therefore, the overall
complexity including the computation of K is O(N 3 ).

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 5 / 60
A quick profiling (N =1000, Q=10)

Line # Time(ms) % Time Line Contents


2 def log_likelihood(kern, X, Y, sigma2):
3 6.0 0.0 N = X.shape[0]
4 55595.0 58.7 K = kern.K(X)
5 4369.0 4.6 Ky = K + np.eye(N)*sigma2
6 30012.0 31.7 L = np.linalg.cholesky(Ky)
7 4361.0 4.6 LinvY = dtrtrs(L, Y, lower=1)[0]
8 49.0 0.1 logL = N*np.log(2*np.pi)/-2.
9 82.0 0.1 logL += np.square(LinvY).sum()/-2.
10 208.0 0.2 logL += -np.log(np.diag(L)).sum()
11 2.0 0.0 return logL

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 6 / 60
Empirical analysis of computational time
I collect the run time for N = {10, 100, 500, 1000, 1500, 2000}.
They take 1.3ms, 8.5ms, 28ms, 0.12s, 0.29s, 0.76s.

Mean
1.4
Data
Confidence
1.2

1.0
time (second)

0.8

0.6

0.4

0.2

0.0

0 500 1000 1500 2000 2500


data size (N )

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 7 / 60
What if we have 1 million data points?
The mean of predicted computational time is 9.4 × 107 seconds ≈ 2.98 years.

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 8 / 60
What about waiting for faster computers?

Computational time = amount of work / computer speed.


If the computer speed increase at the pace of 20% year over year:
I After 10 years, it will take about 176 days.
I After 50 years, it will take about 2.9 hours.

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 9 / 60
What about parallel computing / GPU?

Ongoing works about speeding up


Cholesky decomposition with multi-core
CPU or GPU.
Main limitation:
I heavy communication and shared
memory.
I O(N 2 ) memory consumption

Figure 2: Runtime of variational sparse GP criterion and gradient computati


”power” dataset, using our MXNet linalg implementation (on CPU and GP
and the GPy reference implementation (same CPU instance, averaged over 20
number U of inducing points. For U = 50, MXNet with GPU takes 0.015 s
Zhenwen Dai (Spotify) seconds, and GPy 1.070
Variational seconds.
Gaussian For U = 3200, MXNet
Processes GPU
15 September 2020 takes 1.49 seconds
@GPSS 2020 10 / 60
Other approaches

Apart from speeding up the exact computation, there have been a lot of works on
approximation of GP inference.
These methods often target at some specific scenario and provide good
approximation for the targeted scenarios.
Provide an overview about common approximations.

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 11 / 60
Big data (?)
lots of data 6= complex function
In real world problems, we often collect a lot of data for modeling relatively simple
relations.

20
Mean
Data
15 Confidence

10

−5

−10
−0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 12 / 60
Data subsampling?

Real data often do not evenly distributed.


We tend to get a lot of data on common cases and very few data on rare cases.

20
Mean
Data
15 Confidence

10

−5

−10
−0.25 0.00 0.25 0.50 0.75 1.00 1.25

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 13 / 60
Covariance matrix of redundant data
With redundant data, the covariance matrix becomes low rank.
What about low rank approximation?

0
2000

20
1500

40

1000

60

500
80

0 20 40 60 80 0 20 40 60 80 100

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 14 / 60
Low-rank approximation

Let’s recall the log-likelihood of GP:

log p(y|X) = log N y|0, K + σ 2 I ,




where K is the covariance matrix computed from X according to the kernel


function k(·, ·) and σ 2 is the variance of the Gaussian noise distribution.
Assume K to be low rank.
This leads to Nyström approximation by Williams and Seeger [Williams and Seeger,
2001].

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 15 / 60
Approximation by subset

Let’s randomly pick a subset from the training data: Z ∈ RM ×Q .


Approximate the covariance matrix K by K̃.

K̃ = Kz K−1 >
zz Kz , where Kz = K(X, Z) and Kzz = K(Z, Z).

Note that K̃ ∈ RN ×N , Kz ∈ RN ×M and Kzz ∈ RM ×M .


The log-likelihood is approximated by

log p(y|X, θ) ≈ log N y|0, Kz K−1 > 2



zz Kz + σ I .

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 16 / 60
Nyström approximation example

The covariance matrix with Nyström approximation using 5 random data points:

0
2000

20
1500

40

1000

60

500
80

0 20 40 60 80 0 20 40 60 80 100

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 17 / 60
Nyström approximation example
 
Compute tr K − K̃ with different M .

3500

3000

2500

2000
trace

1500

1000

500

0
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
M

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 18 / 60
Efficient computation using Woodbury formula

The naive formulation does not bring any computational benefits.


1 1
L̃ = − log |2π(K̃ + σ 2 I)| − y> (K̃ + σ 2 I)−1 y
2 2
Apply the Woodbury formula:

(Kz K−1 > 2 −1


zz Kz + σ I) = σ −2 I − σ −4 Kz (Kzz + σ −2 K> −1 >
z Kz ) Kz

Note that (Kzz + σ −2 K>


z Kz ) ∈ R
M ×M
.
The computational complexity reduces to O(N M 2 ).

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 19 / 60
Nyström approximation

The approximation is directly done on the covariance matrix without the concept of
pseudo data.
The approximation becomes exact if the whole data set is taken, i.e.,
KK−1 K> = K.
The subset selection is done randomly.

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 20 / 60
Gaussian process with Pseudo Data (1)

Snelson and Ghahramani [2006] proposes the idea of having pseudo data, which is
later referred to as Fully independent training conditional (FITC).
Augment the training data (X, y) with pseudo data u at location Z.

Kf f + σ 2 I Kf u
       
y X y
p | =N |0,
u Z u K>fu Kuu

where Kf f = K(X, X), Kf u = K(X, Z) and Kuu = K(Z, Z).

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 21 / 60
Gaussian process with Pseudo Data (2)

Thanks to the marginalization property of Gaussian distribution,


Z
p(y|X) = p(y, u|X, Z).
u

Further re-arrange the notation:

p(y, u|X, Z) = p(y|u, X, Z)p(u|Z)

where p(u|Z) = N (u|0, Kuu ),


p(y|u, X, Z) = N y|Kf u K−1 −1 > 2

uu u, Kf f − Kf u Kuu Kf u + σ I .

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 22 / 60
FITC approximation (1)

So far, p(y|X) has not been changed, but there is no speed-up.


Kf f ∈ RN ×N in Kf f − Kf u K−1 > 2
uu Kf u + σ I.
The FITC approximation assumes

p̃(y|u, X, Z) = N y|Kf u K−1 2



uu u, Λ + σ I ,

where Λ = (Kf f − Kf u K−1 >


uu Kf u ) ◦ I.

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 23 / 60
FITC approximation (2)

Marginalize u from the model definition:

p̃(y|X, Z) = N y|0, Kf u K−1 > 2



uu Kf u + Λ + σ I

Woodbury formula can be applied in the sam way as in Nyström approximation:

(Kz K−1 > 2 −1


zz Kz + Λ + σ I) = A − AKz (Kzz + K> −1 >
z AKz ) Kz A,

where A = (Λ + σ 2 I)−1 .

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 24 / 60
FITC approximation (3)

FITC allows the pseudo data not being a subset of training data.
The inducing inputs Z can be optimized via gradient optimization.
Like Nyström approximation, when taking all the training data as inducing inputs,
the FITC approximation is equivalent to the original GP:

p̃(y|X, Z = X) = N y|0, Kf f + σ 2 I


FITC can be combined easily with expectation propagation (EP).


Bui et al. [2017] provides an overview and a nice connection with variational sparse
GP.

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 25 / 60
Model Approximation vs. Approximate Inference
FITC approximation changes the model definition.

A better objective under FITC does not necessarily corresponds to a better


inducing
approximation to the inputs
original GP.and number of inducing points. The noise variance is shrunk to
despite the mean prediction not going through every data point. Note how the mean
Z can
In fact, optimizingand leadtraining
how the to overfitting.
data lie well[Quiñonero-Candela and Rasmussen,
within the predictive variance. Only when consi
2005, Bauer et al.,probabilities
2016] will this behaviour cause diminished performance. VFE, on the othe
approximate the posterior predictive distribution almost exactly.

FITC (nlml = 23.16, n = 1.93 · 10 4


) VFE (nlml = 38.86, n

Optimal values Figure


for the1:exact
Behaviour of FITC
GP: nlml and σVFE
= 34.15, on a subset
= 0.274. [Bauerofet
100
al.,data points of the Snel
2016]
inducing inputs (red crosses indicate inducing inputs; red lines indicate mean and
the prediction of the full GP in grey. Optimised values for the full GP: nlml = 34.
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 26 / 60
Model Approximation vs. Approximate Inference

Variational inference (VI) takes a different approach.

VI keeps the model definition untouched.


VI derives a lower bound of the log-marginal likelihood:
Z
p(y, x)
log(y) ≥ q(x) log dx = L
q(x)

Alternatively, it can be written as

KL (q(x) k p(x|y)) = log p(y) − L.

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 27 / 60
Variational Sparse Gaussian Process (1)

Titsias [2009] introduces a variational approach for sparse GP.


It follows the same concept of pseudo data:
Z
p(y|X) = p(y|f )p(f |u, X, Z)p(u|Z)
f ,u

where p(u|Z) = N (u|0, Kuu ),


p(y|u, X, Z) = N y|Kf u K−1 −1 > 2

uu u, Kf f − Kf u Kuu Kf u + σ I .

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 28 / 60
Variational Sparse Gaussian Process (2)

Instead of approximate the model, Titsias [2009] derives a variational lower bound.
Normally, a variational lower bound of a marginal likelihood looks like
Z
log p(y|X) = log p(y|f )p(f |u, X, Z)p(u|Z)
f ,u
p(y|f )p(f |u, X, Z)p(u|Z)
Z
≥ q(f , u) log .
f ,u q(f , u)

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 29 / 60
Special Variational Posterior

Titsias [2009] defines an unusual variational posterior:

q(f , u) = p(f |u, X, Z)q(u), where q(u) = N (u|µ, Σ) .

Plug it into the lower bound:

p(f(|u, X,
(Z)p(u|Z)
Z
p(y|f )(
(
(
((
L= p(f |u, X, Z)q(u) log
p(f(|u, X,(Z)q(u)
(
(
f ,u ( ((
= hlog p(y|f )ip(f |u,X,Z)q(u) − KL (q(u) k p(u|Z))
= log N y|Kf u K−1 2

uu u, σ I q(u)
− KL (q(u) k p(u|Z))

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 30 / 60
Special Variational Posterior

There is no inversion of any big covariance matrices in the first term:


N 1
− log 2πσ 2 − 2 (Kf u K−1 > −1
uu u − y) (Kf u Kuu u − y) q(u)
2 2σ
The overall complexity of the lower bound is O(N M 2 ).

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 31 / 60
Tighten the Bound

Find the optimal parameters of q(u):

µ∗ , Σ∗ = arg max L(µ, Σ).


µ,Σ

Make the bound as tight as possible by plugging in µ∗ and Σ∗ :


1
L = log N y|0, Kf u K−1 > 2 −1 >
 
uu Kf u + σ I − tr Kf f − Kf u Kuu Kf u .
2σ 2
The 1st term is the same as in the Nyström approximation.
The overall complexity of the lower bound remains O(N M 2 ).

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 32 / 60
Variational sparse GP
R
Note that L is not a valid log-pdf, y exp(L(y)) ≤ 1, due to the trace term.
As inducing points are variational parameters, optimizing the inducing inputs Z
always leads to a better bound.
The model does not “overfit” with too many inducing points.

10 Mean
Inducing
8 Data
Confidence
6

−2

−4

−6

0.4 0.5 0.6 0.7 0.8 0.9 1.0

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 33 / 60
FITC vs. Variational sparse GP
model approximation vs. approximate inference (see [Bauer et al., 2016])
inducing inputswhen
Note that, and number
point of inducing points.
estimating The noise variance
hyper-parameters, is shrunk
if the numberto practically
of inducingzero,
despite the mean prediction not going through every data point. Note how the mean still behaves well
points is too small, the model may “under-fit”:
and how the training data lie well within the predictive variance. Only when considering predictive
probabilities will this behaviour cause diminished performance. VFE, on the other hand, is able to
L =distribution
approximate the posterior predictive log p(y) −almost k p(x|y)) .
exactly.
KL (q(x)

FITC (nlml = 23.16, n = 1.93 · 10 4


) VFE (nlml = 38.86, n = 0.286)

Figure 1:Optimal
Behaviour of FITC
values for theand VFEGP:
exact on anlml
subset of 100σdata
= 34.15, points of
= 0.274. the Snelson
[Bauer dataset for 8
et al., 2016]
inducing inputs (red crosses indicate inducing inputs; red lines indicate mean and 2 ) compared to
the prediction of the full GP in grey. Optimised values for the full GP: nlml = 34.15, n = 0.274

For both approximations,


Zhenwen Dai (Spotify) the complexity Variational
penalty Gaussian
decreases with decreased noise
Processes variance,
15 September 2020by reducing
@GPSS 2020 34 / 60
Limitations of Sparse GP

Variational sparse GP has computational complexity O(N M 2 ).

The computation becomes infeasible under two scenarios:

The number of data points N is very high, e.g., millions of data points.
The function is very complex, which requires tens of thousands of inducing points.

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 35 / 60
Mini-batch Learning (1)

Mini-batch learning allows DNNs to be trained on millions of data points.


Given a set of inputs and labels, D = {xi , yi }N
i=1 , (xi , yi ) ∼ p(x, y), the true loss
function is defined as
Z N
1 X
ctrue = l(fθ (x), y)p(x, y)dxdy ≈ l(fθ (x), y) = c,
N i=1

where fθ (·) is DNN and l(·, ·) is the loss function.


Gradient descent (GD) updates the parameters by

dc
θt+1 = θt − η .

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 36 / 60
Mini-batch Learning (2)

Mini-batch learning approximates the loss by subsampling the data,


1 X
cMB = l(fθ (xi ), yi ).
B
xi ,yi ∼p̃(x,y)

Stochastic gradient descent (SGD) updates the parameters by

dcMB
θt+1 = θt − η .

Can mini-batch learning be applied to GPs as well?

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 37 / 60
Mini-batch Learning for GPs

Mini-batch learning relies on the objective being an expectation w.r.t. the data,
i.e., hl(fθ (x), y)ip(x,y) .
The log-marginal likelihood of GP:

log N y|0, K + σ 2 I


The variational lower bound of sparse GP:


1
log N y|0, Kf u K−1 > 2 −1 >
 
uu Kf u + σ I − tr Kf f − Kf u Kuu Kf u
2σ 2

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 38 / 60
“Uncollapsed” Lower Bound

Hensman et al. [2013] discovers that the “uncollapsed” variational lower bound of
sparse GP can be used for mini-batch learning.
The “uncollapsed” variational lower bound of sparse GP:

L = hlog p(y|f )ip(f |u,X,Z)q(u) − KL (q(u) k p(u))

The 2nd term, KL (q(u) k p(u)), does not depend on the data.

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 39 / 60
“Uncollapsed” Lower Bound

In the 1st term, as p(y|f ) = N (y|f , σ 2 I),


N
X
log N yn |fn , σ 2

log p(y|f ) =
n=1
R
Denote q(f |X, Z) = p(f |u, X, Z)q(u)du.
* N +
X
log N yn |fn , σ 2

hlog p(y|f )iq(f |X,Z) =
n=1 q(f |X,Z)
N
X
log N yn |fn , σ 2

= q(fn |xn ,Z)
n=1

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 40 / 60
Stochastic Variational GP (SVGP)

The resulting lower bound can be written as the sum over the data,
N
X
log N yn |fn , σ 2

L= q(fn |xn ,Z)
− KL (q(u) k p(u))
n=1
N X N
log N yi |fi , σ 2

≈ q(fi |xi ,Z)
− KL (q(u) k p(u)) = LMB
B B
xi ,yi ∼p̃(x,y)

This allows us to do mini-batch learning with SGD,


dLMB
θt+1 = θt − η .

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 41 / 60
2D Synthetic Data

Figure 4: Convergence of the SVIGP algorithm on the


two dimensional toy data
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 42 / 60
Airline Delay Data

Flight delays for every commercial flight in the USA from January to April 2008.
700,000 train, 100,000 test

Figure 7: Root mean squared errors in predicting flight delays using information about the flight.
Figure 8: Root mean square errors for models with
di↵erent numbers of inducing variables.
to be covered. Distance and airtime should in theory this framework fits with other Gaussian process based
be correlated, but they have veryFigure
di↵erent6:relevances.
Posterior models such
variance of as the GPLVMprices.
apartment and GP classification. We
This can be intuitively explained by considering that leave the details of these implementations to future
on longer flightsZhenwen
it’s easierDai
to (Spotify)
make up for delays at work. Variational Gaussian Processes 15 September 2020 @GPSS 2020 43 / 60
The pros and cons of SVGP

Pros

With mini-batch learning, the computational complexity reduces from O(N M 2 ) to


O(M 3 ).

Cons

The variational distribution q(u) needs to be explicitly optimized.


The number of variational parameters increase from M Q to (2M + M 2 )Q.
Optimization relies on SGD methods and the methods like L-BFGS are no longer
applicable.
It can be challenging to initialize q(u).

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 44 / 60
Non-Gaussian likelihood

So far, we have only discussed GP regression with Gaussian noise distribution.

In practice, many difference noise distributions for modelling real data, e.g.,

Student-t distribution: data with outliers


Poisson / Multi-nomial distribution: Integer counts
Beta distribution: bounded real values
Bernoulli / Categorical distribution: classification labels

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 45 / 60
Common approaches

Common choice for approximation inference:

Exact GP with Laplace approximation


Expectation Propagation (EP) with sparse GP

Both of them are quite complex to implement and difficult to scale.

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 46 / 60
SVGP with non-Gaussian likelihood

Let’s use a binary classification as an example.

The outputs are binary. y = (y1 , . . . , yN ), yi ∈ {0, 1}.


The likelihood is a Bernoulli distribution with a Sigmoid link function:

p(yi |fi ) = σ(fi )yi (1 − σ(fi ))(1−yi )

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 47 / 60
SVGP with non-Gaussian likelihood

The lower bound of SVGP is


N
X
L= hlog p(yn |fn )iq(fn |xn ,Z) − KL (q(u) k p(u)) .
n=1

The 2nd term, KL (q(u) k p(u)), is closed form.


The 1st term, N
P
n=1 hlog p(yn |fn )iq(fn |xn ,Z) , is the sum of a list of 1D integrals.
Those integrals are intractable.

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 48 / 60
Gauss-Hermite Quadrature
Regarding those 1D integrals,
q(fn |xn , Z) is a 1D Gaussian distribution. See the definition: 1

n=5
0.5
Z
0
q(fn |xn , Z) = p(fn |u, xn , Z)q(u)du.
1

n = 10
0.5

Gauss-Hermite quadrature can be applied, 0

1
C
X n = 15
hlog p(yn |fn )iq(fn |xn ,Z) ≈ wj log p(yn |fj ), 0.5

0
j=1
√ C−1 1
2 C! π n = 20
wj = 2 . 0.5
C [HC−1 (fj )]2
0

-4 -2 0 2 4
The quadrature result is exact if log p(yn |fn ) is a polynomial x

with its order less than C.

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 49 / 60
Binary Classification Example
Scalable Variational Gaussian Process Classification

M=4 M=8 M=16 M=32 M=64 Full


KLSP
MF
GFITC

Figure 1: The e↵ect of increasing the number of inducing points for the banana dataset. Rows represent the
[Hensman et al., 2015]
KL method, the mean field method and Generalized FITC, whilst columns show increasing numbers of inducing
points. In each pane, the colored points represent training data, the inducing inputs are black dots and the
decision boundaries are black lines. The rightmost column shows the result of the equivalent non-sparse methods
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 50 / 60
Beyond 1D GPs

Multi-class classification is a common example.


For a C-class classification, y ∈ {1, . . . , C}, a GP is used to model each class,

f1 , . . . , fC ∼ GP (0, K(X, X)).

The common likelihood is a categorical distribution with a soft-max function,


C
Y efnj
p(yn |fn1 , . . . , fnC ) = g(fnj )δ[yn −j] , g(fnj ) = PC fnj 0
j=1 j 0 =1 e

Gauss-Hermite quadrature is not a good choice due to high dimensionality.

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 51 / 60
Monte Carlo Sampling

Monte Carlo sampling can approximate the multi-dimensional integral:


Z C
X
hlog p(yn |fn )iq(fn |xn ,Z) = q(fn |xn , Z) δ[yn − j] log g(fnj )dfn
j=1
T X
C
1 X
≈ δ[yn − j] log g(ftnj )
T t=1 j=1

where ftn ∼ q(fn |xn , Z) and ftn = (ftn1 , . . . , ftnC ).


Reparameterization trick can be used to reduce of the variance of the gradient.
2
Denote q(fnj |xnj , Z) = N (mnj , σnj ). A sample can be rewritten as

ftnj = mnj + σnj t , t ∼ N (0, 1).

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 52 / 60
Are big covariance matrices always (almost) low-rank?

Of course, not.
A time series example
y = f (t) + 
The data are collected with even time interval continuously.

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 53 / 60
A time series example: 10 data points
When we observe until t = 1.0:

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 54 / 60
A time series example: 100 data points
When we observe until t = 10.0:

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 55 / 60
A time series example: 1000 data points
When we observe until t = 100.0:

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 56 / 60
Banded precision matrix
For the kernels like the Matern family, the precision matrix is banded.
For example, given a Matern 12 or known as exponential kernel:
0|
k(x, x0 ) = σ 2 exp(− |x−x
l2
).

This slide is taken from Nicolas Durrande [Durrande et al., 2019].

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 57 / 60
Closed form precision matrix

The precision matrix of Matern kernels can be computed in closed form.


The lower triangular matrix from the Cholesky decomposition of the precision
matrix is banded as well.
1 1
log(y|X) = − log |2π(LL> )−1 | − tr yy> LL>

2 2
where L is the lower triangular matrix from the Cholesky decomposition of the
precision matrix Q, Q = LL> .
The computational complexity becomes O(N ).

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 58 / 60
Other approximations

deterministic/stochastic frequency approximation


distributed approximation
conjugate gradient methods for covariance matrix inversion

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 59 / 60
Q & A!

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 60 / 60
Matthias Bauer, Mark van der Wilk, and Carl Edward Rasmussen. Understanding probabilistic sparse
gaussian process approximations. In Advances in Neural Information Processing Systems 29, pages
1533–1541. 2016.
Thang D Bui, Josiah Yan, and Richard E Turner. A unifying framework for gaussian process
pseudo-point approximations using power expectation propagation. Journal of Machine Learning
Research, 18:3649–3720, 2017.
Nicolas Durrande, Vincent Adam, Lucas Bordeaux, Stefanos Eleftheriadis, and James Hensman.
Banded matrix operators for gaussian markov models in the automatic differentiation era. In
Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the Eighteenth International
Conference on Artificial Intelligence and Statistics, pages 2780–2789, 2019.
James Hensman, Nicolò Fusi, and Neil D. Lawrence. Gaussian processes for big data. page 282–290,
2013.
James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable Variational Gaussian Process
Classification. In Proceedings of the Eighteenth International Conference on Artificial Intelligence
and Statistics, pages 351–360, 2015.
Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate
gaussian process regression. Journal of Machine Learning Research, 6:1939—-1959, 2005.
Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. In Advances
in Neural Information Processing Systems, pages 1257–1264. 2006.
Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Proceedings
of the Twelth International Conference on Artificial Intelligence and Statistics, pages 567–574, 2009.
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 60 / 60
Christopher K. I. Williams and Matthias Seeger. Using the nyström method to speed up kernel
machines. In Advances in Neural Information Processing Systems, pages 682–688. 2001.

Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 60 / 60

You might also like