Professional Documents
Culture Documents
Dai 2020
Dai 2020
Zhenwen Dai
Spotify
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 1 / 60
Gaussian process
Input and Output Data:
y = (y1 , . . . , yN ), X = (x1 , . . . , xN )>
p(y|f ) = N y|f , σ 2 I ,
p(f |X) = N (f |0, K(X, X))
10 Mean
Data
8 Confidence
−2
−4
−6
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 2 / 60
Behind a Gaussian process fit
Prediction on a test point given the observed data and the optimized
hyper-parameters.
p(f∗ |X∗ , y, X, θ) =
N f∗ |K∗ (K + σ 2 I)−1 y, K∗∗ − K∗ (K + σ 2 I)−1 K>
∗
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 3 / 60
How to implement the log-likelihood (1)
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 4 / 60
How to implement the log-likelihood (2)
1 1
= − log |2π(K + σ 2 I)| − y> (K + σ 2 I)−1 y
2 2
1 X
= − (||L−1 y||2 + N log 2π) − log Lii
2 i
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 5 / 60
A quick profiling (N =1000, Q=10)
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 6 / 60
Empirical analysis of computational time
I collect the run time for N = {10, 100, 500, 1000, 1500, 2000}.
They take 1.3ms, 8.5ms, 28ms, 0.12s, 0.29s, 0.76s.
Mean
1.4
Data
Confidence
1.2
1.0
time (second)
0.8
0.6
0.4
0.2
0.0
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 7 / 60
What if we have 1 million data points?
The mean of predicted computational time is 9.4 × 107 seconds ≈ 2.98 years.
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 8 / 60
What about waiting for faster computers?
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 9 / 60
What about parallel computing / GPU?
Apart from speeding up the exact computation, there have been a lot of works on
approximation of GP inference.
These methods often target at some specific scenario and provide good
approximation for the targeted scenarios.
Provide an overview about common approximations.
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 11 / 60
Big data (?)
lots of data 6= complex function
In real world problems, we often collect a lot of data for modeling relatively simple
relations.
20
Mean
Data
15 Confidence
10
−5
−10
−0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 12 / 60
Data subsampling?
20
Mean
Data
15 Confidence
10
−5
−10
−0.25 0.00 0.25 0.50 0.75 1.00 1.25
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 13 / 60
Covariance matrix of redundant data
With redundant data, the covariance matrix becomes low rank.
What about low rank approximation?
0
2000
20
1500
40
1000
60
500
80
0 20 40 60 80 0 20 40 60 80 100
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 14 / 60
Low-rank approximation
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 15 / 60
Approximation by subset
K̃ = Kz K−1 >
zz Kz , where Kz = K(X, Z) and Kzz = K(Z, Z).
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 16 / 60
Nyström approximation example
The covariance matrix with Nyström approximation using 5 random data points:
0
2000
20
1500
40
1000
60
500
80
0 20 40 60 80 0 20 40 60 80 100
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 17 / 60
Nyström approximation example
Compute tr K − K̃ with different M .
3500
3000
2500
2000
trace
1500
1000
500
0
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
M
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 18 / 60
Efficient computation using Woodbury formula
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 19 / 60
Nyström approximation
The approximation is directly done on the covariance matrix without the concept of
pseudo data.
The approximation becomes exact if the whole data set is taken, i.e.,
KK−1 K> = K.
The subset selection is done randomly.
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 20 / 60
Gaussian process with Pseudo Data (1)
Snelson and Ghahramani [2006] proposes the idea of having pseudo data, which is
later referred to as Fully independent training conditional (FITC).
Augment the training data (X, y) with pseudo data u at location Z.
Kf f + σ 2 I Kf u
y X y
p | =N |0,
u Z u K>fu Kuu
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 21 / 60
Gaussian process with Pseudo Data (2)
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 22 / 60
FITC approximation (1)
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 23 / 60
FITC approximation (2)
where A = (Λ + σ 2 I)−1 .
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 24 / 60
FITC approximation (3)
FITC allows the pseudo data not being a subset of training data.
The inducing inputs Z can be optimized via gradient optimization.
Like Nyström approximation, when taking all the training data as inducing inputs,
the FITC approximation is equivalent to the original GP:
p̃(y|X, Z = X) = N y|0, Kf f + σ 2 I
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 25 / 60
Model Approximation vs. Approximate Inference
FITC approximation changes the model definition.
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 27 / 60
Variational Sparse Gaussian Process (1)
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 28 / 60
Variational Sparse Gaussian Process (2)
Instead of approximate the model, Titsias [2009] derives a variational lower bound.
Normally, a variational lower bound of a marginal likelihood looks like
Z
log p(y|X) = log p(y|f )p(f |u, X, Z)p(u|Z)
f ,u
p(y|f )p(f |u, X, Z)p(u|Z)
Z
≥ q(f , u) log .
f ,u q(f , u)
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 29 / 60
Special Variational Posterior
p(f(|u, X,
(Z)p(u|Z)
Z
p(y|f )(
(
(
((
L= p(f |u, X, Z)q(u) log
p(f(|u, X,(Z)q(u)
(
(
f ,u ( ((
= hlog p(y|f )ip(f |u,X,Z)q(u) − KL (q(u) k p(u|Z))
= log N y|Kf u K−1 2
uu u, σ I q(u)
− KL (q(u) k p(u|Z))
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 30 / 60
Special Variational Posterior
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 31 / 60
Tighten the Bound
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 32 / 60
Variational sparse GP
R
Note that L is not a valid log-pdf, y exp(L(y)) ≤ 1, due to the trace term.
As inducing points are variational parameters, optimizing the inducing inputs Z
always leads to a better bound.
The model does not “overfit” with too many inducing points.
10 Mean
Inducing
8 Data
Confidence
6
−2
−4
−6
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 33 / 60
FITC vs. Variational sparse GP
model approximation vs. approximate inference (see [Bauer et al., 2016])
inducing inputswhen
Note that, and number
point of inducing points.
estimating The noise variance
hyper-parameters, is shrunk
if the numberto practically
of inducingzero,
despite the mean prediction not going through every data point. Note how the mean still behaves well
points is too small, the model may “under-fit”:
and how the training data lie well within the predictive variance. Only when considering predictive
probabilities will this behaviour cause diminished performance. VFE, on the other hand, is able to
L =distribution
approximate the posterior predictive log p(y) −almost k p(x|y)) .
exactly.
KL (q(x)
Figure 1:Optimal
Behaviour of FITC
values for theand VFEGP:
exact on anlml
subset of 100σdata
= 34.15, points of
= 0.274. the Snelson
[Bauer dataset for 8
et al., 2016]
inducing inputs (red crosses indicate inducing inputs; red lines indicate mean and 2 ) compared to
the prediction of the full GP in grey. Optimised values for the full GP: nlml = 34.15, n = 0.274
The number of data points N is very high, e.g., millions of data points.
The function is very complex, which requires tens of thousands of inducing points.
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 35 / 60
Mini-batch Learning (1)
dc
θt+1 = θt − η .
dθ
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 36 / 60
Mini-batch Learning (2)
dcMB
θt+1 = θt − η .
dθ
Can mini-batch learning be applied to GPs as well?
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 37 / 60
Mini-batch Learning for GPs
Mini-batch learning relies on the objective being an expectation w.r.t. the data,
i.e., hl(fθ (x), y)ip(x,y) .
The log-marginal likelihood of GP:
log N y|0, K + σ 2 I
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 38 / 60
“Uncollapsed” Lower Bound
Hensman et al. [2013] discovers that the “uncollapsed” variational lower bound of
sparse GP can be used for mini-batch learning.
The “uncollapsed” variational lower bound of sparse GP:
The 2nd term, KL (q(u) k p(u)), does not depend on the data.
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 39 / 60
“Uncollapsed” Lower Bound
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 40 / 60
Stochastic Variational GP (SVGP)
The resulting lower bound can be written as the sum over the data,
N
X
log N yn |fn , σ 2
L= q(fn |xn ,Z)
− KL (q(u) k p(u))
n=1
N X N
log N yi |fi , σ 2
≈ q(fi |xi ,Z)
− KL (q(u) k p(u)) = LMB
B B
xi ,yi ∼p̃(x,y)
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 41 / 60
2D Synthetic Data
Flight delays for every commercial flight in the USA from January to April 2008.
700,000 train, 100,000 test
Figure 7: Root mean squared errors in predicting flight delays using information about the flight.
Figure 8: Root mean square errors for models with
di↵erent numbers of inducing variables.
to be covered. Distance and airtime should in theory this framework fits with other Gaussian process based
be correlated, but they have veryFigure
di↵erent6:relevances.
Posterior models such
variance of as the GPLVMprices.
apartment and GP classification. We
This can be intuitively explained by considering that leave the details of these implementations to future
on longer flightsZhenwen
it’s easierDai
to (Spotify)
make up for delays at work. Variational Gaussian Processes 15 September 2020 @GPSS 2020 43 / 60
The pros and cons of SVGP
Pros
Cons
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 44 / 60
Non-Gaussian likelihood
In practice, many difference noise distributions for modelling real data, e.g.,
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 45 / 60
Common approaches
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 46 / 60
SVGP with non-Gaussian likelihood
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 47 / 60
SVGP with non-Gaussian likelihood
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 48 / 60
Gauss-Hermite Quadrature
Regarding those 1D integrals,
q(fn |xn , Z) is a 1D Gaussian distribution. See the definition: 1
n=5
0.5
Z
0
q(fn |xn , Z) = p(fn |u, xn , Z)q(u)du.
1
n = 10
0.5
1
C
X n = 15
hlog p(yn |fn )iq(fn |xn ,Z) ≈ wj log p(yn |fj ), 0.5
0
j=1
√ C−1 1
2 C! π n = 20
wj = 2 . 0.5
C [HC−1 (fj )]2
0
-4 -2 0 2 4
The quadrature result is exact if log p(yn |fn ) is a polynomial x
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 49 / 60
Binary Classification Example
Scalable Variational Gaussian Process Classification
Figure 1: The e↵ect of increasing the number of inducing points for the banana dataset. Rows represent the
[Hensman et al., 2015]
KL method, the mean field method and Generalized FITC, whilst columns show increasing numbers of inducing
points. In each pane, the colored points represent training data, the inducing inputs are black dots and the
decision boundaries are black lines. The rightmost column shows the result of the equivalent non-sparse methods
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 50 / 60
Beyond 1D GPs
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 51 / 60
Monte Carlo Sampling
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 52 / 60
Are big covariance matrices always (almost) low-rank?
Of course, not.
A time series example
y = f (t) +
The data are collected with even time interval continuously.
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 53 / 60
A time series example: 10 data points
When we observe until t = 1.0:
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 54 / 60
A time series example: 100 data points
When we observe until t = 10.0:
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 55 / 60
A time series example: 1000 data points
When we observe until t = 100.0:
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 56 / 60
Banded precision matrix
For the kernels like the Matern family, the precision matrix is banded.
For example, given a Matern 12 or known as exponential kernel:
0|
k(x, x0 ) = σ 2 exp(− |x−x
l2
).
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 57 / 60
Closed form precision matrix
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 58 / 60
Other approximations
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 59 / 60
Q & A!
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 60 / 60
Matthias Bauer, Mark van der Wilk, and Carl Edward Rasmussen. Understanding probabilistic sparse
gaussian process approximations. In Advances in Neural Information Processing Systems 29, pages
1533–1541. 2016.
Thang D Bui, Josiah Yan, and Richard E Turner. A unifying framework for gaussian process
pseudo-point approximations using power expectation propagation. Journal of Machine Learning
Research, 18:3649–3720, 2017.
Nicolas Durrande, Vincent Adam, Lucas Bordeaux, Stefanos Eleftheriadis, and James Hensman.
Banded matrix operators for gaussian markov models in the automatic differentiation era. In
Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the Eighteenth International
Conference on Artificial Intelligence and Statistics, pages 2780–2789, 2019.
James Hensman, Nicolò Fusi, and Neil D. Lawrence. Gaussian processes for big data. page 282–290,
2013.
James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable Variational Gaussian Process
Classification. In Proceedings of the Eighteenth International Conference on Artificial Intelligence
and Statistics, pages 351–360, 2015.
Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate
gaussian process regression. Journal of Machine Learning Research, 6:1939—-1959, 2005.
Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. In Advances
in Neural Information Processing Systems, pages 1257–1264. 2006.
Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Proceedings
of the Twelth International Conference on Artificial Intelligence and Statistics, pages 567–574, 2009.
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 60 / 60
Christopher K. I. Williams and Matthias Seeger. Using the nyström method to speed up kernel
machines. In Advances in Neural Information Processing Systems, pages 682–688. 2001.
Zhenwen Dai (Spotify) Variational Gaussian Processes 15 September 2020 @GPSS 2020 60 / 60