Professional Documents
Culture Documents
Tungban Probabilistic ML 2021 - Lecture09
Tungban Probabilistic ML 2021 - Lecture09
Lecture 09
Gaussian Processes
Philipp Hennig
17 May 2021
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Learning Representations 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. An Example for GP Regression 23 13.07. Example: Topic Models
11 25.05. Understanding Kernels 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
Summary:
▶ Gaussian distributions can be used to learn functions
▶ Analytical inference is possible using general linear models
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
Reminder: Where we left off last time
Probabilistic Parametric Regression
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Reminder: Where we left off last time
Probabilistic Parametric Regression
h i⊺
2 2 2
ϕ(x) = e− 2 (x−8) e− 2 (x−7) e− 2 (x−6)
1 1 1
...
h 2 2
i
= e−(x−ci ) /(2λ )
i=1,...,F
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
A Structural Observation
Graphical Model
y output
weights w1 w2 w3 w4 w5 w6 w7 w8 w9
features [ϕx ]1 [ϕx ]2 [ϕx ]3 [ϕx ]4 [ϕx ]5 [ϕx ]6 [ϕx ]7 [ϕx ]8 [ϕx ]9
parameters θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9
x input
A linear Gaussian regressor is a single hidden layer neural network, with quadratic output loss, and fixed
input layer. Hyperparameter-fitting corresponds to training the input layer. The usual way to train such
network, however, does not include the Occam factor.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
What are we actually doing with those features?
let’s look at that algebra again
p(fx | y, ϕX ) = N (fx ; ϕ⊺x µ + ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ),
ϕ⊺x Σϕx − ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σϕx )
= N (fx ; mx + kxX (kXX + σ 2 I)−1 (y − mX ),
kxx − kxX (kXX + σ 2 I)−1 kXx )
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Code
Gaussian_Process_Regression.ipynb
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
Sometimes, more features make things cheaper
an example [DJC MacKay, 1998]
σ 2 (cmax −cmin )
▶ For simplicity, let’s fix Σ = F I
σ 2 (cmax − cmin ) X
F
thus: ϕ(xi )⊺ Σϕ(xj ) = ϕℓ (xi )ϕℓ (xj )
F
ℓ=1
(x − cℓ )2
▶ especially, for ϕℓ (x) = exp −
2λ2
σ 2 (cmax − cmin ) X
F
(xi − cℓ )2 (xj − cℓ )2
ϕ(xi )⊺ Σϕ(xj ) = exp − exp −
F 2λ2 2λ2
ℓ=1
F !
σ 2 (cmax − cmin ) (xi − xj )2 X (cℓ − 21 (xi + xj ))2
= exp − exp −
F 4λ2 λ2
ℓ
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
Sometimes, more features make things cheaper
an example [DJC MacKay, 1998]
F !
⊺ σ 2 (cmax − cmin ) (xi − xj )2 X (cℓ − 21 (xi + xj ))2
ϕ(xi ) Σϕ(xj ) = exp − exp −
F 4λ2 λ2
ℓ
F·δc
now increase F so # of features in δc approaches (cmax −cmin )
!
2 Z cmax − 1
(x − x ) (c (xi + xj )) 2
ϕ(xi )⊺ Σϕ(xj ) _ σ 2 exp −
i j
exp − 2
dc
4λ2 cmin λ2
let cmin _ −∞, cmax _ ∞
√ (xi − xj )2
k(xi , xj ) := ϕ(xi )⊺ Σϕ(xj ) _ πλσ 2 exp −
4λ2
√ (xi − xj )2
kxi xj = πλσ exp −
2
4λ2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
In pictures
convergence to Gaussian kernel
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
In pictures
convergence to Gaussian kernel
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
In pictures
convergence to Gaussian kernel
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
In pictures
convergence to Gaussian kernel
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
Code
Gaussian_Process_Regression.ipynb
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
What just happened?
kernelization and Gaussian processes
Definition (kernel)
k : X × X _ R is a (Mercer / positive definite) kernel if, for any finite collection X = [x1 , . . . , xN ], the
matrix kXX ∈ RN×N with [kXX ]ij = k(xi , xj ) is positive semidefinite.
def kernel (f) : λ (a,b) -> [[f(a[i],b[j]) for j=1:length(b)] for i=1:length(a) ]
actually, in python:
def kernel (f) : return lambda a,b : np.array([ [np.float64(f(a[i],b[j])) for j in range(b.size) ] for i in range(a.size) ])
⊺ N
v Av ≥ 0 ∀ v ∈ R .
Equivalently:
▶ All eigenvalues of the symmetric matrix A are non-negative
▶ A is a Gram matrix — the outer product of N vectors [ϕi ]i=1,....N
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
Kernels
It is fine to think of them as inner products
Lemma:
k is a Mercer kernel if it can be written as (assuming L is positive set for ν)
X Z
′ ′
k(x, x ) = ϕℓ (x)ϕℓ (x ) or = ϕℓ (x)ϕℓ (x′ ) dν(ℓ)
ℓ∈L L
∫ ∫ [ ]2
N N ⊺
∑∑N ∑
N ∑ ∑
Proof: ∀X ∈ X , v ∈ R : v kXX v = vi ϕℓ (xi ) vj ϕℓ (xj ) dν(ℓ) = vi ϕℓ (xi ) dν(ℓ) ≥ 0 □
i j i
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Gaussian processes
are programs
Definition
Let µ : X _ R be any function, k : X × X _ R be a Mercer kernel. A Gaussian process
p(f) = GP(f; µ, k) is a probability distribution over the function f : X _ R, such that every finite
restriction to function values fX := [fx1 , . . . , fxN ] is a Gaussian distribution p(fX ) = N (fX ; µX , kXX ).
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
Another kernel
The Wiener process
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
(
1 if x ≥ ci
ϕi (x) = θ(x − ci ) =
0 else
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
Another Kernel
The Wiener process
(
1 if x ≥ ci cmax − c0
ϕi (x) = θ(x − ci ) = Σ = σ2
0 else F
σ 2 (cmax − c0 ) X
F
⇒ k(xi , xj ) = ϕ(xi )⊺ Σϕ(xj ) = θ(min(xi , xj ) − cℓ )
F
ℓ=1
F·δc
now increase F so # of features in δc approaches (cmax −c0 )
Z cmax
min(xi ,xj )
k(xi , xj ) _ σ 2
θ(min(xi , xj ) − c) dc = σ 2 [1]c0
c0
= σ 2 (min(xi , xj ) − c0 )
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
Graphical View
Convergence to Wiener process
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Graphical View
Convergence to Wiener process
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Graphical View
Convergence to Wiener process
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Graphical View
Convergence to Wiener process
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
t = t um eine Gro8e gewachsen ist, welche zwischen x und
+Oldest
d a liegt. Auch in diesem Falle andert sich also
TheaFunktion Gaussian Process die
f gema6 Gleichung (1). Ferner mu6 offenbar fur
Brownian Motion & Wiener process [image: F. Schmutzer, 1921, Austrian National Library]
x ~ und
0 t=O
+OD
f(z, t ) = 0 und sf(z, t)d x -. n
-m
Die
[A. Einstein, ÜberHaufigkeitsverteilung der derinWärme
die von der molekularkinetischen Theorie einer beliebigen
geforderte Zeit
Bewegung von in t
erfolgten
ruhenden Lagenilnderungen
Flüssigkeiten ist d.also
suspendierten Teilchen, Ann. Physik, dieselbe
1905] wie die der zu-
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
Just one more!
Splines [G. Wahba, Spline Models for Observational Data, 1990]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 18
Cubic Splines
from rectified linear units
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
1 1
k(xi , xk ) = σ 2 min3 (xi − x0 , xj − x0 )3 + |xi − xj |min2 (xi − x0 , xj − x0 )
3 2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Cubic Splines
from rectified linear units
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
1 1
k(xi , xk ) = σ 2 min3 (xi − x0 , xj − x0 )3 + |xi − xj |min2 (xi − x0 , xj − x0 )
3 2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Cubic Splines
from rectified linear units
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
1 1
k(xi , xk ) = σ 2 min3 (xi − x0 , xj − x0 )3 + |xi − xj |min2 (xi − x0 , xj − x0 )
3 2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Cubic Splines
from rectified linear units
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
1 1
k(xi , xk ) = σ 2 min3 (xi − x0 , xj − x0 )3 + |xi − xj |min2 (xi − x0 , xj − x0 )
3 2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Cubic Splines
from rectified linear units
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
1 1
k(xi , xk ) = σ 2 min3 (xi − x0 , xj − x0 )3 + |xi − xj |min2 (xi − x0 , xj − x0 )
3 2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Gaussian Processes
▶ Sometimes it is possible to consider infinitely many features at once, by extending from a sum to
an integral. This requires some regularity assumption about the features’ locations, shape, etc.
▶ The resulting nonparametric model is known as a Gaussian process
▶ Inference in GPs is tractable (though at polynomial cost O(N3 ) in the number N of datapoints)
▶ There is no unique kernel. In fact, there are quite a few! E.g.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 20
How many kernels are there?
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
A Lot!
because the space of kernels is large
Theorem:
Let X, Y be index sets and ϕ : Y _ X . If k1 , k2 : X × X _ R are Mercer kernels, then the following
functions are also Mercer kernels (up to minor regularity assumptions)
▶ α · k1 (a, b) for α ∈ R+ (proof: trivial)
▶ k1 (ϕ(c), ϕ(d)) for c, d ∈ Y (proof: by Mercer’s theorem, next lecture)
▶ k1 (a, b) + k2 (a, b) (proof: trivial)
▶ k1 (a, b) · k2 (a, b) Schur product theorem
(proof involved. E.g. Bapat, 1997. Million, 2007)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 22
Scaled Kernels are Kernels
(this affects the effect of noise)
(a − b)2
k(a, b) = 12 · exp −
2
10 10
f(x)
f(x)
0 0
−10 −10
−5 0 5 −5 0 5
x x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 23
Scaled Kernels are Kernels
(this affects the effect of noise)
(a − b)2
k(a, b) = 102 · exp −
2
10 10
f(x)
f(x)
0 0
−10 −10
−5 0 5 −5 0 5
x x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 23
Kernels over Transformed Inputs are Kernels
this can be used to define length-scales
(a − b)2
k(a, b) = 20 · exp −
2 · 52
10 10
f(x)
f(x)
0 0
−10 −10
−5 0 5 −5 0 5
x x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 24
Kernels over Transformed Inputs are Kernels
this can be used to define length-scales
(a − b)2
k(a, b) = 20 · exp −
2 · 0.52
10 10
f(x)
f(x)
0 0
−10 −10
−5 0 5 −5 0 5
x x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 24
Kernels over Transformed Inputs are Kernels
nonlinear transformations (also at the heart of deep learning)
3
(ϕ(a) − ϕ(b))2 a+8
k(a, b) = 20 · exp − ϕ(a) =
2 5
10 10
f(x)
f(x)
0 0
−10 −10
−5 0 5 −5 0 5
x x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 25
Sums of Kernels are Kernels
f ∼ GP(0, k1 ) and g ∼ GP(0, k2 ) ⇔ h = f + g ∼ GP(0, k1 + k2 )
(a − b)2
k(a, b) = 20 · exp − + ϕ(a)⊺ ϕ(b) ϕ(x) := [1, x, x2 ]
2
10 10
f(x)
f(x)
0 0
−10 −10
−5 0 5 −5 0 5
x x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
Products of Kernels are Kernels
Schur product theorem
2
(a − b)2 ⊺ x+8
k(a, b) = 20 · exp − · ϕ(a) ϕ(b) ϕ(x) :=
2 5
10 10
f(x)
f(x)
0 0
−10 −10
−5 0 5 −5 0 5
x x
If you think your model has no parameters, you just haven’t found them yet!
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 27
Can We Learn the Kernel?
Bayesian Hierarchical Inference, vol. 2
(a − b)2
p(f | θ) = GP(f; mθ , kθ ) e.g. mθ (x) = ϕ(x)⊺ θ, and/or kθ (a, b) = θ1 exp(− ).
2θ22
▶ recall from last lecture: the evidence in Bayes’ theorem is the marginal likelihood for the model
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 28
Kernel Learning is Fitting an Entire Population of Features
The Graphical Model View
output
weights w1 w2 w3 w4 w5 w6 w7 w8 w9
features [ϕx ]1 [ϕx ]2 [ϕx ]3 [ϕx ]4 [ϕx ]5 [ϕx ]6 [ϕx ]7 [ϕx ]8 [ϕx ]9
parameters θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9
x
input
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 29
Kernel Learning is Fitting an Entire Population of Features
The Graphical Model View
output
function f
parameters θ
x
input
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 29
Summary: Gaussian Processes
▶ Sometimes it is possible to consider infinitely many features at once, by extending from a sum to
an integral.
▶ The resulting nonparametric model is known as a Gaussian process
▶ Inference in GPs is tractable (though at polynomial cost O(N3 ) in the number N of datapoints)
▶ There is an entire semi-ring of kernels: If k1 , k2 are kernels, ϕ any function, then so are
▶ α · k1 (a, b) for α ∈ R+
▶ k1 (ϕ(c), ϕ(d)) for c, d ∈ Y
▶ k1 (a, b) + k2 (a, b)
▶ k1 (a, b) · k2 (a, b)
▶ Kernels can be learned just like features, e.g. by type-II MAP/ML or numerical integration
Gaussian processes are an extremely important basic model for supervised regression. They have been
used in many pratical applications, but also provide a theoretical foundation for more complicated mod-
els, e.g. in classification, unsupervised learning, deep learning. More later.
Next lecture: Some theory for Gaussian processes. If you don’t like pretty pictures, wait for it!
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 30
The Toolbox
Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)
Modelling: Computation:
▶ Directed Graphical Models ▶ Monte Carlo
▶ Gaussian Distributions ▶ Linear algebra / Gaussian inference
▶ Kernels ▶ maximum likelihood / MAP
▶ ▶
▶ ▶
▶
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 31