Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Probabilistic Inference and Learning

Lecture 09
Gaussian Processes

Philipp Hennig
17 May 2021

Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Learning Representations 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. An Example for GP Regression 23 13.07. Example: Topic Models
11 25.05. Understanding Kernels 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
Summary:
▶ Gaussian distributions can be used to learn functions
▶ Analytical inference is possible using general linear models

f(x) = ϕ(x)⊺ w = ϕ⊺x w

▶ The choice of features ϕ : X _ R is essentially unconstrained


▶ with parametrized feature families, hierarchical inference allows fitting features
▶ Gaussian linear regressors are single-layer neural networks with quadratic loss

aka. least-square regression (cf. next week)


Gaussian parametric regression

optimize features infinitely many features

Deep Learning Nonparametric Learning

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
Reminder: Where we left off last time
Probabilistic Parametric Regression

f(x) := ϕ⊺x w p(w) = N (w; µ, Σ) p(y | f) = N (y; fX , σ 2 I)


( )
p(w | y, ϕX ) = N w; µ + ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ), Σ − ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σ
( ( ) )
p(fx | y, ϕX ) = N fx ; ϕ⊺x µ + ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ), ϕ⊺x Σ −⊺ ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σ ϕx

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Reminder: Where we left off last time
Probabilistic Parametric Regression
h i⊺
2 2 2
ϕ(x) = e− 2 (x−8) e− 2 (x−7) e− 2 (x−6)
1 1 1
...
h 2 2
i
= e−(x−ci ) /(2λ )
i=1,...,F

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
A Structural Observation
Graphical Model

y output

weights w1 w2 w3 w4 w5 w6 w7 w8 w9

features [ϕx ]1 [ϕx ]2 [ϕx ]3 [ϕx ]4 [ϕx ]5 [ϕx ]6 [ϕx ]7 [ϕx ]8 [ϕx ]9

parameters θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9

x input

A linear Gaussian regressor is a single hidden layer neural network, with quadratic output loss, and fixed
input layer. Hyperparameter-fitting corresponds to training the input layer. The usual way to train such
network, however, does not include the Occam factor.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
What are we actually doing with those features?
let’s look at that algebra again

p(fx | y, ϕX ) = N (fx ; ϕ⊺x µ + ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ),
ϕ⊺x Σϕx − ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σϕx )
= N (fx ; mx + kxX (kXX + σ 2 I)−1 (y − mX ),
kxx − kxX (kXX + σ 2 I)−1 kXx )

using the abstraction / encapsulation

mx := ϕ⊺x µ m : X_R mean function


kab := ϕ⊺a Σϕb k : X × X_R covariance function, aka. kernel

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Code
Gaussian_Process_Regression.ipynb

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
Sometimes, more features make things cheaper
an example [DJC MacKay, 1998]

σ 2 (cmax −cmin )
▶ For simplicity, let’s fix Σ = F I

σ 2 (cmax − cmin ) X
F
thus: ϕ(xi )⊺ Σϕ(xj ) = ϕℓ (xi )ϕℓ (xj )
F
ℓ=1
 
(x − cℓ )2
▶ especially, for ϕℓ (x) = exp −
2λ2
   
σ 2 (cmax − cmin ) X
F
(xi − cℓ )2 (xj − cℓ )2
ϕ(xi )⊺ Σϕ(xj ) = exp − exp −
F 2λ2 2λ2
ℓ=1
  F !
σ 2 (cmax − cmin ) (xi − xj )2 X (cℓ − 21 (xi + xj ))2
= exp − exp −
F 4λ2 λ2

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
Sometimes, more features make things cheaper
an example [DJC MacKay, 1998]

  F !
⊺ σ 2 (cmax − cmin ) (xi − xj )2 X (cℓ − 21 (xi + xj ))2
ϕ(xi ) Σϕ(xj ) = exp − exp −
F 4λ2 λ2

F·δc
now increase F so # of features in δc approaches (cmax −cmin )
 !
2  Z cmax − 1
(x − x ) (c (xi + xj )) 2
ϕ(xi )⊺ Σϕ(xj ) _ σ 2 exp −
i j
exp − 2
dc
4λ2 cmin λ2
let cmin _ −∞, cmax _ ∞
 
√ (xi − xj )2
k(xi , xj ) := ϕ(xi )⊺ Σϕ(xj ) _ πλσ 2 exp −
4λ2
 
√ (xi − xj )2
kxi xj = πλσ exp −
2
4λ2

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
In pictures
convergence to Gaussian kernel

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
In pictures
convergence to Gaussian kernel

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
In pictures
convergence to Gaussian kernel

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

▶ aka. radial basis function, square(d)-exponential kernel

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
In pictures
convergence to Gaussian kernel

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

▶ aka. radial basis function, square(d)-exponential kernel

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
Code
Gaussian_Process_Regression.ipynb

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
What just happened?
kernelization and Gaussian processes

Definition (kernel)
k : X × X _ R is a (Mercer / positive definite) kernel if, for any finite collection X = [x1 , . . . , xN ], the
matrix kXX ∈ RN×N with [kXX ]ij = k(xi , xj ) is positive semidefinite.

def kernel (f) : λ (a,b) -> [[f(a[i],b[j]) for j=1:length(b)] for i=1:length(a) ]
actually, in python:
def kernel (f) : return lambda a,b : np.array([ [np.float64(f(a[i],b[j])) for j in range(b.size) ] for i in range(a.size) ])

Definition (positive definite matrix)


A symmetric matrix A ∈ RN×N is called positive (semi-) definite if

⊺ N
v Av ≥ 0 ∀ v ∈ R .

Equivalently:
▶ All eigenvalues of the symmetric matrix A are non-negative
▶ A is a Gram matrix — the outer product of N vectors [ϕi ]i=1,....N

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
Kernels
It is fine to think of them as inner products

Lemma:
k is a Mercer kernel if it can be written as (assuming L is positive set for ν)
X Z
′ ′
k(x, x ) = ϕℓ (x)ϕℓ (x ) or = ϕℓ (x)ϕℓ (x′ ) dν(ℓ)
ℓ∈L L
∫ ∫ [ ]2
N N ⊺
∑∑N ∑
N ∑ ∑
Proof: ∀X ∈ X , v ∈ R : v kXX v = vi ϕℓ (xi ) vj ϕℓ (xj ) dν(ℓ) = vi ϕℓ (xi ) dν(ℓ) ≥ 0 □
i j i

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Gaussian processes
are programs

Definition
Let µ : X _ R be any function, k : X × X _ R be a Mercer kernel. A Gaussian process
p(f) = GP(f; µ, k) is a probability distribution over the function f : X _ R, such that every finite
restriction to function values fX := [fx1 , . . . , fxN ] is a Gaussian distribution p(fX ) = N (fX ; µX , kXX ).

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
Another kernel
The Wiener process

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

(
1 if x ≥ ci
ϕi (x) = θ(x − ci ) =
0 else
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
Another Kernel
The Wiener process

(
1 if x ≥ ci cmax − c0
ϕi (x) = θ(x − ci ) = Σ = σ2
0 else F

σ 2 (cmax − c0 ) X
F
⇒ k(xi , xj ) = ϕ(xi )⊺ Σϕ(xj ) = θ(min(xi , xj ) − cℓ )
F
ℓ=1

F·δc
now increase F so # of features in δc approaches (cmax −c0 )

Z cmax
min(xi ,xj )
k(xi , xj ) _ σ 2
θ(min(xi , xj ) − c) dc = σ 2 [1]c0
c0

= σ 2 (min(xi , xj ) − c0 )

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
Graphical View
Convergence to Wiener process

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

k(xi , xk ) = σ 2 (min(xi , xj ) − c0 ) var f(x) = σ 2 (x − c0 )

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Graphical View
Convergence to Wiener process

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

k(xi , xk ) = σ 2 (min(xi , xj ) − c0 ) var f(x) = σ 2 (x − c0 )

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Graphical View
Convergence to Wiener process

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

k(xi , xk ) = σ 2 (min(xi , xj ) − c0 ) var f(x) = σ 2 (x − c0 )

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Graphical View
Convergence to Wiener process

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

k(xi , xk ) = σ 2 (min(xi , xj ) − c0 ) var f(x) = σ 2 (x − c0 )

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
t = t um eine Gro8e gewachsen ist, welche zwischen x und
+Oldest
d a liegt. Auch in diesem Falle andert sich also
TheaFunktion Gaussian Process die
f gema6 Gleichung (1). Ferner mu6 offenbar fur
Brownian Motion & Wiener process [image: F. Schmutzer, 1921, Austrian National Library]
x ~ und
0 t=O
+OD
f(z, t ) = 0 und sf(z, t)d x -. n
-m

sein. Das Problem, welches mit dem Problem der Diffusion


von einem Punkte aus (unter Vernachlassigung der Wechsel-
wirkung der diff undierenden Teilchen) iibereinstimmt, ist nun
mathematisch vollkommen bestimmt ; seine Losung ist :

Die
[A. Einstein, ÜberHaufigkeitsverteilung der derinWärme
die von der molekularkinetischen Theorie einer beliebigen
geforderte Zeit
Bewegung von in t
erfolgten
ruhenden Lagenilnderungen
Flüssigkeiten ist d.also
suspendierten Teilchen, Ann. Physik, dieselbe
1905] wie die der zu-

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
Just one more!
Splines [G. Wahba, Spline Models for Observational Data, 1990]

remember Wiener: ϕi (x) = θ(x − ci ) f(x) = ϕ(x)⊺ w


Z x
consider: ϕ̃i (x) = θ(x̃ − ci ) dx̃ “ReLU”
−∞
Z x
equivalent to: f̃(x) = f(x̃) dx̃
−∞
ZZ xi ,xj
=⇒ k̃(xi , xj ) = ϕ̃(xi )⊺ Σϕ̃(xj ) = k(x̃i , x̃j ) dx̃i dx̃j

1 1 
= σ2 min3 (xi − x0 , xj − x0 )3 + |xi − xj |min2 (xi − x0 , xj − x0 )
3 2

Nb: The integral of a Gaussian process is also a Gaussian process!

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 18
Cubic Splines
from rectified linear units

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

1 1 
k(xi , xk ) = σ 2 min3 (xi − x0 , xj − x0 )3 + |xi − xj |min2 (xi − x0 , xj − x0 )
3 2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Cubic Splines
from rectified linear units

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

1 1 
k(xi , xk ) = σ 2 min3 (xi − x0 , xj − x0 )3 + |xi − xj |min2 (xi − x0 , xj − x0 )
3 2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Cubic Splines
from rectified linear units

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

1 1 
k(xi , xk ) = σ 2 min3 (xi − x0 , xj − x0 )3 + |xi − xj |min2 (xi − x0 , xj − x0 )
3 2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Cubic Splines
from rectified linear units

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

1 1 
k(xi , xk ) = σ 2 min3 (xi − x0 , xj − x0 )3 + |xi − xj |min2 (xi − x0 , xj − x0 )
3 2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Cubic Splines
from rectified linear units

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

1 1 
k(xi , xk ) = σ 2 min3 (xi − x0 , xj − x0 )3 + |xi − xj |min2 (xi − x0 , xj − x0 )
3 2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Gaussian Processes
▶ Sometimes it is possible to consider infinitely many features at once, by extending from a sum to
an integral. This requires some regularity assumption about the features’ locations, shape, etc.
▶ The resulting nonparametric model is known as a Gaussian process
▶ Inference in GPs is tractable (though at polynomial cost O(N3 ) in the number N of datapoints)
▶ There is no unique kernel. In fact, there are quite a few! E.g.

k(a, b) = exp(−(a − b)2 ) Gaussian / Square Exponential / RBF kernel


k(a, b) = min(a − t0 , b − t0 ) Wiener process
1
k(a, b) = min3 (a − t0 , b − t0 ) cubic spline kernel
3
1
+ |a − b| · min2 (a − t0 , b − t0 )
2 !
2 −1 2a⊺ b
k(a, b) = sin p Neural Network kernel (Williams, 1998)
π (1 + 2a⊺ a)(1 + 2b⊺ b)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 20
How many kernels are there?

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
A Lot!
because the space of kernels is large

Theorem:
Let X, Y be index sets and ϕ : Y _ X . If k1 , k2 : X × X _ R are Mercer kernels, then the following
functions are also Mercer kernels (up to minor regularity assumptions)
▶ α · k1 (a, b) for α ∈ R+ (proof: trivial)
▶ k1 (ϕ(c), ϕ(d)) for c, d ∈ Y (proof: by Mercer’s theorem, next lecture)
▶ k1 (a, b) + k2 (a, b) (proof: trivial)
▶ k1 (a, b) · k2 (a, b) Schur product theorem
(proof involved. E.g. Bapat, 1997. Million, 2007)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 22
Scaled Kernels are Kernels
(this affects the effect of noise)

 
(a − b)2
k(a, b) = 12 · exp −
2

10 10
f(x)

f(x)
0 0

−10 −10

−5 0 5 −5 0 5
x x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 23
Scaled Kernels are Kernels
(this affects the effect of noise)

 
(a − b)2
k(a, b) = 102 · exp −
2

10 10
f(x)

f(x)
0 0

−10 −10

−5 0 5 −5 0 5
x x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 23
Kernels over Transformed Inputs are Kernels
this can be used to define length-scales

 
(a − b)2
k(a, b) = 20 · exp −
2 · 52

10 10
f(x)

f(x)
0 0

−10 −10

−5 0 5 −5 0 5
x x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 24
Kernels over Transformed Inputs are Kernels
this can be used to define length-scales

 
(a − b)2
k(a, b) = 20 · exp −
2 · 0.52

10 10
f(x)

f(x)
0 0

−10 −10

−5 0 5 −5 0 5
x x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 24
Kernels over Transformed Inputs are Kernels
nonlinear transformations (also at the heart of deep learning)

   3
(ϕ(a) − ϕ(b))2 a+8
k(a, b) = 20 · exp − ϕ(a) =
2 5

10 10
f(x)

f(x)
0 0

−10 −10

−5 0 5 −5 0 5
x x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 25
Sums of Kernels are Kernels
f ∼ GP(0, k1 ) and g ∼ GP(0, k2 ) ⇔ h = f + g ∼ GP(0, k1 + k2 )

 
(a − b)2
k(a, b) = 20 · exp − + ϕ(a)⊺ ϕ(b) ϕ(x) := [1, x, x2 ]
2

10 10
f(x)

f(x)
0 0

−10 −10

−5 0 5 −5 0 5
x x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
Products of Kernels are Kernels
Schur product theorem

   2
(a − b)2 ⊺ x+8
k(a, b) = 20 · exp − · ϕ(a) ϕ(b) ϕ(x) :=
2 5

10 10
f(x)

f(x)
0 0

−10 −10

−5 0 5 −5 0 5
x x

If you think your model has no parameters, you just haven’t found them yet!
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 27
Can We Learn the Kernel?
Bayesian Hierarchical Inference, vol. 2

(a − b)2
p(f | θ) = GP(f; mθ , kθ ) e.g. mθ (x) = ϕ(x)⊺ θ, and/or kθ (a, b) = θ1 exp(− ).
2θ22

▶ recall from last lecture: the evidence in Bayes’ theorem is the marginal likelihood for the model

p(y | f, x, θ)p(f |, θ) p(y | f, x, θ)p(f |, θ)


p(f | y, x, θ) = R =
p(y | f, x, θ)p(f |, θ) df p(y | x, θ)

▶ For Gaussians, die evidence has analytic form:


⊺ ⊺ ⊺
N (y; ϕθX w + b, Λ) · N (w, µ, Σ) = N (w; mθpost , Vθpost ) ·N (y; ϕθX µ + b, ϕθX ΣϕθX + Λ)
| {z } | {z } | {z } | {z }
p(y|f,x,θ) p(f) p(f|y,x,θ) p(y|θ,x)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 28
Kernel Learning is Fitting an Entire Population of Features
The Graphical Model View

output

weights w1 w2 w3 w4 w5 w6 w7 w8 w9

features [ϕx ]1 [ϕx ]2 [ϕx ]3 [ϕx ]4 [ϕx ]5 [ϕx ]6 [ϕx ]7 [ϕx ]8 [ϕx ]9

parameters θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9

x
input

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 29
Kernel Learning is Fitting an Entire Population of Features
The Graphical Model View

output

function f

parameters θ

x
input

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 29
Summary: Gaussian Processes
▶ Sometimes it is possible to consider infinitely many features at once, by extending from a sum to
an integral.
▶ The resulting nonparametric model is known as a Gaussian process
▶ Inference in GPs is tractable (though at polynomial cost O(N3 ) in the number N of datapoints)
▶ There is an entire semi-ring of kernels: If k1 , k2 are kernels, ϕ any function, then so are
▶ α · k1 (a, b) for α ∈ R+
▶ k1 (ϕ(c), ϕ(d)) for c, d ∈ Y
▶ k1 (a, b) + k2 (a, b)
▶ k1 (a, b) · k2 (a, b)
▶ Kernels can be learned just like features, e.g. by type-II MAP/ML or numerical integration
Gaussian processes are an extremely important basic model for supervised regression. They have been
used in many pratical applications, but also provide a theoretical foundation for more complicated mod-
els, e.g. in classification, unsupervised learning, deep learning. More later.

Next lecture: Some theory for Gaussian processes. If you don’t like pretty pictures, wait for it!

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 30
The Toolbox

Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)

Modelling: Computation:
▶ Directed Graphical Models ▶ Monte Carlo
▶ Gaussian Distributions ▶ Linear algebra / Gaussian inference
▶ Kernels ▶ maximum likelihood / MAP
▶ ▶
▶ ▶

Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 31

You might also like