Tungban Probabilistic ML 2021 - Lecture09

Probabilistic Inference and Learning
Lecture 09
Gaussian Processes
Philipp Hennig
17 May 2021
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Learning Representations 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. An Example for GP Regression 23 13.07. Example: Topic Models
11 25.05. Understanding Kernels 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision
Probabilistic ML — P. Hennig, SS 2021 — Lecture 09: Gaussian Processes— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
Summary:
▶ Gaussian distributions can be used to learn functions
▶ Analytical inference is possible using general linear models
f(x) = ϕ(x)⊺ w = ϕ⊺x w
▶ The choice of features ϕ : X _ R is essentially unconstrained

▶ with parametrized feature families, hierarchical inference allows fitting features
▶ Gaussian linear regressors are single-layer neural networks with quadratic loss
aka. least-square regression (cf. next week)

Gaussian parametric regression
optimize features infinitely many features
Deep Learning Nonparametric Learning
Reminder: Where we left off last time
Probabilistic Parametric Regression
f(x) := ϕ⊺x w p(w) = N (w; µ, Σ) p(y | f) = N (y; fX , σ 2 I)

( )
p(w | y, ϕX ) = N w; µ + ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ), Σ − ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σ
( ( ) )
p(fx | y, ϕX ) = N fx ; ϕ⊺x µ + ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ), ϕ⊺x Σ −⊺ ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σ ϕx
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Reminder: Where we left off last time
Probabilistic Parametric Regression
h i⊺
2 2 2
ϕ(x) = e− 2 (x−8) e− 2 (x−7) e− 2 (x−6)
1 1 1
...
h 2 2
i
= e−(x−ci ) /(2λ )
i=1,...,F
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
A Structural Observation
Graphical Model
y output
weights w1 w2 w3 w4 w5 w6 w7 w8 w9
features [ϕx ]1 [ϕx ]2 [ϕx ]3 [ϕx ]4 [ϕx ]5 [ϕx ]6 [ϕx ]7 [ϕx ]8 [ϕx ]9
parameters θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9
x input
A linear Gaussian regressor is a single hidden layer neural network, with quadratic output loss, and fixed
input layer. Hyperparameter-fitting corresponds to training the input layer. The usual way to train such
network, however, does not include the Occam factor.
What are we actually doing with those features?
let’s look at that algebra again
p(fx | y, ϕX ) = N (fx ; ϕ⊺x µ + ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ),
ϕ⊺x Σϕx − ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σϕx )
= N (fx ; mx + kxX (kXX + σ 2 I)−1 (y − mX ),
kxx − kxX (kXX + σ 2 I)−1 kXx )
using the abstraction / encapsulation
mx := ϕ⊺x µ m : X_R mean function

kab := ϕ⊺a Σϕb k : X × X_R covariance function, aka. kernel
Code
Gaussian_Process_Regression.ipynb
Sometimes, more features make things cheaper
an example [DJC MacKay, 1998]
σ 2 (cmax −cmin )
▶ For simplicity, let’s fix Σ = F I
σ 2 (cmax − cmin ) X
F
thus: ϕ(xi )⊺ Σϕ(xj ) = ϕℓ (xi )ϕℓ (xj )
F
ℓ=1

(x − cℓ )2
▶ especially, for ϕℓ (x) = exp −
2λ2

σ 2 (cmax − cmin ) X
F
(xi − cℓ )2 (xj − cℓ )2
ϕ(xi )⊺ Σϕ(xj ) = exp − exp −
F 2λ2 2λ2
ℓ=1
F !
σ 2 (cmax − cmin ) (xi − xj )2 X (cℓ − 21 (xi + xj ))2
= exp − exp −
F 4λ2 λ2
ℓ
Sometimes, more features make things cheaper
an example [DJC MacKay, 1998]
F !
⊺ σ 2 (cmax − cmin ) (xi − xj )2 X (cℓ − 21 (xi + xj ))2
ϕ(xi ) Σϕ(xj ) = exp − exp −
F 4λ2 λ2
ℓ
F·δc
now increase F so # of features in δc approaches (cmax −cmin )
!
2 Z cmax − 1
(x − x ) (c (xi + xj )) 2
ϕ(xi )⊺ Σϕ(xj ) _ σ 2 exp −
i j
exp − 2
dc
4λ2 cmin λ2
let cmin _ −∞, cmax _ ∞

√ (xi − xj )2
k(xi , xj ) := ϕ(xi )⊺ Σϕ(xj ) _ πλσ 2 exp −
4λ2

√ (xi − xj )2
kxi xj = πλσ exp −
2
4λ2
In pictures
convergence to Gaussian kernel
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
In pictures
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
In pictures
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
▶ aka. radial basis function, square(d)-exponential kernel
In pictures
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
▶ aka. radial basis function, square(d)-exponential kernel
Code
Gaussian_Process_Regression.ipynb
What just happened?
kernelization and Gaussian processes
Definition (kernel)
k : X × X _ R is a (Mercer / positive definite) kernel if, for any finite collection X = [x1 , . . . , xN ], the
matrix kXX ∈ RN×N with [kXX ]ij = k(xi , xj ) is positive semidefinite.
def kernel (f) : λ (a,b) -> [[f(a[i],b[j]) for j=1:length(b)] for i=1:length(a) ]
actually, in python:
def kernel (f) : return lambda a,b : np.array([ [np.float64(f(a[i],b[j])) for j in range(b.size) ] for i in range(a.size) ])
Definition (positive definite matrix)

A symmetric matrix A ∈ RN×N is called positive (semi-) definite if
⊺ N
v Av ≥ 0 ∀ v ∈ R .
Equivalently:
▶ All eigenvalues of the symmetric matrix A are non-negative
▶ A is a Gram matrix — the outer product of N vectors [ϕi ]i=1,....N
Kernels
It is fine to think of them as inner products
Lemma:
k is a Mercer kernel if it can be written as (assuming L is positive set for ν)
X Z
′ ′
k(x, x ) = ϕℓ (x)ϕℓ (x ) or = ϕℓ (x)ϕℓ (x′ ) dν(ℓ)
ℓ∈L L
∫ ∫ [ ]2
N N ⊺
∑∑N ∑
N ∑ ∑
Proof: ∀X ∈ X , v ∈ R : v kXX v = vi ϕℓ (xi ) vj ϕℓ (xj ) dν(ℓ) = vi ϕℓ (xi ) dν(ℓ) ≥ 0 □
i j i
Gaussian processes
are programs
Definition
Let µ : X _ R be any function, k : X × X _ R be a Mercer kernel. A Gaussian process
p(f) = GP(f; µ, k) is a probability distribution over the function f : X _ R, such that every finite
restriction to function values fX := [fx1 , . . . , fxN ] is a Gaussian distribution p(fX ) = N (fX ; µX , kXX ).
Another kernel
The Wiener process
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
(
1 if x ≥ ci
ϕi (x) = θ(x − ci ) =
0 else
Another Kernel
The Wiener process
(
1 if x ≥ ci cmax − c0
ϕi (x) = θ(x − ci ) = Σ = σ2
0 else F
σ 2 (cmax − c0 ) X
F
⇒ k(xi , xj ) = ϕ(xi )⊺ Σϕ(xj ) = θ(min(xi , xj ) − cℓ )
F
ℓ=1
F·δc
now increase F so # of features in δc approaches (cmax −c0 )
Z cmax
min(xi ,xj )
k(xi , xj ) _ σ 2
θ(min(xi , xj ) − c) dc = σ 2 [1]c0
c0
= σ 2 (min(xi , xj ) − c0 )
Graphical View
Convergence to Wiener process
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
k(xi , xk ) = σ 2 (min(xi , xj ) − c0 ) var f(x) = σ 2 (x − c0 )
Graphical View
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Graphical View
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Graphical View
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
t = t um eine Gro8e gewachsen ist, welche zwischen x und
+Oldest
d a liegt. Auch in diesem Falle andert sich also
TheaFunktion Gaussian Process die
f gema6 Gleichung (1). Ferner mu6 offenbar fur
Brownian Motion & Wiener process [image: F. Schmutzer, 1921, Austrian National Library]
x ~ und
0 t=O
+OD
f(z, t ) = 0 und sf(z, t)d x -. n
-m
sein. Das Problem, welches mit dem Problem der Diffusion

von einem Punkte aus (unter Vernachlassigung der Wechsel-
wirkung der diff undierenden Teilchen) iibereinstimmt, ist nun
mathematisch vollkommen bestimmt ; seine Losung ist :
Die
[A. Einstein, ÜberHaufigkeitsverteilung der derinWärme
die von der molekularkinetischen Theorie einer beliebigen
geforderte Zeit
Bewegung von in t
erfolgten
ruhenden Lagenilnderungen
Flüssigkeiten ist d.also
suspendierten Teilchen, Ann. Physik, dieselbe
1905] wie die der zu-
Just one more!
Splines [G. Wahba, Spline Models for Observational Data, 1990]
remember Wiener: ϕi (x) = θ(x − ci ) f(x) = ϕ(x)⊺ w

Z x
consider: ϕ̃i (x) = θ(x̃ − ci ) dx̃ “ReLU”
−∞
Z x
equivalent to: f̃(x) = f(x̃) dx̃
−∞
ZZ xi ,xj
=⇒ k̃(xi , xj ) = ϕ̃(xi )⊺ Σϕ̃(xj ) = k(x̃i , x̃j ) dx̃i dx̃j
∞
1 1
= σ2 min3 (xi − x0 , xj − x0 )3 + |xi − xj |min2 (xi − x0 , xj − x0 )
3 2
Nb: The integral of a Gaussian process is also a Gaussian process!
Cubic Splines
from rectified linear units
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
1 1
k(xi , xk ) = σ 2 min3 (xi − x0 , xj − x0 )3 + |xi − xj |min2 (xi − x0 , xj − x0 )
3 2
Cubic Splines
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
1 1
3 2
Cubic Splines
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
1 1
3 2
Cubic Splines
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
1 1
3 2
Cubic Splines
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
1 1
3 2
Gaussian Processes
▶ Sometimes it is possible to consider infinitely many features at once, by extending from a sum to
an integral. This requires some regularity assumption about the features’ locations, shape, etc.
▶ The resulting nonparametric model is known as a Gaussian process
▶ Inference in GPs is tractable (though at polynomial cost O(N3 ) in the number N of datapoints)
▶ There is no unique kernel. In fact, there are quite a few! E.g.
k(a, b) = exp(−(a − b)2 ) Gaussian / Square Exponential / RBF kernel

k(a, b) = min(a − t0 , b − t0 ) Wiener process
1
k(a, b) = min3 (a − t0 , b − t0 ) cubic spline kernel
3
1
+ |a − b| · min2 (a − t0 , b − t0 )
2 !
2 −1 2a⊺ b
k(a, b) = sin p Neural Network kernel (Williams, 1998)
π (1 + 2a⊺ a)(1 + 2b⊺ b)
How many kernels are there?
A Lot!
because the space of kernels is large
Theorem:
Let X, Y be index sets and ϕ : Y _ X . If k1 , k2 : X × X _ R are Mercer kernels, then the following
functions are also Mercer kernels (up to minor regularity assumptions)
▶ α · k1 (a, b) for α ∈ R+ (proof: trivial)
▶ k1 (ϕ(c), ϕ(d)) for c, d ∈ Y (proof: by Mercer’s theorem, next lecture)
▶ k1 (a, b) + k2 (a, b) (proof: trivial)
▶ k1 (a, b) · k2 (a, b) Schur product theorem
(proof involved. E.g. Bapat, 1997. Million, 2007)
Scaled Kernels are Kernels
(this affects the effect of noise)

(a − b)2
k(a, b) = 12 · exp −
2
10 10
f(x)
f(x)
0 0
−10 −10
−5 0 5 −5 0 5
x x
Scaled Kernels are Kernels
(this affects the effect of noise)

(a − b)2
k(a, b) = 102 · exp −
2
10 10
f(x)
f(x)
0 0
−10 −10
−5 0 5 −5 0 5
x x
Kernels over Transformed Inputs are Kernels
this can be used to define length-scales

(a − b)2
k(a, b) = 20 · exp −
2 · 52
10 10
f(x)
f(x)
0 0
−10 −10
−5 0 5 −5 0 5
x x
this can be used to define length-scales

(a − b)2
k(a, b) = 20 · exp −
2 · 0.52
10 10
f(x)
f(x)
0 0
−10 −10
−5 0 5 −5 0 5
x x
nonlinear transformations (also at the heart of deep learning)
3
(ϕ(a) − ϕ(b))2 a+8
k(a, b) = 20 · exp − ϕ(a) =
2 5
10 10
f(x)
f(x)
0 0
−10 −10
−5 0 5 −5 0 5
x x
Sums of Kernels are Kernels
f ∼ GP(0, k1 ) and g ∼ GP(0, k2 ) ⇔ h = f + g ∼ GP(0, k1 + k2 )

(a − b)2
k(a, b) = 20 · exp − + ϕ(a)⊺ ϕ(b) ϕ(x) := [1, x, x2 ]
2
10 10
f(x)
f(x)
0 0
−10 −10
−5 0 5 −5 0 5
x x
Products of Kernels are Kernels
Schur product theorem
2
(a − b)2 ⊺ x+8
k(a, b) = 20 · exp − · ϕ(a) ϕ(b) ϕ(x) :=
2 5
10 10
f(x)
f(x)
0 0
−10 −10
−5 0 5 −5 0 5
x x
If you think your model has no parameters, you just haven’t found them yet!
Can We Learn the Kernel?
Bayesian Hierarchical Inference, vol. 2
(a − b)2
p(f | θ) = GP(f; mθ , kθ ) e.g. mθ (x) = ϕ(x)⊺ θ, and/or kθ (a, b) = θ1 exp(− ).
2θ22
▶ recall from last lecture: the evidence in Bayes’ theorem is the marginal likelihood for the model
p(y | f, x, θ)p(f |, θ) p(y | f, x, θ)p(f |, θ)

p(f | y, x, θ) = R =
p(y | f, x, θ)p(f |, θ) df p(y | x, θ)
▶ For Gaussians, die evidence has analytic form:

⊺ ⊺ ⊺
N (y; ϕθX w + b, Λ) · N (w, µ, Σ) = N (w; mθpost , Vθpost ) ·N (y; ϕθX µ + b, ϕθX ΣϕθX + Λ)
| {z } | {z } | {z } | {z }
p(y|f,x,θ) p(f) p(f|y,x,θ) p(y|θ,x)
Kernel Learning is Fitting an Entire Population of Features
The Graphical Model View
output
weights w1 w2 w3 w4 w5 w6 w7 w8 w9
features [ϕx ]1 [ϕx ]2 [ϕx ]3 [ϕx ]4 [ϕx ]5 [ϕx ]6 [ϕx ]7 [ϕx ]8 [ϕx ]9
parameters θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9
x
input
Kernel Learning is Fitting an Entire Population of Features
The Graphical Model View
output
function f
parameters θ
x
input
Summary: Gaussian Processes
▶ Sometimes it is possible to consider infinitely many features at once, by extending from a sum to
an integral.
▶ The resulting nonparametric model is known as a Gaussian process
▶ Inference in GPs is tractable (though at polynomial cost O(N3 ) in the number N of datapoints)
▶ There is an entire semi-ring of kernels: If k1 , k2 are kernels, ϕ any function, then so are
▶ α · k1 (a, b) for α ∈ R+
▶ k1 (ϕ(c), ϕ(d)) for c, d ∈ Y
▶ k1 (a, b) + k2 (a, b)
▶ k1 (a, b) · k2 (a, b)
▶ Kernels can be learned just like features, e.g. by type-II MAP/ML or numerical integration
Gaussian processes are an extremely important basic model for supervised regression. They have been
used in many pratical applications, but also provide a theoretical foundation for more complicated mod-
els, e.g. in classification, unsupervised learning, deep learning. More later.
Next lecture: Some theory for Gaussian processes. If you don’t like pretty pictures, wait for it!
The Toolbox
Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)
Modelling: Computation:
▶ Directed Graphical Models ▶ Monte Carlo
▶ Gaussian Distributions ▶ Linear algebra / Gaussian inference
▶ Kernels ▶ maximum likelihood / MAP
▶ ▶
▶ ▶
▶

Tungban Probabilistic ML 2021 - Lecture09

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tungban Probabilistic ML 2021 - Lecture09

Uploaded by

Copyright:

Available Formats

Probabilistic Inference and Learning

f(x) = ϕ(x)⊺ w = ϕ⊺x w

▶ The choice of features ϕ : X _ R is essentially unconstrained

aka. least-square regression (cf. next week)

optimize features infinitely many features

Deep Learning Nonparametric Learning

f(x) := ϕ⊺x w p(w) = N (w; µ, Σ) p(y | f) = N (y; fX , σ 2 I)

using the abstraction / encapsulation

mx := ϕ⊺x µ m : X_R mean function

▶ aka. radial basis function, square(d)-exponential kernel

▶ aka. radial basis function, square(d)-exponential kernel

Definition (positive definite matrix)

k(xi , xk ) = σ 2 (min(xi , xj ) − c0 ) var f(x) = σ 2 (x − c0 )

k(xi , xk ) = σ 2 (min(xi , xj ) − c0 ) var f(x) = σ 2 (x − c0 )

k(xi , xk ) = σ 2 (min(xi , xj ) − c0 ) var f(x) = σ 2 (x − c0 )

k(xi , xk ) = σ 2 (min(xi , xj ) − c0 ) var f(x) = σ 2 (x − c0 )

sein. Das Problem, welches mit dem Problem der Diffusion

remember Wiener: ϕi (x) = θ(x − ci ) f(x) = ϕ(x)⊺ w

Nb: The integral of a Gaussian process is also a Gaussian process!

k(a, b) = exp(−(a − b)2 ) Gaussian / Square Exponential / RBF kernel

p(y | f, x, θ)p(f |, θ) p(y | f, x, θ)p(f |, θ)

▶ For Gaussians, die evidence has analytic form:

You might also like