Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Probabilistic Inference and Learning

Lecture 04
Sampling

Philipp Hennig
27 April 2021

Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Understanding Deep Learning 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. An Example for GP Regression 23 13.07. Example: Topic Models
11 25.05. Understanding Kernels 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
A Computational Challenge
Integration is the core computation of probabilistic inference

Probabilistic inference requires integrals:


▶ Evidences. Example from Lecture 3:
QN QN n
p(xi | π)p(π) π (1 − π)N−n
p(π|x1 , . . . , xN ) = R 1 QNi = R 1 QNi
i p(xi | π)p(π) dπ i π (1 − π)
n N−n dπ
0 0

▶ Expectations (actually, evidences are expectations, too)


Z
⟨f⟩p := Ep [f] := f(x)p(x) dx “Expectation of f under p”

f(x) = x mean
f(x) = (x − Ep (x)) 2
variance
p
f(x) = x p-th moment
f(x) = − log x entropy
..
.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
The Toolbox

Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)

Modelling: Computation:
▶ Directed Graphical Models ▶ Monte Carlo
▶ ▶
▶ ▶
▶ ▶
▶ ▶

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Randomized Methods — Monte Carlo
the idea

▶ the “simplest thing to do”: replace integral with sum:


Z Z
1X X
S
f(x)p(x) dx ≈ f(xi ); p(x, y) dx ≈ p(y | xi ); if xi ∼ p(x)
S
i=1 i

▶ this requires being able to sample xi ∼ p(x)

Definition (Monte Carlo method)


Algorithms that compute expectations in the above way, using samples xi ∼ p(x) are called Monte Carlo
methods (Stanisław Ulam, John von Neumann).

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 image source: wikipedia u:fruitpunchline 5
A method from a different age
Monte Carlo Methods and the Manhattan Project images: Los Alamos National Laboratory / wikipedia

Stanisław Ulam Nicholas Metropolis John von Neumann


1909–1984 1915–1999 1903–1957

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
The FERMIAC
analog Monte Carlo computer images: wikipedia

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
Example
a dumb way to compute π

1
π/4
▶ ratio of quarter-circle to square:
R 1 0.8
▶ π = 4 I(x⊺ x < 1)u(x) dx
▶ draw x ∼ u(x), check x⊺ x < 1, count 0.6

1 from numpy.random import rand


2 S = 100000
0.4
3 sum((rand(S,2)**2).sum(axis=1) < 1) / S * 4

0.2
> 3.13708
> 3.14276
0
0 0.2 0.4 0.6 0.8 1

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
Monte Carlo works on every Integrable Function
is this a good thing?

Z
ϕ := f(x)p(x) dx = Ep (f)

▶ Let xs ∼ p, s = 1, . . . , S iid. (i.e. p(xs = x) = p(x) and p(xs , xt ) = p(xs )p(xt ) ∀s, t)

1X
S
ϕ̂ := f(xs ) ^ the Monte Carlo estimator is …
S
s=1
Z Z
1X 1X
S S
E(ϕ̂) =: f(xs )p(xs ) dxs = f(xs )p(xs ) dxs
S S
s=1 s=1

1X
S
E(f(xs )) = ϕ = ^ … an unbiased estimator!
S
s=1
R
▶ the only requirement for this is that f(x)p(x) dx exists (i.e. f must be Lebesgue-integrable
relative to p). Monte Carlo integration can even work on discontinuous functions.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
Sampling converges slowly
expected square error

▶ The expected square error (variance) drops as O(S−1 )


" #2
1X
S
E(ϕ̂ − E(ϕ̂)) = E (f(xs ) − ϕ)
2
S
s=1

1 X
S X
S
= E(f(xs )f(xr )) − ϕE(f(xs )) − E(f(xr ))ϕ + ϕ2
S2
s=1 r=1
  
1 X
S
 X 
= 2  ϕ2 − 2ϕ2 + ϕ2  + E(f2 ) − ϕ2 
S | {z } | {z }
s=1 r̸=s =0 =:var(f)
1
= var(f) = O(S−1 )
S
▶ Thus, the expected error (the square-root of the expected square error) drops as O(S−1/2 )

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
sampling is for rough guesses
recall example computation for π

101

MC var(f)/s
6 π
0
10

4
10−1
ϕ̂

2
10−2

0
10−3

100 101 102 103 104 105 100 101 102 103 104 105
# samples # samples

▶ need only ∼ 9 samples to get order of magnitude right (std(ϕ)/3)


▶ need 1014 samples for single-precision (∼ 10−7 ) calculations!
▶ sampling is good for rough estimates, not for precise calculations
▶ Always think of other options before trying to sample!
Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
▶ samples from a probability distribution can be used to estimate expectations, roughly, without
having to design an elaborate integration algorithm
▶ The error of the estimate is independent of the dimensionality of the input domain!

How do we generate random samples from p(x)?

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Reminder: Change of Measure
The transformation law

Theorem (Change of Variable for Probability Density Functions)


Let X be a continuous random variable with PDF pX (x) over c1 < x < c2 . And, let Y = u(X) be a
monotonic differentiable function with inverse X = v(Y). Then the PDF of Y is
−1
dv(y) du(x)
pY (y) = pX (v(y)) · = pX (v(y)) · .
dy dx

Let X = (X1 , . . . , Xd ) have a joint density pX . Let g : Rd _ Rd be continously differentiable and injective,
with non-vanishing Jacobian Jg . Then Y = g(X) has density
(
pX (g−1 (y)) · |Jg−1 (y)| if y is in the range of g,
pY (y) =
0 otherwise.
∂gi (x)
The Jacobian Jg is the d × d matrix with [Jg (x)]ij = ∂xj .

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
Some special cases
sampling from an exponential distribution is analytic

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4

Z
1
p(x) = e−x/λ p(x) dx = 1 − e−x/λ
λ
1 − u = 1 − e−x/λ x = −λ log(u)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
Example: Sampling from a Beta Distribution
uniform variables
Consider u ∼ U[0, 1] (i.e. u ∈ [0, 1], and p(u) = 1). The variable x = u1/α has the Beta density

∂u(x)
px (x) = pu (u(x)) · = α · xα−1 = B(x; α, 1).
∂x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
Example: Sampling from a Beta Distribution
uniform variables
Consider u ∼ U[0, 1] (i.e. u ∈ [0, 1], and p(u) = 1). The variable x = u1/α has the Beta density

∂u(x)
px (x) = pu (u(x)) · = α · xα−1 = B(x; α, 1).
∂x

Homework:
Consider two independent variables

X ∼ G(α, θ) Y ∼ G(β, θ)

where Γ(ξ; α, θ) = 1
Γ(α)θ k
ξ α−1 e−ξ/θ is the Gamma distribution. Show that the random variable
X
Z= X+Y is Beta distributed, with the density

Γ(α + β) α−1
p(Z = z) = B(z; α, β) = z (1 − z)β−1 .
Γ(α)Γ(β)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
▶ samples from a probability distribution can be used to estimate expectations, roughly
▶ ‘random numbers’ don’t really need to be unpredictable, as long as they have as little structure as
possible
▶ uniformly distributed random numbers can be transformed into other distributions. This can be
done numerically efficiently in some cases, and it is worth thinking about doing so
What do we do if we don’t know a good transformation?

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Why is sampling hard?
Sampling is harder than global optimization

To produce exact samples:


▶ need to know cumulative density everywhere
▶ need to know regions of high density (not just local maxima!)
▶ a global description of the entire function
Practical Monte Carlo Methods aim to construct samples from

p̃(x)
p(x) =
Z
assuming that it is possible to evaluate the unnormalized density p̃ (but not p) at arbitrary points.
Typical example: Compute moments of a posterior

p(D | x)p(x) 1X n
p(x | D) = R as Ep(x|D) (xn ) ≈ x with xi ∼ p(x | D)
p(D, x) dx S s i

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
Rejection Sampling
a simple method [Georges-Louis Leclerc, Comte de Buffon, 1707–1788]

0.4
0.3
0.2
0.1
0
−4 −2 0 2 4 6 8 10

▶ for any p(x) = p̃(x)/Z (normalizer Z not required)


▶ choose q(x) s.t. cq(x) ≥ p̃(x)
▶ draw s ∼ q(x), u ∼ Uniform[0, cq(s)]
▶ reject if u > p̃(s)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 18
The Problem with Rejection Sampling
the curse of dimensionality [MacKay, §29.3]

0.4
Example:
p(x) ▶ p(x) = N (x; 0, σp2 )
cq(x)
▶ q(x) = N (x; 0, σq2 )
0.3
▶ σq > σ p
▶ optimal c is given by
 D  
p(x)

0.2
(2πσq2 )D/2 σq σq
c= = = exp D ln
(2πσp2 )D/2 σp σp
0.1
▶ acceptance rate is ratio of volumes: 1/c
▶ rejection rate rises exponentially in D
0
−4 −2 0 2 4 ▶ for σq /σp = 1.1, D = 100, 1/c < 10−4
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Importance Sampling
a slightly less simple method

▶ computing p̃(x), q(x), then throwing them away seems wasteful


▶ instead, rewrite (assume q(x) > 0 if p(x) > 0)
Z Z
p(x)
ϕ = f(x)p(x) dx = f(x) q(x) dx
q(x)
1X p(xs ) 1X
≈ f(xs ) =: f(xs )ws if xs ∼ q(x)
S s q(xs ) S s
▶ this is just using a new function g(x) = f(x)p(x)/q(x), so it is an unbiased estimator
▶ ws is known as the importance (weight) of sample s
▶ if normalization unknown, can also use p̃(x) = Zp(x)
Z
11X p̃(xs )
f(x)p(x) dx = f(xs )
ZS s q(xs )
1 X p̃(xs )/q(xs ) X
= f(xs ) 1 P =: f(xs )w̃s
S s S t 1p̃(xt )/q(xt ) s

▶ this is consistent, but biased


Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 20
What’s wrong with Importance Sampling?
the curse of dimensionality, revisited
 
▶ recall that var ϕ̂ = var(f)/S — importance sampling replaces var(f) with var(g) = var f qp
 
▶ var f qp can be very large if q ≪ p somewhere. In many dimensions, usually all but everywhere!
▶ if p has “undiscovered islands”, some samples have p(x)/q(x) _ ∞
4
p(x)
q(x)
w(x) 3

log10 sample count


2

0
−2 0 2 4 6 8 −20 0 20 40 60 80 100
x f(x), g(x)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
Sampling (Monte Carlo) Methods
Sampling is a way of performing rough probabilistic computations, in particular for expectations
(including marginalization).
▶ samples from a probability distribution can be used to estimate expectations, roughly
▶ uniformly distributed random numbers can be transformed into other distributions. This can be
done numerically efficiently in some cases, and it is worth thinking about doing so
▶ Rejection sampling is a primitive but exact method that works with intractable models
▶ Importance sampling makes more efficient use of samples, but can have high variance (and this
may not be obvious)
Next Lecture:
▶ Markov Chain Monte Carlo methods are more elaborate ways of getting approximate answers to
intractable problems.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 22

You might also like