04 Sampling

Probabilistic Inference and Learning
Lecture 04
Sampling
Philipp Hennig
27 April 2021
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Understanding Deep Learning 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. An Example for GP Regression 23 13.07. Example: Topic Models
11 25.05. Understanding Kernels 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision
Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
A Computational Challenge
Integration is the core computation of probabilistic inference
Probabilistic inference requires integrals:

▶ Evidences. Example from Lecture 3:
QN QN n
p(xi | π)p(π) π (1 − π)N−n
p(π|x1 , . . . , xN ) = R 1 QNi = R 1 QNi
i p(xi | π)p(π) dπ i π (1 − π)
n N−n dπ
0 0
▶ Expectations (actually, evidences are expectations, too)

Z
⟨f⟩p := Ep [f] := f(x)p(x) dx “Expectation of f under p”
f(x) = x mean
f(x) = (x − Ep (x)) 2
variance
p
f(x) = x p-th moment
f(x) = − log x entropy
..
.
The Toolbox
Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)
Modelling: Computation:
▶ Directed Graphical Models ▶ Monte Carlo
▶ ▶
▶ ▶
▶ ▶
▶ ▶
▶
Randomized Methods — Monte Carlo
the idea
▶ the “simplest thing to do”: replace integral with sum:

Z Z
1X X
S
f(x)p(x) dx ≈ f(xi ); p(x, y) dx ≈ p(y | xi ); if xi ∼ p(x)
S
i=1 i
▶ this requires being able to sample xi ∼ p(x)
Definition (Monte Carlo method)

Algorithms that compute expectations in the above way, using samples xi ∼ p(x) are called Monte Carlo
methods (Stanisław Ulam, John von Neumann).
Probabilistic ML — P. Hennig, SS 2021 — Lecture 04: Sampling— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 image source: wikipedia u:fruitpunchline 5
A method from a different age
Monte Carlo Methods and the Manhattan Project images: Los Alamos National Laboratory / wikipedia
Stanisław Ulam Nicholas Metropolis John von Neumann

1909–1984 1915–1999 1903–1957
The FERMIAC
analog Monte Carlo computer images: wikipedia
Example
a dumb way to compute π
1
π/4
▶ ratio of quarter-circle to square:
R 1 0.8
▶ π = 4 I(x⊺ x < 1)u(x) dx
▶ draw x ∼ u(x), check x⊺ x < 1, count 0.6
1 from numpy.random import rand

2 S = 100000
0.4
3 sum((rand(S,2)**2).sum(axis=1) < 1) / S * 4
0.2
> 3.13708
> 3.14276
0
0 0.2 0.4 0.6 0.8 1
Monte Carlo works on every Integrable Function
is this a good thing?
Z
ϕ := f(x)p(x) dx = Ep (f)
▶ Let xs ∼ p, s = 1, . . . , S iid. (i.e. p(xs = x) = p(x) and p(xs , xt ) = p(xs )p(xt ) ∀s, t)
1X
S
ϕ̂ := f(xs ) ^ the Monte Carlo estimator is …
S
s=1
Z Z
1X 1X
S S
E(ϕ̂) =: f(xs )p(xs ) dxs = f(xs )p(xs ) dxs
S S
s=1 s=1
1X
S
E(f(xs )) = ϕ = ^ … an unbiased estimator!
S
s=1
R
▶ the only requirement for this is that f(x)p(x) dx exists (i.e. f must be Lebesgue-integrable
relative to p). Monte Carlo integration can even work on discontinuous functions.
Sampling converges slowly
expected square error
▶ The expected square error (variance) drops as O(S−1 )

" #2
1X
S
E(ϕ̂ − E(ϕ̂)) = E (f(xs ) − ϕ)
2
S
s=1
1 X
S X
S
= E(f(xs )f(xr )) − ϕE(f(xs )) − E(f(xr ))ϕ + ϕ2
S2
s=1 r=1
  
1 X
S
 X 
= 2  ϕ2 − 2ϕ2 + ϕ2  + E(f2 ) − ϕ2 
S | {z } | {z }
s=1 r̸=s =0 =:var(f)
1
= var(f) = O(S−1 )
S
▶ Thus, the expected error (the square-root of the expected square error) drops as O(S−1/2 )
sampling is for rough guesses
recall example computation for π
101
√
MC var(f)/s
6 π
0
10
4
10−1
ϕ̂
2
10−2
0
10−3
100 101 102 103 104 105 100 101 102 103 104 105
# samples # samples
▶ need only ∼ 9 samples to get order of magnitude right (std(ϕ)/3)

▶ need 1014 samples for single-precision (∼ 10−7 ) calculations!
▶ sampling is good for rough estimates, not for precise calculations
▶ Always think of other options before trying to sample!
▶ samples from a probability distribution can be used to estimate expectations, roughly, without
having to design an elaborate integration algorithm
▶ The error of the estimate is independent of the dimensionality of the input domain!
How do we generate random samples from p(x)?
Reminder: Change of Measure
The transformation law
Theorem (Change of Variable for Probability Density Functions)

Let X be a continuous random variable with PDF pX (x) over c1 < x < c2 . And, let Y = u(X) be a
monotonic differentiable function with inverse X = v(Y). Then the PDF of Y is
−1
dv(y) du(x)
pY (y) = pX (v(y)) · = pX (v(y)) · .
dy dx
Let X = (X1 , . . . , Xd ) have a joint density pX . Let g : Rd _ Rd be continously differentiable and injective,
with non-vanishing Jacobian Jg . Then Y = g(X) has density
(
pX (g−1 (y)) · |Jg−1 (y)| if y is in the range of g,
pY (y) =
0 otherwise.
∂gi (x)
The Jacobian Jg is the d × d matrix with [Jg (x)]ij = ∂xj .
Some special cases
sampling from an exponential distribution is analytic
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Z
1
p(x) = e−x/λ p(x) dx = 1 − e−x/λ
λ
1 − u = 1 − e−x/λ x = −λ log(u)
Example: Sampling from a Beta Distribution
uniform variables
Consider u ∼ U[0, 1] (i.e. u ∈ [0, 1], and p(u) = 1). The variable x = u1/α has the Beta density
∂u(x)
px (x) = pu (u(x)) · = α · xα−1 = B(x; α, 1).
∂x
Example: Sampling from a Beta Distribution
uniform variables
Consider u ∼ U[0, 1] (i.e. u ∈ [0, 1], and p(u) = 1). The variable x = u1/α has the Beta density
∂u(x)
px (x) = pu (u(x)) · = α · xα−1 = B(x; α, 1).
∂x
Homework:
Consider two independent variables
X ∼ G(α, θ) Y ∼ G(β, θ)
where Γ(ξ; α, θ) = 1
Γ(α)θ k
ξ α−1 e−ξ/θ is the Gamma distribution. Show that the random variable
X
Z= X+Y is Beta distributed, with the density
Γ(α + β) α−1
p(Z = z) = B(z; α, β) = z (1 − z)β−1 .
Γ(α)Γ(β)
▶ samples from a probability distribution can be used to estimate expectations, roughly
▶ ‘random numbers’ don’t really need to be unpredictable, as long as they have as little structure as
possible
▶ uniformly distributed random numbers can be transformed into other distributions. This can be
done numerically efficiently in some cases, and it is worth thinking about doing so
What do we do if we don’t know a good transformation?
Why is sampling hard?
Sampling is harder than global optimization
To produce exact samples:

▶ need to know cumulative density everywhere
▶ need to know regions of high density (not just local maxima!)
▶ a global description of the entire function
Practical Monte Carlo Methods aim to construct samples from
p̃(x)
p(x) =
Z
assuming that it is possible to evaluate the unnormalized density p̃ (but not p) at arbitrary points.
Typical example: Compute moments of a posterior
p(D | x)p(x) 1X n
p(x | D) = R as Ep(x|D) (xn ) ≈ x with xi ∼ p(x | D)
p(D, x) dx S s i
Rejection Sampling
a simple method [Georges-Louis Leclerc, Comte de Buffon, 1707–1788]
0.4
0.3
0.2
0.1
0
−4 −2 0 2 4 6 8 10
▶ for any p(x) = p̃(x)/Z (normalizer Z not required)

▶ choose q(x) s.t. cq(x) ≥ p̃(x)
▶ draw s ∼ q(x), u ∼ Uniform[0, cq(s)]
▶ reject if u > p̃(s)
The Problem with Rejection Sampling
the curse of dimensionality [MacKay, §29.3]
0.4
Example:
p(x) ▶ p(x) = N (x; 0, σp2 )
cq(x)
▶ q(x) = N (x; 0, σq2 )
0.3
▶ σq > σ p
▶ optimal c is given by
D
p(x)
0.2
(2πσq2 )D/2 σq σq
c= = = exp D ln
(2πσp2 )D/2 σp σp
0.1
▶ acceptance rate is ratio of volumes: 1/c
▶ rejection rate rises exponentially in D
0
−4 −2 0 2 4 ▶ for σq /σp = 1.1, D = 100, 1/c < 10−4
x
Importance Sampling
a slightly less simple method
▶ computing p̃(x), q(x), then throwing them away seems wasteful

▶ instead, rewrite (assume q(x) > 0 if p(x) > 0)
Z Z
p(x)
ϕ = f(x)p(x) dx = f(x) q(x) dx
q(x)
1X p(xs ) 1X
≈ f(xs ) =: f(xs )ws if xs ∼ q(x)
S s q(xs ) S s
▶ this is just using a new function g(x) = f(x)p(x)/q(x), so it is an unbiased estimator
▶ ws is known as the importance (weight) of sample s
▶ if normalization unknown, can also use p̃(x) = Zp(x)
Z
11X p̃(xs )
f(x)p(x) dx = f(xs )
ZS s q(xs )
1 X p̃(xs )/q(xs ) X
= f(xs ) 1 P =: f(xs )w̃s
S s S t 1p̃(xt )/q(xt ) s
▶ this is consistent, but biased

What’s wrong with Importance Sampling?
the curse of dimensionality, revisited

▶ recall that var ϕ̂ = var(f)/S — importance sampling replaces var(f) with var(g) = var f qp

▶ var f qp can be very large if q ≪ p somewhere. In many dimensions, usually all but everywhere!
▶ if p has “undiscovered islands”, some samples have p(x)/q(x) _ ∞
4
p(x)
q(x)
w(x) 3
log10 sample count

2
0
−2 0 2 4 6 8 −20 0 20 40 60 80 100
x f(x), g(x)
Sampling (Monte Carlo) Methods
Sampling is a way of performing rough probabilistic computations, in particular for expectations
(including marginalization).
▶ samples from a probability distribution can be used to estimate expectations, roughly
▶ uniformly distributed random numbers can be transformed into other distributions. This can be
done numerically efficiently in some cases, and it is worth thinking about doing so
▶ Rejection sampling is a primitive but exact method that works with intractable models
▶ Importance sampling makes more efficient use of samples, but can have high variance (and this
may not be obvious)
Next Lecture:
▶ Markov Chain Monte Carlo methods are more elaborate ways of getting approximate answers to
intractable problems.

04 Sampling

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

04 Sampling

Uploaded by

Copyright:

Available Formats

Probabilistic Inference and Learning

Probabilistic inference requires integrals:

▶ Expectations (actually, evidences are expectations, too)

▶ the “simplest thing to do”: replace integral with sum:

▶ this requires being able to sample xi ∼ p(x)

Definition (Monte Carlo method)

Stanisław Ulam Nicholas Metropolis John von Neumann

1 from numpy.random import rand

▶ The expected square error (variance) drops as O(S−1 )

▶ need only ∼ 9 samples to get order of magnitude right (std(ϕ)/3)

How do we generate random samples from p(x)?

Theorem (Change of Variable for Probability Density Functions)

To produce exact samples:

▶ for any p(x) = p̃(x)/Z (normalizer Z not required)

▶ computing p̃(x), q(x), then throwing them away seems wasteful

▶ this is consistent, but biased

log10 sample count

You might also like