Professional Documents
Culture Documents
ML Section19 Sampling MCMC
ML Section19 Sampling MCMC
Stefan Harmeling
1 n
E φ(X ) ≈ ∑ φ(xi )
n i=1
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 2
Properties of the estimator
Monte Carlo (MC) estimator:
1 n
φ̂(x1 , . . . , xn ) ∶= ∑ φ(xi )
n i=1
Mean
The mean of the MC estimator is unbiased, i.e.
E φ̂(X1 , . . . , Xn ) = E φ(X )
Variance
The variance of the MC estimator decreases linearly in n, more
precisely,
1
Var φ̂(X1 , . . . , Xn ) = Var φ(X )
n
1 n
φ̂(X1 , . . . Xn ) = ∑ φ(Xi )
n i=1
1 1 1
E φ̂ = E φ̂(X1 , . . . Xn ) = E ∑ φ(Xi ) = ∑ E φ(Xi ) = ∑ φ = φ.
n i n i ´¹¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¶ n i
=φ
n n − n − n2 2 2
= Eφ(X )2
+ φ
n2 n2
1 1 1 1 1
= Eφ(X )2 − φ2 = Eφ(X )2 − (Eφ(X ))2 = Var φ(X ).
n n n n n
Simple facts:
▸ A quarter circle (with radius one) has area π/4.
▸ The uniform PDF can be written as u(x) = [0 ≤ x ≤ 1].
Writing π as an expectation:
X1 ∼ Uniform(0, 1)
X2 ∼ Uniform(0, 1)
π = E [X12 + X22 < 1] = ∫ [x12 + x22 < 1] u(x1 )u(x2 )dx1 dx2
Estimate π by sampling:
▸ Sample n uniform pairs (x1 , x2 ) to estimate π,
1 n 2
π≈ ∑ [x + x2 < 1]
2
n i=1 1
Goal
▸ sample from a given PDF p(x) or from some unnormalized PDF
p∗ (x)
Intuitive recipe
▸ discretize the range of x in finite number of regions and calculate
local probabilities of those regions
▸ sample from the finite distribution
Possible problems
▸ in many dimensions we need exponentially many regions
▸ some regions of high density are more important, but how can we
find them!
A useful analogy from David MacKay’s Chapter 29.
Task
▸ estimate average plankton concentration in some lake
▸ the depth of the lake at location x is given by some unnormalized
density p∗ (x) = p(x)Zp
▸ the plankton concentration at x is φ(x)
▸ we want to estimate EX ∼p φ(X )
Approach
▸ drive around with boat to n locations xi and measure the depth
p∗ (xi ) and plankton concentration φ(xi )
▸ use some nice formula to estimate the overall plankton
concentration
Problems
▸ you never know even while measuring p∗ (xi ) whether you
reached the important points where the lake is really deep, or
whether you are missing some deeper areas that are rare
X ∼ pX (x)
Y = f (X )
pX (x) dx = pY (y ) dy
Deep insight
Let
X ∼ pX (x)
Y = FX (X )
Then
pY (y ) = [0 ≤ y ≤ 1]
pY (y ) = pX (FX−1 (y )) ∣(FX−1 )′ (y )∣
= pX (FX−1 (y ))(FX−1 )′ (y )
1
= pX (FX−1 (y )) ′ −1
FX (FX (y ))
1
= pX (FX−1 (y ))
pX (FX−1 (y ))
=1
where we used
▸ that the CDF is monotonicaly increasing, i.e. (FX−1 )′ (y ) ≥ 0
▸ the inverse function theorem, i.e. (FX−1 )′ (y ) = 1/FX′ (FX−1 (y ))
▸ and that the derivative of the CDF is the PDF, i.e. FX′ (x) = pX (x)
X ∼ pX (x) Ô⇒ Y = FX (X ) ∼ Uniform(0, 1)
Backward
Given some PDF pX
Example
▸ How can we sample from a exponential distribution with PDF
pX (x) = λ exp(−λx)?
▸ Calculate the CDF: FX (x) = . . . = 1 − exp(−λx)
▸ Derive the inverse CDF: FX−1 (y ) = − log(1 − y )/λ
▸ Sample Y from a uniform distribution and transform with the
inverse CDF.
▸ Note that it does not matter whether we use − log(y )/λ or
− log(1 − y )/λ.
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 19
Box-Müller method to sample from a Gaussian
Goal
▸ sample uniformly below a PDF p(x)
▸ i.e. sample a location x0 according to p(x) and uniformly a value
y between 0 and p(x0 ) (for two-dimensional visualization)
X ∼ p(x)
Y ∣X ∼ Uniform(0, p(X ))
Note
▸ Note that the PDF of Y given X = x0 can be written as:
Goal
▸ sample from unnormalized density p∗
Trick
▸ find a density q(x) from which we can sample, that majorizes
p∗ (x) for some c > 0, i.e.
c q(x) ≥ p∗ (x).
1 (x + 1)2
q(x) = √ exp (− )
2π ⋅ 4 2⋅4
Simple example
▸ consider two d dimensional Gaussians p(x) and q(x) with mean
zero and standard deviations σp and σq , with σq > σp
▸ assume we want to sample from p(x) using rejection sampling
applied to the wider Gaussian q(x)
▸ the optimal constant c, such that c q(x) ≥ p(x) can be calculated
at the origins as
p(0) (2πσQ )
2 d/2
σQ d σ
c= = = ( ) = exp(d ln Q )
q(0) (2πσP )
2 d/2 σP σP
EX ∼p φ(X ) = ∫ φ(x)p(x)dx
p(x)
= ∫ φ(x) q(x)dx
q(x)
p(x)
= EX ∼q φ(x)
q(x)
▸ change the importance of the samples with weights w(xi ) = p(xi )
q(xi )
.
1 n p(xi ) 1 n
EX ∼p φ(X ) ≈ ∑ φ(xi ) = ∑ φ(xi )w(xi )
n i=1 q(xi ) n i=1
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 31
then one can show. . .
▸ in the limit for n → ∞ the estimator will converge against the true
expectation, i.e.
1 n n→∞
∑ φ(xi )w(xi ) ÐÐÐ→ EX ∼p φ(X ).
n i=1
1 n 1 n n→∞
∑ 1 ⋅ w(xi ) = ∑ w(xi ) ÐÐÐ→ EX ∼p 1 = 1.
n i=1 n i=1
1 n w ∗ (xi ) 1 n w(xi )
∑ φ(xi ) 1 n = ∑ φ(xi ) 1 n
n i=1 n ∑ j=1 w ∗ (xj ) n i=1 n ∑ j=1 w(xj )
Note
▸ in the first step we replace the weights with stars to weights
without stars
▸ the numerator converges to the expectation
▸ the denominator converges against 1 (as shown previously)
p(x)
EX ∼p φ(X ) = EX ∼q φ(x)
q(x)
1 p(X )
VarX ∼q (φ(X ) )
n q(X )
Goals
1. generate samples x1 , . . . , xn from some PDF p(x)
2. estimate the expected value:
1 n
EX ∼p φ(X ) = ∫ φ(x)p(x)dx ≈ ∑ φ(xi )
n i=1
Markov chain:
n
p(x1 , x2 , . . . , xn ) = p(x1 ) ∏ p(xi ∣xi−1 )
i=2
p∗ (x ′ )
α(x ′ , xt ) = min (1, )
p∗ (xt )
p∗ (x ′ ) q(xt ∣x ′ )
α(x ′ , xt ) = min (1, )
p∗ (xt ) q(x ′ ∣xt )
▸ assume x ≠ x ′ ,
Setup
▸ let’s consider multi-dimensional samples x = (x1 , x2 , . . . , xd )
▸ let p(x1 , x2 , . . . , xd ) be a joint probability distribution, from which it
is hard to sample
▸ however, assume that sampling from the marginals
p(x1 ∣x2 , . . . , xd ), . . . is easy
▸ there are d proposal distributions:
q1 (x ′ ∣x) = [xd − xd′ ]⋯[x2 − x2′ ] p(x1′ ∣x2 , . . . , xd ) update only first entry
⋮
Goal
▸ generate samples from unnormalized PDF p∗ (x) = p(x)Zp
Steps: (code box copied from MacKay’s book page 375)
“Stepping out” (code box copied from MacKay’s book page 375)
▸ Not sure why there is the following condition: “If there is n0 so that
K n (x, y ) ≥ 0 for all n ≥ n0 . . . ”. If you know why, please let me
know, and I include it here!
▸ This might be an instance of https:
//de.wikipedia.org/wiki/Satz_von_Perron-Frobenius
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 57
Metropolis algorithm
copied from Sec 2. in Diaconis, “MCMC revolution”, 2008
▸ X finite state space
▸ π(x) a probability on X , (possibly unnormalized)
▸ J(x, y ) a transition matrix defining some Markov chain on X with
J(x, y ) > 0 ⇐⇒ J(y , x) > 0.
▸ J and π can be unrelated.
▸ The Metropolis algorithm transforms J to new transition
probabilities K such that the Markov chain defined by K has π as
its stationary distribution (where A(x, y ) = π(y )J(y , x)/(π(x)J(x, y ))):
⎧
⎪J(x, y ) if x ≠ y , A(x, y ) ≥ 1
⎪
⎪
⎪
⎪J(x, y )A(x, y ) if x ≠ y , A(x, y ) < 1
K (x, y ) = ⎨
⎪
⎪J(x, y ) +
⎪
⎪
⎪ ∑ J(x, z)(1 − A(x, z)) if x = y
⎩ z∶A(x,z)<1
⎧
⎪J(x, y ) if x ≠ y , A(x, y ) ≥ 1
⎪
⎪
⎪
⎪J(x, y )A(x, y ) if x ≠ y , A(x, y ) < 1
K (x, y ) = ⎨
⎪
⎪J(x, y ) +
⎪
⎪
⎪ ∑ J(x, z)(1 − A(x, z)) if x = y
⎩ z∶A(x,z)<1