ML Section19 Sampling MCMC

Machine Learning
Section 19: Sampling and MCMC
Stefan Harmeling
26./31. January 2022 (WS 2021/22)
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 1

What is sampling?
following David MacKay, Chapter 29.1
Monte Carlo methods

“Monte Carlo methods are computational techniques that make use of
random numbers.” (quoting David MacKay, Chapter 29.1)
Goals of Monte Carlo methods

Solve the following two problems:
1. generate samples x1 , . . . , xn from some PDF p(x)
2. estimate the expected value of some function φ(x) for a certain
PDF p(x), i.e.
E φ(X ) = EX ∼p φ(X ) = ∫ φ(x)p(x)dx
by replacing the integration with a summation of samples:
1 n
E φ(X ) ≈ ∑ φ(xi )
n i=1
Properties of the estimator
Monte Carlo (MC) estimator:
1 n
φ̂(x1 , . . . , xn ) ∶= ∑ φ(xi )
n i=1
Mean
The mean of the MC estimator is unbiased, i.e.
E φ̂(X1 , . . . , Xn ) = E φ(X )
Variance
The variance of the MC estimator decreases linearly in n, more
precisely,
1
Var φ̂(X1 , . . . , Xn ) = Var φ(X )
n

More details:
▸ The estimator can be seen as a function of random variables:
1 n
φ̂(X1 , . . . Xn ) = ∑ φ(Xi )
n i=1
▸ Calculating the expected value yields
1 1 1
E φ̂ = E φ̂(X1 , . . . Xn ) = E ∑ φ(Xi ) = ∑ E φ(Xi ) = ∑ φ = φ.
n i n i ´¹¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¶ n i
=φ
where we write abbreviate the mean as φ = E φ(X ).

More details:
Calculating the variance yields
E(φ̂ − φ)2 = E(φ̂2 − φ̂φ − φφ̂ + φ2 ) = E φ̂2 − 2φ E φ̂ + E φ2

1
= E φ̂2 − φ2 = E 2 ∑ φ(Xi )φ(Xj ) − φ2
n i,j
1 n 1 n
=E ∑ φ(Xi ) 2
+ E ∑ φ(Xi )φ(Xj ) − φ
2
n2 i n2 i,j
i≠j
1 1
= 2 ∑ E φ(Xi )2 + 2 ∑ φ2 − φ2
n i n i,j
i≠j
n n − n − n2 2 2
= Eφ(X )2
+ φ
n2 n2
1 1 1 1 1
= Eφ(X )2 − φ2 = Eφ(X )2 − (Eφ(X ))2 = Var φ(X ).
n n n n n

More notes:
▸ E φ = φ since φ is constant and not random.

▸ φ E φ̂ = φ2
▸ Since Xi and Xj are independent for i ≠ j, we have
E φ(Xi )φ(Xj ) = E φ(Xi ) E φ(Xj )
▸ Var(X ) = E X 2 − (E X )2

Sometimes sampling is easy...

Estimating the value of π
Simple facts:
▸ A quarter circle (with radius one) has area π/4.
▸ The uniform PDF can be written as u(x) = [0 ≤ x ≤ 1].
Writing π as an expectation:
X1 ∼ Uniform(0, 1)
X2 ∼ Uniform(0, 1)
π = E [X12 + X22 < 1] = ∫ [x12 + x22 < 1] u(x1 )u(x2 )dx1 dx2
Estimate π by sampling:
▸ Sample n uniform pairs (x1 , x2 ) to estimate π,
1 n 2
π≈ ∑ [x + x2 < 1]
2
n i=1 1

In general sampling is hard...

General sampling can be difficult
Goal
▸ sample from a given PDF p(x) or from some unnormalized PDF
p∗ (x)
Intuitive recipe
▸ discretize the range of x in finite number of regions and calculate
local probabilities of those regions
▸ sample from the finite distribution
Possible problems
▸ in many dimensions we need exponentially many regions
▸ some regions of high density are more important, but how can we
find them!
A useful analogy from David MacKay’s Chapter 29.

Lake analogy from David MacKay
see “A useful analogy” in David MacKay’s book, page 359
Task
▸ estimate average plankton concentration in some lake
▸ the depth of the lake at location x is given by some unnormalized
density p∗ (x) = p(x)Zp
▸ the plankton concentration at x is φ(x)
▸ we want to estimate EX ∼p φ(X )
Approach
▸ drive around with boat to n locations xi and measure the depth
p∗ (xi ) and plankton concentration φ(xi )
▸ use some nice formula to estimate the overall plankton
concentration
Problems
▸ you never know even while measuring p∗ (xi ) whether you
reached the important points where the lake is really deep, or
whether you are missing some deeper areas that are rare

The lake from David MacKay’s book
Figure 29.3 copied from his book

In case the PDF has a nice form and we are
good at integration we can use the
transformation of variable trick. . .

Recall
Transformation of variables
For an invertible function f
X ∼ pX (x)
Y = f (X )
the PDF of the transformed random variable Y is

dx
pY (y ) = pX (f −1 (y )) ∣ (y )∣
dy
where (dx/dy )(y ) = (f −1 )′ (y ) is the derivative of the inverse of f .

Note that under transformations the probabilities should stay the same,
i.e. informally this can be written as
pX (x) dx = pY (y ) dy
to easily memorize the formula.

A most useful example
Given a PDF pX (x) let FX (x) be the corresponding CDF, i.e.

x
FX (x) = ∫ p(x ′ )dx ′
−∞
Deep insight
Let
X ∼ pX (x)
Y = FX (X )
Then
pY (y ) = [0 ≤ y ≤ 1]
i.e. Y ∼ Uniform(0, 1) is uniformly distributed between zero and one.

Details:
Applying the transformation of variables for y ∈ [0, 1]:
pY (y ) = pX (FX−1 (y )) ∣(FX−1 )′ (y )∣
= pX (FX−1 (y ))(FX−1 )′ (y )
1
= pX (FX−1 (y )) ′ −1
FX (FX (y ))
1
= pX (FX−1 (y ))
pX (FX−1 (y ))
=1
where we used
▸ that the CDF is monotonicaly increasing, i.e. (FX−1 )′ (y ) ≥ 0
▸ the inverse function theorem, i.e. (FX−1 )′ (y ) = 1/FX′ (FX−1 (y ))
▸ and that the derivative of the CDF is the PDF, i.e. FX′ (x) = pX (x)

Transforming uniform into anything
Forward
X ∼ pX (x) Ô⇒ Y = FX (X ) ∼ Uniform(0, 1)
Backward
Given some PDF pX
Y ∼ Uniform(0, 1) Ô⇒ X = FX−1 (Y ) ∼ pX (x)
Example
▸ How can we sample from a exponential distribution with PDF
pX (x) = λ exp(−λx)?
▸ Calculate the CDF: FX (x) = . . . = 1 − exp(−λx)
▸ Derive the inverse CDF: FX−1 (y ) = − log(1 − y )/λ
▸ Sample Y from a uniform distribution and transform with the
inverse CDF.
▸ Note that it does not matter whether we use − log(y )/λ or
− log(1 − y )/λ.
Box-Müller method to sample from a Gaussian
Recipe Sample polar coordinates and transform to cartesian:
R ∼ Exponential(1/2) squared magnitude

φ ∼ Uniform(0, 2π)
√
X1 = R cos φ
√
X2 = R sin φ
It can be shown that:

▸ X1 and X2 are both Gaussian distributed
▸ for the proof use the transformation of variables formula (this is not
so easy, since we have to use a two-dimensional transformation
formula)

Other general methods for sampling

Graphically sampling under some density
Goal
▸ sample uniformly below a PDF p(x)
▸ i.e. sample a location x0 according to p(x) and uniformly a value
y between 0 and p(x0 ) (for two-dimensional visualization)
X ∼ p(x)
Y ∣X ∼ Uniform(0, p(X ))
Note
▸ Note that the PDF of Y given X = x0 can be written as:
p(y ∣x0 ) = u(y /p(x0 )) = [0 ≤ y /p(x0 ) ≤ 1] = [0 ≤ y ≤ p(x0 )]

As an example let’s sample from a standard normal Gaussian, i.e. with
mean zero and std one:
1
p(x) = √ exp(−x 2 /2)
2π

Rejection sampling

Rejection sampling
from David MacKay’s Chapter 29
Goal
▸ sample from unnormalized density p∗
Trick
▸ find a density q(x) from which we can sample, that majorizes
p∗ (x) for some c > 0, i.e.
c q(x) ≥ p∗ (x).
▸ sample from q(x) in two dimensions (see previous slide) and

keep only those samples that are below p∗ (x)

Rejection sampling
from David MacKay’s Chapter 29
Goal
▸ sample from unnormalized density p∗
Example
▸ we want to get samples from the unnormalized PDF
p∗ (x) = exp(0.4(x − 0.4)2 − 0.08x 4 )
▸ the Gaussian PDF q(x)
1 (x + 1)2
q(x) = √ exp (− )
2π ⋅ 4 2⋅4
majorizes p∗ for c = 17 we have c q(x) ≥ p∗ (x).

Steps
1. sample x1 from q(x)
2. sample y1 from Uniform(0, c q(x1 ))
3. if y1 < p∗ (x1 ) then x1 is a sample from p∗ else goto step 1
Problems with rejection sampling in many dimensions
Simple example
▸ consider two d dimensional Gaussians p(x) and q(x) with mean
zero and standard deviations σp and σq , with σq > σp
▸ assume we want to sample from p(x) using rejection sampling
applied to the wider Gaussian q(x)
▸ the optimal constant c, such that c q(x) ≥ p(x) can be calculated
at the origins as
p(0) (2πσQ )
2 d/2
σQ d σ
c= = = ( ) = exp(d ln Q )
q(0) (2πσP )
2 d/2 σP σP
▸ c describes also the volume under c q(x), so the acceptance ratio

will be 1/c which can be arbitrary small for large d

Importance sampling

Basic importance sampling
Goal
▸ estimate the expected value of φ(x) for X ∼ p(x), i.e. EX ∼p φ(X )
Reweighting solution
▸ generate samples from another density q(x)
▸ use a weighted average that adjusts the importance of the
samples
Derive the weights
▸ extend the expectation with q(x)/q(x)
EX ∼p φ(X ) = ∫ φ(x)p(x)dx
p(x)
= ∫ φ(x) q(x)dx
q(x)
p(x)
= EX ∼q φ(x)
q(x)
▸ change the importance of the samples with weights w(xi ) = p(xi )
q(xi )
.
1 n p(xi ) 1 n
EX ∼p φ(X ) ≈ ∑ φ(xi ) = ∑ φ(xi )w(xi )
n i=1 q(xi ) n i=1
then one can show. . .
▸ in the limit for n → ∞ the estimator will converge against the true
expectation, i.e.
1 n n→∞
∑ φ(xi )w(xi ) ÐÐÐ→ EX ∼p φ(X ).
n i=1
▸ for self-normalized importance sampling (coming up) we need this

result also for the constant function φ(x) = 1:
1 n 1 n n→∞
∑ 1 ⋅ w(xi ) = ∑ w(xi ) ÐÐÐ→ EX ∼p 1 = 1.
n i=1 n i=1

Self-normalized importance sampling
Problem
▸ what can we do if we only have access to the unnormalized
density p∗ (x) = p(x)Zp and to samples
x1 , . . . , xn
from the unnormalized density q ∗ (x) = q(x)Zq ?
Solution
▸ define the ratios of the unnormalized PDFs,
p∗ (x) p(x)Zp
w ∗ (x) = =
q ∗ (x) q(x)Zq
▸ define the ratios of the normalized PDFs as well:
p(x)
w(x) = .
q(x)
▸ after evaluating these weights on the random samples x1 , . . . , xn ,
notice that for all i we get equality for the quotients:
w(xi ) w ∗ (xi )
n
= since the constant Zp /Zq cancels
1
n ∑j=1
w(xj ) 1 n
n ∑j=1
w ∗ (xj )
Self-normalized importance sampling - cont.
Thus the estimator based on the unnormalized PDFs can be shown to

estimate the correct expectation:
1 n w ∗ (xi ) 1 n w(xi )
∑ φ(xi ) 1 n = ∑ φ(xi ) 1 n
n i=1 n ∑ j=1 w ∗ (xj ) n i=1 n ∑ j=1 w(xj )
∑i=1 φ(xi )w(xi )

1 n
n→∞ EX ∼p φ(X )
= n
ÐÐÐ→
∑j=1 w(xj )
1 n
n
1
Note
▸ in the first step we replace the weights with stars to weights
without stars
▸ the numerator converges to the expectation
▸ the denominator converges against 1 (as shown previously)

Problems of importance sampling
p(x)
EX ∼p φ(X ) = EX ∼q φ(x)
q(x)
What can happen to the variance?

▸ instead of φ(x) we have to consider φ(x)p(x)/q(x)
▸ assume that there are x with q(x) ≪ p(x)
▸ then
1 p(X )
VarX ∼q (φ(X ) )
n q(X )
can get arbitrary large. . .

Markov Chain Monte Carlo (MCMC) methods

Monte Carlo methods
Goals
1. generate samples x1 , . . . , xn from some PDF p(x)
2. estimate the expected value:
1 n
EX ∼p φ(X ) = ∫ φ(x)p(x)dx ≈ ∑ φ(xi )
n i=1
Generate IID samples (last time)

▸ Transformation of samples (e.g. uniform): mathematically
challenging to derive inverse CDF (aka quantile function)
▸ Rejection sampling, importance sampling: proposal q(x) must be
very similar to p(x)
Generate dependent samples
▸ given a current sample x, let the proposal q(x ′ ∣x) depend on x
▸ hence the name: Markov chain Monte Carlo (or MCMC)

What is a Markov chain?
Sequence with independent variables:

n
p(x1 , x2 , . . . , xn ) = ∏ p(xi )
i=1
General sequence of random variables

n
p(x1 , x2 , . . . , xn ) = p(x1 ) ∏ p(xi ∣x1 , . . . , xi−1 )
i=2
Markov chain:
n
p(x1 , x2 , . . . , xn ) = p(x1 ) ∏ p(xi ∣xi−1 )
i=2

Metropolis method
Goal
▸ generate samples from unnormalized PDF p∗ (x) = p(x)Zp
Iterate:
1. given xt , sample candidate x ′ from proposal distribution q(x ′ ∣xt )
(must be symmetric, i.e. q(x ′ ∣x) = q(x∣x ′ ))
2. calculate acceptance ratio:
p∗ (x ′ )
α(x ′ , xt ) = min (1, )
p∗ (xt )
3. accept x ′ with probability α, i.e. set xt+1 = x ′

4. otherwise stay, i.e. set xt+1 = xt
Notes
▸ initialize x0 arbitrarily
▸ every iteration generates a sample
▸ discard initial samples (burn-in phase)

Metropolis-Hastings method
Goal
Iterate:
1. given xt , sample candidate x ′ from proposal distribution q(x ′ ∣xt )
(needn’t be symmetric, i.e. possibly q(x ′ ∣x) ≠ q(x∣x ′ ))
2. calculate acceptance ratio:
p∗ (x ′ ) q(xt ∣x ′ )
α(x ′ , xt ) = min (1, )
p∗ (xt ) q(x ′ ∣xt )
3. accept x ′ with probability α, i.e. set xt+1 = x ′

4. otherwise stay, i.e. set xt+1 = xt
Notes
▸ initialize x0 arbitrarily
▸ every iteration generates a sample
▸ discard initial samples (burn-in phase)

Of course Metropolis is a special case of Metropolis-Hastings. Why?

Metropolis-Hastings method: why does it work?
Intuition
▸ sampler should find x ′ where p∗ (x ′ ) is large
▸ will accept candidate for sure, if p∗ (x ′ ) is at least as large as
p∗ (x)
▸ otherwise, with probability α(x ′ , x) we might accept x ′ also with
smaller p∗ (x) (for enough variability)
Formally
▸ given p∗ (x) and q(x ′ ∣x) the Metropolis-Hastings method implies
transition probabilities
⎧
⎪
⎪q(xt+1 ∣xt ) α(xt+1 , xt ) for xt+1 ≠ xt
h(xt+1 ∣xt ) = ⎨
⎪
⎪q(xt+1 ∣xt ) + ∫ q(x ′ ∣xt ) (1 − α(x ′ , xt )) dx ′ for xt+1 = xt
⎩
▸ first case is accepting a new point, second case is either
surprisingly sampling the previous point (first summand) or not
accepting a candidate x ′ (the integral)

Detailed balance
Notice
▸ there are three probability distributions:
1. the wanted distribution p∗ (x) (can also be evaluted for x ′ )
2. the proposal distribution q(x ′ ∣x)
3. the transition distribution h(x ′ ∣x), which is also a conditional
distribution defined by the Metropolis-Hastings method for p∗ (x) and
q(x ′ ∣x)
▸ p(x) (the normalized PDF) and h(x ′ ∣x) fulfill the detailed balance
condition, i.e. for all x and x ′
p(x) h(x ′ ∣x) = p(x ′ ) h(x∣x ′ )
▸ the latter implies that p(x) is a stationary distribution of the

Markov chain defined by transition probability h(x ′ ∣x), because
∫ p(x) h(x ∣x) dx = ∫ p(x ) h(x∣x ) dx = p(x ) ∫ h(x∣x ) dx = p(x )

′ ′ ′ ′ ′ ′
▸ To show convergence of the Markov chain, we need ergodicity

(which is beyond the scope of this lecture).
Proof the detailed balance
▸ for x = x ′ the detailed balance holds trivally,
p(x) h(x ′ ∣x) = p(x ′ ) h(x∣x ′ )
▸ assume x ≠ x ′ ,
p(x) h(x ′ ∣x) = p(x) q(x ′ ∣x) α(x ′ , x)

p(x ′ ) q(x∣x ′ )
= p(x) q(x ′ ∣x) min (1, )
p(x) q(x ′ ∣x)
= min (p(x) q(x ′ ∣x), p(x ′ ) q(x∣x ′ ))
p(x) q(x ′ ∣x)
= p(x ′ ) q(x∣x ′ ) min (1, )
p(x ′ ) q(x∣x ′ )
= p(x ′ ) h(x∣x ′ )
▸ we can replace p∗ with p in α, since Zp cancels

Examples of MCMC
▸ Gibbs sampling
▸ Slice sampling

Gibbs sampling - two variables
Setup
▸ let’s consider two-dimensional samples x = (x1 , x2 )
▸ let p(x1 , x2 ) be a joint probability distribution, from which it is
difficult to sample
▸ however, assume that sampling from the conditional distributions
p(x1 ∣x2 ) and p(x2 ∣x1 ) is easy
▸ there are two proposal distributions:
q1 (x ′ ∣x) = [x2 − x2′ ] p(x1′ ∣x2 ) update only first entry

q2 (x ′ ∣x) = [x1 − x1′ ] p(x2′ ∣x1 ) update only second entry
▸ this implies acceptance ratios of one (use [x2 − x2′ ] = 1):
p(x1 , x2 ) q1 (x ′ ∣x) ′ p(x2 ) p(x1 ∣x2 ) p(x1 ∣x2 )

′
= [x 2 − x 2 ] =1
p(x1′ , x2′ ) q1 (x∣x ′ ) p(x2′ ) p(x1′ ∣x2′ ) p(x1 ∣x2′ )
similar for q2 (x ′ ∣x)

▸ thus Gibbs sampling is Metropolis-Hastings that always accepts
Gibbs sampling - several variables
Setup
▸ let’s consider multi-dimensional samples x = (x1 , x2 , . . . , xd )
▸ let p(x1 , x2 , . . . , xd ) be a joint probability distribution, from which it
is hard to sample
▸ however, assume that sampling from the marginals
p(x1 ∣x2 , . . . , xd ), . . . is easy
▸ there are d proposal distributions:
q1 (x ′ ∣x) = [xd − xd′ ]⋯[x2 − x2′ ] p(x1′ ∣x2 , . . . , xd ) update only first entry
⋮
▸ again all acceptance ratios are one

▸ thus Gibbs sampling is Metropolis-Hastings that always accepts

Slice sampling

Slice sampling - one variable (1D)
closely following David MacKay, Sec. 29.7
Goal
Steps: (code box copied from MacKay’s book page 375)

Slice sampling - one variable (1D)
closely following David MacKay, Sec. 29.7
“Stepping out” (code box copied from MacKay’s book page 375)
“Shrinking” (code box copied from MacKay’s book page 375)

Finally: an interesting MCMC paper
▸ paper: Persi Diaconis, The MCMC revolution, 2008

▸ see code 18-MCMC-demo-Diaconis.ipynb
▸ Example of discrete MCMC

Persi Diaconis from
http://statweb.stanford.edu/~cgates/PERSI/c\gamma{} or
https://diaconis.ckirby.su.domains/

Diaconis “brief treatise on Markov Chains”
copied from Sec 2. in Diaconis, “MCMC revolution”, 2008
▸ X a set of finitely many states.
▸ Markov chain is defined by matrix K , with entries K (x, y ), s.t.
K (x, y ) ≥ 0 ∀x ∶ ∑ K (x, y ) = 1
y
▸ each row in K defines a probability distribution, the entry K (x, y )

is the probability to transition from x to y
▸ This defines a chain
X0 = x, X1 = y , X2 = z, . . .
▸ Written as probabilities:
p(X1 = y ∣X0 = x) = K (x, y )
p(X1 = y , X2 = z∣X0 = x) = K (x, y )K (y , z)
p(X2 = z∣X0 = x) = ∑ K (x, y )K (y , z)
y
▸ n-th power of the matrix:

K n (x, y ) = p(Xn = y ∣X0 = x)
Stationary distribution
▸ A Markov chain defined by K (x, y ) has stationary distribution
π(x) > 0, ∑x π(x) = 1, if
∑ π(x)K (x, y ) = π(y )

x
▸ π is a left eigenvector of K with eigenvalue 1.
Theorem 1 (Fundamental Theorem of Markov Chains)

For finite set of states X and a Markov chain defined by transition
matrix K (x, y ). If there is n0 so that K n (x, y ) ≥ 0 for all n ≥ n0 , then K
has a unique stationary distribution π and, for n → ∞,
K n (x, y ) → π(y ) for each x, y , ∈ X
▸ Not sure why there is the following condition: “If there is n0 so that
K n (x, y ) ≥ 0 for all n ≥ n0 . . . ”. If you know why, please let me
know, and I include it here!
▸ This might be an instance of https:
//de.wikipedia.org/wiki/Satz_von_Perron-Frobenius
Metropolis algorithm
▸ X finite state space
▸ π(x) a probability on X , (possibly unnormalized)
▸ J(x, y ) a transition matrix defining some Markov chain on X with
J(x, y ) > 0 ⇐⇒ J(y , x) > 0.
▸ J and π can be unrelated.
▸ The Metropolis algorithm transforms J to new transition
probabilities K such that the Markov chain defined by K has π as
its stationary distribution (where A(x, y ) = π(y )J(y , x)/(π(x)J(x, y ))):
⎧
⎪J(x, y ) if x ≠ y , A(x, y ) ≥ 1
⎪
⎪
⎪
⎪J(x, y )A(x, y ) if x ≠ y , A(x, y ) < 1
K (x, y ) = ⎨
⎪
⎪J(x, y ) +
⎪
⎪
⎪ ∑ J(x, z)(1 − A(x, z)) if x = y
⎩ z∶A(x,z)<1
▸ now one can show detailed balance for π and K :

π(x)K (x, y ) = π(y )K (y , x)
▸ which implies that π is stationary distribution for K :
∑ π(x)K (x, y ) = ∑ π(y )K (y , x) = π(y )k (y , x) = π(y )
x x
Operator view
▸ The Metropolis algorithm transforms J to new transition

probabilities K such that the Markov chain defined by K has π as
its stationary distribution:
⎧
⎪J(x, y ) if x ≠ y , A(x, y ) ≥ 1
⎪
⎪
⎪
⎪J(x, y )A(x, y ) if x ≠ y , A(x, y ) < 1
K (x, y ) = ⎨
⎪
⎪J(x, y ) +
⎪
⎪
⎪ ∑ J(x, z)(1 − A(x, z)) if x = y
⎩ z∶A(x,z)<1
▸ This defines an operator (function) on Markov chains.

▸ What happens if we apply it twice or three times?
▸ Does the Markov chain converge faster?

18-MCMC-demo-Diaconis.ipynb

END of Sampling/MCMC section.
However, this is also the
END of the Machine Learning lecture!
THANK YOU FOR YOUR ATTENTION!

ML Section19 Sampling MCMC

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Section19 Sampling MCMC

Uploaded by

Copyright:

Available Formats

Machine Learning

Section 19: Sampling and MCMC

26./31. January 2022 (WS 2021/22)

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 1

Monte Carlo methods

Goals of Monte Carlo methods

E φ(X ) = EX ∼p φ(X ) = ∫ φ(x)p(x)dx

by replacing the integration with a summation of samples:

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 3

▸ The estimator can be seen as a function of random variables:

▸ Calculating the expected value yields

where we write abbreviate the mean as φ = E φ(X ).

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 4

Calculating the variance yields

E(φ̂ − φ)2 = E(φ̂2 − φ̂φ − φφ̂ + φ2 ) = E φ̂2 − 2φ E φ̂ + E φ2

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 5

▸ E φ = φ since φ is constant and not random.

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 6

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 7

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 8

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 11

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 12

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 13

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 14

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 15

the PDF of the transformed random variable Y is

where (dx/dy )(y ) = (f −1 )′ (y ) is the derivative of the inverse of f .

to easily memorize the formula.

Given a PDF pX (x) let FX (x) be the corresponding CDF, i.e.

i.e. Y ∼ Uniform(0, 1) is uniformly distributed between zero and one.

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 17

Applying the transformation of variables for y ∈ [0, 1]:

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 18

Y ∼ Uniform(0, 1) Ô⇒ X = FX−1 (Y ) ∼ pX (x)

Recipe Sample polar coordinates and transform to cartesian:

R ∼ Exponential(1/2) squared magnitude

It can be shown that:

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 20

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 22

p(y ∣x0 ) = u(y /p(x0 )) = [0 ≤ y /p(x0 ) ≤ 1] = [0 ≤ y ≤ p(x0 )]

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 23

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 24

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 25

▸ sample from q(x) in two dimensions (see previous slide) and

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 26

p∗ (x) = exp(0.4(x − 0.4)2 − 0.08x 4 )

▸ the Gaussian PDF q(x)

majorizes p∗ for c = 17 we have c q(x) ≥ p∗ (x).

▸ c describes also the volume under c q(x), so the acceptance ratio

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 29

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 30

▸ for self-normalized importance sampling (coming up) we need this

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 32

Thus the estimator based on the unnormalized PDFs can be shown to

∑i=1 φ(xi )w(xi )

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 34

What can happen to the variance?

can get arbitrary large. . .

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 36

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 37

Generate IID samples (last time)

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 38