Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

Machine Learning

Section 19: Sampling and MCMC

Stefan Harmeling

26./31. January 2022 (WS 2021/22)

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 1


What is sampling?
following David MacKay, Chapter 29.1

Monte Carlo methods


“Monte Carlo methods are computational techniques that make use of
random numbers.” (quoting David MacKay, Chapter 29.1)

Goals of Monte Carlo methods


Solve the following two problems:
1. generate samples x1 , . . . , xn from some PDF p(x)
2. estimate the expected value of some function φ(x) for a certain
PDF p(x), i.e.

E φ(X ) = EX ∼p φ(X ) = ∫ φ(x)p(x)dx

by replacing the integration with a summation of samples:

1 n
E φ(X ) ≈ ∑ φ(xi )
n i=1
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 2
Properties of the estimator
Monte Carlo (MC) estimator:

1 n
φ̂(x1 , . . . , xn ) ∶= ∑ φ(xi )
n i=1

Mean
The mean of the MC estimator is unbiased, i.e.

E φ̂(X1 , . . . , Xn ) = E φ(X )

Variance
The variance of the MC estimator decreases linearly in n, more
precisely,

1
Var φ̂(X1 , . . . , Xn ) = Var φ(X )
n

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 3


More details:

▸ The estimator can be seen as a function of random variables:

1 n
φ̂(X1 , . . . Xn ) = ∑ φ(Xi )
n i=1

▸ Calculating the expected value yields

1 1 1
E φ̂ = E φ̂(X1 , . . . Xn ) = E ∑ φ(Xi ) = ∑ E φ(Xi ) = ∑ φ = φ.
n i n i ´¹¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¶ n i

where we write abbreviate the mean as φ = E φ(X ).

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 4


More details:

Calculating the variance yields

E(φ̂ − φ)2 = E(φ̂2 − φ̂φ − φφ̂ + φ2 ) = E φ̂2 − 2φ E φ̂ + E φ2


1
= E φ̂2 − φ2 = E 2 ∑ φ(Xi )φ(Xj ) − φ2
n i,j
1 n 1 n
=E ∑ φ(Xi ) 2
+ E ∑ φ(Xi )φ(Xj ) − φ
2
n2 i n2 i,j
i≠j
1 1
= 2 ∑ E φ(Xi )2 + 2 ∑ φ2 − φ2
n i n i,j
i≠j

n n − n − n2 2 2
= Eφ(X )2
+ φ
n2 n2
1 1 1 1 1
= Eφ(X )2 − φ2 = Eφ(X )2 − (Eφ(X ))2 = Var φ(X ).
n n n n n

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 5


More notes:

▸ E φ = φ since φ is constant and not random.


▸ φ E φ̂ = φ2
▸ Since Xi and Xj are independent for i ≠ j, we have
E φ(Xi )φ(Xj ) = E φ(Xi ) E φ(Xj )
▸ Var(X ) = E X 2 − (E X )2

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 6


Sometimes sampling is easy...

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 7


Estimating the value of π

Simple facts:
▸ A quarter circle (with radius one) has area π/4.
▸ The uniform PDF can be written as u(x) = [0 ≤ x ≤ 1].
Writing π as an expectation:

X1 ∼ Uniform(0, 1)
X2 ∼ Uniform(0, 1)
π = E [X12 + X22 < 1] = ∫ [x12 + x22 < 1] u(x1 )u(x2 )dx1 dx2

Estimate π by sampling:
▸ Sample n uniform pairs (x1 , x2 ) to estimate π,

1 n 2
π≈ ∑ [x + x2 < 1]
2
n i=1 1

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 8


Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 9
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 10
In general sampling is hard...

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 11


General sampling can be difficult

Goal
▸ sample from a given PDF p(x) or from some unnormalized PDF
p∗ (x)
Intuitive recipe
▸ discretize the range of x in finite number of regions and calculate
local probabilities of those regions
▸ sample from the finite distribution
Possible problems
▸ in many dimensions we need exponentially many regions
▸ some regions of high density are more important, but how can we
find them!
A useful analogy from David MacKay’s Chapter 29.

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 12


Lake analogy from David MacKay
see “A useful analogy” in David MacKay’s book, page 359

Task
▸ estimate average plankton concentration in some lake
▸ the depth of the lake at location x is given by some unnormalized
density p∗ (x) = p(x)Zp
▸ the plankton concentration at x is φ(x)
▸ we want to estimate EX ∼p φ(X )
Approach
▸ drive around with boat to n locations xi and measure the depth
p∗ (xi ) and plankton concentration φ(xi )
▸ use some nice formula to estimate the overall plankton
concentration
Problems
▸ you never know even while measuring p∗ (xi ) whether you
reached the important points where the lake is really deep, or
whether you are missing some deeper areas that are rare

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 13


The lake from David MacKay’s book
Figure 29.3 copied from his book

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 14


In case the PDF has a nice form and we are
good at integration we can use the
transformation of variable trick. . .

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 15


Recall
Transformation of variables
For an invertible function f

X ∼ pX (x)
Y = f (X )

the PDF of the transformed random variable Y is


dx
pY (y ) = pX (f −1 (y )) ∣ (y )∣
dy

where (dx/dy )(y ) = (f −1 )′ (y ) is the derivative of the inverse of f .


Note that under transformations the probabilities should stay the same,
i.e. informally this can be written as

pX (x) dx = pY (y ) dy

to easily memorize the formula.


Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 16
A most useful example

Given a PDF pX (x) let FX (x) be the corresponding CDF, i.e.


x
FX (x) = ∫ p(x ′ )dx ′
−∞

Deep insight
Let

X ∼ pX (x)
Y = FX (X )

Then

pY (y ) = [0 ≤ y ≤ 1]

i.e. Y ∼ Uniform(0, 1) is uniformly distributed between zero and one.

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 17


Details:

Applying the transformation of variables for y ∈ [0, 1]:

pY (y ) = pX (FX−1 (y )) ∣(FX−1 )′ (y )∣
= pX (FX−1 (y ))(FX−1 )′ (y )
1
= pX (FX−1 (y )) ′ −1
FX (FX (y ))
1
= pX (FX−1 (y ))
pX (FX−1 (y ))
=1

where we used
▸ that the CDF is monotonicaly increasing, i.e. (FX−1 )′ (y ) ≥ 0
▸ the inverse function theorem, i.e. (FX−1 )′ (y ) = 1/FX′ (FX−1 (y ))
▸ and that the derivative of the CDF is the PDF, i.e. FX′ (x) = pX (x)

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 18


Transforming uniform into anything
Forward

X ∼ pX (x) Ô⇒ Y = FX (X ) ∼ Uniform(0, 1)

Backward
Given some PDF pX

Y ∼ Uniform(0, 1) Ô⇒ X = FX−1 (Y ) ∼ pX (x)

Example
▸ How can we sample from a exponential distribution with PDF
pX (x) = λ exp(−λx)?
▸ Calculate the CDF: FX (x) = . . . = 1 − exp(−λx)
▸ Derive the inverse CDF: FX−1 (y ) = − log(1 − y )/λ
▸ Sample Y from a uniform distribution and transform with the
inverse CDF.
▸ Note that it does not matter whether we use − log(y )/λ or
− log(1 − y )/λ.
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 19
Box-Müller method to sample from a Gaussian

Recipe Sample polar coordinates and transform to cartesian:

R ∼ Exponential(1/2) squared magnitude


φ ∼ Uniform(0, 2π)

X1 = R cos φ

X2 = R sin φ

It can be shown that:


▸ X1 and X2 are both Gaussian distributed
▸ for the proof use the transformation of variables formula (this is not
so easy, since we have to use a two-dimensional transformation
formula)

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 20


Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 21
Other general methods for sampling

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 22


Graphically sampling under some density

Goal
▸ sample uniformly below a PDF p(x)
▸ i.e. sample a location x0 according to p(x) and uniformly a value
y between 0 and p(x0 ) (for two-dimensional visualization)

X ∼ p(x)
Y ∣X ∼ Uniform(0, p(X ))

Note
▸ Note that the PDF of Y given X = x0 can be written as:

p(y ∣x0 ) = u(y /p(x0 )) = [0 ≤ y /p(x0 ) ≤ 1] = [0 ≤ y ≤ p(x0 )]

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 23


As an example let’s sample from a standard normal Gaussian, i.e. with
mean zero and std one:
1
p(x) = √ exp(−x 2 /2)

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 24


Rejection sampling

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 25


Rejection sampling
from David MacKay’s Chapter 29

Goal
▸ sample from unnormalized density p∗
Trick
▸ find a density q(x) from which we can sample, that majorizes
p∗ (x) for some c > 0, i.e.

c q(x) ≥ p∗ (x).

▸ sample from q(x) in two dimensions (see previous slide) and


keep only those samples that are below p∗ (x)

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 26


Rejection sampling
from David MacKay’s Chapter 29
Goal
▸ sample from unnormalized density p∗
Example
▸ we want to get samples from the unnormalized PDF

p∗ (x) = exp(0.4(x − 0.4)2 − 0.08x 4 )

▸ the Gaussian PDF q(x)

1 (x + 1)2
q(x) = √ exp (− )
2π ⋅ 4 2⋅4

majorizes p∗ for c = 17 we have c q(x) ≥ p∗ (x).


Steps
1. sample x1 from q(x)
2. sample y1 from Uniform(0, c q(x1 ))
3. if y1 < p∗ (x1 ) then x1 is a sample from p∗ else goto step 1
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 27
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 28
Problems with rejection sampling in many dimensions

Simple example
▸ consider two d dimensional Gaussians p(x) and q(x) with mean
zero and standard deviations σp and σq , with σq > σp
▸ assume we want to sample from p(x) using rejection sampling
applied to the wider Gaussian q(x)
▸ the optimal constant c, such that c q(x) ≥ p(x) can be calculated
at the origins as

p(0) (2πσQ )
2 d/2
σQ d σ
c= = = ( ) = exp(d ln Q )
q(0) (2πσP )
2 d/2 σP σP

▸ c describes also the volume under c q(x), so the acceptance ratio


will be 1/c which can be arbitrary small for large d

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 29


Importance sampling

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 30


Basic importance sampling
Goal
▸ estimate the expected value of φ(x) for X ∼ p(x), i.e. EX ∼p φ(X )
Reweighting solution
▸ generate samples from another density q(x)
▸ use a weighted average that adjusts the importance of the
samples
Derive the weights
▸ extend the expectation with q(x)/q(x)

EX ∼p φ(X ) = ∫ φ(x)p(x)dx
p(x)
= ∫ φ(x) q(x)dx
q(x)
p(x)
= EX ∼q φ(x)
q(x)
▸ change the importance of the samples with weights w(xi ) = p(xi )
q(xi )
.

1 n p(xi ) 1 n
EX ∼p φ(X ) ≈ ∑ φ(xi ) = ∑ φ(xi )w(xi )
n i=1 q(xi ) n i=1
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 31
then one can show. . .
▸ in the limit for n → ∞ the estimator will converge against the true
expectation, i.e.

1 n n→∞
∑ φ(xi )w(xi ) ÐÐÐ→ EX ∼p φ(X ).
n i=1

▸ for self-normalized importance sampling (coming up) we need this


result also for the constant function φ(x) = 1:

1 n 1 n n→∞
∑ 1 ⋅ w(xi ) = ∑ w(xi ) ÐÐÐ→ EX ∼p 1 = 1.
n i=1 n i=1

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 32


Self-normalized importance sampling
Problem
▸ what can we do if we only have access to the unnormalized
density p∗ (x) = p(x)Zp and to samples
x1 , . . . , xn
from the unnormalized density q ∗ (x) = q(x)Zq ?
Solution
▸ define the ratios of the unnormalized PDFs,
p∗ (x) p(x)Zp
w ∗ (x) = =
q ∗ (x) q(x)Zq
▸ define the ratios of the normalized PDFs as well:
p(x)
w(x) = .
q(x)
▸ after evaluating these weights on the random samples x1 , . . . , xn ,
notice that for all i we get equality for the quotients:
w(xi ) w ∗ (xi )
n
= since the constant Zp /Zq cancels
1
n ∑j=1
w(xj ) 1 n
n ∑j=1
w ∗ (xj )
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 33
Self-normalized importance sampling - cont.

Thus the estimator based on the unnormalized PDFs can be shown to


estimate the correct expectation:

1 n w ∗ (xi ) 1 n w(xi )
∑ φ(xi ) 1 n = ∑ φ(xi ) 1 n
n i=1 n ∑ j=1 w ∗ (xj ) n i=1 n ∑ j=1 w(xj )

∑i=1 φ(xi )w(xi )


1 n
n→∞ EX ∼p φ(X )
= n
ÐÐÐ→
∑j=1 w(xj )
1 n
n
1

Note
▸ in the first step we replace the weights with stars to weights
without stars
▸ the numerator converges to the expectation
▸ the denominator converges against 1 (as shown previously)

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 34


Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 35
Problems of importance sampling

p(x)
EX ∼p φ(X ) = EX ∼q φ(x)
q(x)

What can happen to the variance?


▸ instead of φ(x) we have to consider φ(x)p(x)/q(x)
▸ assume that there are x with q(x) ≪ p(x)
▸ then

1 p(X )
VarX ∼q (φ(X ) )
n q(X )

can get arbitrary large. . .

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 36


Markov Chain Monte Carlo (MCMC) methods

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 37


Monte Carlo methods

Goals
1. generate samples x1 , . . . , xn from some PDF p(x)
2. estimate the expected value:

1 n
EX ∼p φ(X ) = ∫ φ(x)p(x)dx ≈ ∑ φ(xi )
n i=1

Generate IID samples (last time)


▸ Transformation of samples (e.g. uniform): mathematically
challenging to derive inverse CDF (aka quantile function)
▸ Rejection sampling, importance sampling: proposal q(x) must be
very similar to p(x)
Generate dependent samples
▸ given a current sample x, let the proposal q(x ′ ∣x) depend on x
▸ hence the name: Markov chain Monte Carlo (or MCMC)

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 38


What is a Markov chain?

Sequence with independent variables:


n
p(x1 , x2 , . . . , xn ) = ∏ p(xi )
i=1

General sequence of random variables


n
p(x1 , x2 , . . . , xn ) = p(x1 ) ∏ p(xi ∣x1 , . . . , xi−1 )
i=2

Markov chain:
n
p(x1 , x2 , . . . , xn ) = p(x1 ) ∏ p(xi ∣xi−1 )
i=2

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 39


Metropolis method
Goal
▸ generate samples from unnormalized PDF p∗ (x) = p(x)Zp
Iterate:
1. given xt , sample candidate x ′ from proposal distribution q(x ′ ∣xt )
(must be symmetric, i.e. q(x ′ ∣x) = q(x∣x ′ ))
2. calculate acceptance ratio:

p∗ (x ′ )
α(x ′ , xt ) = min (1, )
p∗ (xt )

3. accept x ′ with probability α, i.e. set xt+1 = x ′


4. otherwise stay, i.e. set xt+1 = xt
Notes
▸ initialize x0 arbitrarily
▸ every iteration generates a sample
▸ discard initial samples (burn-in phase)

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 40


Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 41
Metropolis-Hastings method
Goal
▸ generate samples from unnormalized PDF p∗ (x) = p(x)Zp
Iterate:
1. given xt , sample candidate x ′ from proposal distribution q(x ′ ∣xt )
(needn’t be symmetric, i.e. possibly q(x ′ ∣x) ≠ q(x∣x ′ ))
2. calculate acceptance ratio:

p∗ (x ′ ) q(xt ∣x ′ )
α(x ′ , xt ) = min (1, )
p∗ (xt ) q(x ′ ∣xt )

3. accept x ′ with probability α, i.e. set xt+1 = x ′


4. otherwise stay, i.e. set xt+1 = xt
Notes
▸ initialize x0 arbitrarily
▸ every iteration generates a sample
▸ discard initial samples (burn-in phase)

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 42


Of course Metropolis is a special case of Metropolis-Hastings. Why?

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 43


Metropolis-Hastings method: why does it work?
Intuition
▸ sampler should find x ′ where p∗ (x ′ ) is large
▸ will accept candidate for sure, if p∗ (x ′ ) is at least as large as
p∗ (x)
▸ otherwise, with probability α(x ′ , x) we might accept x ′ also with
smaller p∗ (x) (for enough variability)
Formally
▸ given p∗ (x) and q(x ′ ∣x) the Metropolis-Hastings method implies
transition probabilities


⎪q(xt+1 ∣xt ) α(xt+1 , xt ) for xt+1 ≠ xt
h(xt+1 ∣xt ) = ⎨

⎪q(xt+1 ∣xt ) + ∫ q(x ′ ∣xt ) (1 − α(x ′ , xt )) dx ′ for xt+1 = xt

▸ first case is accepting a new point, second case is either
surprisingly sampling the previous point (first summand) or not
accepting a candidate x ′ (the integral)

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 44


Detailed balance
Notice
▸ there are three probability distributions:
1. the wanted distribution p∗ (x) (can also be evaluted for x ′ )
2. the proposal distribution q(x ′ ∣x)
3. the transition distribution h(x ′ ∣x), which is also a conditional
distribution defined by the Metropolis-Hastings method for p∗ (x) and
q(x ′ ∣x)
▸ p(x) (the normalized PDF) and h(x ′ ∣x) fulfill the detailed balance
condition, i.e. for all x and x ′

p(x) h(x ′ ∣x) = p(x ′ ) h(x∣x ′ )

▸ the latter implies that p(x) is a stationary distribution of the


Markov chain defined by transition probability h(x ′ ∣x), because

∫ p(x) h(x ∣x) dx = ∫ p(x ) h(x∣x ) dx = p(x ) ∫ h(x∣x ) dx = p(x )


′ ′ ′ ′ ′ ′

▸ To show convergence of the Markov chain, we need ergodicity


(which is beyond the scope of this lecture).
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 45
Proof the detailed balance

▸ for x = x ′ the detailed balance holds trivally,

p(x) h(x ′ ∣x) = p(x ′ ) h(x∣x ′ )

▸ assume x ≠ x ′ ,

p(x) h(x ′ ∣x) = p(x) q(x ′ ∣x) α(x ′ , x)


p(x ′ ) q(x∣x ′ )
= p(x) q(x ′ ∣x) min (1, )
p(x) q(x ′ ∣x)
= min (p(x) q(x ′ ∣x), p(x ′ ) q(x∣x ′ ))
p(x) q(x ′ ∣x)
= p(x ′ ) q(x∣x ′ ) min (1, )
p(x ′ ) q(x∣x ′ )
= p(x ′ ) h(x∣x ′ )

▸ we can replace p∗ with p in α, since Zp cancels

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 46


Examples of MCMC
▸ Gibbs sampling
▸ Slice sampling

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 47


Gibbs sampling - two variables
Setup
▸ let’s consider two-dimensional samples x = (x1 , x2 )
▸ let p(x1 , x2 ) be a joint probability distribution, from which it is
difficult to sample
▸ however, assume that sampling from the conditional distributions
p(x1 ∣x2 ) and p(x2 ∣x1 ) is easy
▸ there are two proposal distributions:

q1 (x ′ ∣x) = [x2 − x2′ ] p(x1′ ∣x2 ) update only first entry


q2 (x ′ ∣x) = [x1 − x1′ ] p(x2′ ∣x1 ) update only second entry

▸ this implies acceptance ratios of one (use [x2 − x2′ ] = 1):

p(x1 , x2 ) q1 (x ′ ∣x) ′ p(x2 ) p(x1 ∣x2 ) p(x1 ∣x2 )



= [x 2 − x 2 ] =1
p(x1′ , x2′ ) q1 (x∣x ′ ) p(x2′ ) p(x1′ ∣x2′ ) p(x1 ∣x2′ )

similar for q2 (x ′ ∣x)


▸ thus Gibbs sampling is Metropolis-Hastings that always accepts
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 48
Gibbs sampling - several variables

Setup
▸ let’s consider multi-dimensional samples x = (x1 , x2 , . . . , xd )
▸ let p(x1 , x2 , . . . , xd ) be a joint probability distribution, from which it
is hard to sample
▸ however, assume that sampling from the marginals
p(x1 ∣x2 , . . . , xd ), . . . is easy
▸ there are d proposal distributions:

q1 (x ′ ∣x) = [xd − xd′ ]⋯[x2 − x2′ ] p(x1′ ∣x2 , . . . , xd ) update only first entry

▸ again all acceptance ratios are one


▸ thus Gibbs sampling is Metropolis-Hastings that always accepts

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 49


Slice sampling

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 50


Slice sampling - one variable (1D)
closely following David MacKay, Sec. 29.7

Goal
▸ generate samples from unnormalized PDF p∗ (x) = p(x)Zp
Steps: (code box copied from MacKay’s book page 375)

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 51


Slice sampling - one variable (1D)
closely following David MacKay, Sec. 29.7

“Stepping out” (code box copied from MacKay’s book page 375)

“Shrinking” (code box copied from MacKay’s book page 375)

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 52


Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 53
Finally: an interesting MCMC paper

▸ paper: Persi Diaconis, The MCMC revolution, 2008


▸ see code 18-MCMC-demo-Diaconis.ipynb
▸ Example of discrete MCMC

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 54


Persi Diaconis from
http://statweb.stanford.edu/~cgates/PERSI/c\gamma{} or
https://diaconis.ckirby.su.domains/

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 55


Diaconis “brief treatise on Markov Chains”
copied from Sec 2. in Diaconis, “MCMC revolution”, 2008
▸ X a set of finitely many states.
▸ Markov chain is defined by matrix K , with entries K (x, y ), s.t.
K (x, y ) ≥ 0 ∀x ∶ ∑ K (x, y ) = 1
y

▸ each row in K defines a probability distribution, the entry K (x, y )


is the probability to transition from x to y
▸ This defines a chain
X0 = x, X1 = y , X2 = z, . . .
▸ Written as probabilities:
p(X1 = y ∣X0 = x) = K (x, y )
p(X1 = y , X2 = z∣X0 = x) = K (x, y )K (y , z)
p(X2 = z∣X0 = x) = ∑ K (x, y )K (y , z)
y

▸ n-th power of the matrix:


K n (x, y ) = p(Xn = y ∣X0 = x)
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 56
Stationary distribution
copied from Sec 2. in Diaconis, “MCMC revolution”, 2008
▸ A Markov chain defined by K (x, y ) has stationary distribution
π(x) > 0, ∑x π(x) = 1, if

∑ π(x)K (x, y ) = π(y )


x

▸ π is a left eigenvector of K with eigenvalue 1.

Theorem 1 (Fundamental Theorem of Markov Chains)


For finite set of states X and a Markov chain defined by transition
matrix K (x, y ). If there is n0 so that K n (x, y ) ≥ 0 for all n ≥ n0 , then K
has a unique stationary distribution π and, for n → ∞,

K n (x, y ) → π(y ) for each x, y , ∈ X

▸ Not sure why there is the following condition: “If there is n0 so that
K n (x, y ) ≥ 0 for all n ≥ n0 . . . ”. If you know why, please let me
know, and I include it here!
▸ This might be an instance of https:
//de.wikipedia.org/wiki/Satz_von_Perron-Frobenius
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 57
Metropolis algorithm
copied from Sec 2. in Diaconis, “MCMC revolution”, 2008
▸ X finite state space
▸ π(x) a probability on X , (possibly unnormalized)
▸ J(x, y ) a transition matrix defining some Markov chain on X with
J(x, y ) > 0 ⇐⇒ J(y , x) > 0.
▸ J and π can be unrelated.
▸ The Metropolis algorithm transforms J to new transition
probabilities K such that the Markov chain defined by K has π as
its stationary distribution (where A(x, y ) = π(y )J(y , x)/(π(x)J(x, y ))):

⎪J(x, y ) if x ≠ y , A(x, y ) ≥ 1



⎪J(x, y )A(x, y ) if x ≠ y , A(x, y ) < 1
K (x, y ) = ⎨

⎪J(x, y ) +


⎪ ∑ J(x, z)(1 − A(x, z)) if x = y
⎩ z∶A(x,z)<1

▸ now one can show detailed balance for π and K :


π(x)K (x, y ) = π(y )K (y , x)
▸ which implies that π is stationary distribution for K :
∑ π(x)K (x, y ) = ∑ π(y )K (y , x) = π(y )k (y , x) = π(y )
x x
Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 58
Operator view

▸ The Metropolis algorithm transforms J to new transition


probabilities K such that the Markov chain defined by K has π as
its stationary distribution:


⎪J(x, y ) if x ≠ y , A(x, y ) ≥ 1



⎪J(x, y )A(x, y ) if x ≠ y , A(x, y ) < 1
K (x, y ) = ⎨

⎪J(x, y ) +


⎪ ∑ J(x, z)(1 − A(x, z)) if x = y
⎩ z∶A(x,z)<1

▸ This defines an operator (function) on Markov chains.


▸ What happens if we apply it twice or three times?
▸ Does the Markov chain converge faster?

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 59


18-MCMC-demo-Diaconis.ipynb

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 60


END of Sampling/MCMC section.

However, this is also the

END of the Machine Learning lecture!

THANK YOU FOR YOUR ATTENTION!

Machine Learning / Stefan Harmeling / 26./31. January 2022 (WS 2021/22) 61

You might also like