Professional Documents
Culture Documents
Chapter 14 MCMC For Continuous Distribution, Gaussian Process (Lecture On 02-18-2021) - STAT 243 - Stochastic Process
Chapter 14 MCMC For Continuous Distribution, Gaussian Process (Lecture On 02-18-2021) - STAT 243 - Stochastic Process
We studied why Metropolis-Hastings chain converges to the full posterior when the parameter
space is discrete. Now we study the case when the parameter space is continuous. When the
Markov chain is defined on a continuous state space, we cannot use transition probability matrix.
We are dealing with Markov chain of the form {Xn : n ≥ 1} but each Xn ∈ S , where S is not
discrete.
Definition 14.1 (Transition kernels) The transition kernels T (x, x′ ) of making a transition from
Xn = x to Xn+1 = x
′
is a function of x and x′ that has a. T (x, ⋅) is a density function.
Let g(x′ |x) be the proposal density for proposing x′ from x. Let a(x, x′ ) be the acceptance
probability of accepting x′ from x. Then the transition kernel for Metropolis-Hasting can be
described as
′ ′ ′
⎧ g(x |x)a(x, x ) x ≠ x
′
T (x, x ) = ⎨ ′ ′ ′ ′
(14.1)
⎩1 − ∫ g(x |x)a(x, x )dx x = x
The first expression in (14.1) is the probability that you propose x′ , which is given by
′
g(x |x) , then the proposal get accepted (the probability of that is a(x, x′ )). For the
second expression in (14.1), ∫ ′
g(x |x)a(x, x )dx
′ ′
is the probability that proposing any
x
′
and get accepted. Thus, 1 − ∫ ′
g(x |x)a(x, x )dx
′ ′
is the probability that any
proposed x′ is not accepted, or just x = x
′
.
Definition 14.2 (Stationary distribution in continuous case) If a transition kernel satisfies
′ ′ ′
∫ T (x, x )π(x)dx = π(x ), ∀x (14.2)
then π(⋅) is known as the stationary distribution for the transition kernel.
Notice the similarity with the definition of stationary distribution in discrete case
(Definition 9.1), where we have the stationary distribution satisfies πj = ∑ πi pij , ∀j
i
.
Here we just replace the sum with integral.
Proposition 14.1 (Detailed balance condition for continuous case) Similar to the discrete case, if
∃π(⋅) such that
′ ′ ′ ′
T (x, x )π(x) = T (x , x)π(x ), ∀x, x (14.3)
then π(⋅) is the stationary distribution, and we say that the detailed balance condition is satisfied.
Let P (x) be the full posterior distribution. We have
′ ′
p(x )g(x|x )
′
a(x, x ) = min(1, ) (14.4)
′
p(x)g(x |x)
If x ≠ x
′
, we then get
′ ′ ′
T (x, x )p(x) = g(x |x)a(x, x )p(x)
′ ′
p(x )g(x|x ) (14.5)
′
= g(x |x)p(x) min(1, )
′
p(x)g(x |x)
We consider two cases. In the first case, assume p(x′ )g(x|x′ ) < p(x)g(x |x)
′
, then (14.5)
becomes
′ ′
p(x )g(x|x )
′ ′ ′ ′
T (x, x )p(x) = g(x |x)p(x) = p(x )g(x|x ) (14.6)
′
p(x)g(x |x)
In addition,
′ ′ ′ ′ ′
T (x , x)p(x ) = g(x|x )a(x , x)p(x )
′
p(x)g(x |x)
′ ′
= g(x|x )p(x ) min(1, ) (14.7)
′ ′
p(x )g(x|x )
′ ′
= g(x|x )p(x )
Therefore, by (14.6) and (14.7), we get
′ ′ ′ ′
T (x, x )p(x) = T (x , x)p(x ), ∀x ≠ x (14.8)
′ ′ ′ ′
T (x, x )p(x) = T (x , x)p(x ), ∀x, x (14.9)
Therefore, p(x) is the stationary distribution. Assume the regularity conditions hold, p(x) is also
the limiting distribution. This is the reason why when doing MCMC using Metropolis-Hastings,
ultimately you will sample from the full posterior distribution.
Now as for Gibbs sampler, let p(θ1 , θ2 ) be the full posterior distribution for (θ1 , θ2 ). Assume
p(θ1 |θ2 ) and p(θ2 |θ1 ) be the conditional distributions of θ1 |θ2 and θ2 |theta1 . The transition
kernel is given by
′ ′ ′ ′ ′
T ((θ1 , θ2 ), (θ , θ )) = p(θ |θ2 )p(θ |θ ) (14.10)
1 2 1 2 1
For Gibbs sampler, the detailed balance condition is not hold, so we need to use the definition
directly.
We want to prove
′ ′ ′ ′ ′ ′
∫ ∫ T ((θ1 , θ2 ), (θ , θ ))p(θ1 , θ2 )dθ1 dθ2 = p(θ , θ ), ∀θ1 , θ2 , θ , θ (14.11)
1 2 1 2 1 2
′ ′ ′
L. H . S. = ∫ ∫ p(θ |θ2 )p(θ |θ )p(θ1 , θ2 )dθ1 dθ2
1 2 1
′ ′ ′
= p(θ |θ ) ∫ p(θ |θ2 )(∫ p(θ1 , θ2 )dθ1 )dθ2
2 1 1
θ2 θ1
(14.12)
′ ′ ′
= p(θ |θ ) ∫ p(θ |θ2 )p(θ2 )dθ2
2 1 1
θ2
′ ′ ′
= p(θ |θ )p(θ )
2 1 1
′ ′
= p(θ , θ ) = R. H . S.
1 2
This implies that p(θ1 , θ2 ) is the stationary distribution for the Gibbs sampling kernel. Under
ergodicity, the stationary distribution is then the limiting distribution. Thus, the Gibbs sampler is
also valid.
This validition is for Gibbs sampler with two parameters. Similar technique can be
generated to Gibbs sampler with more parameters.
Gaussian Process
Why Gaussian process is so important? Consider the example where the data (yi , xi ) for
i = 1, ⋯ , n looks like Figure 14.1.
4
linear fit
spline fit
2
0
y
-2
-4
0 1 2 3 4 5
FIGURE 14.1: An example when we may need to use Gaussian process to fit the data.
We may first consider fitting the data using linear regression, that is
yi = μ + xi β + ϵi (14.13)
which is shown as the blue line in Figure 14.1. It is not a good fit to the data. There is obviously an
inherent nonlinear relationship between x and y, which is not captured by the linear regression
model. Therefore, we consider fit a nonlinear regression model
yi = f (xi ) + ϵi (14.14)
There are many different ways to make f nonlinear. For example, f can be piecewise polynomials
or more generally, f can be constructed using the spline functions. The idea behind this kind of
method is to fit locally linear or polynomial functions to the data, then add some constrains such
that the function at the boundaries are smooth. However, there are some issues with this kind of
technique. First of all, you need to know how many spline basis functions you have to use. Every
spline function is represented by some knots. You have to decide how to put knots. There is not
automatic approach to do this. This drawbacks lead to the usage of Gaussian process to
estimate this nonlinear function f . The Gaussian process is much more automatic than the spline
functions and it is also flexible.
f (⋅) is an unknown function and our job is to put a prior distribution on f (⋅). Here f (⋅) is an
infinite dimensional quantity and stochastic process can act as a prior on an infinite dimensional
function. There comes the idea of using a Gaussian process prior on f (⋅).
Formally, we say f (⋅) ∼ GP (μ, Cν (d, θ)) . Here we inherently assume that the fitted Gaussian
process is stationary, which means Cov(f (x), f (x′ )) is just a function of x − x′ .
f (⋅) ∼ GP (μ, Cν (d, θ)) imples that E(f (xi )) = μ for all xi and
Cov(f (xi ), f (xj )) = Cν (d, θ) where d = |xi − xj | , and θ, ν are parameters. Cν (⋅, ⋅) is called
the covariance kernel of a Gaussian process.
Definition 14.4 (Matern covariance kernel) The Matern family of covariance kernel is defined as
1−ν
2
2 2 ν
Cν (d, ϕ, σ ) = σ (√2νϕd) Kν (√2νϕd) (14.15)
Γ(ν)
For more about Matern family covariance functions, one can referred to Stein (1999).
2
, Cν (d, ϕ, σ 2 ) = σ
2
exp(−dϕ) , which is also known as the exponential covariance
kernel;
3
ν =
2
, Cν (d, ϕ, σ 2 ) 2
= σ (1 + √3dϕ) exp(−√3dϕ) ;
5 5
ν =
2
, Cν (d, ϕ, σ 2 ) 2
= σ (1 + √5dϕ +
3
2 2
d ϕ ) exp(−√5dϕ) ;
2
d ϕ
,
ν → ∞ Cν (d, ϕ, σ ) = σ
2 2
exp(−
2
), which is also known as the Gaussian covariance
kernel.
If you draw a function from a Gaussian process with the covariance kernel specified by Matern,
the function you draw will be ⌊ν⌋ times differentiable. By choosing ν appropriately, we can
control the smoothness of the functions we draw from the Gaussian process prior. For example,
1
If ν =
2
, we will draw functions which are nowhere differentiable;
If ν =
3
2
, we will draw functions which are once differentiable;
Definition 14.5 (Powered exponential covariance kernels) The powered exponential covariance
kernels is defined as
2 2 α
C(d, ϕ, σ , α) = σ exp(−ϕd ) (14.16)
where α is called the power in the powered exponential kernels. If α = 1 , then the covariance
function is called the exponential kernel while if α = 2 , it is called the Gaussian kernel.
Suppose we want to fit y = f (x) + ϵ , and we have data (yi , xi ) for i = 1, ⋯ , n . How to fit the
model? How to find p(f |y1 , x1 , ⋯ , yn , xn )? We will discuss these questions later.
References
Stein, Michael. 1999. Interpolation of Spatial Data: Some Theory of Kriging. New York, NY: Springer.