Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

4

1 An Introduction to Bayesian Modeling


Not surprisingly, Bayes’ Theorem is the key result that drives Bayesian
S modeling andTstatistics. Let S be a
sample space and let B1 , . . . , BK be a partition of S so that (i) k Bk = S and (ii) Bi Bj = ∅ for all i 6= j.
We then have the following:

Theorem 1 (Bayes’ Theorem) Let A be any event. Then for any 1 ≤ k ≤ K we have
P (A | Bk )P (Bk ) P (A | Bk )P (Bk )
P (Bk | A) = = PK .
P (A) j=1 P (A | Bj )P (Bj )

Of course there is also a continuous version of Bayes’ Theorem with sums replaced by integrals. Bayes’
Theorem provides us with a simple rule for updating probabilities when new information appears. In Bayesian
modeling and statistics this new information is the observed data and it allows us to update our prior beliefs
about parameters of interest which are themselves assumed to be random variables.

The Prior and Posterior Distributions


Let θ be some unknown parameter vector of interest. We assume θ is random with some distribution π(θ).
This is prior distribution and it captures our prior uncertainty regarding θ. There is also a random vector
y with PDF (or PMF) p(y | θ) – this is the likelihood. The joint distribution of θ and y is then given by

p(θ, y) = π(θ)p(y | θ)

and we can integrate the joint distribution to get the marginal distribution of y, namely
Z
p(y) = π(θ)p(y | θ) dθ.
θ

We can compute the posterior distribution via Bayes’ Theorem and obtain
π(θ)p(y | θ) π(θ)p(y | θ)
π(θ | y) = = R . (1)
p(y) θ
π(θ)p(y | θ) dθ

A Little History
Bayesian modeling has a long history going back to the discovery of Bayes Theorem by the Reverend Thomas
Bayes in the 18th century and its later but independent discovery by Laplace towards the end of the 18th
century. Both Bayes and Laplace were objective Bayesians and were skeptical about conveying too much
information through the prior distribution. For over a century only the Bayesian paradigm existed but this
changed in the early 20th century with major contributions from Fisher (maximum likelihood estimation),
Neyman and Wald. Neyman developed the frequentist approach to statistics whereby statistical procedures
are evaluated w.r.t. a probability distribution over all possible data-sets. This approach gave rise to the
concepts of confidence intervals and hypothesis testing as well as the (somewhat awkward) interpretation
of their results. Despite the work of de Finetti, Savage and others who continued to push the Bayesian
paradigm, the frequentist and maximum likelihood approaches held sway over most of the 20th century.
However, Bayesian methods have made a strong comeback in recent decades with the advent of new MCMC
algorithms and the impressive growth in computing power.
It’s interesting to note that frequentists take the parameter vector θ to be fixed (albeit unknown) and
introduce uncertainty over the possible data-sets y. In contrast the Bayesian approach treats the data-set
y as given and instead introduces uncertainty over θ. It’s worth noting, however, that the frequentist and
Bayesian approaches do share some ideas and statisticians today seem to be much more at ease moving
between these differing approaches. Moreover, some well-known and popular statistical procedures combine
elements of both approaches. For example, empirical Bayesian methods (see Section 8) are very Bayesian
in spirit but they are not strictly Bayesian and in fact the analysis of these methods is often frequentist in
nature.
Robert [38] provides a thorough introduction to Bayesian statistics as well as its connections to and differences
with the frequentist approach.

Electronic copy available at: https://ssrn.com/abstract=3759243


5

Back to the Prior, Posterior etc.


The mode of the posterior
R distribution is called the maximum a posterior (MAP) estimator while the mean
is of course E [θ | y] = θ π(θ | y) dθ. The posterior predictive distribution is the distribution of a new
as yet unseen data-point, ynew :
Z
p(ynew ) := p(ynew | y) = p(ynew , θ | y) dθ

= p(ynew | θ, y)π(θ | y) dθ

= p(ynew | θ)π(θ | y) dθ (2)
θ

where the final equality follows because the data are assumed i.i.d. given θ. As its name suggests, the
posterior predictive distribution can be used to predict new values of y but it also plays an important role
in model checking and selection as we shall see in Section 7.
Much of Bayesian analysis is concerned with “understanding” the posterior π(θ | y). Note that

π(θ | y) ∝ π(θ)p(y | θ)

which is what we often work with in practice. Sometimes we can recognize the form of the posterior by
simply inspecting π(θ)p(y | θ). But typically we cannot recognize the posterior and cannot compute the
denominator in (1) either. In such cases approximate inference techniques such as MCMC are required. We
begin with a simple example.

Example 1 (A Beta Prior and Binomial Likelihood)


Let θ ∈ (0, 1) represent some unknown probability. We assume a Beta(α, β) prior so that

θα−1 (1 − θ)β−1
π(θ) = , 0 < θ < 1.
B(α, β)

We also assume that y | θ ∼ Bin(n, θ) so that p(y | θ) = ny θy (1 − θ)n−y , y = 0, . . . , n. The posterior then


satisfies

p(θ | y) ∝ π(θ)p(y | θ)
θα−1 (1 − θ)β−1 n y
 
= θ (1 − θ)n−y
B(α, β) y
∝ θα+y−1 (1 − θ)n−y+β−1

which we recognize as the Beta(α + y, β + n − y) distribution! See Figure 20.1 for a numerical example and
a visualization of how the data and prior interact to produce the posterior distribution.

1.1 Conjugate Priors


Consider the following probabilistic model. The parameter vector θ has prior π( · ; α0 ) while the data
y = (y1 , . . . , yn ) is distributed as p(y | θ). As we saw earlier, the posterior distribution satisfies

p(θ | y) ∝ p(θ, y) = p(y | θ)π(θ; α0 ).

We say the prior π(θ; α) is a conjugate prior for the likelihood p(y | θ) if the posterior satisfies

p(θ | y) = π(θ; α(y))

so that the observations influence the posterior only via a parameter change α0 → α(y). In particular, the
form or type of the distribution is unchanged. In Example 1, for example, we saw the beta distribution is
conjugate for the binomial likelihood. Here are two further examples.

Electronic copy available at: https://ssrn.com/abstract=3759243


6

3.5
Prior
3
Posterior
2.5

1.5

0.5

0
0 0.2 0.4 0.6 0.8 1

Figure 20.1 Prior and posterior densities for α = β = 2 and n = x = 5, respectively. The dashed vertical
line shows the location of the posterior mode at θ = 6/7 = 0.857.

Example 2 (Conjugate Prior for Mean of a Normal Distribution)


Suppose θ ∼ N(µ0 , γ02 ) and p(yi | θ) = N(θ, σ 2 ) for i = 1, . . . , n with σ 2 is assumed known. In this case we
have α0 = (µ0 , γ02 ). If y = (y1 , . . . , yn ) we then have
p(θ | y) ∝ p(y | θ)π(θ; α0 )
 n
(θ − µ0 )2 Y (yi − θ)2
  
∝ exp − exp −
2γ02 i=1
2σ 2
(θ − µ1 )2
 
∝ exp −
2γ12
where
 n
X 
γ1−2 := γ0−2 + nσ −2 and µ1 := γ12 µ0 γ0−2 + yi σ −2 .
i=1

Of course we recognize p(θ | y) as the N(µ1 , γ12 ) distribution.

Example 3 (Conjugate Prior for Mean and Variance of a Normal Distribution)


Suppose that p(yi | θ) = N(µ, σ 2 ) for i = 1, . . . , n and let y := (y1 , . . . , yn ). We now assume µ and σ 2 are
unknown so that θ = (µ, σ 2 ). We assume a joint prior of the form
π(µ, σ 2 ) = π(µ | σ 2 )π(σ 2 )
N µ0 , σ 2 /κ0 × Inv-χ2 ν0 , σ02
 
=
 
−(ν0 /2+1) 1 
σ −1 σ 2 exp − 2 ν0 σ02 + κ0 (µ0 − µ)2



which we recognize as the N-Inv-χ2 (µ0 , σ02 /κ0 , ν0 , σ02 ) PDF. Note that µ and σ 2 are not independent under
this joint prior. It is a straightforward task to show that multiplying this prior by the normal likelihood
yields a N-Inv-χ2 distribution.

1.1.1 The Exponential Family of Distributions


The canonical form of the exponential family distribution is
>
p(y | θ) = h(y)eθ u(y)−ψ(θ)
(3)
where θ ∈ Rm is a parameter vector and u(y) = (u1 (y), . . . , um (y)) is the vector of sufficient statistics.
The exponential family includes Normal, Gamma, Beta, Poisson, Dirichlet, Wishart and Multinomial distri-
butions as special cases. The exponential family is also essentially the only distribution with a non-trivial
conjugate prior. This conjugate prior takes the form
>
π(θ; α, γ) ∝ eθ α−γψ(θ)
. (4)

Electronic copy available at: https://ssrn.com/abstract=3759243


7

Combining (3) and (4) we see the posterior takes the form
>
u(y)−ψ(θ) θ > α−γψ(θ) >
p(θ | y, α, γ) ∝ eθ e = eθ (α+u(y))−(γ+1)ψ(θ)

= π(θ | α + u(y), γ + 1)

which (as claimed) has the same form as the prior.

1.2 Selecting Priors More Generally


When possible conjugate priors are often chosen for tractability reasons but such priors are impossible for
many models of interest, e.g. hierarchical models. Moreover, tractability of the posterior is no longer a real
concern given modern computing power and the availability of sophisticated inference techniques such as
Gibbs sampling, Hamiltonian Monte-Carlo and variational Bayes among others.
Selecting an appropriate prior is a key component of Bayesian modeling. With only a finite amount of data,
the prior can have a very large influence on the posterior and so it’s important to be aware of this and to
understand the sensitivity of posterior inference to the choice of prior. Ideally, relevant information should
be captured in the prior but (especially when there is plenty of data) we want the data rather than the
prior to dominate the posterior. It’s also important that the prior not put too little weight on regions that
appear to be supported by the data. This would suggest either an inappropriate choice of prior or model.
In practice then it’s common to use non-informative priors or only weakly informative priors to limit the
influence of the prior on the posterior distribution.
Finally we note a common misconception regarding Bayesian statistics. This is that the only advantage
of the Bayesian approach over the frequentist approach is that the choice of prior allows us to express our
prior beliefs on quantities of interest. In fact there are many other more important advantages including
modeling flexibility, exact inference rather than asymptotic inference, the ability to estimate functions of
any parameters without “plugging” in maximum likelihood estimates (MLEs), more accurate estimates of
parameter uncertainty, etc. Of course there are disadvantages to the Bayesian approach as well. These include
the subjectivity induced by choice of prior as well high computational costs. But these disadvantages don’t
seem to be too significant given the widespread practice of using non-informative (or only weakly informative)
priors and the availability of modern computing power.
For further information on prior selection we refer to Robert [38] and Gelman et al. [19] who have a lot more
to say on this important issue.

1.3 The Bernstein-von Mises Theorem


Despite differences between the Bayesian and frequentist approaches we do have the following important and
satisfying result.
Theorem 2 (Bernstein-von Mises) Under suitable assumptions and for sufficiently large sample sizes,the
posterior distribution of θ is approximately normal with mean equal to the true value of θ and variance equal
to the inverse of the Fisher information matrix.
The Bernstein-von Mises Theorem implies that Bayesian and MLE estimators have the same large sample
properties. This is not really surprising since the influence of the prior should diminish with increasing sample
sizes. But this is a theoretical result and we often don’t have “large” sample sizes (at least relative to the
number of parameters) so it’s quite possible for the posterior to be (very) non-normal and even multi-modal.
Moreover, the “suitable assumptions”2 mentioned in the theorem don’t hold in many interesting models,
including for example models where the number of parameters grows with the number of data-points. Such
models are used today in many applications of interest.

1.4 The Sampling Problem


Much of Bayesian inference is concerned with understanding (or simulating from) the posterior distribution

π(θ | y) ∝ π(θ)p(y | θ)

without knowing the constant of proportionality given by the denominator in (1). This can be viewed as
a specific instance of a more general sampling problem. Specifically, suppose we are given a distribution
2 See Gelman et al. [19] for these assumptions as well as a more detailed discussion of the Bernstein-von Mises Theorem.

Electronic copy available at: https://ssrn.com/abstract=3759243


8

function
1
p(z) = p̃(z) (5)
Zp
where p̃(z) ≥ 0 is easy to compute but Zp is (too) hard to compute. This very important situation arises in
several contexts:
R
1. In Bayesian models where p̃(θ) := p(y | θ)π(θ) is easy to compute but Zp := p(y) = θ π(θ)p(y | θ)dθ,
i.e. the denominator in (1), can be very difficult or impossible to compute. In this case Zp is often
referred to as the marginal likelihood or evidence.
2. In models from statistical physics, e.g. the Ising model, we only know p̃(z) = e−E(z) , where E(z) is an
“energy” function. (The Ising model is an example of a Markov network or an undirected graphical
model.) In this case Zp is often known as the partition function.
3. Dealing with evidence in directed graphical models such as belief networks or directed acyclic graphs
(DAGs).
The sampling problem is the problem of simulating from p(z) in (5) without knowing the constant Zp . While
the well-known acceptance-rejection3 algorithm can be used, it is very inefficient in high dimensions and an
alternative approach is required. That alternative approach is Markov Chain Monte-Carlo (MCMC).

1.5 Exercises
1. (Interpreting the Prior )
How can we interpret the prior distribution in Example 1?

2. (Prior Marginal and the Beta-Bernoulli Distribution )


Before seeing any data what is the marginal distribution of x in Example 1? (This is the prior marginal
distribution and in this case the prior marginal is known as the beta-Bernoulli distribution.)

3. (Conjugate Priors)
(a) Consider the following form of the Normal distribution
1
κ(y − µ)2
 
κ2
p(y | µ, κ) = √ exp − .
2π 2
where κ (the variance inverse) is called the precision parameter. Show that this distribution can
be written as an Exponential Family distribution of the form
θ1 y 2
 
p(y | θ1 , θ2 ) = h(y) exp − + θ2 y − ψ(θ1 , θ2 )
2
Characterize h(y), (θ1 , θ2 ) and the function ψ(θ1 , θ2 ).

(b) Recall that the generic conjugate prior for an exponential family distribution is given by
π(θ1 , θ2 ) ∝ exp (a1 θ1 + a2 θ2 − γψ(θ1 , θ2 ) (6)
Substitute your expression for (θ1 , θ2 ) from part (a) to show that the conjugate prior for the
Normal model is of the form
1 γκ 2
a0 −1 −κ/b0 − 2 (µ−µ0 )
π(κ | a0 , b0 ) · π(µ | µ0 , γκ) ∝ κ
| e
{z } ·κ
| e {z
2
}. (7)
Gamma(κ|a0 ,b0 ) Normal(µ|µ0 ,γκ)

Your expressions for a0 , b0 and µ0 should be in terms of γ, a1 and a2 . (This prior is known as
the Normal-Gamma prior.)

3 The acceptance-rejection algorithm is a standard Monte-Carlo algorithm that is covered in just about every introduction

to Monte-Carlo simulation. It is described in Exercise 5.

Electronic copy available at: https://ssrn.com/abstract=3759243


9

(c) Suppose (µ, κ) ∼ Normal-Gamma(a 0 , b0 , µ0 , γ), and the likelihood of the data y is p(y | µ, κ) =
1 
κ 2 κ(y−µ)2


exp − 2 . Compute the posterior distribution after you see n IID samples {y1 , . . . , yn }.
(The results of parts (a) and (b) can help simplify the calculations.)

4. (Dirichlet-Multinomial Conjugate Pair)


A K-dimensional Dirichlet random vector θ lies on the simplex in RK and has density
P 
K
Γ k=1 φ k
p(θ) = QK φK −1
θ1φ1 −1 · · · θK (8)
k=1 Γ (φk )

where φ := (φ1 , . . . , φK ) is a strictly positive parameter vector.


(a) What is E[θj ]?
(b) Suppose that y1 , . . . , yn are IID Multinomial(n, θ) given θ. What is p(θ | y, φ), i.e. the distribu-
tion of θ | (y, φ) where y = (y1 , . . . , yn )?
(c) What is E[θj | y, φ]? Now write it as a convex combination of the prior mean from part (a) and
the corresponding maximum-likelihood estimator?

Remark 1 The Dirichlet prior and multinomial likelihood form a conjugate pair that includes the
beta-Binomial as a special case. Your answer to part (c) explains why E[θj | y, φ] is sometimes called
a shrinkage estimator of θj .

5. (The Acceptance-Rejection Algorithm)


The acceptance-rejection (AR) algorithm is a standard algorithm for a simulating a random variable
X with density f (·) when it is hard to simulate X directly. It proceeds as follows. Let Y be a random
variable with density g(·) and suppose (i) it is easy to simulate a value of Y and (ii) there exists a
constant M such that f (x) ≤ M g(x) for all x in the support of f . Then the following pseudo-code
describes how the AR algorithm proceeds.

The Acceptance-Rejection Algorithm

generate Y with PDF g(·)


generate U
f (Y )
while U > ag(Y )
generate Y
generate U
set X = Y

(a) Prove that this algorithm works, i.e. terminates with a random variable X having the correct
distribution.
(b) Why must we have M ≥ 1?
(c) On average how many samples of Y are required until one if accepted?

Electronic copy available at: https://ssrn.com/abstract=3759243

You might also like