Slides PDF

Bayesian inference
Petteri Piiroinen
University of Helsinki
Spring 2020
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 1 / 40

1.1.3 Fully Bayesian model for thumbtack
For thumbtack example:

fΘ (θ) = 1 for all θ ∈ (0, 1)
fY |Θ (y |θ) ∝ θy (1 − θ)n−y for fixed y as a function of θ (i.e. as
likelihood)
Hence
fΘ|Y (θ|y ) ∝ θy (1 − θ)n−y for fixed y

Recognizing the posteriori as beta distribution
From probability we know that a random variable X ∼ Beta(α, β) has

a beta distribution with parameters α, β, if
fX (x ) ∝ x α−1 (1 − x )β−1
for x ∈ (0, 1) and fX (x ) = 0 otherwise.

The normalization constant is the reciprocal of the beta function
Z 1
Γ(x )Γ(y )
B(α, β) = x α−1 (1 − x )β−1 dx =
0 Γ(x + y )

Recognizing the posteriori as beta distribution
Now fΘ|Y (θ|y ) ∝ θy (1 − θ)n−y for fixed y

so:
Θ|(Y = y ) ∼ Beta(y + 1, n + 1 − y )

U(0,1)
Beta(17,15)
4
3
f(θ)
2
1
0
0.0 0.2 0.4 0.6 0.8 1.0

par(mar = c(4, 4, .1, .1))
y <- 16
n <- 30
theta <- seq(0,1, by = .01) # create tight grid for plotting
alpha <- y + 1
beta <- n - y + 1
plot(theta, dbeta(theta, alpha, beta),
lwd = 2, col = 'green', type ='l',
xlab = expression(theta),
ylab = expression(paste('f(', theta, ')')))
lines(theta, dunif(theta),
lwd = 2, col = 'blue', type ='l')
legend('topright', inset = .02,
legend = c('U(0,1)',
paste0('Beta(', alpha, ',', beta, ')')),
col = c('blue', 'green'), lwd = 2)
Summarizing the posteriori distribution
From full posterior distribution, we can easily compute the probabilities

we were interested in:
P(Θ > 0.5|Y = y ) ≈ 0.64, since
1 - pbeta(0.5, alpha, beta)
## [1] 0.6399499

Summarizing the posteriori distribution
P(0.4 < Θ < 0.6|Y = y ) ≈ 0.71, since

pbeta(0.6, alpha, beta) - pbeta(0.4, alpha, beta)
## [1] 0.7128906
P(0.2 < Θ < 0.8|Y = y ) ≈ 0.9996, since
pbeta(0.8, alpha, beta) - pbeta(0.2, alpha, beta)
## [1] 0.9996158

Summarizing with point estimates
We can also summarize the posterior distributions with a point

estimate.
In Bayesian statistics posterior mean is a widely used point estimate
because of its optimality in the sense of mean squared error
α y +1 17
E(Θ|Y = y ) = = = ≈ 0.53125
α+β n−y +1+y +1 32
The MLE was
y 16
θ(y
b )= = ≈ 0.5333333
n 30
These are very close (separated by ‘pseudo-observation’)

1.2 Components of Bayesian inference
Let’s generalize main concepts of the Bayesian belief updating process

(this is a recap of Statistical models in Statistical inference courses,
with the corresponding changes)
Consider a slightly more general situation
we have observed a data set y = (y1 , . . . , yn ) with n observations
we model the observed data set as an observed value of the random
vector: y = Y(ωact ) where Y = (Y1 , . . . , Yn )

Parametric inference
The distribution of Y is known, once a parameter θ = (θ1 , . . . , θd ) ∈ Ω

is known.
The set Ω is called parameter space
Unlike in frequentist inference, we express the distribution of Y as the
conditional distribution given the parameter
y 7→ fY|Θ (y|θ), θ∈Ω
(since in Bayesian inference the parameter is considered as random)

Just like in frequentist case the inference is reduced to inference of the
parameter, but now it is finding out the distribution of the unknown
parameter Θ
This simplifies the inference process significantly, because we can limit
ourselves to the vector spaces instead of the function spaces.

Sampling distribution / likelihood function
Definition.
Conditional distribution of the data set given the parameter,
y 7→ fY|Θ (y|θ),
for every θ ∈ Ω is called a sampling distribution. The mapping
θ 7→ fY|Θ (y|θ)
for every data set y ∈ Rn is a likelihood function.
These terms are used interchangeably in practice (and also on this

course).
The set of sampling distributions is also usually called as statistical
model

Because our data set is a vector, in the general case a structure of the
sampling distribution can be quite complicated.
We can simplify things by assuming that our observations are
(conditionally) independent (given the value of the parameter Θ
Y1 , . . . , Yn ⊥
⊥|Θ,
This implies that

n
Y
fY|Θ (y|θ) = fYi |Θ (yi |θ).
i=1

Sampling distribution / likelihood function (i.i.d.)
The situation is further simplified if our observations follow a same

distribution. This situation is encountered quite often in this course, at
least in the simplest examples.
We say that random variables are independent and identically
distributed (i.i.d.). In this case we have
n
Y
fY|Θ (y|θ) = f (yi |θ).
i=1

In some cases, such as in our thumbtack tossing example the form of

the sampling distribution (binomial distribution in this case) follows
quite naturally from the structure of the expermintal situation.
Other distributions that often follow:
multinomial (more than two possible outcomes)
normal distribution (sums of the independent random variables)
Poisson distribution (occurrences of the independent events)
exponential distribution (waiting times or lifespans).

More complex situations: physical models, imaging problems, . . .

In the more complex situations we cannot usually use any of these
simple models directly:
hierarchical models out of basic distributions.

Prior distribution
Definition.
A marginal distribution θ 7→ fΘ (θ) of the parameter is called a prior
distribution.
Priori is latin for before: the prior distribution describes our beliefs
about the likely values of the parameter θ before observing any data.

Non-informative prior
we should choose as a vague priori distribution as possible

if no strong beliefs about the possible values of the parameter
if we do not want to influence our results
these kinds of priors are called uninformative priors

Non-informative prior
what means vague?

uniform prior (problems with unbounded parameter spaces)

Informative prior
Sometimes we want to let our prior knowledge influence our posterior

distribution
This kind of prior distribution is called an informative prior.
For example in imaging problems, you typically want to enforce prior
knowledge; we have strong prior belief on the structure of some
parameters

Hyperpriors
The prior distribution for θ is usually parametric distribution

its parameters φ = (φ1 , . . . , φk ) are called hyperparameters.
We can denote prior distribution also as
fθ|Φ (θ|φ),
but often the notation is simplified by leaving out the hyperparameters.

Bayesian model
To specify the fully Bayesian probability model, we have to specify
Y|Θ sampling distribution (otantajakauma)
and
Θ prior distribution (priorijakauma)
Together (via chain rule) these specify
(Θ, Y) joint distribution (yhteisjakauma)
which we usually express as
fΘ,Y (θ, y ) = fΘ (θ)fY |θ (y|θ).

Posterior distribution
Definition.
The conditional distribution Θ|Y of the parameter given the data is called a
posterior distribution.
Posteriori is latin for after: posterior distribution describes our beliefs

about the probable values of the parameter after we have observed the
data.
Posteriori distribution Θ|Y is the ultimate goal of bayesian
parametric inference:
it contains all the information the data carries (given the model and the
prior are ‘’correctly chosen”)

Posterior distribution
In principle, the posterior distribution is computed using the Bayes’

theorem:
fΘ (θ)fY|Θ (y|θ)
fΘ|Y (θ|y) = fΘ,Y (θ, y)fY (y) = .
fY (y)
In practice, we almost always compute the unnormalized density of the

posterior as a product of the sampling and prior distributions
fΘ|Y (θ|y) ∝ fΘ (θ)fY|Θ (y|θ)
and afterwards deduce the missing normalizing constant.

Marginal likelihood
The normalizing constant fY (y) is called marginal likelihood or

evidence.
It is obtained via marginalizing the joint distribution (from
(Θ, Y) −→ Y
For continuos distributions
Z
fY (y) = fΘ (θ)fY |θ (y|θ)dθ
Ω
For discrete distributions

X
fY (y) = fΘ (θ)fY |θ (y|θ)
θ∈Ω

Predictive distribution
In Bayesian data analysis (Gelman et al. 2013) the marginal likelihood

is called a prior predictive distribution.
it presents our beliefs about the distribution of the data before any
observations are made.
it is a weighted ‘’average” over all θ ∈ Ω weighted with prior
distribution
fY (y) = Eg(y, Θ) (1)
when we denote the transformation
g(y, θ) := fY|Θ (y|θ)

1.3 Prediction
We started with parameter estimation and inference of it

Next we consider predicting new observations

1.3.1 Motivating thumbtack example, part II
Assume we have tossed a thumbtack n = 30 times

it landed point up y = 16 times
what if we are interested on predicting new observations?
based on these: what is our predictive distribution for the number of
successes, if we throw the same thumbtack m = 10 more times?

Because the thumbtack stays the same, it makes sense to use same
binomial model
Ye ∼ Bin(m, θ)
it makes sense to model the old and the new observations independent
given the parameter:
Ye,Y⊥ ⊥|Θ.

An intuitive (naive) way to obtain a pmf of Ye would be just to plug

the point estimate, such as a ML estimate θ(y
b ), as the parameter
value of the probability mass function of the new observations:
fYe |Θ (ye |θ(y

b )).
same problem as with parameter estimation: what if we had again

observed a data y = 0 with n = 30?

e |Y ,
the proper Bayesian predictive distribution is the distribution of Y
the distribution of new observations given the observed data
This is denoted by
fYe |Y (ye |y ).
Note that the parameter θ is absent, so how can we derive this

conditional pmf now?
Before this, let’s consider this more generally

1.3.2 Posterior predictive distribution
Let’s consider a general case before computing the actual thumbtack
example
Assume we have observations
Y = (Y1 , . . . , Yn )
with a sampling distribution fY|Θ (y|Θ), where Θ ∈ Ω is a random
parameter vector
We want to predict m new observations:
Y
e = (Y
e1 , . . . , Y
em )
from the same random process.

Definition.
The conditional distribution Y|Y
e of the new observations given the data is
called a posterior predictive distribution.
1.3.2 Posterior predictive distribution
Lemma 1.
If in addition the new observations are independent of the observed data
given the parameter
e ⊥Y)|Θ,
(Y⊥
then (for densities)
Z
fY y|y) =
e|Y (e fY y|θ)fΘ|Y (θ|y)dθ
e|Θ (e (2)
Ω
Note: we will many times denote Y,

e Y⊥⊥|Θ, instead since this notation
works for more than two random vectors as well

Derivation of the formula (2)
Suppose we would have the joint distribution for (Θ, Y, Y)

e
We then conditionalize with data to obtain (Θ, Y)|Y
e
Then we can marginalize to get joint distribution for Y|Y
e
But any similar strategy would work, so we opt for more suitable route

So let’s do the following:

We will first specify Y|Y,
e Θ
Then with posteriori distribution Θ|Y and (conditional) chain rule we
get Y,
e Θ|Y
And we (conditionally) marginalize to get Y|Y
e

The (conditional) chain rule (for densities) is
fX,Y|Z (x, y|z) = fX|Y,Z (x|y, z)fY|Z (y|z)
The (conditional) marginalization (for densities) is

Z
fX|Z (x|z) = fX,Y|Z (x, y|z)dy

The idea (on blackboard) is:

Y,
e Y⊥ ⊥|Θ implies Y|Y,
e Θ = Y|Θ
e
the previous and conditional chain rule gives
fe
Y,Θ|Y
(e
y, θ|y) = fe
Y|Θ
y|θ)fΘ|Y (θ|y)
(e
conditional marginalization gives the claim.

Now that we have a formula for the posterior predictive distribution

let’s compute it for the thumbtack case
Z
fYe |Y (ye |y ) = fYe |Θ (ye |θ)fΘ|Y (θ|y )dθ
Ω
!
m B(ye + α1 , m + β1 − ye )
=
ye B(α1 , β1 )
* This means:
e |Y ∼ Beta-bin(m, α1 , β1 )
Y
where α1 = y + 1, β1 = n − y + 1 the parameters of the posterior
distribution

Two ways of computing the previous integral
1 Explicitly recognize a familiar integral

2 Integrating as statistician (i.e. recognizing unnormalized density
functions)

Prior predictive distribution and posterior predictive
distribution
Recall that we had a representation of the prior predictive distribution
as
fY (y) = Eg(y, Θ)
g(y, θ) := fY|Θ (y|θ)
We have a similar looking representation for the posterior predictive

distribution
fY y|y) = E g2 (e
e|Y (e y, Θ)|Y = y
g2 (e
y, θ) := fY y|θ)
e|Θ (e

Slides PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slides PDF

Uploaded by

Copyright:

Available Formats

Bayesian inference

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 1 / 40

For thumbtack example:

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 2 / 40

From probability we know that a random variable X ∼ Beta(α, β) has

for x ∈ (0, 1) and fX (x ) = 0 otherwise.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 3 / 40

Now fΘ|Y (θ|y ) ∝ θy (1 − θ)n−y for fixed y

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 4 / 40

0.0 0.2 0.4 0.6 0.8 1.0

From full posterior distribution, we can easily compute the probabilities

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 7 / 40

P(0.4 < Θ < 0.6|Y = y ) ≈ 0.71, since

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 8 / 40

We can also summarize the posterior distributions with a point

The MLE was

These are very close (separated by ‘pseudo-observation’)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 9 / 40

Let’s generalize main concepts of the Bayesian belief updating process

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 10 / 40

The distribution of Y is known, once a parameter θ = (θ1 , . . . , θd ) ∈ Ω

y 7→ fY|Θ (y|θ), θ∈Ω

(since in Bayesian inference the parameter is considered as random)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 11 / 40

for every θ ∈ Ω is called a sampling distribution. The mapping

for every data set y ∈ Rn is a likelihood function.

These terms are used interchangeably in practice (and also on this

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 12 / 40

This implies that

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 13 / 40

The situation is further simplified if our observations follow a same

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 14 / 40

In some cases, such as in our thumbtack tossing example the form of

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 15 / 40

More complex situations: physical models, imaging problems, . . .

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 16 / 40

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 17 / 40

we should choose as a vague priori distribution as possible

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 18 / 40

what means vague?

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 19 / 40

Sometimes we want to let our prior knowledge influence our posterior

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 20 / 40

The prior distribution for θ is usually parametric distribution

but often the notation is simplified by leaving out the hyperparameters.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 21 / 40

To specify the fully Bayesian probability model, we have to specify

Y|Θ sampling distribution (otantajakauma)

(Θ, Y) joint distribution (yhteisjakauma)

which we usually express as

fΘ,Y (θ, y ) = fΘ (θ)fY |θ (y|θ).

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 22 / 40

Posteriori is latin for after: posterior distribution describes our beliefs

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 23 / 40

In principle, the posterior distribution is computed using the Bayes’

In practice, we almost always compute the unnormalized density of the

fΘ|Y (θ|y) ∝ fΘ (θ)fY|Θ (y|θ)

and afterwards deduce the missing normalizing constant.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 24 / 40

The normalizing constant fY (y) is called marginal likelihood or

For discrete distributions

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 25 / 40