Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Bayesian inference

Petteri Piiroinen

University of Helsinki

Spring 2020

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 1 / 40


1.1.3 Fully Bayesian model for thumbtack

For thumbtack example:


fΘ (θ) = 1 for all θ ∈ (0, 1)
fY |Θ (y |θ) ∝ θy (1 − θ)n−y for fixed y as a function of θ (i.e. as
likelihood)
Hence
fΘ|Y (θ|y ) ∝ θy (1 − θ)n−y for fixed y

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 2 / 40


Recognizing the posteriori as beta distribution

From probability we know that a random variable X ∼ Beta(α, β) has


a beta distribution with parameters α, β, if

fX (x ) ∝ x α−1 (1 − x )β−1

for x ∈ (0, 1) and fX (x ) = 0 otherwise.


The normalization constant is the reciprocal of the beta function
Z 1
Γ(x )Γ(y )
B(α, β) = x α−1 (1 − x )β−1 dx =
0 Γ(x + y )

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 3 / 40


Recognizing the posteriori as beta distribution

Now fΘ|Y (θ|y ) ∝ θy (1 − θ)n−y for fixed y


so:
Θ|(Y = y ) ∼ Beta(y + 1, n + 1 − y )

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 4 / 40


1.1.3 Fully Bayesian model for thumbtack
U(0,1)
Beta(17,15)
4
3
f(θ)

2
1
0

0.0 0.2 0.4 0.6 0.8 1.0


Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 5 / 40
1.1.3 Fully Bayesian model for thumbtack
par(mar = c(4, 4, .1, .1))
y <- 16
n <- 30
theta <- seq(0,1, by = .01) # create tight grid for plotting
alpha <- y + 1
beta <- n - y + 1
plot(theta, dbeta(theta, alpha, beta),
lwd = 2, col = 'green', type ='l',
xlab = expression(theta),
ylab = expression(paste('f(', theta, ')')))
lines(theta, dunif(theta),
lwd = 2, col = 'blue', type ='l')
legend('topright', inset = .02,
legend = c('U(0,1)',
paste0('Beta(', alpha, ',', beta, ')')),
col = c('blue', 'green'), lwd = 2)
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 6 / 40
Summarizing the posteriori distribution

From full posterior distribution, we can easily compute the probabilities


we were interested in:
P(Θ > 0.5|Y = y ) ≈ 0.64, since
1 - pbeta(0.5, alpha, beta)

## [1] 0.6399499

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 7 / 40


Summarizing the posteriori distribution

P(0.4 < Θ < 0.6|Y = y ) ≈ 0.71, since


pbeta(0.6, alpha, beta) - pbeta(0.4, alpha, beta)

## [1] 0.7128906
P(0.2 < Θ < 0.8|Y = y ) ≈ 0.9996, since
pbeta(0.8, alpha, beta) - pbeta(0.2, alpha, beta)

## [1] 0.9996158

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 8 / 40


Summarizing with point estimates

We can also summarize the posterior distributions with a point


estimate.
In Bayesian statistics posterior mean is a widely used point estimate
because of its optimality in the sense of mean squared error

α y +1 17
E(Θ|Y = y ) = = = ≈ 0.53125
α+β n−y +1+y +1 32

The MLE was

y 16
θ(y
b )= = ≈ 0.5333333
n 30

These are very close (separated by ‘pseudo-observation’)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 9 / 40


1.2 Components of Bayesian inference

Let’s generalize main concepts of the Bayesian belief updating process


(this is a recap of Statistical models in Statistical inference courses,
with the corresponding changes)
Consider a slightly more general situation
we have observed a data set y = (y1 , . . . , yn ) with n observations
we model the observed data set as an observed value of the random
vector: y = Y(ωact ) where Y = (Y1 , . . . , Yn )

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 10 / 40


Parametric inference

The distribution of Y is known, once a parameter θ = (θ1 , . . . , θd ) ∈ Ω


is known.
The set Ω is called parameter space
Unlike in frequentist inference, we express the distribution of Y as the
conditional distribution given the parameter

y 7→ fY|Θ (y|θ), θ∈Ω

(since in Bayesian inference the parameter is considered as random)


Just like in frequentist case the inference is reduced to inference of the
parameter, but now it is finding out the distribution of the unknown
parameter Θ
This simplifies the inference process significantly, because we can limit
ourselves to the vector spaces instead of the function spaces.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 11 / 40


Sampling distribution / likelihood function

Definition.
Conditional distribution of the data set given the parameter,

y 7→ fY|Θ (y|θ),

for every θ ∈ Ω is called a sampling distribution. The mapping

θ 7→ fY|Θ (y|θ)

for every data set y ∈ Rn is a likelihood function.

These terms are used interchangeably in practice (and also on this


course).
The set of sampling distributions is also usually called as statistical
model

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 12 / 40


Sampling distribution / likelihood function

Because our data set is a vector, in the general case a structure of the
sampling distribution can be quite complicated.
We can simplify things by assuming that our observations are
(conditionally) independent (given the value of the parameter Θ

Y1 , . . . , Yn ⊥
⊥|Θ,

This implies that


n
Y
fY|Θ (y|θ) = fYi |Θ (yi |θ).
i=1

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 13 / 40


Sampling distribution / likelihood function (i.i.d.)

The situation is further simplified if our observations follow a same


distribution. This situation is encountered quite often in this course, at
least in the simplest examples.
We say that random variables are independent and identically
distributed (i.i.d.). In this case we have
n
Y
fY|Θ (y|θ) = f (yi |θ).
i=1

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 14 / 40


Sampling distribution / likelihood function

In some cases, such as in our thumbtack tossing example the form of


the sampling distribution (binomial distribution in this case) follows
quite naturally from the structure of the expermintal situation.
Other distributions that often follow:
multinomial (more than two possible outcomes)
normal distribution (sums of the independent random variables)
Poisson distribution (occurrences of the independent events)
exponential distribution (waiting times or lifespans).

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 15 / 40


Sampling distribution / likelihood function

More complex situations: physical models, imaging problems, . . .


In the more complex situations we cannot usually use any of these
simple models directly:
hierarchical models out of basic distributions.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 16 / 40


Prior distribution

Definition.
A marginal distribution θ 7→ fΘ (θ) of the parameter is called a prior
distribution.

Priori is latin for before: the prior distribution describes our beliefs
about the likely values of the parameter θ before observing any data.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 17 / 40


Non-informative prior

we should choose as a vague priori distribution as possible


if no strong beliefs about the possible values of the parameter
if we do not want to influence our results
these kinds of priors are called uninformative priors

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 18 / 40


Non-informative prior

what means vague?


uniform prior (problems with unbounded parameter spaces)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 19 / 40


Informative prior

Sometimes we want to let our prior knowledge influence our posterior


distribution
This kind of prior distribution is called an informative prior.
For example in imaging problems, you typically want to enforce prior
knowledge; we have strong prior belief on the structure of some
parameters

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 20 / 40


Hyperpriors

The prior distribution for θ is usually parametric distribution


its parameters φ = (φ1 , . . . , φk ) are called hyperparameters.
We can denote prior distribution also as

fθ|Φ (θ|φ),

but often the notation is simplified by leaving out the hyperparameters.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 21 / 40


Bayesian model

To specify the fully Bayesian probability model, we have to specify

Y|Θ sampling distribution (otantajakauma)

and
Θ prior distribution (priorijakauma)
Together (via chain rule) these specify

(Θ, Y) joint distribution (yhteisjakauma)

which we usually express as

fΘ,Y (θ, y ) = fΘ (θ)fY |θ (y|θ).

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 22 / 40


Posterior distribution

Definition.
The conditional distribution Θ|Y of the parameter given the data is called a
posterior distribution.

Posteriori is latin for after: posterior distribution describes our beliefs


about the probable values of the parameter after we have observed the
data.
Posteriori distribution Θ|Y is the ultimate goal of bayesian
parametric inference:
it contains all the information the data carries (given the model and the
prior are ‘’correctly chosen”)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 23 / 40


Posterior distribution

In principle, the posterior distribution is computed using the Bayes’


theorem:
fΘ (θ)fY|Θ (y|θ)
fΘ|Y (θ|y) = fΘ,Y (θ, y)fY (y) = .
fY (y)

In practice, we almost always compute the unnormalized density of the


posterior as a product of the sampling and prior distributions

fΘ|Y (θ|y) ∝ fΘ (θ)fY|Θ (y|θ)

and afterwards deduce the missing normalizing constant.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 24 / 40


Marginal likelihood

The normalizing constant fY (y) is called marginal likelihood or


evidence.
It is obtained via marginalizing the joint distribution (from
(Θ, Y) −→ Y
For continuos distributions
Z
fY (y) = fΘ (θ)fY |θ (y|θ)dθ

For discrete distributions


X
fY (y) = fΘ (θ)fY |θ (y|θ)
θ∈Ω

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 25 / 40


Predictive distribution

In Bayesian data analysis (Gelman et al. 2013) the marginal likelihood


is called a prior predictive distribution.
it presents our beliefs about the distribution of the data before any
observations are made.
it is a weighted ‘’average” over all θ ∈ Ω weighted with prior
distribution
fY (y) = Eg(y, Θ) (1)
when we denote the transformation

g(y, θ) := fY|Θ (y|θ)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 26 / 40


1.3 Prediction

We started with parameter estimation and inference of it


Next we consider predicting new observations

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 27 / 40


1.3.1 Motivating thumbtack example, part II

Assume we have tossed a thumbtack n = 30 times


it landed point up y = 16 times
what if we are interested on predicting new observations?
based on these: what is our predictive distribution for the number of
successes, if we throw the same thumbtack m = 10 more times?

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 28 / 40


1.3.1 Motivating thumbtack example, part II

Because the thumbtack stays the same, it makes sense to use same
binomial model
Ye ∼ Bin(m, θ)

it makes sense to model the old and the new observations independent
given the parameter:
Ye,Y⊥ ⊥|Θ.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 29 / 40


1.3.1 Motivating thumbtack example, part II

An intuitive (naive) way to obtain a pmf of Ye would be just to plug


the point estimate, such as a ML estimate θ(y
b ), as the parameter
value of the probability mass function of the new observations:

fYe |Θ (ye |θ(y


b )).

same problem as with parameter estimation: what if we had again


observed a data y = 0 with n = 30?

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 30 / 40


1.3.1 Motivating thumbtack example, part II

e |Y ,
the proper Bayesian predictive distribution is the distribution of Y
the distribution of new observations given the observed data
This is denoted by
fYe |Y (ye |y ).

Note that the parameter θ is absent, so how can we derive this


conditional pmf now?
Before this, let’s consider this more generally

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 31 / 40


1.3.2 Posterior predictive distribution
Let’s consider a general case before computing the actual thumbtack
example
Assume we have observations
Y = (Y1 , . . . , Yn )
with a sampling distribution fY|Θ (y|Θ), where Θ ∈ Ω is a random
parameter vector
We want to predict m new observations:
Y
e = (Y
e1 , . . . , Y
em )

from the same random process.


Definition.
The conditional distribution Y|Y
e of the new observations given the data is
called a posterior predictive distribution.
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 32 / 40
1.3.2 Posterior predictive distribution

Lemma 1.
If in addition the new observations are independent of the observed data
given the parameter
e ⊥Y)|Θ,
(Y⊥
then (for densities)
Z
fY y|y) =
e|Y (e fY y|θ)fΘ|Y (θ|y)dθ
e|Θ (e (2)

Note: we will many times denote Y,


e Y⊥⊥|Θ, instead since this notation
works for more than two random vectors as well

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 33 / 40


Derivation of the formula (2)

Suppose we would have the joint distribution for (Θ, Y, Y)


e
We then conditionalize with data to obtain (Θ, Y)|Y
e
Then we can marginalize to get joint distribution for Y|Y
e
But any similar strategy would work, so we opt for more suitable route

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 34 / 40


Derivation of the formula (2)

So let’s do the following:


We will first specify Y|Y,
e Θ
Then with posteriori distribution Θ|Y and (conditional) chain rule we
get Y,
e Θ|Y
And we (conditionally) marginalize to get Y|Y
e

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 35 / 40


1.3.1 Motivating thumbtack example, part II

The (conditional) chain rule (for densities) is

fX,Y|Z (x, y|z) = fX|Y,Z (x|y, z)fY|Z (y|z)

The (conditional) marginalization (for densities) is


Z
fX|Z (x|z) = fX,Y|Z (x, y|z)dy

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 36 / 40


Derivation of the formula (2)

The idea (on blackboard) is:


Y,
e Y⊥ ⊥|Θ implies Y|Y,
e Θ = Y|Θ
e
the previous and conditional chain rule gives

fe
Y,Θ|Y
(e
y, θ|y) = fe
Y|Θ
y|θ)fΘ|Y (θ|y)
(e

conditional marginalization gives the claim.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 37 / 40


1.3.1 Motivating thumbtack example, part II

Now that we have a formula for the posterior predictive distribution


let’s compute it for the thumbtack case

Z
fYe |Y (ye |y ) = fYe |Θ (ye |θ)fΘ|Y (θ|y )dθ

!
m B(ye + α1 , m + β1 − ye )
=
ye B(α1 , β1 )

* This means:
e |Y ∼ Beta-bin(m, α1 , β1 )
Y
where α1 = y + 1, β1 = n − y + 1 the parameters of the posterior
distribution

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 38 / 40


Two ways of computing the previous integral

1 Explicitly recognize a familiar integral


2 Integrating as statistician (i.e. recognizing unnormalized density
functions)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 39 / 40


Prior predictive distribution and posterior predictive
distribution
Recall that we had a representation of the prior predictive distribution
as
fY (y) = Eg(y, Θ)
when we denote the transformation

g(y, θ) := fY|Θ (y|θ)

We have a similar looking representation for the posterior predictive


distribution 
fY y|y) = E g2 (e
e|Y (e y, Θ)|Y = y
when we denote the transformation

g2 (e
y, θ) := fY y|θ)
e|Θ (e

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 40 / 40

You might also like