Professional Documents
Culture Documents
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
Jessi Cisewski
Department of Statistics
Yale University
1
Likelihood Principle
All of the information in a sample is contained in the likelihood
function, a density or distribution function.
2
Likelihood Principle
All of the information in a sample is contained in the likelihood
function, a density or distribution function.
2
Likelihood Principle
All of the information in a sample is contained in the likelihood
function, a density or distribution function.
2
Likelihood functions
Consider a random sample of size n = 1 from a Normal(µ = 3,
σ = 2): X1 ∼ N(3, 2)
0.20
0.15
1 (x−µ)2
f (x, µ, σ) = √ e− 2σ 2
Density
2πσ 2
0.10
(x−3)2
1 −
(2)(22 )
=√
0.05
e
2π22
0.00
-5 0 5 10
X
0.20
x1
µ
0.15
1 (x−µ)2
e−
Likelihood
f (x, µ, σ) = √ 2σ 2
0.10
2πσ 2
1 (1.747−µ)2
e−
0.05
=√ 2σ 2
2πσ 2
0.00
-5 0 5 10
µ
3
Consider a random sample of size n = 50 (assuming independence,
and a known σ): X1 , . . . , X50 ∼ N(3, 2)
50 (xi −µ)2
1
e − 2σ2
Y
f (x, µ, σ) = f (x1 , . . . , x50 , µ, σ)= √
i=1 2πσ 2
sample mean
µ
2.0e-47
Likelihood
1.0e-47
0.0e+00
5
n = 50
n = 100
n = 250
(normalized) Likelihood
n = 500
4
µ
3
2
1
0
5
Maximum likelihood estimation
The parameter value, θ, that maximizes the likelihood:
θ̂ = max f (x1 , . . . , xn , θ)
θ
MLE
6e-43
µ
maxµ f (x1 , . . . , xn , µ, σ) =
Likelihood
(xi −µ)2
maxµ ni=1 √ 1 2 e − 2σ2
4e-43
Q
2πσ
2e-43
Pn
i=1 xi
Hence, µ̂ = n = x̄
0e+00
6
Bayesian framework
7
Bayes’ Rule
Let A and B be two events in the sample space. Then
8
Posterior distribution
Likelihood Prior
Data
z }| { z}|{
z}|{ f (x | θ) · π(θ) f (x | θ)π(θ)
π(θ | x ) = =R ∝ f (x | θ)π(θ)
f (x) Θ dθf (x | θ)π(θ)
9
Gaussian example
Consider y1:n = y1 , . . . , yn drawn from a Gaussian(µ, σ), µ
unknown
Qn −(yi −µ)2
Likelihood: f (y1:n | µ) = √ 1 e 2σ 2
i=1 2πσ 2
∼ N(µ1 , σ1 )
where P
µ0 yi
2
σ0
+ σ2
1 n
−1/2
µ1 = 1
, σ1 = + 2
σ02
+ σn2 2
σ0 σ
10
Data: y1 , . . . , y4 ∼ N(µ = 3, σ = 2), ȳ = 1.278
Prior: N(µ0 = −3, σ0 = 5)
Posterior: N(µ1 = 1.114, σ1 = 0.981)
0.4
Likelihood
Prior
Posterior
0.3
Density
0.2
0.1
0.0
-10 -5 0 5 10
µ
11
Data: y1 , . . . , y4 ∼ N(µ = 3, σ = 2), ȳ = 1.278
Prior: N(µ0 = −3, σ0 = 1)
Posterior: N(µ1 = −0.861, σ1 = 0.707)
0.6
Likelihood
Prior
Posterior
0.4
Density
0.2
0.0
-10 -5 0 5 10
µ
12
Data: y1 , . . . , y200 ∼ N(µ = 3, σ = 2), ȳ = 2.999
Prior: N(µ0 = −3, σ0 = 1)
Posterior: N(µ1 = 2.881, σ1 = 0.140)
3.0
Likelihood
Prior
Posterior
2.0
Density
1.0
0.0
-6 -4 -2 0 2 4 6
µ
13
Example 2 - on your own
Consider the following model:
Y | θ ∼ U(0, θ)
θ ∼ Pareto(α, β)
α
π(θ) = θαβ
α+1 I(β,∞) (θ)
1 αβ α
π(θ | y ) ∝ I(0,θ) (y ) α+1 I(β,∞) (θ)
θ θ
1 1
∝ I(y ,∞) (θ) α+1 I(β,∞) (θ)
θ θ
1
∝ α+2 I(max{y ,β},∞) (θ)
θ
=⇒ Pareto(α + 1, max{y , β})
14
Prior distribution
If one has a specific prior in mind, then it fits nicely into the
definition of the posterior
15
Choosing a prior
16
Conjugate priors
17
Some conjugate priors
18
Beta-Binomial
X = 1, with probability θ
X = 0, with probability 1 − θ
Pn
Let y = i=1 xi =⇒ y is a draw from a Binomial(n, θ)
n k
p(Y = k) = θ (1 − θ)n−k
k
19
Beta-Binomial
π(θ | y ) ∼ Beta(y + α, n − y + β)
20
Beta - Binomial posterior derivation
n y n−y Γ(α + β) α−1 β−1
f (y , θ) = θ (1 − θ) θ (1 − θ)
y Γ(α)Γ(β)
n Γ(α + β) y +α−1
= θ (1 − θ)n−y +β−1
y Γ(α)Γ(β)
1
n Γ(α + β) Γ(y + α)Γ(n − y + β)
Z
f (y ) = f (y , θ)dθ =
0 y Γ(α)Γ(β) Γ(n + α + β)
f (y , θ) Γ(n + α + β)
π(θ | y ) = = θy +α−1 (1 − θ)n−y +β−1
f (y ) Γ(y + α)Γ(n − y + β)
∼ Beta(y + α, n − y + β)
21
Beta priors and posteriors
Γ(α + β) α−1
π(θ | α, β) = θ (1 − θ)β−1
Γ(α)Γ(β)
3.0 Post(1,1)
Prior(1,1)
2.0
Density
1.0 0.0
Γ(α + β) α−1
π(θ | α, β) = θ (1 − θ)β−1
Γ(α)Γ(β)
3.0 Post(1,1)
Prior(1,1)
Post(3,2)
Prior(3,2)
Post(1/2, 1/2)
Prior(1/2,1/2)
2.0
Density
1.0 0.0
e −λ λy
Y ∼ Poisson(λ) =⇒ P(Y = y ) =
y!
0.20
λ:4
Mean = Variance = λ
0.05 0.10 0.15
Bayesian inference on λ:
Probability
Y | λ ∼ Poisson(λ)
(λ > 0)
0 5 10 15
X
Astronomical example
? Photons from distant quasars, cosmic rays
For more details see Feigelson and Babu (2012), Section 4.2
23
Gamma density
α
1.5
1
2
3
4
1.0
Density
β α α−1 −βy
f (y ) = y e ,y > 0
0.5
Γ(α)
0.0
0 1 2 3 4 5 6
24
Poisson - Gamma Posterior
Pn
yi
Qn e −λ λyi −nλ
f (y1:n | λ) = i=1 yi ! = e Qnλ (yi=1
!)
(Likelihood)
i=1 i
β α α−1 −βλ
π(λ) = λ e (Prior)
Γ(α)
Pn
e −nλ i=1 yi β α α−1 −βλ
π(λ | y1:n ) ∝ Q λ ×
n
i=1 (y Γ(α) λ e
Pi !)
n
∝e −nλ
λ i=1 yi λ α−1 e −βλ
Pn
=e −λ(n+β) λ i=1 yi +α−1
P
∼ Γ ( yi + α, n + β)
25
Poisson - Gamma Posterior illustrations
1.0
Likelihood Likelihood
Prior Prior
Posterior Posterior
Input value Input value
0.8
0.8
0.6
0.6
Density
Density
0.4
0.4
0.2
0.2
0.0
0.0
0 2 4 6 8 10 0 2 4 6 8 10
λ λ
Same dataset
26
Hierarchical priors
Y | θ, γ ∼ f (y | θ, γ) (Likelihood)
Θ | γ ∼ π(θ | γ) (Prior)
Γ ∼ φ(γ) (Hyper prior)
27
Simple illustration
Posterior
µ | Y ∝ f (y | µ, φ)π1 (µ | φ)π2 (φ)
28
Non-informative priors
Desire is for a prior that does not favor any particular value on
the parameter space
? Side note: some may have philosophical issues with this (e.g.
R.A. Fisher, which lead to fiducial inference)
29
Example of improper prior with proper posterior:
Data: x1 , . . . , xn ∼ N(θ, 1)
(Improper) prior: π(θ) ∝ 1
(Proper) posterior: π(θ | x1:n ) ∼ N(ȳ , n−1/2 )
30
Uniform prior
This is what many astronomers use for non-informative priors, and
is often what comes to mind when we think of “flat” priors
θ ∼ U(0, 1)
1.0
0.8
0.6
Density
0.4
0.2
0.0
θ ∼ U(0, 1)
Prior for θ2 :
15
10
Density
5
0
33
Objective priors
Jeffreys’ prior
Uses Fisher information
Reference priors
Select priors that maximize some measure of divergence
between the posterior and prior (hence minimizing the impact
a prior has on the posterior)
“The Formal Definition of Reference Priors” by Berger et al. (2009)
34
Jeffrey’s Prior
p
πJ (θ) ∝ |I (θ)|
where I(θ) is the Fisher information
2
d
I(θ) = E log L(θ | Y )
dθ
2
d
= −E log L(θ | Y ) (for exponential family)
dθ2
Some intuition1
I(θ) is understood to be a proxy for the information content in the
model about θ → high values of I(θ) correspond with likely values
of θ. This reduces the effect of the prior on the posterior
Most useful in single-parameter setting; not recommended with
multiple parameters
1
For more details see Robert (2007): “The Bayesian Choice”
35
Exponential example
f (y | θ) = θe −θy
Calculate the Fisher Information:
log(f (y | θ)) = log(θ) − θy
d 1
dθ log(f (y | θ)) = θ − y
d2
dθ2
log(f (y | θ)) = − θ12
d2 1
−E dθ 2 log(f (y | θ)) = θ 2
Hence, r
1 1
πJ (θ) ∝ 2
=
θ θ
36
Exponential example, continued
1
πJ (θ) ∝
θ
√
Suppose we consider φ = f (θ) = θ2 =⇒ θ = φ
dθ
dφ = − 2√1 φ
√ dθ
Hence, πJ0 (φ) = πJ ( φ) dφ = √1 √1
φ2 φ
∝ 1
φ
1
πJ0 (φ) ∝
φ
37
Binomial example
n y
f (Y | θ) = θ (1 − θ)n−y
y
Calculate the Fisher Information:
log(f (Y | θ)) = log( yn ) + y log(θ) + (n − y ) log(1 − θ)
d y (n−y )
dθ log(f (Y | θ)) = θ − 1−θ
d2 (n−y )
dθ2
log(f (Y | θ)) = − θy2 − (1−θ) 2
→ Note that E (y ) = nθ
d 2 nθ (n−nθ) n n n
−E dθ 2 log(f (Y | θ)) = θ2
+ (1−θ)2
= θ + 1−θ = θ(1−θ)
Hence, r
n 1 1
πJ (θ) ∝ ∝ θ− 2 (1 − θ)− 2
θ(1 − θ)
Beta( 21 , 12 )
38
1 1
πJ (θ) ∝ θ− 2 (1 − θ)− 2
2.0
Beta(1/2, 1/2)
Beta(1, 1)
1.5
Density
1.0
0.5
0.0
θ
39
1 1
If we use Jeffreys’ prior πJ (θ) ∝ θ− 2 (1 − θ)− 2 , what is the
posterior for θ | Y ?
1 1
π(θ | Y ) ∝ yn θy (1 − θ)n−y θ− 2 (1 − θ)− 2
1 1
π(θ | Y ) ∝ θy (1 − θ)n−y θ− 2 (1 − θ)− 2
π(θ | Y ) ∝ θy −1/2 (1 − θ)n−y −1/2
π(θ | Y ) ∼ Beta(y + 1/2, n − y + 1/2)
40
Gaussian with unknown µ - on your own
(y −µ)2
f (Y | µ, σ 2 ) ∝ e − 2σ 2
d (y −µ) (y −µ)
dθ log(f (Y | θ)) = 2 2σ 2 = σ2
d2
dθ2
log(f (Y | θ)) = − σ12
d2 1
−E dθ 2 log(f (Y | θ)) = σ 2
Hence,
πJ (µ) ∝ 1
41
Inference with a posterior
Point estimation:
R
posterior mean: < θ >= Θ dθp(θ | Y , model)
posterior mode (MAP = maximum a posteriori)
42
Posterior predictive distributions: predict new ỹ given data y
R R
f (ỹ , y ) f (ỹ , y , θ)dθ f (ỹ | y , θ)f (y , θ)dθ
f (ỹ | y ) = = =
f (y ) f (y ) f (y )
Z
= f (ỹ | y , θ)π(θ | y )dθ
Z
= f (ỹ | θ)π(θ | y )dθ (if y and ỹ are independent)
43
Confidence intervals 6= credible intervals
A 95% confidence interval is
based on repeated sampling of
datasets - about 95% of the A 95% credible interval is
confidence intervals will capture defined using the posterior
the true parameter value distribution
Posterior
0.40.3
Density
0.2
95% Prob
0.1
0.0 -3 -2 -1 0 1 2 3
θ
44
Summary
45
Summary
Thank you!
45
Bibliography I
Berger, J. O., Bernardo, J. M., and Sun, D. (2009), “The formal definition of
reference priors,” The Annals of Statistics, 905–938.
Feigelson, E. D. and Babu, G. J. (2012), Modern Statistical Methods for
Astronomy: With R Applications, Cambridge University Press.
Kass, R. E. and Wasserman, L. (1996), “The selection of prior distributions by
formal rules,” Journal of the American Statistical Association, 91,
1343–1370.
Robert, C. (2007), The Bayesian choice: from decision-theoretic foundations to
computational implementation, Springer Science & Business Media.
46