Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University

Introduction to Bayesian Methods
Jessi Cisewski
Department of Statistics
Yale University
Sagan Summer Workshop 2016

Our goal: introduction to Bayesian methods
Likelihoods
Priors: conjugate priors, “non-informative” priors
Posteriors
Related topics covered this week

Markov chain Monte Carlo (MCMC)
Selecting priors
Bayesian modeling comparison
Hierarchical Bayesian modeling
Some material is from Tom Loredo, Sayan Mukherjee, Beka Steorts
1
Likelihood Principle
All of the information in a sample is contained in the likelihood
function, a density or distribution function.
2
The data are modeled by a likelihood function.
2

Not all statistical paradigms agree with this principle.
2
Likelihood functions
Consider a random sample of size n = 1 from a Normal(µ = 3,
σ = 2): X1 ∼ N(3, 2)
Probability density function (pdf)

−→ the function f (x, θ), where θ is fixed and x is variable
0.20
0.15
1 (x−µ)2
f (x, µ, σ) = √ e− 2σ 2
Density
2πσ 2
0.10
(x−3)2
1 −
(2)(22 )
=√
0.05
e
2π22
0.00
-5 0 5 10
X
The data are drawn from this

Likelihood
−→ the function f (x, θ), where θ is variable and x is fixed
3
Likelihood functions
Consider a random sample of size n = 1 from a Normal(µ = 3,
σ = 2): X1 ∼ N(3, 2)
Probability density function (pdf)

−→ the function f (x, θ), where θ is fixed and x is variable
Likelihood
−→ the function f (x, θ), where θ is variable and x is fixed
0.20
x1
µ
0.15
1 (x−µ)2
e−
Likelihood
f (x, µ, σ) = √ 2σ 2
0.10
2πσ 2
1 (1.747−µ)2
e−
0.05
=√ 2σ 2
2πσ 2
0.00
-5 0 5 10
µ
3
Consider a random sample of size n = 50 (assuming independence,
and a known σ): X1 , . . . , X50 ∼ N(3, 2)
50 (xi −µ)2
1
e − 2σ2
Y
f (x, µ, σ) = f (x1 , . . . , x50 , µ, σ)= √
i=1 2πσ 2
sample mean
µ
2.0e-47
Likelihood
1.0e-47
0.0e+00
1.0 1.5 2.0 2.5 3.0 3.5 4.0

µ
4
Consider a random sample of size n = 50 (assuming independence,
and a known σ): X1 , . . . , X50 ∼ N(3, 2)
50 (xi −µ)2
1
e − 2σ2
Y
f (x, µ, σ) = f (x1 , . . . , x50 , µ, σ)= √
i=1 2πσ 2
5
n = 50
n = 100
n = 250
(normalized) Likelihood
n = 500
4
µ
3
2
1
0
1.0 1.5 2.0 2.5 3.0 3.5 4.0

µ
4

How do we infer θ?
5
Maximum likelihood estimation
The parameter value, θ, that maximizes the likelihood:
θ̂ = max f (x1 , . . . , xn , θ)
θ
“Minimizing χ2 statistic” (under the Gaussian assumption)
MLE
6e-43
µ
maxµ f (x1 , . . . , xn , µ, σ) =
Likelihood
(xi −µ)2
maxµ ni=1 √ 1 2 e − 2σ2
4e-43
Q
2πσ
2e-43
Pn
i=1 xi
Hence, µ̂ = n = x̄
0e+00
1.0 2.0 3.0 4.0

µ
6
Bayesian framework
Classical or Frequentist methods for inference consider θ to be

fixed and unknown
−→ performance of methods evaluated by repeated sampling
−→ consider all possible data sets
Bayesian methods consider θ to be random

−→ only considers observed data set and prior information
7
Bayes’ Rule
Let A and B be two events in the sample space. Then
P(AB) P(B | A)P(A)

P(A | B) = =
P(B) P(B)
P(AB)
Note P(B | A) = P(A) =⇒ P(AB) = P(B | A)P(A)
−→ It is really just about conditional probabilities.

Sample space
8
Posterior distribution
Likelihood Prior
Data
z }| { z}|{
z}|{ f (x | θ) · π(θ) f (x | θ)π(θ)
π(θ | x ) = =R ∝ f (x | θ)π(θ)
f (x) Θ dθf (x | θ)π(θ)
The prior distribution allows you to “easily” incorporate your

beliefs about the parameter(s) of interest
Posterior is a distribution on the parameter space given the
observed data
9
Gaussian example
Consider y1:n = y1 , . . . , yn drawn from a Gaussian(µ, σ), µ
unknown
Qn −(yi −µ)2
Likelihood: f (y1:n | µ) = √ 1 e 2σ 2
i=1 2πσ 2
Prior: π(µ) ∼ N(µ0 , σ0 )

Posterior:
π(µ | Y1:n ) ∝ f (Y1:n | µ)π(µ)
 
n 2 −(µ−µ0 )2
Y 1 −(yi −µ)
 q 1 e 2σ02 
= √ e 2σ2
i=1 2πσ 2 2πσ 2 0
∼ N(µ1 , σ1 )
where P
µ0 yi
2
σ0
+ σ2

1 n
−1/2
µ1 = 1
, σ1 = + 2
σ02
+ σn2 2
σ0 σ
10
Data: y1 , . . . , y4 ∼ N(µ = 3, σ = 2), ȳ = 1.278
Prior: N(µ0 = −3, σ0 = 5)
Posterior: N(µ1 = 1.114, σ1 = 0.981)
0.4
Likelihood
Prior
Posterior
0.3
Density
0.2
0.1
0.0
-10 -5 0 5 10
µ
11
Data: y1 , . . . , y4 ∼ N(µ = 3, σ = 2), ȳ = 1.278
Prior: N(µ0 = −3, σ0 = 1)
Posterior: N(µ1 = −0.861, σ1 = 0.707)
0.6
Likelihood
Prior
Posterior
0.4
Density
0.2
0.0
-10 -5 0 5 10
µ
12
Data: y1 , . . . , y200 ∼ N(µ = 3, σ = 2), ȳ = 2.999
Prior: N(µ0 = −3, σ0 = 1)
Posterior: N(µ1 = 2.881, σ1 = 0.140)
3.0
Likelihood
Prior
Posterior
2.0
Density
1.0
0.0
-6 -4 -2 0 2 4 6
µ
13
Example 2 - on your own
Consider the following model:
Y | θ ∼ U(0, θ)
θ ∼ Pareto(α, β)
α
π(θ) = θαβ
α+1 I(β,∞) (θ)
where I(a,b) (x) = 1 if a < x < b and 0 otherwise
Find the posterior distribution of θ | y
1 αβ α
π(θ | y ) ∝ I(0,θ) (y ) α+1 I(β,∞) (θ)
θ θ
1 1
∝ I(y ,∞) (θ) α+1 I(β,∞) (θ)
θ θ
1
∝ α+2 I(max{y ,β},∞) (θ)
θ
=⇒ Pareto(α + 1, max{y , β})
14
Prior distribution
The prior distribution allows you to “easily” incorporate your

beliefs about the parameter(s) of interest
If one has a specific prior in mind, then it fits nicely into the
definition of the posterior
But how do you go from prior information to a prior distribution?
And what if you don’t actually have prior information?
15
Choosing a prior
Informative/Subjective prior: choose a prior that reflects our

belief/uncertainty about the unknown parameter
- Based on experience of the researcher from previous studies,
scientific or physical considerations, other sources of information
? Example: For a prior on the mass of a star in a Milky Way-type
galaxy, you likely would not use an infinite interval
Objective, non-informative, vague, default priors
Hierarchical models: put a prior on the prior
Conjugate priors: priors selected for convenience
16
Conjugate priors
The posterior distribution is from the same family of distributions

as the prior
We saw this with a Gaussian prior on µ resulted in a Gaussian

posterior µ | Y1:n
(Gaussian priors are conjugate with Gaussian likelihoods resulting
in a Gaussian posterior)
17
Some conjugate priors
Normal - normal: normal priors are conjugate with normal

likelihoods
Beta - binomial: beta priors are conjugate with binomial
likelihoods
Gamma - Poisson: gamma priors are conjugate with Poisson
likelihoods
Dirichlet - multinomial: Dirichlet priors are conjugate with
multinomial likelihoods
18
Beta-Binomial
Suppose we have an iid sample, x1 , . . . , xn , from a Bernoulli(θ)
X = 1, with probability θ
X = 0, with probability 1 − θ
Pn
Let y = i=1 xi =⇒ y is a draw from a Binomial(n, θ)

n k
p(Y = k) = θ (1 − θ)n−k
k
We want the posterior distribution for θ
19
Beta-Binomial
We have a binomial likelihood, and need to specify a prior on θ

Note that θ ∈ [0, 1]
If prior π(θ) ∼ Beta(α, β), then posterior
π(θ | y ) ∼ Beta(y + α, n − y + β)
The beta distribution is the conjugate prior for binomial

likelihoods
20
Beta - Binomial posterior derivation

n y n−y Γ(α + β) α−1 β−1
f (y , θ) = θ (1 − θ) θ (1 − θ)
y Γ(α)Γ(β)

n Γ(α + β) y +α−1
= θ (1 − θ)n−y +β−1
y Γ(α)Γ(β)
1
n Γ(α + β) Γ(y + α)Γ(n − y + β)
Z
f (y ) = f (y , θ)dθ =
0 y Γ(α)Γ(β) Γ(n + α + β)
f (y , θ) Γ(n + α + β)
π(θ | y ) = = θy +α−1 (1 − θ)n−y +β−1
f (y ) Γ(y + α)Γ(n − y + β)
∼ Beta(y + α, n − y + β)
21
Beta priors and posteriors
Γ(α + β) α−1
π(θ | α, β) = θ (1 − θ)β−1
Γ(α)Γ(β)
3.0 Post(1,1)
Prior(1,1)
2.0
Density
1.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0

22
θ
Beta priors and posteriors
Γ(α + β) α−1
π(θ | α, β) = θ (1 − θ)β−1
Γ(α)Γ(β)
3.0 Post(1,1)
Prior(1,1)
Post(3,2)
Prior(3,2)
Post(1/2, 1/2)
Prior(1/2,1/2)
2.0
Density
1.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0

22
θ
Poisson distribution
e −λ λy
Y ∼ Poisson(λ) =⇒ P(Y = y ) =
y!
0.20
λ:4
Mean = Variance = λ
0.05 0.10 0.15
Bayesian inference on λ:
Probability
Y | λ ∼ Poisson(λ)
What prior to use for λ?

0.00
(λ > 0)
0 5 10 15
X
Astronomical example
? Photons from distant quasars, cosmic rays
For more details see Feigelson and Babu (2012), Section 4.2
23
Gamma density
α
1.5
1
2
3
4
1.0
Density
β α α−1 −βy
f (y ) = y e ,y > 0
0.5
Γ(α)
0.0
0 1 2 3 4 5 6
Often written as Y ∼ Γ(α, β)

α > 0 (shape parameter), β > 0 (rate parameter)
Note: sometimes θ = 1/β is used instead
Mean = α/β, Variance = α/β 2
Γ(1, β) ∼ Exponential(β), Γ(d/2, 1/2) ∼ χ2d
24
Poisson - Gamma Posterior
Pn
yi
 Qn e −λ λyi −nλ
f (y1:n | λ) = i=1 yi ! = e Qnλ (yi=1
!)
(Likelihood)
i=1 i


β α α−1 −βλ

π(λ) = λ e (Prior)




 Γ(α)



 Pn
e −nλ i=1 yi β α α−1 −βλ
π(λ | y1:n ) ∝ Q λ ×
n
i=1 (y Γ(α) λ e

 Pi !)
n


 ∝e −nλ
λ i=1 yi λ α−1 e −βλ

 Pn


 =e −λ(n+β) λ i=1 yi +α−1

 P
∼ Γ ( yi + α, n + β)

The gamma distribution is the conjugate prior for Poisson

likelihoods
25
Poisson - Gamma Posterior illustrations
alpha=1 and beta=1 alpha=3 and beta=0.5

1.0
1.0
Likelihood Likelihood
Prior Prior
Posterior Posterior
Input value Input value
0.8
0.8
0.6
0.6
Density
Density
0.4
0.4
0.2
0.2
0.0
0.0
0 2 4 6 8 10 0 2 4 6 8 10
λ λ
Same dataset
26
Hierarchical priors
A prior is put on the parameters of the prior distribution =⇒ the

prior on the parameter of interest, θ, has additional parameters
Y | θ, γ ∼ f (y | θ, γ) (Likelihood)
Θ | γ ∼ π(θ | γ) (Prior)
Γ ∼ φ(γ) (Hyper prior)
It is assumed that φ(γ) is fully known, and γ is called a hyper

parameter
More layers can be added, but of course that makes the model
more complex −→ posterior may require computational techniques
(e.g. MCMC)
27
Simple illustration
Y | (µ, φ) ∼ N(µ, 1) Likelihood

µ | φ ∼ N(φ, 2) Prior
φ ∼ N(0, 1) Hyperprior
Maybe we want to put a hyperhyperprior on φ?
Posterior
µ | Y ∝ f (y | µ, φ)π1 (µ | φ)π2 (φ)
28
Non-informative priors
What to do if we don’t have relevant prior information? What if

our model is too complex to know what reasonable priors are?
Desire is for a prior that does not favor any particular value on
the parameter space
? Side note: some may have philosophical issues with this (e.g.
R.A. Fisher, which lead to fiducial inference)
We will discuss some methods for finding “non-informative

priors.” It turns out these priors can be improper (i.e. they
integrate to ∞ rather than 1), so you need to verify that the
resulting posterior distribution is proper
29
Example of improper prior with proper posterior:
Data: x1 , . . . , xn ∼ N(θ, 1)
(Improper) prior: π(θ) ∝ 1
(Proper) posterior: π(θ | x1:n ) ∼ N(ȳ , n−1/2 )
Example of improper prior with improper posterior

Data: x1 , . . . , xn ∼ Bernouilli(θ), y = ni=1 xi ∼ Binomial(n, θ)
P
(Improper) prior: π(θ) ∼ Beta(−1, −1)

(Improper) posterior: π(θ | x1:n ) ∝ θy −1 (1 − θ)n−y −1
This is improper for y = 0 or n
If you use improper priors, you have to check that the

posterior is proper
30
Uniform prior
This is what many astronomers use for non-informative priors, and
is often what comes to mind when we think of “flat” priors
θ ∼ U(0, 1)
1.0
0.8
0.6
Density
0.4
0.2
0.0
-0.5 0.0 0.5 1.0 1.5
What if we consider a transformation of θ, such as θ2 ?

31
Uniform prior
θ ∼ U(0, 1)
Prior for θ2 :
15
10
Density
5
0
0.0 0.2 0.4 0.6 0.8 1.0
−→ Notice that the above is not Uniform - the prior on θ2 is

informative. This is an undesirable property of Uniform priors. We
would like the “un-informativeness” to be invariant under
transformations.
32
There are a number of reasons why you may not have prior
information:
1 your work may be the first of its kind
2 you are skeptical about previous results that would have
informed your priors
3 the parameter space is too high dimensional to understand
how your informative priors work together
4 ···
If this is the case, then you may like the priors to have little effect
on the resulting posterior
33
Objective priors
Jeffreys’ prior
Uses Fisher information
Reference priors
Select priors that maximize some measure of divergence
between the posterior and prior (hence minimizing the impact
a prior has on the posterior)
“The Formal Definition of Reference Priors” by Berger et al. (2009)
More about selecting priors can be found here: Kass and

Wasserman (1996)
34
Jeffrey’s Prior
p
πJ (θ) ∝ |I (θ)|
where I(θ) is the Fisher information
2
d
I(θ) = E log L(θ | Y )
dθ
2
d
= −E log L(θ | Y ) (for exponential family)
dθ2
Some intuition1
I(θ) is understood to be a proxy for the information content in the
model about θ → high values of I(θ) correspond with likely values
of θ. This reduces the effect of the prior on the posterior
Most useful in single-parameter setting; not recommended with
multiple parameters
1
For more details see Robert (2007): “The Bayesian Choice”
35
Exponential example
f (y | θ) = θe −θy
Calculate the Fisher Information:
log(f (y | θ)) = log(θ) − θy
d 1
dθ log(f (y | θ)) = θ − y
d2
dθ2
log(f (y | θ)) = − θ12
d2 1
−E dθ 2 log(f (y | θ)) = θ 2
Hence, r
1 1
πJ (θ) ∝ 2
=
θ θ
36
Exponential example, continued
1
πJ (θ) ∝
θ
√
Suppose we consider φ = f (θ) = θ2 =⇒ θ = φ
dθ
dφ = − 2√1 φ
√ dθ
Hence, πJ0 (φ) = πJ ( φ) dφ = √1 √1
φ2 φ
∝ 1
φ
1
πJ0 (φ) ∝
φ
We see here that Jeffreys prior is invariant to the transformation

f (θ) = θ2
37
Binomial example

n y
f (Y | θ) = θ (1 − θ)n−y
y
log(f (Y | θ)) = log( yn ) + y log(θ) + (n − y ) log(1 − θ)

d y (n−y )
dθ log(f (Y | θ)) = θ − 1−θ
d2 (n−y )
dθ2
log(f (Y | θ)) = − θy2 − (1−θ) 2
→ Note that E (y ) = nθ
d 2 nθ (n−nθ) n n n
−E dθ 2 log(f (Y | θ)) = θ2
+ (1−θ)2
= θ + 1−θ = θ(1−θ)
Hence, r
n 1 1
πJ (θ) ∝ ∝ θ− 2 (1 − θ)− 2
θ(1 − θ)
Beta( 21 , 12 )
38
1 1
πJ (θ) ∝ θ− 2 (1 − θ)− 2
2.0
Beta(1/2, 1/2)
Beta(1, 1)
1.5
Density
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
θ
39
1 1
If we use Jeffreys’ prior πJ (θ) ∝ θ− 2 (1 − θ)− 2 , what is the
posterior for θ | Y ?
1 1
π(θ | Y ) ∝ yn θy (1 − θ)n−y θ− 2 (1 − θ)− 2

1 1
π(θ | Y ) ∝ θy (1 − θ)n−y θ− 2 (1 − θ)− 2
π(θ | Y ) ∝ θy −1/2 (1 − θ)n−y −1/2
π(θ | Y ) ∼ Beta(y + 1/2, n − y + 1/2)
The posterior distribution is proper (which we knew would be the

case since the prior is proper)
It is just a coincidence that the Jeffreys prior is conjugate.
40
Gaussian with unknown µ - on your own
(y −µ)2
f (Y | µ, σ 2 ) ∝ e − 2σ 2

−µ) 2
log(f (Y | θ)) = − (y2σ 2
d (y −µ) (y −µ)
dθ log(f (Y | θ)) = 2 2σ 2 = σ2
d2
dθ2
log(f (Y | θ)) = − σ12
d2 1
−E dθ 2 log(f (Y | θ)) = σ 2
Hence,
πJ (µ) ∝ 1
41
Inference with a posterior
Now that we have a posterior, what do we want to do with it?
Point estimation:
R
posterior mean: < θ >= Θ dθp(θ | Y , model)
posterior mode (MAP = maximum a posteriori)
Credible regions: posterior probability

R p that θ falls in regions R
p = P(θ ∈ R | Y , model) = R dθp(θ | Y , model) highest posterior
density (HPD) region
Posterior predictive distributions: predict new ỹ given data y
42
Posterior predictive distributions: predict new ỹ given data y
R R
f (ỹ , y ) f (ỹ , y , θ)dθ f (ỹ | y , θ)f (y , θ)dθ
f (ỹ | y ) = = =
f (y ) f (y ) f (y )
Z
= f (ỹ | y , θ)π(θ | y )dθ
Z
= f (ỹ | θ)π(θ | y )dθ (if y and ỹ are independent)
43
Confidence intervals 6= credible intervals
A 95% confidence interval is
based on repeated sampling of
datasets - about 95% of the A 95% credible interval is
confidence intervals will capture defined using the posterior
the true parameter value distribution
Posterior
0.40.3
Density
0.2
95% Prob
0.1
0.0 -3 -2 -1 0 1 2 3
θ
Parameters are random

Parameters are not random
http://www.nature.com
44
Summary
We discussed some basics of Bayesian methods

Bayesian inference relies on the posterior distribution
Likelihood Prior
Data
z }| { z}|{
z}|{ f (x | θ) · π(θ)
π(θ | x ) =
f (x)
There are different ways to select priors: subjective, conjugate,

“non-informative”
Credible intervals and confidence intervals have different
interpretations
We’ll be hearing a lot more about Bayesian methods throughout
the week.
45
Summary
We discussed some basics of Bayesian methods

Bayesian inference relies on the posterior distribution
Likelihood Prior
Data
z }| { z}|{
z}|{ f (x | θ) · π(θ)
π(θ | x ) =
f (x)
There are different ways to select priors: subjective, conjugate,

“non-informative”
Credible intervals and confidence intervals have different
interpretations
We’ll be hearing a lot more about Bayesian methods throughout
the week.
Thank you!
45
Bibliography I
Berger, J. O., Bernardo, J. M., and Sun, D. (2009), “The formal definition of
reference priors,” The Annals of Statistics, 905–938.
Feigelson, E. D. and Babu, G. J. (2012), Modern Statistical Methods for
Astronomy: With R Applications, Cambridge University Press.
Kass, R. E. and Wasserman, L. (1996), “The selection of prior distributions by
formal rules,” Journal of the American Statistical Association, 91,
1343–1370.
Robert, C. (2007), The Bayesian choice: from decision-theoretic foundations to
computational implementation, Springer Science & Business Media.
46

Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University

Uploaded by

Copyright:

Available Formats

You might also like

Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University

Uploaded by

Copyright:

Available Formats

Introduction to Bayesian Methods

Sagan Summer Workshop 2016

Related topics covered this week

Some material is from Tom Loredo, Sayan Mukherjee, Beka Steorts

The data are modeled by a likelihood function.

The data are modeled by a likelihood function.

Probability density function (pdf)

The data are drawn from this

Probability density function (pdf)

1.0 1.5 2.0 2.5 3.0 3.5 4.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

The data are modeled by a likelihood function.

“Minimizing χ2 statistic” (under the Gaussian assumption)

1.0 2.0 3.0 4.0

Classical or Frequentist methods for inference consider θ to be

Bayesian methods consider θ to be random

P(AB) P(B | A)P(A)

−→ It is really just about conditional probabilities.

The prior distribution allows you to “easily” incorporate your

Prior: π(µ) ∼ N(µ0 , σ0 )

where I(a,b) (x) = 1 if a < x < b and 0 otherwise

Find the posterior distribution of θ | y

The prior distribution allows you to “easily” incorporate your

But how do you go from prior information to a prior distribution?

And what if you don’t actually have prior information?

Informative/Subjective prior: choose a prior that reflects our

Objective, non-informative, vague, default priors

Hierarchical models: put a prior on the prior

Conjugate priors: priors selected for convenience

The posterior distribution is from the same family of distributions

We saw this with a Gaussian prior on µ resulted in a Gaussian

Normal - normal: normal priors are conjugate with normal

Suppose we have an iid sample, x1 , . . . , xn , from a Bernoulli(θ)

We want the posterior distribution for θ

We have a binomial likelihood, and need to specify a prior on θ

If prior π(θ) ∼ Beta(α, β), then posterior

The beta distribution is the conjugate prior for binomial

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

What prior to use for λ?

Often written as Y ∼ Γ(α, β)

The gamma distribution is the conjugate prior for Poisson

alpha=1 and beta=1 alpha=3 and beta=0.5

A prior is put on the parameters of the prior distribution =⇒ the

It is assumed that φ(γ) is fully known, and γ is called a hyper

Y | (µ, φ) ∼ N(µ, 1) Likelihood

Maybe we want to put a hyperhyperprior on φ?

What to do if we don’t have relevant prior information? What if

We will discuss some methods for finding “non-informative

Example of improper prior with improper posterior

(Improper) prior: π(θ) ∼ Beta(−1, −1)

If you use improper priors, you have to check that the

-0.5 0.0 0.5 1.0 1.5

What if we consider a transformation of θ, such as θ2 ?

0.0 0.2 0.4 0.6 0.8 1.0

−→ Notice that the above is not Uniform - the prior on θ2 is

More about selecting priors can be found here: Kass and