Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Chapter 4

Week 4

L4: Jeffrey’s Prior, Eliciting and Analysing Priors

4.1 Jeffrey’s prior

4.1.1 Example of the problem of “uninformative” prior

Priors considered “uninformative” for a parameter, θ, may not be considered uninformative for transfor-
mations of the parameter, ψ=g(θ). For example, consider a Uniform(0,20) prior for a parameter θ. The
problem√with such uniform priors is that the induced prior for simple transformations of the parameter,
e.g., φ= θ, will not be uniform. Given θ ∼ Uniform(0,20), the distribution for φ is found by the change of
variable theorem1 :
1 dφ2 √ 1 √ φ √
π(φ) = I(0 < φ < 20) = 2φI(0 < φ < 20) = I(0 < φ < 20)
20 dφ 20 10
√ √
where I(0 < φ < 20) is an indicator function that equals 1 when √ 0 < φ < 20 and 0 otherwise. This
induced prior clearly is informative, as it linearly increases from 0 to 20.

4.1.2 Definition and calculation of Jeffrey’s prior

Jeffrey’s prior is an example of an objective prior which can be seen as a remedy to the just discussed problem
of the induced priors resulting from transformations. It is a prior that is invariant to strictly monotonic (1-1
or bijective) transformations of the parameter, say φ =g(θ), where g is strictly monotonic.
Jeffrey’s prior is proportional to the square root of Fisher’s Information, I(θ|y):
p
Jeffrey’s prior πJP (θ) ∝ I(θ|y) (4.1)
1 The change of variable theorem is a procedure for determining the pdf of a (continuous) random variable Y that is a strictly

monotonic (1:1) transformation of another (continuous) random variable X, i.e., Y =g(X). The pdf for Y :
dg −1 (y)
pY (y) = pX (g −1 (y))
dy
See Section 4.5 for more details.

61
where
" 2 #
d log f (y|θ)
I(θ|y) = E (4.2)

Note that under certain regularity conditions (e.g. that the differentiation operation can be moved inside
the integral), Fisher Information can be calculated from the second derivative of the log likelihood:
 2 
d log f (y|θ)
I(θ|y) = −E (4.3)
dθ2

which is often much easier to calculate than Eq’n 4.25. For more discussion of Fisher’s Information see
Section 4.6.

4.1.3 Jeffrey’s prior is invariant to 1:1 transformations

Remark. Let f (y|θ) denote a probability density or mass function for a random variable y where θ is a
scalar. Let φ = g(θ) where g is a strictly p monotonic (1:1 or bijective) transformation. If we specify a
Jeffrey’s
p prior for θ, namely, π JP (θ) ∝ I(θ|y), then the induced prior on φ, π(φ), is proportional to
I(φ|y). In other words, a 1:1 transformation of a parameter that has a Jeffrey’s prior yields a Jeffrey’s
prior for the transformed parameter.

dy dy dz
Proof. This proof uses the chain rule, dx = dz dx , and the change of variable theorem.
Write the Fisher information for θ as follows:
" 2 # " 2 #
d log f (x|θ) d log f (y|φ)) dφ
I(θ|y) = E =E
dθ dφ dθ
" 2 # 2 2
d log f (y|φ)) dφ dφ
=E = I(φ|y)
dφ dθ dθ

Thus the Jeffrey’s prior for θ can be written:


p p dφ
πJP (θ) ∝ I(θ|y) = I(φ|y)

Then the induced distribution for φ given this prior:

dθ p dθ dφ p
π(φ) = πJP (θ) ∝ I(φ|y) = I(φ|y)
dφ dφ dθ

4.1.4 Example A. Binomial distribution

Suppose that the prevalence of Potato Virus Y in a population of aphids is an unknown parameter θ. A
random sample of n aphids is taken (using a trap) and the number of aphids with the virus is x. Assuming
independence between the aphids and that they all have the same probability of having the virus, x ∼

62
Binomial(n, θ). The Fisher information:
" #
d2 log nx + x log(θ) + (n − x) log(1 − θ)
  
−x n−x
I(θ) = −E = −E −
dθ2 θ2 (1 − θ)2
E(x) n − E(x) nθ n − nθ n
= 2
+ 2
= 2 + 2
=
θ (1 − θ) θ (1 − θ) θ(1 − θ)
Thus the Jeffrey’s prior is
s
1
πJP (θ) ∝ = θ−1/2 (1 − θ)−1/2
θ(1 − θ)

which is the kernel for a Beta(1/2,1/2) distribution.

θ(1−θ)
Aside. Note that the mle for θ is θ̂ = x/n and the variance of θ̂ is n , which equals I(θ)−1 .

4.1.5 Example B. Exponential distribution


iid
Let x1 , . . . , xn be an iid sample from an exponential distribution with rate parameter λ, namely, xi ∼
Exponential(λ), i = 1, . . . , n. For example, suppose that xi is amount of time individual i waits in a queue
at a bank during lunch hour until seeing a teller.
The Fisher information for a single random variable:
" 2 # " 2 # " 2 #  
d log f (x|λ) d log λ − λx 1 1 x 2
I(λ) = E =E =E −x =E 2 −2 +x
dλ dλ λ λ λ
1 2 2 1
= − 2+ 2 = 2
λ2 λ λ λ
where we used the facts that E(x)=1/λ and E(x2 )=2/λ2 (use the moment generating function λ/(λ − t)).
Alternatively, we could use the other expression for I(λ):
 2   
d log f (x|λ) −1 1
I(λ) = −E = −E = 2
dθ2 λ2 λ
n
Thus the Fisher information for the entire sample is then In (λ) = λ2 .

Then the Jeffrey’s prior:


p 1
πJP (λ) ∝ I(λ) =
λ
Note that this is an improper prior as it does not integrate over the domain of λ, namely (0, ∞). However
this is a situation where the posterior for λ is proper:
n
! n
!
1 n X
n−1
X
p(λ|x1 , . . . , xn ) ∝ π(λ)f (x1 , . . . , xn |λ) = λ exp −λ xi = λ exp −λ xi
λ i=1 i=1
Pn
which is the kernel for a Gamma(n, i=1 xi ) density function.
To
√ demonstrate the invariance under a 1:1 transformation, reparameterize the exponential with θ=g(λ) =
λ, then
f (x|θ) = θ2 exp(−θ2 x)

63
and g −1 (θ) = θ2 = λ. Given π(λ) ∝ 1/λ, the induced prior for θ:

dg −1 (θ) dθ2 1 2
πθ (θ) = πλ (g −1 (θ)) = 2
=
dθ dθ θ θ
Checking that this is indeed the Jeffrey’s prior for θ:
 2
d 2 log(θ) − θ2 x

4
I(θ) = −E 2
= 2
dθ θ
p
Thus the Jeffrey’s prior is πJP (θ) ∝ I(θ) = 2/θ.

4.1.6 Example C. Normal dist’n with known variance

The sampling distribution for x1 , . . . , xn is Normal(µ, σ 2 ) where σ 2 is known. It can be shown that the
Jeffrey’s prior for µ when σ 2 is known is

πJP (µ) ∝ 1, − ∞ < µ < ∞


R∞
This is an improper prior because −∞
1dµ is not finite. However the posterior distribution is proper:

(µ − ȳ)2
 
π(µ|y) ∝ exp − ∗1
2σ 2 /n
 2

namely the kernel for a Normal ȳ, σn .

4.1.7 Example D. Normal dist’n with known mean

The sampling distribution for x1 , . . . , xn is Normal(µ, σ 2 ) where µ is known. The Jeffrey’s prior for σ 2 (given
known µ) can be shown to be the following:
1
πJP (σ 2 ) ∝ , 0 < σ2 < ∞ (4.4)
σ2
R∞ 1 2
which is an improper prior because 0
is not finite. The posterior distribution for σ 2 ,
σ 2 dσ

z2 z2
   
2
 n
2 −2 1  n
2 −( 2 +1)
p(σ |) ∝ σ exp − 2 = σ exp − 2 (4.5)
2σ σ2 2σ
Pn  2

where z 2 = i=1 (xi − µ)2 . This is the kernel for Γ−1 n2 , z2 so long as n/2 > 0 (obviously so), and z 2 > 0,
which simply means that not all values in y are identical.

4.1.8 Jeffrey’s prior for a multivariate parameter vector.

To calculate Jeffrey’s prior for a multivariate parameter vector, θ, with p parameters, one calculates the
score function, which is now the gradient of the log likelihood:
 d
dθ1 ln(f (x))

..
S(θ) =  (4.6)
 
. 
d
dθp ln(f (x))

64
and then calculates the Hessian of the log likelihood, namely, the matrix of the partial derivatives of the
score function
d2 d2 d2
 
ln(f (x)) ln(f (x)) ... ln(f (x))
 d d d

dθ S(θ)[1] dθ1 S(θ)[2] . . . dθ1 S(θ)[p] dθ12 dθ1 dθ2 dθ1 dθp
 d1 S(θ)[1] d d   d2 d2 d2

 dθ2 dθ2 S(θ)[2] . . . dθ2 S(θ)[p]   dθ2 dθ1 ln(f (x)) dθ22 dθ2
ln(f (x)) ... dθ2 dθp ln(f (x)) 
H(θ) =  . = 
 .. .. .. .. .. .. ..
 ..
  
. . .  .
  . . . 

d d d 2 2
d2
dθp S(θ)[1] dθp S(θ)[2] ... dθp S(θ)[p]
d d
dθp dθ1 ln(f (x)) dθ2 dθp ln(f (x)) ... dθp2 ln(f (x))
(4.7)

Fisher information is

I(θ|x) = −E[H(θ)] (4.8)

Finally, the Jeffrey’s prior for the vector of parameters is proportional to the square root of the determinant
of the Fisher information matrix:
p
πJP (θ) ∝ det(I(θ|x)) (4.9)

Example E. Normal(µ,σ 2 ) Jeffrey’s prior

To make the differentiation a little less awkward, the normal distribution is parameterized with θ=σ 2 . Given
y1 , . . . , yn (iid) Normal(µ, θ) pdf can be written:
1
Pn 2
f (y; µ, θ) = (2πθ)−n/2 e− 2θ i=1 (yi −µ) (4.10)

Then the log likelihood:


n
X
log L(µ, θ) = l ∝ −n log(θ) − θ−1 (yi − µ)2 (4.11)
i=1

The score vector:


n
dl X
= 2θ−1 (yi − µ) = 2θ−1 n(−µ) (4.12)
dµ i=1
n
dl X
= −nθ−1 + θ−2 (yi − µ)2 (4.13)
dθ i=1

The Hessian matrix (H) components:

d2 l
= −2θ−1 n (4.14)
dµ2
d2 l d2 l
= = −2θ−2 n(−µ) (4.15)
dµdθ dθdµ
n
d2 l −2 −3
X
= nθ − 2θ (yi − µ)2 (4.16)
dθ2 i=1

Then the Fisher Information matrix


2n 2n
   
0 σ2 0
I(µ, θ) = −E[H] = θ
n = n (4.17)
0 θ2 0 σ4

65
Then the Jeffrey’s prior for (µ, σ 2 ):
r r
2
p 2n n 2n2 1
πJP (µ, σ ) ∝ |I(µ, σ 2 )| = = ∝ 3 (4.18)
σ2 σ4 σ6 (σ 2 ) 2
The posterior can be shown to be the product of an inverse Chi-square (Gamma) and a normal (conditional
on σ 2 ).

Example F. Normal with two independent Jeffrey’s priors.

Note: this example is Not using the Determinant as above.


If the Jeffrey’s prior for µ (given σ 2 is known) and the Jeffrey’s prior for σ 2 (given µ is known) are treated
as independent priors, the Jeffrey’s prior is
1
π(µ, σ 2 ) ∝ 1 ∗ , − ∞ < µ < ∞, 0 < σ 2 < ∞
σ2
then it turns out that the posterior marginal distribution for σ 2 is
n − 1 (n − 1)s2
 
2 −1
σ |y ∼ Γ ,
2 2
Pn
where s2 = (1/(n − 1)) i=1 (yi − ȳ)2 , the usual sample variance. And the posterior marginal distribution
for µ is a student’s t distribution with mean ȳ, scale parameter s2 /n, and n − 1 degrees of freedom.
Referring to the 2x4 timber example from Lecture 3 notes, ȳ=3.5283, s2 = 0.0122209, with 9 degrees of
freedom. Thus the posterior expected value of µ is 3.5283. A 95% credible interval for µ can be found by
calculating the 2.5 and 97.5 percentiles of the standard t9 df distribution, multiplying them by the square
root of the scale parameter and then adding the mean:

qt(c(0.025,0.975),df=9)*sqrt(0.0122209/10) + 3.5283
3.449219 3.607381

Thus 95% credible interval for µ of (3.45, 3.61).


For σ 2 , the 95% credible interval is calculated for the Gamma(9/2, 9*0.0122209/2) distribution and then
inverting the results:

gam.bds <- qgamma(c(0.025,0.975),shape=9/2,rate=9*0.0122209/2)


inv.gam.bds <- sort(1/gam.bds)
inv.gam.bds
0.005781919 0.040730458

Thus a 95% credible interval for σ 2 of (0.0058, 0.0407).

4.2 Reference Priors

As mentioned in LN 2, the Jeffrey’s prior is one of several objective priors, namely procedures for selecting
priors which will yield the same prior for anyone who uses the procedure. Reich and Ghosh (2019) discuss

66
four other objective priors that are at least worth knowing about and I recommend reading the approximately
two pages of discussion. Admittedly understanding the general concepts behind the procedures and being
able to implement the procedures can have quite different degrees of difficulty, with the latter generally more
difficult than the former. Here we will just examine one other type of objective prior, the Reference Prior,
denoted πRP (θ).

4.2.1 KL divergence

Before introducing Reference Priors, the notion of Kullback-Leibler (KL) divergence is introduced. KL
divergence is a measure of the difference between two pmfs or two pdfs. A KL divergence value of 0 means
that the two distributions are identical.
For the continuous case let f and g denote two pdfs. The KL divergence is defined “conditional” on one of
the two distributions, here denoted KL(f, g) or KL(g, f ) where KL(f, g) 6= KL(g, f ), except when f and g
are identical (almost everywhere). More exactly, KL(f, g) is the expected value of log(f (x)/g(x) assuming
that f is “true”—more accurately stated that the expectation is with respect to f (x), and KL(g, f ) is the
reverse:
Z   Z Z
f (x)
KL(f, g) = log f (x)dx = log(f (x))f (x)dx − log(g(x))f (x)dx = Ef [log(f (X)] − Ef [log(g(X)]
g(x)
(4.19)
and
Z  
g(x)
KL(g, f ) = log g(x)dx = Eg [log(g(x))] − Eg [log(f (x)] (4.20)
f (x)
h i
Notes: (1) if g(x) = f (x), then log fg(x)
(x)
= log(1) = 0; and (2) KL(f, g) ≥ 0.

KL divergence examples. As a simple example consider a discrete valued random variable with values
0, 1, or 2. Let f (x) be the Binomial(2,p=0.2) pmf with probabilities 0.64, 0.32, and 0.04 for X=0, 1, and 2,
respectively. Let g(x) be the discrete uniform where g(0) = g(1) = g(2) = 1/3. Then
2  
X f (x)
KL(f, g) = f (x) log = 0.64 log(0.64/0.33) + 0.32 log(0.32/0.33) + 0.04 log(0.04/0.33) = 0.3196145
x=0
g(x)
2  
X g(x)
KL(g, f ) = g(x) log = 0.33 log(0.33/0.64) + 0.33 log(0.33/0.32) + 0.33 log(0.33/0.04) = 0.5029201
x=0
f (x)

For another example let g(x) be a Binomial(2,p=0.25). Then KL(f, g)= 0.01400421 and KL(g, f ) =
0.01476399, thus the KL divergence measures are both close to 0 (and close to each other).

4.2.2 Reference prior

Given data y, the KL divergence between the prior and posterior (with respect to the posterior) is
Z  
p(θ|y)
KL(p(θ|y), π(θ)) = p(θ|y) log dθ (4.21)
π(θ)
The key idea of a RP is that the KL divergence between the posterior and the prior should be as large as
possible, thus implying that the data are dominating the prior.

67
However, this measure is conditional on the data, y, which does not help for determining a prior. Thus to
remove the conditioning on the data, the data are integrated out, and πRP (θ) is the probability distribution
(pmf or pdf) that maximizes the following:
Z
Ey [KL(p(θ|y), π(θ))] = [KL(p(θ|y), π(θ))] m(y)dy (4.22)

where m(y) is the marginal distribution for the data.


While this approach is conceptually attractive, it can be technically challenging as one is trying to find
an entire probability distribution π(θ) that maximizes eq’n (4.22), noting that determining m(y) involves
integration as well.

4.3 Eliciting Informative Priors

Often the scientist or subject-matter specialist will have a definite opinion as to what the range of parameter
values should be. For example, an experienced heart surgeon who has done 1000s of coronary by-pass
surgeries on a variety of patients will have a definite opinion on post-surgery survival probability.
How does one translate that prior knowledge into a prior probability distribution?

• First, think about whether it’s even feasible or are there too many parameters, or is the underlying
sampling model fairly complex, e.g., a hierarchical model. For example, suppose the sampling model
is a multiple regression with 4 covariates, thus these five parameters, β0 , β1 , β2 , β3 , and β4 , and the
variance parameter. How might the expert’s knowledge be translated into prior distributions for these 5
parameters? The expert might have an opinion about the signs, positive or negative, of each coefficient,
and perhaps the relative importance of each covariate; e.g., dealing with standardized covariates, then
the effects of x1 and x2 are thought to be positive but the effect of x1 may be twice as large as x2 .
• Single parameter case. This is generally the most feasible situation. If the expert is not familiar with
probability distributions, the statistician may need to work with the expert to arrive at a prior and
can help by asking questions about the parameter without using statistics jargon.
For example, instead of asking for the median, ask “For what value of the parameter do you think that
it’s equally likely that values are either below or above it?”
“What do you think the range of values might be?”
“ What do you think the relative variation around an average value be, for example, if your best guess
for θ is 15, is your uncertainty within ± 10% of that value (± 1.5), or 20% (± 3.0)?” Thus potentially
getting a measure of the coefficient of variation, CV =σ/µ.
Given a mean value, and a range, standard deviation, or CV, and assuming a particular standard prob-
ability distribution might suffice, rough estimates of hyperparameters for the prior might be calculated.
This is the “moment matching” idea that we’ve examined previously.
For example, with the above example of the surgeon, the surgeon was thinking that θ would be 0.7 on
average. Further questioning about the surgeon’s uncertainty led to a determination that a CV of 0.1
would be appropriate. Using a Beta distribution for the prior, the mean is α/(α + β) and the variance
is (αβ)/[(α + β)2 (α + β + 1)], some algebra yields Beta(29.3, 12.6).
• Discrete histogram priors. Another simple way to elicit priors is to partition parameter values into
non-overlapping bins, and have the expert present relative weights for each bin. For example, θ is
grouped into three bins, [0,10), [10,25), [25,30], and the expert gives relative weights of 0.2, 0.5, and
0.3. A proper histogram, pdf, is constructed using the result that bin area = height × width, where

68
bin area corresponds to probability or weight. For the bin [0,10), area=0.2, width=10, thus height
equals area/width = 0.2/10 = 0.02. For [10,25) height is 0.5/15 = 0.033, and for [25,30] height is 0.3/5
= 0.06.

References
• “The elicitation of prior distributions”, Chaloner, 1996, in Bayesian Biostatistics, eds., Berry and
Stangl.
• Uncertain Judgements: Eliciting Experts’ Probabilities, O’Hagan, et al. 2006. This book is available
online as a pdf via the University Library.

4.4 Sensitivity analysis of priors

Using the term sensitivity analysis loosely here, we mean an examination of the effects of different priors on
the posterior. For example, the comparison of the posterior distribution for the probability of survival after
by-pass surgery for the surgeon’s prior and the medical student’s prior is a sensitivity analysis.
Another loosely put phrase, if the posterior distributions for different priors look much “the same”, e.g.,
have similar means and variances, then then one might say that the results are robust to the priors.
With large enough samples, unless the prior is particularly concentrated over a narrow range of possible
values, sometimes called a pig-headed prior, the posterior will look much the same for a wide range of priors
as the data (the likelihood) are dominating the prior.

Problematic issues
• Practical issue: if 100s of parameters, then tedious at least, to carry out a sensitivity analysis for all
the priors.
• Generalized linear models where data come from an exponential family distribution with parameter θ
and covariate(s) x, say F, and, g(θ, x), the “link” function, is a linear model:

y|θ, x ∼ F(θ, x)
g(θ, x) = β0 + β1 x

Apparently uninformative priors in the link function may induce quite informative priors at a lower
level. For example, a logistic regression for the number of patients surviving heart bypass surgery
where the probabilities differ with age:

yi |Agei ∼ Bernoulli(θ(Agei ))
exp(β0 + β1 Age)
where θ(Age) =
1 + exp(β0 + β1 Age)
 
θ(Age)
equivalently ln = β0 + β1 Age
1 − θ(Age)
A seemingly innocuous prior for both β0 and β1 is Normal(µ=0,σ 2 =52 ). Suppose that the ages range
from 40 to 70. Figure 4.1 shows the results of simulating from the priors for β0 and β1 on the induced
priors for survival for four different ages. Note how the probabilities are massed near 0 and 1. Such
simulation exercises can be quite valuable for detecting such effects.

69
Figure 4.1: Induced prior for survival probability (θ) as a function of age given Normal(0,5) priors for logit
transformation, ln (θ/(1 − θ)) = β0 + β1 Age.

Age= 40 Age= 50

5
5

4
4

3
3

2
2

1
1
0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Survival Survival

Age= 60 Age= 70
5

5
4

4
3

3
2

2
1

1
0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Survival Survival

70
4.5 Supplement A: Change of Variable Theorem

The Problem

A continuous random variable X has a pdf fX (X).


A new random variable Y is “constructed” from X by a strictly monotonic function g (a 1:1 function)
Y = g(X).
Thus X can be “recovered” from Y by an inverse function, g −1 , where X = g −1 (Y ).
The problem: what is the pdf for Y , namely, fY (Y )?

The Solution

The pdf for Y is found as follows:


dg −1 (Y )
fY (Y ) = fX (g −1 (Y )) (4.23)
dY

Example 1. X is Uniform(0,20) and Y =g(X) = 3X. Thus g −1 (Y ) = Y /3.


Note that the support of Y will be (0,60) as Y =3X. Then
1 dY /3 1 1 1
fY (Y ) = = = , IY (0 < Y < 60)
20 dY 20 3 60

• IY (condition) is an Indicator function which takes on one of two values: 1 when the condition is met,
is True, and 0 when it is not met, is False.
• This is an “intuitive” result, Y ∼ Uniform(0,60)

1 1
Example 2. X ∼ Uniform(16,64), thus pdf of X is 64−16 = 48 IX (16 < X < 64). Define Y = g(X) =
√ −1 2
X, noting that the support for Y is (4,8). Then g (Y ) = Y , and
1 dY 2 1 1
fY (Y ) = = 2Y = Y IY (4 < Y < 8)
48 dY 48 24

Skeleton of Proof

The essence of the change of variables theorem is that for x=g −1 (y), Pr(y ≤ Y ≤ y + dy) should equal
Pr(g −1 (y) < X < g −1 (y) + dg −1 (y) ≡ Pr(x ≤ X ≤ x + dx). This will happen if the following relationship
between the areas under the two pdfs holds:

|fY (y)dy| = |fX (g −1 (y))dg −1 (y)|

Note that fY (y)dy is approximately Pr(y < Y < y + dy). Think of the area of a rectangle, Area=Height ×
Width, where Area is probability, Height is the pdf evaluated at y, and Width is dy. Then
dg −1 (y)
fY (y) = fX (g −1 (y))
dy

71
Biological Allometry Example

This example is based on http://www.biology.arizona.edu/biomath/tutorials/applications/allometry.html:

Male fiddler crabs (Uca pugnax) possess an enlarged major claw for fighting or threatening other
males. In addition, males with larger claws attract more female mates.
The sex appeal (claw size) of a particular species of fiddler crab 4.2 is determined by the following
allometric equation:

Mc = 0.036Mb1.356

where Mc is the mass of the major claw and Mb is the body mass of the crab minus the mass of
the claw.

Suppose that Mb is on average 2000 mg with a CV of 0.20. Assuming a Gamma distribution for Mb , then
Mb ∼ Gamma(25, 0.0125).

Figure 4.2: Fiddler Crab. Image from Southeastern Regional Taxonomic Center (SERTC), South Carolina
Department of Natural Resources.

What is the pdf for Mc ? To reduce notation momentarily, let X=Mb , Y =Mc , a=0.036, b=1.856, α=25, and
β=0.0125. Then
  1b
Y
Y = g(X) = aX b and X = g −1 (Y ) =
a
and
dg −1 (y) 1−b
= a−1/b y b
dy
The pdf for Y (Mc ):
"  1 #α−1
−1
 1
dg (y) β Y b 1 1−b
e−β∗( a ) × a−1/b y b
Y b
fY (y) = fX (g −1 (y)) =
dy Γ(α) a b

Substituting the original values:


" 1 #25−1
0.012525
 1.856 1
Y 1 1 1−1.856
e−0.125∗( 0.036 )
Y 1.856
fMc (Mc ) = 1/1.856
y 1.856 (4.24)
Γ(25) 0.036 0.036 1.856

The accuracy of the derivation was examined by simulating body mass from a Gamma(25,0.0125) and then
transforming using the allometric equation (the R code is shown below). The empirical and theoretical pdfs
are plotted in Figure 4.3, and the two are quite similar.

72
Claw mass pdf

Theory
Empirical

2.0e−05
1.5e−05
1.0e−05
5.0e−06
0.0e+00

20000 40000 60000 80000 100000 120000

Claw mass

Figure 4.3: Theoretical and empirical pdf for crab claw mass.

#---- Change of variable with Fiddler Crabs ----


body.alpha <- 25
body.beta <- 0.0125
n <- 500
set.seed(931)
sim.mass <- sort(rgamma(n=n,shape=body.alpha,rate=body.beta))

claw.a <- 0.036


claw.b <- 1.856
sim.claw <- claw.a*sim.mass^claw.b
plot(density(sim.claw))

Mc.density <- function(y,alpha,beta,a,b) {


x <- (y/a)^(1/b)
p1 <- dgamma(x,alpha,beta)
p2 <- a^(-1/b)*(1/b)*y^((1-b)/b)
out <- p1*p2
}

theory.density <- Mc.density(y=sim.claw,alpha=body.alpha,beta=body.beta,


a=claw.a,b=claw.b)

plot(sim.claw,theory.density,xlab="Claw mass",ylab="",main="Claw mass pdf",


type="l",col="blue")
lines(density(sim.claw),col="red",lty=2)
legend("topright",legend=c("Theory","Empirical"),col=c("blue","red"),lty=1:2)

73
4.6 Supplement B: Fisher Information

In the following we begin with the case of a single (scalar) parameter θ.

Definition of Fisher Information:


" 2 #
d log f (x|θ)
I(θ|x) = E (4.25)

Note that under certain regularity conditions2 , Fisher Information can be calculated from the second deriva-
tive of the log likelihood:
 2 
d log f (x|θ)
I(θ|x) = −E (4.26)
dθ2

which is often much easier to calculate than Eq’n 4.25.

Remarks.

1. Given n iid random variables x1 , . . . , xn from the same distribution with parameter θ, the Fisher
information for θ = nI1 (θ|x), where I1 (θ|x) denotes the information for a single observation:
 2   2 Pn 
d log f (x1 , . . . , xn |θ) d i=1 log f (xi |θ)
I(θ|x1 , . . . , xn ) = −E = −E
dθ2 dθ2
n
d2 log f (xi |θ)
X   
= −E = nI1 (θ|x) (4.27)
i=1
dθ2

2. Inverse of I(θ) as lower bound on variance of θ̂. Under the previously mentioned regularity conditions,
the inverse of Fisher information is the lower bound on the variance of an unbiased estimator of a
parameter. In other words, given a probability distribution with parameter θ which satisfies certain
regularity conditions, if θ̂ is unbiased for θ, then

V (θ̂) ≥ I(θ)−1

The right hand term is called the Cramer-Rao bound.


Thus the variance of an unbiased estimate can never be less than the Cramer-Rao bound.

2 There d log(f (x|θ)


are three conditions in this case. (1) For all x such that f (x|θ) > 0, dx
exists and is finite. (2) The order of
operations of integration with respect to x and differentiation with respect to θ for the expectation of a function of T (x) can
be interchanged, i.e.,
Z  Z
d df (x|θ)
T (x)f (x|θ)dx = T (x) dx
dθ dθ
. (3) The order of operations of integration and differentiation can also be reversed for the second derivative of f (x|θ) with
respect to θ, i.e.,
d2 d2 f (x|θ)
Z  Z
T (x)f (x|θ)dx = T (x) dx
dθ2 dθ2
.

74
3. Maximum likelihood estimators. In the particular case of maximum likelihood estimates (mles), the
inverse of I(θ|x) evaluated at the mle, θ̂, is often used as an estimate of the variance of θ̂:

ar(θ̂) = I(θ)−1
Vd

y
Example. if y ∼ Binomial(n, p), then the mle for p is p̂ = n.
The variance of p̂ is
hyi 1 1 p(1 − p)
V ar[p̂] = V ar = V ar[y] = 2 np(1 − p) =
n n2 n n
It can be shown that Fisher information for p is
n
I(p) =
p(1 − p)
p(1−p)
Observe that I −1 (p) = n , which is the variance of p̂.
4. Observed Fisher Information is the Fisher information without the integration, i.e., without taking the
expectation of the second derivative of the log likelihood:

d2 log(f (x|θ))
J (θ) = (4.28)
dθ2

and an estimate of θ, namely θ̂, is substituted for θ.

5. Multivariate Θ. Extension of Fisher Information to the case of multiple parameters, Θ = (θ1 , . . . , θq ),


is similar to much of the above. The differences are that instead of having a single first derivative of
log(f (x|θ)), there is a vector of first derivatives, namely the gradient:
 d log(f (x)|θ1 )) 
dθ1
 d log(f (x)|θ2 )) 
 dθ2 
∇ log(f (x|Θ)) =  ..  (4.29)
.
 
 
d log(f (x)|θ2 ))
dθq

And instead of a single second derivative of log(f (x)|θ), there is a matrix of second derivatives, namely
the Hessian:
 d2 log(f (x)|Θ)) d2 log(f (x)|Θ)) 2
. . . d log(f (x)|Θ))

dθ12 dθ1 dθ2 dθ1 dθq
 d2 log(f (x)|Θ)) d2 log(f (x)|Θ)) 
 2

dθ dθ dθ
∇∇T log(f (x|Θ)) = 
2 1 1
(4.30)
 
.. 

 . 

d2 log(f (x)|Θ)) d2 log(f (x)|Θ)) 2

dθq dθ1 dθq dθ2 . . . d log(f (x)|Θ))


dθ 2 q

75

You might also like