Download as pdf or txt
Download as pdf or txt
You are on page 1of 106

Solutions for Exercises in:

Applied Statistical Inference: Likelihood and Bayes

Leonhard Held and Daniel Sabanés Bové

Solutions provided by:


Leonhard Held, Daniel Sabanés Bové, Andrea Kraus and Manuela Ott

March 31, 2017

Corrections and other feedback are appreciated. Please send them to


manuela.ott@uzh.ch
Contents
2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
3 Elements of frequentist inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Frequentist properties of the likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Likelihood inference in multiparameter models . . . . . . . . . . . . . . . . . . . . 51
6 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8 Numerical methods for Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . 157
9 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
2 Likelihood

1. Examine the likelihood function in the following examples.


a) In a study of a fungus which infects wheat, 250 wheat seeds are disseminated af-
ter contaminating them with the fungus. The research question is how large the
probability θ is that an infected seed can germinate. Due to technical problems,
the exact number of germinated seeds cannot be evaluated, but we know only
that less than 25 seeds have germinated. Write down the likelihood function for
θ based on the information available from the experiment.
◮ Since the seeds germinate independently of each other with fixed proba-
bility θ, the total number X of germinated seeds has a binomial distribution,
X ∼ Bin(250, θ). The event we know that happened is X ≤ 24. Thus, the likeli-
hood function for θ is here the cdf of the binomial distribution with parameter θ
evaluated at 24:

L(θ) = Pr(X ≤ 24; θ)


24
X
= f (x; n = 250, θ)
x=0
24  
X 250 x
= θ (1 − θ)250−x .
x
x=0

b) Let X1:n be a random sample from a N(θ, 1) distribution. However, only the
largest value of the sample, Y = max(X1 , . . . , Xn ), is known. Show that the
density of Y is

f (y) = n {Φ(y − θ)}


n−1
ϕ(y − θ), y ∈ R,

where Φ(·) is the distribution function and ϕ(·) is the density function of the
standard normal distribution N(0, 1). Derive the distribution function of Y and
the likelihood function L(θ).
2 2 Likelihood 3

◮ We have X1 , . . . , Xn ∼ N(θ, 1) and their maximum is Y . Hence, the cdf of


iid
> ## likelihood function in partially observed
> ## binomial experiment:
Y is
> likelihood1 <- function(theta, # probability
n, # sample size
F (y) = Pr(Y ≤ y) = x) # number in interval [0, x] was observed
= Pr{max(X1 , . . . , Xn ) ≤ y}
{
pbinom(x, size = n, p = theta) # use existing function
n
Y }
= Pr(Xi ≤ y) > theta <- seq(from=0, to=0.2, # range for theta
i=1 length=1000) # determines resolution
> plot(theta, likelihood1(theta, n=250, x=24),
= {Φ(y − θ)}n type="l",
xlab=expression(theta), ylab=expression(L(theta)),
because Xi ≤ y is equivalent to Xi − θ ≤ y − θ, and Xi − θ follows a standard main="1 (a)")
1 (a)
normal distribution for all i = 1, . . . , n. The probability density function of Y 1.0

follows:
0.8
d
fY (y) = FY (y) 0.6
dy

L(θ)
= n{Φ(y − θ)} n−1
φ(y − θ) 0.4

and the likelihood function L(θ) is exactly this density function, but seen as a 0.2

function of θ for fixed y.


c) Let X1:3 denote a random sample of size n = 3 from a Cauchy C(θ, 1) distri-
0.0

bution, cf . Appendix A.5.2. Here θ ∈ R denotes the location parameter of the 0.00 0.05 0.10 0.15 0.20

Cauchy distribution with density θ


ii. L(θ) in 1b) if the observed sample is x = (1.5, 0.25, 3.75, 3.0, 2.5).
1 1
f (x) = , x ∈ R. ◮ Here too we can use the implemented functions from R:
π 1 + (x − θ)2
> ## likelihood for mean of normal distribution if only the
Derive the likelihood function for θ. > ## maximum is known from the sample
> likelihood2 <- function(theta, # mean of iid normal distributions
◮ Since we have an iid sample from the C(θ, 1) distribution, the likelihood n, # sample size
function is given by the product of the densities f (xi ; θ): y) # observed maximum
{
3
Y n * pnorm(y - theta)^(n-1) * dnorm(y - theta)
L(θ) = f (xi ; θ) }
> x <- c(1.5,0.25, 3.75, 3.0, 2.5)
i=1
> theta <- seq(-1, 6, length = 1000)
3
Y 1 1 > plot(theta, likelihood2(theta, y=max(x), n=length(x)),
= · type="l",
π 1 + (xi − θ)2 xlab=expression(theta), ylab=expression(L(theta)),
i=1
1 1 main="1 (b)")
= .
π 3 {1 + (x1 − θ)2 }{1 + (x2 − θ)2 }{1 + (x3 − θ)2 }

d) Using R, produce a plot of the likelihood functions:


i. L(θ) in 1a). .
◮ We can use the implemented cdf (pbinom) of the binomial distribution:
4 2 Likelihood 5

1 (b)
0.6 2. A first-order autoregressive process X0 , X1 , . . . , Xn is specified by the conditional
0.5
distribution

L(θ) 0.4 Xi | Xi−1 = xi−1 , . . . , X0 = x0 ∼ N(α · xi−1 , 1), i = 1, 2, . . . , n


0.3 and some initial distribution for X0 . This is a popular model for time series data.
0.2 a) Consider the observation X0 = x0 as fixed. Show that the log-likelihood kernel
0.1
for a realization x1 , . . . , xn can be written as
n
1X
0.0
l(α) = − (xi − αxi−1 )2 .
−1 0 1 2 3 4 5 6
2
i=1

θ
◮ The likelihood is given by
iii. L(θ) in 1c) if the observed sample is x = (0, 5, 9).
◮ Here the vector computations are very useful: L(α) = f (x1 , . . . , xn | x0 ; α)
> ## likelihood for location parameter of Cauchy in = f (xn | xn−1 , . . . , x1 , x0 ; α)f (xn−1 | xn−2 , . . . , x0 ; α) · · · f (x1 | x0 ; α)
> ## random sample
n
Y
> likelihood3 <- function(theta, # location parameter
x) # observed data vector = f (xi | xi−1 , . . . , x0 ; α)
{ i=1
1/pi^3 / prod((1+(x-theta)^2))  
1 1
Yn
} = √ exp − (xi − αxi−1 )2
> ## In order to plot the likelihood, the function must be able 2π 2
i=1
> ## to take not only one theta value but a theta vector. We can use the  
n
> ## following trick to get a vectorised likelihood function: Y 1
= exp − (xi − αxi−1 )2
> likelihood3vec <- Vectorize(likelihood3,
2
vectorize.args="theta") i=1
> x <- c(0, 5, 9)
( n )
X 1
2
> theta <- seq(from=-5, to=15, = exp − (xi − αxi−1 ) .
length = 1000) 2
i=1
> plot(theta, likelihood3vec(theta, x=x),
type="l", The log-likelihood kernel is thus
xlab=expression(theta), ylab=expression(L(theta)),
n
main="1 (c)") 1X
l(α) = log L(α) = − (xi − αxi−1 )2 .
1 (c)
2
i=1

6e−05 b) Derive the score equation for α, compute α̂ML and verify that it is really the
maximum of l(α).
4e−05 ◮ The score function for α is
L(θ)

dl(α)
S(α) =
2e−05 dα
n
1X
=− 2(xi − αxi−1 ) · (−xi−1 )
0e+00
2
i=1
n
X
−5 0 5 10 15
= xi xi−1 − αx2i−1
θ
i=1
n
X n
X
= xi xi−1 − α x2i−1 ,
i=1 i=1
6 2 Likelihood 7

so the score equation S(α) = 0 is solved by


Pn −4
xi xi−1
α̂ML = Pi=1 n 2
.
i=1 xi−1
−6

l(α)
This is really a local maximum of the log-likelihood function, because the latter −8

is (strictly) concave, which can easily be verified from


−10
X n
dS(α)
=− x2i−1 < 0. −12

i=1
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
The Fisher information
α
n
dS(α) X 2 3. Show that in Example 2.2 the likelihood function L(N ) is maximised at N̂ = ⌊ Mx·n ⌋,
I(α) = − = xi−1 > 0
dα where ⌊x⌋ is the largest integer that is smaller than x. To this end, analyse the
i=1
monotonic behaviour of the ratio L(N )/L(N − 1). In which cases is the MLE not
is here independent of the parameter and thus equals the observed Fisher in-
unique? Give a numeric example.
formation I(α̂ML ). Since there are no other local maxima or restrictions of the
◮ For N ∈ Θ = {max(n, M + n − x), max(n, M + n − x) + 1, . . . }, the likelihood
parameter space (α ∈ R), α̂ML is really the global maximum of l(α).
function is 
c) Create a plot of l(α) and compute α̂ML for the following sample: N −M
L(N ) ∝ n−x
 .
(x0 , . . . , x6 ) = (−0.560, −0.510, 1.304, 0.722, 0.490, 1.960, 1.441).
N
n

The ratio R(N ) = L(N )/L(N − 1) is thus


 
◮ Note that in the R-code x[1] corresponds to x0 in the text because indexing N −M N −1

starts at 1 in R. R(N ) = n−x


N
 · n
N −1−M

n n−x
> ## implement log-likelihood kernel
(N − M )!n!(N − n)! (N − 1)!(n − x)!(N − 1 − M − n + x)!
> loglik <- function(alpha, # parameter for first-order term = ·
x) # observed data vector (n − x)!(N − M − n + x)!N ! n!(N − 1 − n)!(N − 1 − M )!
{
(N − M )(N − n)
i <- 2:length(x) = .
- 1/2 * sum((x[i] - alpha * x[i-1])^2) (N − M − n + x)N
}
> ## and plot for the given data As the local maximum of L(N ), the MLE N̂ML has to satisfy both that L(N̂ML ) ≥
L(N̂ML −1) (i. e. R(N̂ML ) ≥ 1) and that L(N̂ML ) ≥ L(N̂ML +1) (i. e. R(N̂ML +1) ≤ 1).
> x <- c(-0.560, -0.510, 1.304, 0.722, 0.490, 1.960, 1.441)
> alpha <- seq(-1, 2, length = 100)
> plot(x=alpha, From the equation above we can see that R(N ) ≥ 1 if and only if N ≤ M n/x. Hence,
y=sapply(alpha, function(alpha) loglik(alpha, x)), R(N + 1) ≤ 1 if and only if N + 1 ≥ M n/x, and, equivalently, if N ≥ M n/x − 1. It
type = "l",
xlab = expression(alpha), follows that each integer in the interval [M n/x − 1, M n/x] is an MLE. If the right
ylab = expression(l(alpha))) endpoint M n/x is not integer, the MLE N̂ML = ⌊M n/x⌋ is unique. However, if
> ## then calculate the MLE and plot it
M n/x is integer, we have two solutions and MLE is not unique.
> i <- seq(along = x)[-1] # again an indexing vector is necessary here
> alphaMl <- sum(x[i] * x[i-1]) / sum(x[i-1]^2) For example, if we change the sample size in the numerical example in Figure 2.2
> alphaMl from n = 63 to n = 65, we obtain the highest likelihood for both 26 · 65/5 = 1690
[1] 0.6835131
and 1689.
> abline(v = alphaMl, lty = 2)
4. Derive the MLE of π for an observation x from a geometric Geom(π) distribution.
What is the MLE of π based on a realization x1:n of a random sample from this
8 2 Likelihood 9

distribution? a) What is the parameter space of φ? See Table A.3 in the Appendix for details
◮ The log-likelihood function is on the multinomial distribution and the parameter space of π.
◮ The first requirement for the probabilities is satisfied for all φ ∈ R:
l(π) = log f (x; π) = log(π) + (x − 1) log(1 − π),
4
X 1  1
so the score function is πj = 2 + φ + 2(1 − φ) + φ = (4 + 2φ − 2φ) = 1.
4 4
j=1
d 1 x−1
S(π) = l(π) = − . Moreover, each probability πj (j = 1, . . . , 4) must lie in the interval (0, 1). We
dπ π 1−π
thus have
Solving the score equation S(π) = 0 yields the MLE π̂ML = 1/x. The Fisher infor-
2+φ
mation is 0< < 1 ⇐⇒ 0 < 2 + φ < 4 ⇐⇒ −2 < φ < 2,
d 1 x−1 4
I(π) = − S(π) = 2 + ,
dπ π (1 − π)2 1−φ
0< < 1 ⇐⇒ 0 < 1 − φ < 4 ⇐⇒ −3 < φ < 1, (2.1)
4
which is positive for every 0 < π < 1, since x ≥ 1 by definition. Thus, 1/x indeed
φ
maximises the likelihood. 0 < < 1 ⇐⇒ 0 < φ < 4. (2.2)
4
For a realisation x1:n of a random sample from this distribution, the quantities
calculated above become Hence, (2.2) and (2.1) imply the lower and upper bounds, respectively, for the
range 0 < φ < 1. This is the intersection of the sets suitable for the probabilities.
n
X
l(π) = log f (xi ; π) b) Show that the likelihood kernel function for φ, based on the observation x, has
i=1 the form
Xn L(φ) = (2 + φ)m1 (1 − φ)m2 φm3
= log(π) + (xi − 1) log(1 − π)
i=1 and derive expressions for m1 , m2 and m3 depending on x.
= n log(π) + n(x̄ − 1) log(1 − π), ◮ We derive the likelihood kernel function based on the probability mass func-
d tion:
S(π) = l(π)
dπ 4
Y
n!
n n(x̄ − 1) L(φ) = Q4
x
πj j
= − , x !
π 1−π j=1 j j=1
 x  x  x  x
d 2+φ 1 1−φ 2 1−φ 3 φ 4
and I(π) = − S(π) ∝
dπ 4 4 4 4
n n(x̄ − 1)  x1 +x2 +x3 +x4
= 2+ . 1
π (1 − π)2 = (2 + φ)x1 (1 − φ)x2 +x3 φx4
4
The Fisher information is again positive, thus the solution 1/x̄ of the score equation ∝ (2 + φ)m1 (1 − φ)m2 φm3
is the MLE.
5. A sample of 197 animals has been analysed regarding a specific phenotype. The with m1 = x1 , m2 = x2 + x3 and m3 = x4 .
number of animals with phenotypes AB, Ab, aB and ab, respectively, turned out c) Derive an explicit formula for the MLE φ̂ML , depending on m1 , m2 and m3 .
to be Compute the MLE given the data given above.
x = (x1 , x2 , x3 , x4 )⊤ = (125, 18, 20, 34)⊤ . ◮ The log-likelihood kernel is

A genetic model now assumes that the counts are realizations of a multinomially l(φ) = m1 log(2 + φ) + m2 log(1 − φ) + m3 log(φ),
distributed multivariate random variable X ∼ M4 (n, π) with n = 197 and proba-
bilities π1 = (2 + φ)/4, π2 = π3 = (1 − φ)/4 and π4 = φ/4 (Rao, 1973, p. 368).
10 2 Likelihood 11

so the score function is [1] 0.7917206

S(φ) =
dl(φ) 6. Show that h(X) = maxi (Xi ) is sufficient for θ in Example 2.18.
dφ ◮ From Example 2.18, we know that the likelihood function of θ is
m1 m2 m3
= + (−1) + (Qn
2+φ 1−φ φ 1
i=1 f (xi ; θ) = θn for θ ≥ maxi (xi ),
m1 (1 − φ)φ − m2 (2 + φ)φ + m3 (2 + φ)(1 − φ) L(θ) = .
= . 0 otherwise
(2 + φ)(1 − φ)φ

The score equation S(φ) = 0 is satisfied if and only if the numerator in the We also know that L(θ) = f (x1:n ; θ), the density of the random sample. The density
expression above equals zero, i. e. if can thus be rewritten as
1
0 = m1 (1 − φ)φ − m2 (2 + φ)φ + m3 (2 + φ)(1 − φ) f (x1:n ; θ) = I[0,θ] (max(xi )).
θn i
= m1 φ − m1 φ2 − 2m2 φ − m2 φ2 + m3 (2 − 2φ + φ − φ2 ) 1
Hence, we can apply the Factorisation theorem (Result 2.2) with g1 (t; θ) = θn I[0,θ] (t)
= φ2 (−m1 − m2 − m3 ) + φ(m1 − 2m2 − m3 ) + 2m3 .
and g2 (x1:n ) = 1 to conclude that T = maxi (Xi ) is sufficient for θ.
This is a quadratic equation of the form aφ2 + bφ + c = 0, with a = −(m1 + m2 + 7. a) Let X1:n be a random sample from a distribution with density
m3 ), b = (m1 − 2m2 − m3 ) and c = 2m3 , which has two solutions φ0/1 ∈ R given (
exp(iθ − xi ) xi ≥ iθ
by √ f (xi ; θ) =
−b ± b2 − 4ac 0 xi < iθ
φ0/1 = .
2a
for Xi , i = 1, . . . , n. Show that T = mini (Xi /i) is a sufficient statistic for θ.
There is no hope for simplifying this expression much further, so we just imple-
◮ Since xi ≥ iθ is equivalent to xi /i ≥ θ, we can rewrite the density of the i-th
ment it in R, and check which of φ0/1 is in the parameter range (0, 1):
observation as
> mle.phi <- function(x)
{ f (xi ; θ) = exp(iθ − xi )I[θ,∞) (xi /i).
m <- c(x[1], x[2] + x[3], x[4])
a <- - sum(m) The joint density of the random sample then is
b <- m[1] - 2 * m[2] - m[3]
n
Y
c <- 2 * m[3]
phis <- (- b + c(-1, +1) * sqrt(b^2 - 4 * a * c)) / (2 * a) f (x1:n ; θ) = f (xi )
correct.range <- (phis > 0) & (phis < 1) i=1
return(phis[correct.range])
( n
! ) n
X Y
} = exp θ i − nx̄ I[θ,∞) (xi /i)
> x <- c(125, 18, 20, 34)
i=1 i=1
> (phiHat <- mle.phi(x))  
n(n + 1)
[1] 0.6268215
= exp θ I[θ,∞) (min(xi /i)) · exp{−nx̄} .
2 i | {z }
Note that this example is also used in the famous EM algorithm paper (Dempster | {z } =g2 (x1:n )
=g2 (h(x1:n )=mini (xi /i) ;θ)
et al., 1977, p. 2), producing the same result as we obtained by using the EM
algorithm (cf. Table 2.1 in Subsection 2.3.2). The result now follows from the Factorisation Theorem (Result 2.2). The crucial
√ Qn
d) What is the MLE of θ = φ? step is that i=1 I[θ,∞) (xi /i) = I[θ,∞) (mini (xi /i)) = I[θ,∞) (h(x1:n )).
◮ From the invariance property of the MLE we have Now we will show minimal sufficiency (required for next item). Consider the
q likelihood ratio
θ̂ML = φ̂ML ,

Λx1:n (θ1 , θ2 ) = exp (θ1 − θ2 )n(n + 1)/2 I[θ1 ,∞) (h(x1:n ))/I[θ2 ,∞) (h(x1:n )).
which in the example above gives θ̂ML ≈ 0.792:
> (thetaHat <- sqrt(phiHat))
12 2 Likelihood

If Λx1:n (θ1 , θ2 ) = Λx̃1:n (θ1 , θ2 ) for all θ1 , θ2 ∈ R for two realisations x1:n and x̃1:n ,
then necessarily
3 Elements of frequentist inference
I[θ1 ,∞) (h(x1:n )) I[θ ,∞) (h(x̃1:n ))
= 1 . (2.3)
I[θ2 ,∞) (h(x1:n )) I[θ2 ,∞) (h(x̃1:n ))
Now assume that h(x1:n ) 6= h(x̃1:n ), and, without loss of generality, that
h(x1:n ) > h(x̃1:n ). Then for θ1 = {h(x1:n ) + h(x̃1:n )}/2 and θ2 = h(x1:n ), we
obtain 1 on the left-hand side of (2.3) an 0 on the right-hand side. Hence,
h(x1:n ) = h(x̃1:n ) must be satisfied for the equality to hold for all θ1 , θ2 ∈ R, and
so the statistic T = h(X1:n ) is minimal sufficient.
b) Let X1:n denote a random sample from a distribution with density 1. Sketch why the MLE  
M ·n
f (x; θ) = exp{−(x − θ)}, θ < x < ∞, −∞ < θ < ∞. N̂ML =
x
Derive a minimal sufficient statistic for θ. in the capture-recapture experiment (cf . Example 2.2) cannot be unbiased. Show
◮ We have a random sample from the distribution of X1 in (7a), hence we that the alternative estimator
proceed in a similar way. First we rewrite the above density as
(M + 1) · (n + 1)
N̂ = −1
f (x; θ) = exp(θ − x)I[θ,∞) (x) (x + 1)
and second we write the joint density as is unbiased if N ≤ M + n.
f (x1:n ; θ) = exp(nθ − nx̄)I[θ,∞) (min(xi )). ◮ If N ≥ n + M , then X can equal zero with positive probability. Hence, the
i
MLE  
By the Factorisation Theorem (Result 2.2), the statistic T = mini (Xi ) is suffi- N̂ML =
M ·n
cient for θ. Its minimal sufficiency can be proved in the same way as in (7a). X
8. Let T = h(X1:n ) be a sufficient statistic for θ, g(·) a one-to-one function and T̃ = can be infinite with positive probability. It follows that the expectation of the MLE
h̃(X1:n ) = g{h(X1:n )}. Show that T̃ is sufficient for θ. is infinite if N ≥ M +n and so cannot be equal to the true parameter value N . We
◮ By the Factorisation Theorem (Result 2.2), the sufficiency of T = h(X1:n ) for have thus shown that for some parameter values, the expectation of the estimator
θ implies the existence of functions g1 and g2 such that is not equal to the true parameter value. Hence, the MLE is not unbiased.
f (x1:n ; θ) = g1 {h(x1:n ); θ} · g2 (x1:n ). To show that the alternative estimator is unbiased if N ≤ M + n, we need to
compute its expectation. If N ≤ M + n, the smallest value in the range T of the
If we set g̃1 := g1 ◦ g −1
, we can write possible values for X is max{0, n − (N − M )} = n − (N − M ). The expectation of
f (x1:n ; θ) = g1 (g −1 [g{h(x1:n )}]; θ) · g2 (x1:n ) = g̃1 {h̃(x1:n ); θ} · g2 (x1:n ), the statistic g(X) = (M + 1)(n + 1)/(X + 1) can thus be computed as
X
which shows the sufficiency of T̃ = h̃(X1:n ) for θ. E{g(X)} = g(x) Pr(X = x)
9. Let X1 and X2 denote two independent exponentially Exp(λ) distributed random x∈T
variables with parameter λ > 0. Show that h(X1 , X2 ) = X1 + X2 is sufficient for λ. min{n,M }  N −M

(M + 1)(n + 1)
X M

◮ The likelihood L(λ) = f (x1:2 , λ) is = x n−x



x+1 N
x=n−(N −M ) n
2
Y  
L(λ) = λ exp(−λxi ) min{n,M }
X M +1 (N +1)−(M +1)

i=1 = (N + 1) x+1
N +1
n−x
 .
= λ2 exp{−λ(x1 + x2 )} · |{z}
1 , x=n−(N −M ) n+1
| {z }
g1 {h(x1:2 )=x1 +x2 ;λ} g2 (x1:n )

and the result follows from the Factorisation Theorem (Result 2.2).
14 3 Elements of frequentist inference 15

We may now shift the index in the sum, so that the summands containing x in 3. Let X1:n be a random sample from a normal distribution with mean µ and vari-
the expression above contain x − 1. Of course, we need to change the range of ance σ 2 > 0. Show that the estimator
summation accordingly. By doing so, we obtain that r
n − 1 Γ( n−1
2
)
  σ̂ = S
min{n,M }
X M +1 (N +1)−(M +1) 2 Γ( n2 )
x+1 n−x
N +1

x=n−(N −M ) n+1 is unbiased for σ, where S is the square root of the sample variance S 2 in (3.1).
It is well known that for X1 , . . . , Xn ∼ N(µ, σ 2 ),
iid
min{n+1,M +1}
X ◮
= .
(n − 1)S 2
x=(n+1)−((N +1)−(M +1)) Y := ∼ χ2 (n − 1),
σ2
Note that the sum above is a sum of probabilities corresponding to a hypergeometric √
distribution with different parameters, namely HypGeom(n + 1, N + 1, M + 1), i. e. see e. g. Davison (2003, page 75). For the expectation of the statistic g(Y ) = Y
  we thus obtain that
min{n+1,M +1} M +1 (N +1)−(M +1)
X (n+1)−x Z∞
=
x
N +1

x=(n+1)−((N +1)−(M +1)) n+1 E{g(Y )} = g(y)fY (y) dy
X 0
= Pr(X ∗ = x) Z∞
( 21 ) 2 n−1 −1
n−1
x∈T ∗ 1
= y 2 exp(−y/2)y 2 dy
= 1, Γ( n−12 )
0
 − 12 Z∞ 1 n
where X is a random variable, X ∼ HypGeom(n + 1, N + 1, M + 1). It follows
∗ ∗
1 Γ( n2 ) ( 2 ) 2 n −1
= y 2 exp(−y/2) dy.
that 2 Γ( n−1
2
) Γ( n2 )
0
E(N̂ ) = E{g(X)} − 1 = N + 1 − 1 = N.

Note however that the alternative estimator is also not unbiased for N > M + n. The integral on the most-right-hand side is the integral of the density of the χ2 (n)
Moreover, its values are not necessarily integer and thus not necessarily in the distribution over its support and therefore equals one. It follows that
 
parameter space. The latter property can be remedied by rounding. This, however, √ √ n  n−1
would lead to the loss of unbiasedness even for N ≤ M + n. E( Y ) = 2Γ Γ ,
2 2
2. Let X1:n be a random sample from a distribution with mean µ and variance σ 2 > 0.
Show that and  √ 
σ2 Y Γ( n−1
2
)
E(X̄) = µ and Var(X̄) = . E(σ̂) = E σ √ = σ.
2 Γ( 2 )
n
n
4. Show that the sample variance S 2 can be written as
◮ By linearity of the expectation, we have that
n
1 X
n
X S2 = (Xi − Xj )2 .
E(X̄) = n−1 E(Xi ) = n−1 n · µ = µ. 2n(n − 1)
i,j=1
i=1
Use this representation to show that
Sample mean X̄ is thus unbiased for expectation µ. 
1 n − 3 
The variance of a sum of uncorrelated random variables is the sum of the respective Var(S 2 ) = c4 − σ4 ,
variances; hence, n n−1

σ2
n
X
Var(X̄) = n−2 Var(Xi ) = n−2 n · σ 2 = .
n
i=1
16 3 Elements of frequentist inference 17

 
where c4 = E {X − E(X)}4 is the fourth central moment of X. It follows that
◮ We start with showing that the estimator S 2 can be rewritten as T :=  
1
Pn 2 Cov (Xi − Xj )2 , (Xi − Xj )2 = Var (Xi − Xj )2 =
2n(n−1) i,j=1 (Xi − Xj ) :
   2
= E (Xi − Xj )4 − E (Xi − Xj )2 =
n
1 X 2
(n − 1)T = · (Xi − 2Xi Xj + Xj2 ) = = 2µ4 + 6σ 4 − (2σ 2 )2 = 2µ4 + 2σ 4 .
2n
i,j=1
n n n n
! Note that since (Xi − Xj )2 = (Xj − Xi )2 , there are 2 · n(n − 1) such terms in
1 X X X X
= · n Xi2 − 2 Xi Xj + n Xj2 = the sum (3.1).
2n 
i=1 i=1 j=1 j=1 – In an analogous way, we may show that if i, j, k are all different, Cov (Xi −

Xj )2 , (Xk − Xj )2 = µ4 − σ 4 . We can form n(n − 1)(n − 2) different triplets
n
X
= Xi2 − nX̄ 2 = (n − 1)S 2 .
(i, j, k) of i, j, k that are all different elements of {1, . . . , n}. For each of
i=1 
these triplets, there are four different terms in the sum (3.1): Cov (Xi −
 
It follows that we can compute the variance of S 2 from the pairwise correlations Xj ) , (Xk − Xj ) Cov (Xi − Xj ) , (Xj − Xk ) , Cov (Xj − Xi ) , (Xj − Xk )2 ,
2 2 2 2 2

between the terms (Xi − Xj )2 , i, j = 1, . . . , n as and Cov (Xj − Xi )2 , (Xk − Xj )2 , each with the same value. In total, we thus
 −2 X have 4 · n(n − 1)(n − 2) terms in (3.1) with the value of µ4 − σ 4 .
Var(S 2Auf :arithmetischesM ittel ) = 2n(n − 1) Var (Xi − Xj )2 =
i,j By combining these intermediate computations, we finally obtain that
 −2 X 
= 2n(n − 1) Cov (Xi − Xj )2 , (Xk − Xl )2 . 1 
Var(S 2 ) =  4 4
2 2n(n − 1)(2µ4 + 2σ ) + 4n(n − 1)(n − 2)(µ4 − σ ) =
i,j,k,l
2n(n − 1)
(3.1)
1 
= µ4 + σ 4 + (n − 2)(µ4 − σ 4 ) =
Depending on the combination of indices, the covariances in the sum above take n(n − 1)
 n − 3 
1
one of the three following values: = µ4 − σ4 .
 n n−1
– Cov (Xi − Xj )2 , (Xk − Xl )2 = 0 if i = j and/or k = l (in this case either the
first or the second term is identically zero) or if i, j, k, l are all different (in this 5. Show that the confidence interval defined in Example 3.6 indeed has coverage
case the result follows from the independence between the different Xi ). probability 50% for all values θ ∈ Θ.

– For i 6= j, Cov (Xi − Xj )2 , (Xi − Xj )2 = 2µ4 + 2σ 4 . To show this, we proceed ◮ To prove the statement, we need to show that Pr{min(X1 , X2 ) ≤ θ ≤
in two steps. We denote µ := E(X1 ), and, using the independence of Xi and max(X1 , X2 )} = 0.5 for all θ ∈ Θ. This follows by the simple calculation:
Xj , we obtain that
Pr{min(X1 , X2 ) ≤ θ ≤ max(X1 , X2 )} = Pr(X1 ≤ θ ≤ X2 ) + Pr(X2 ≤ θ ≤ X1 )
   
E (Xi − Xj )2 = E (Xi − µ)2 + E (Xj − µ)2 − 2 E (Xi − µ)(Xj − µ) = = Pr(X1 ≤ θ) Pr(X2 ≥ θ) + Pr(X2 ≤ θ) Pr(X1 ≥ θ)
 
= 2σ 2 − 2 E(Xi ) − µ E(Xj ) − µ = 2σ 2 . = 0.5 · 0.5 + 0.5 · 0.5 = 0.5.

In an analogous way, we can show that 6. Consider a random sample X1:n from the uniform model U(0, θ), cf . Example 2.18.
4
 4
 Let Y = max(X1 , . . . , Xn ) denote the maximum of the random sample X1:n . Show
E (Xi − Xj ) = E (Xi − µ + µ − Xj ) =
   that the confidence interval for θ with limits
= E (Xi − µ)4 − 4 E (Xi − µ)3 (Xj − µ) + 6 E (Xi − µ)2 (Xj − µ)2
  (1 − γ)−1/n Y
− 4 E (Xi − µ)(Xj − µ)3 + E (Xj − µ)4 = Y and

= µ4 − 4 · 0 + 6 · (σ 2 )2 − 4 · 0 + µ4 = 2µ4 + 6σ 4 .
18 3 Elements of frequentist inference 19

has coverage γ. 7. Consider a population with mean µ and variance σ 2 . Let X1 , . . . , X5 be indepen-
◮ Recall that the density function of the uniform distribution U(0, θ) is f (x) = dent draws from this population. Consider the following estimators for µ:
1
θ I[0,θ) (x). The corresponding distribution function is 1
T1 = (X1 + X2 + X3 + X4 + X5 ),
Zx 5
Fx (x) = f (u) du 1
T2 = (X1 + X2 + X3 ),
3
1 1
−∞

T3 = (X1 + X2 + X3 + X4 ) + X5 ,
0R for x ≤ 0,

 8 2
= x 1
du = x
for 0 ≤ x ≤ θ, T4 = X1 + X2
 0 θ θ

1 for x ≥ θ. and T5 = X1 .

To prove the coverage of the confidence interval with limits Y and (1 − γ)−1/n Y , a) Which estimators are unbiased for µ?
we need to show that Pr{Y ≤ θ ≤ (1 − γ)−1/n Y } = γ for all θ ∈ Θ. We first derive ◮ The estimators T1 , T2 , and T5 are sample means of sizes 5, 3, and 1,
the distribution of the random variable Y . For its distribution function FY , we respectively, and as such are unbiased for µ, cf. Exercise 2. Further, T3 is also
obtain that unbiased, as
1 1
E(T3 ) = · 4µ + µ = µ.
FY (y) = Pr(Y ≤ y) 8 2
= Pr{max(X1 , . . . , Xn ) ≤ y} On the contrary, T4 is not unbiased, as

= Pr(X1 ≤ y, . . . , Xn ≤ y) E(T4 ) = E(X1 ) + E(X2 ) = 2µ.


= {F (y)}n .
Note, however, that T4 is unbiased for 2µ.
For its density fY (y), it follows that b) Compute the MSE of each estimator.
dFY (y) ◮ Since the biased of an unbiased estimator is zero, the M SE of an unbiased
fY (y) = estimator is equal to its variance, cf. 3.5. For the sample means T1 , T2 , and
dy
= n F (y)n−1 f (y) T5 , we therefore directly have that
(
n n−1
for 0 ≤ y ≤ θ, σ2 σ2
= θ
n y M SE(T1) = , M SE(T2) = und M SE(T5) = σ 2 ,
5 3
0 otherwise.
cf. Exercise 2. For the M SE of T3 , we have that
The coverage of the confidence interval can now be calculated as
1 1 5 2
M SE(T3 ) = Var(T3 ) = · 4σ 2 + 2 σ 2 = σ .
Pr{Y ≤ θ ≤ (1 − γ)−1/n Y } = Pr{θ (1 − γ)1/n ≤ Y ≤ θ} 82 2 16
Zθ Finally, the M SE of T4 , we have that
= fY (y) dy
M SE(T4) = {E(T4 )}2 + Var(T4 ) = µ2 + 2σ 2 .
θ(1−γ)1/n

Zθ 8. The distribution of a multivariate random variable X belongs to an exponential


n
= y n−1 dy family of order p, if the logarithm of its probability mass or density function can
θn
θ(1−γ)1/n be written as
  p
X
n θn {θ(1 − γ)1/n }n log{f (x; τ )} = ηi (τ )Ti (x) − B(τ ) + c(x).
= − (3.2)
θn n n i=1

= {1 − (1 − γ)} = γ.
20 3 Elements of frequentist inference 21

Here τ is the p-dimensional parameter vector and Ti , ηi , B and c are real-valued c) Show that the density of the normal distribution N(µ, σ 2 ) can be written in
functions. It is assumed that the set {1, η1 (τ ), . . . , ηp (τ )} is linearly independent. the forms (3.2) and (3.3), respectively, where τ = (µ, σ 2 )⊤ . Hence derive a
Then we define the canonical parameters θ1 = η1 (τ1 ), . . . , θp = ηp (τp ). With minimal sufficient statistic for τ .
θ = (θ1 , . . . , θp )⊤ and T (x) = (T1 (x), . . . , Tp (x))⊤ we can write the log density in ◮ For X ∼ N(µ, σ 2 ), we have τ = (µ, σ 2 )⊤ . We can rewrite the log density
canonical form: as
log{f (x; θ)} = θ⊤ T (x) − A(θ) + c(x). (3.3) 1 1 (x − µ)2
log f (x; µ, σ 2 ) = − log(2πσ 2 ) −
Exponential families are interesting because most of the commonly used distribu- 2 2 σ2
1 1 1 x2 − 2xµ + µ2
tions, such as the Poisson, geometric, binomial, normal and gamma distribution, = − log(2π) − log(σ 2 ) −
2 2 2 σ2
are exponential families. Therefore it is worthwhile to derive general results for
1 2 µ µ2 1 1
exponential families, which can then be applied to many distributions at once. = − 2 x + 2 x − 2 − log(σ 2 ) − log(2π)
2σ σ 2σ 2 2
For example, two very useful results for the exponential family of order one in
= η1 (τ )T1 (x) + η2 (τ )T2 (x) − B(τ ) + c(x),
canonical form are E{T (X)} = dA/dθ(θ) and Var{T (X)} = d2 A/dθ 2 (θ).
where
a) Show that T (X) is minimal sufficient for θ.
1
◮ Consider two realisations x and y with corresponding likelihood ratios θ1 = η1 (τ ) = − T1 (x) = x2
2σ 2
Λx (θ 1 , θ 2 ) and Λy (θ1 , θ2 ) being equal, which on the log scale gives the equation µ
θ2 = η2 (τ ) = 2 T2 (x) = x
σ
log f (x; θ1 ) − log f (x; θ2 ) = log f (y; θ1 ) − log f (y; θ 2 ). µ2 1
B(τ ) = + log(σ 2 )
2σ 2 2
Plugging in (3.3) we can simplify it to 1
and c(x) = − log(2π).
2
1 T (x) − A(θ 1 ) − θ 2 T (x) + A(θ 2 ) = θ 1 T (y) − A(θ 1 ) − θ 2 T (y) + A(θ 2 )
θ⊤ ⊤ ⊤ ⊤

We can invert the canonical parametrisation θ = η(τ ) = (η1 (τ ), η2 (τ ))⊤ by


(θ1 − θ2 )⊤ T (x) = (θ 1 − θ 2 )⊤ T (y).
σ 2 = −2θ1 ,
If T (x) = T (y), then this equation holds for all possible values of θ1 , θ2 . There-
µ = θ2 σ 2 = −2θ1 θ2 ,
fore T (x) is sufficient for θ. On the other hand, if this equation holds for all
θ 1 , θ 2 , then T (x) must equal T (y). Therefore T (x) is also minimal sufficient so for the canonical form we have the function
for θ.
A(θ) = B{η −1 (θ)}
b) Show that the density of the Poisson distribution Po(λ) can be written in the
(−2θ1 θ2 )2 1
forms (3.2) and (3.3), respectively. Thus derive the expectation and variance = + log(−2θ1 )
of X ∼ Po(λ). 2(−2θ1 ) 2
◮ For the density of a random variable X ∼ Po(λ), we have that θ2 θ2 1
= − 1 2 + log(−2θ1 )
 x  θ1 2
λ 1
log f (x; λ) = log exp(−λ) = log(λ)x − λ − log(x!), = −θ1 θ22 + log(−2θ1 ).
x! 2

so p = 1, θ = η(λ) = log(λ), T (x) = x, B(λ) = λ and c(x) = − log(x!). Finally, from above, we know that T (x) = (x2 , x)⊤ is minimal sufficient for τ .
For the canonical representation, we have A(θ) = B{η −1 (θ)} = B{exp(θ)} = d) Show that for an exponential family of order one, I(τ̂ML ) = J(τ̂ML ). Verify this
exp(θ). Hence, both the expectation E{T (X)} = dA/dθ(θ) and the variance result for the Poisson distribution.
Var{T (X)} = d2 A/dθ 2 (θ) of X are exp(θ) = λ. ◮ Let X be a random variable with density from the exponential family of
order one. By taking the derivative of the log likelihood, we obtain the score
function
dη(τ ) dB(τ )
S(τ ) = T (x) − ,
dτ dτ
22 3 Elements of frequentist inference 23

so that the MLE τ̂ML satisfies the equation The log-likelihood l(θ) of the random sample X1:n is thus
dB(τ̂ML ) n
X n
X n
X
T (x) = dτ
dη(τ̂ML )
. l(θ) = log{f (xi ; θ)} = {θ T (xi ) − A(θ) + c(xi )} ∝ θ T (xi ) − n A(θ).
dτ i=1 i=1 i=1

We obtain the observed Fisher information from the Fisher information 9. Assume that survival times X1:n form a random sample from a gamma distribution
2 2
d B(τ ) d η(τ ) G(α, α/µ) with mean E(Xi ) = µ and shape parameter α.
I(τ ) = − T (x) Pn
dτ 2 dτ 2 a) Show that X̄ = n−1 i=1 Xi is a consistent estimator of the mean survival
by plugging in the MLE: time µ.
◮ The sample mean X̄ is unbiased for µ and has variance Var(X̄) =
dB(τ̂ML )
d2 A(τ̂ML ) d2 η(τ̂ML ) Var(Xi )/n = µ2 /(nα), cf. Exercise 2 and Appendix A.5.2. It follows that
I(τ̂ML ) = − dτ
.
dτ 2 dτ 2 dη(τ̂ML )
its mean squared error MSE = µ2 /(nα) goes to zero as n → ∞. Thus, the

estimator is consistent in mean square and hence also consistent.
Further, we have that
Note that this holds for all random samples where the individual random vari-
dB(η −1 (θ)) dη −1 (θ)
dB(τ)
d ables have finite expectation and variance.
E{T (X)} = (B ◦ η −1 )(θ) = · = dτ
,
dθ dτ dθ dη(τ) b) Show that Xi /µ ∼ G(α, α).

◮ From Appendix A.5.2, we know that by multiplying a random variable
where θ = η(τ ) is the canonical parameter. Hence with G(α, α/µ) distribution by µ−1 , we obtain a random variable with G(α, α)
distribution.
d2 B(τ ) d2 η(τ )
dB(τ)
J(τ ) = − dτ
c) Define the approximate pivot from Result 3.1,
dτ 2 dτ 2 dη(τ)

X̄ − µ
follows. If we now plug in τ̂ML , we obtain the same formula as for I(τ̂ML ). Z= √ ,
S/ n
For the Poisson example, we have I(λ) = x/λ2 and J(λ) = 1/λ. Plugging in
Pn
the MLE λ̂ML = x leads to I(λ̂ML ) = J(λ̂ML ) = 1/x. where S 2 = (n − 1)−1 i=1 (Xi − X̄)2 . Using the result from above, show that
e) Show that for an exponential family of order one in canonical form, I(θ) = J(θ). the distribution of Z does not depend on µ.
Verify this result for the Poisson distribution. ◮ We can rewrite Z as follows:
◮ In the canonical parametrisation (3.3),
X̄ − µ
Z=q Pn
dA(θ) 1
i=1 (Xi − X̄)
2
S(θ) = T (x) − n(n−1)

d2 A(θ) X̄/µ − 1
and I(θ) = , =q Pn
dθ 2 1
i=1 (Xi /µ − X̄/µ)
2
n(n−1)
where the latter is independent of the observation x, and therefore obviously
Ȳ − 1
I(θ) = J(θ). =q Pn ,
1
i=1 (Yi − Ȳ )
2
For the Poisson example, the canonical parameter is θ = log(λ). Since A(θ) = n(n−1)
exp(θ), also the second derivative equals exp(θ) = I(θ) = J(θ). Pn
f ) Suppose X1:n is a random sample from a one-parameter exponential family where Yi = Xi /µ and Ȳ = n−1 i=1 Yi = X̄/µ . From above, we know that
with canonical parameter θ. Derive an expression for the log-likelihood l(θ). Yi ∼ G(α, α), so its distribution depends only on α and not on µ. Therefore, Z
◮ Using the canonical parametrisation of the density, we can write the log- is a function of random variables whose distributions do not depend on µ. It
likelihood of a single observation as follows that the distribution of Z does not depend on µ either.

log{f (x; θ)} = θ T (x) − A(θ) + c(x).


24 3 Elements of frequentist inference 25

d) For n = 10 and α ∈ {1, 2, 5, 10}, simulate 100 000 samples from Z, and com- [3,] -2.841559 1.896299
pare the resulting 2.5% and 97.5% quantiles with those from the asymptotic [4,] -2.657800 2.000258
> ## compare with standard normal ones:
standard normal distribution. Is Z a good approximate pivot? > qnorm(p=c(0.025, 0.975))
◮ [1] -1.959964 1.959964
> ## simulate one realisation n=10 and alpha=1 n=10 and alpha=2
0.4 0.4
> z.sim <- function(n, alpha)

Density

Density
0.3 0.3
{
0.2 0.2
y <- rgamma(n=n, alpha, alpha)
0.1 0.1
0.0 0.0
yq <- mean(y)
−4 −2 0 2 4 −4 −2 0 2 4
sy <- sd(y)
Z Z
n=10 and alpha=5 n=10 and alpha=10
z <- (yq - 1) / (sy / sqrt(n)) 0.4 0.4
return(z)

Density

Density
0.3 0.3
} 0.2 0.2
> ## fix cases: 0.1 0.1
> n <- 10 0.0 0.0
> alphas <- c(1, 2, 5, 10) −4 −2 0 2 4 −4 −2 0 2 4
> ## space for quantile results
> quants <- matrix(nrow=length(alphas), Z Z
ncol=2) We see that the distribution of Z is skewed to the left compared to the standard
> ## set up graphics space
> par(mfrow=c(2, 2)) normal distribution: the 2.5% quantiles are clearly lower than −1.96, and also
> ## treat every case the 97.5% quantiles are slightly lower than 1.96. For increasing α (and also for
> for(i in seq_along(alphas))
increasing n of course), the normal approximation becomes better. Altogether,
{
## draw 100000 samples the normal approximation does not appear too bad, given the fact that n = 10
Z <- replicate(n=100000, expr=z.sim(n=n, alpha=alphas[i])) is a rather small sample size.
## plot histogram e) Show that X̄/µ ∼ G(nα, nα). If α was known, how could you use this quantity
hist(Z, to derive a confidence interval for µ?
prob=TRUE,
◮ We know from above that the summands Xi /µ in X̄/µ are independent and
col="gray", Pn
main=paste("n=", n, " and alpha=", alphas[i], sep=""), have G(α, α) distribution. From Appendix A.5.2, we obtain that i=1 Xi /µ ∼
nclass=50, G(nα, α). From the same appendix, we also have that by multiplying the sum
xlim=c(-4, 4),
ylim=c(0, 0.45)) by n−1 we obtain G(nα, nα) distribution.
If α was known, then X̄/µ would be a pivot for µ and we could derive a 95%
## compare with N(0, 1) density
confidence interval as follows:
curve(dnorm(x),
from=min(Z),
to=max(Z), 0.95 = Pr{q0.025 (nα) ≤ X̄/µ ≤ q0.975 (nα)}
n=201,
add=TRUE, = Pr{1/q0.975 (nα) ≤ µ/X̄ ≤ 1/q0.025 (nα)}
col="red") = Pr{X̄/q0.975 (nα) ≤ µ ≤ X̄/q0.025 (nα)}
## save empirical quantiles
quants[i, ] <- quantile(Z, prob=c(0.025, 0.975)) where qγ (β) denotes the γ quantile of G(β, β). So the confidence interval would
} be
 
X̄/q0.975 (nα), X̄/q0.025 (nα) .
> ## so the quantiles were:
(3.4)
> quants
[,1] [,2]
f ) Suppose α is unknown, how could you derive a confidence interval for µ?
[1,] -4.095855 1.623285
[2,] -3.326579 1.741014 ◮ If α is unknown, we could estimate it and then use the confidence interval
26 3 Elements of frequentist inference 27

from (3.4). Of course we could also use Z ∼ N(0, 1) and derive the standard Now assume that xn 6= yn . Without loss of generality, let xn < yn . Then we
a

Wald interval can choose N1 = xn and N2 = yn , and equation (3.6) gives us



[X̄ ± 1.96 · S/ n] (3.5) 1 0
= ,
from that. A third possibility would be to simulate from the exact distribution 1 1
of Z as we have done above, using the estimated α value, and use the empirical which is not true. Hence, xn = yn must be fulfilled. It follows that Xn is
quantiles from the simulation instead of the ±1.96 values from the standard minimal sufficient for N .
normal distribution to construct a confidence interval analogous to (3.5). c) Confirm that the probability mass function of Xn is

10. All beds in a hospital are numbered consecutively from 1 to N > 1. In one room xn −1

a doctor sees n ≤ N beds, which are a random subset of all beds, with (ordered) fXn (xn ; N ) = n−1
N
 I{n,...,N } (xn ).
n
numbers X1 < · · · < Xn . The doctor now wants to estimate the total number of

beds N in the hospital. ◮ For a fixed value Xn = xn of the maximum, there are xn−1 n −1
possibilities
a) Show that the joint probability mass function of X = (X1 , . . . , Xn ) is how to choose the first n − 1 values. Hence, the total number of possible draws
 xn −1
 −1 giving a maximum of x is N n / n−1 . Considering also the possible range for
N
f (x; N ) = I{n,...,N } (xn ). xn , this leads to the probability mass function
n 
xn −1
 fXn (xn ; N ) = n−1
N
 I{n,...,N } (xn ).
◮ There are N n possibilities to draw n values without replacement out of N n
values. Hence, the probability of one outcome x = (x1 , . . . , xn ) is the inverse,

N −1
d) Show that
. Due to the nature of the problem, the highest number xn cannot be n+1
n
N̂ = Xn − 1
larger than N , nor can it be smaller than n. Altogether, we thus have n
is an unbiased estimator of N .
f (x; N ) = Pr(X1 = x1 , . . . , Xn = xn ; N )
◮ For the expectation of Xn , we have
 −1
N  −1 X  
= I{n,...,N } (xn ). N
N
x−1
n E(Xn ) = x·
n n−1
x=n
b) Show that Xn is minimal sufficient for N .  −1 X  
N
◮ We can factorise the probability mass function as follows: N x
= n·
 −1 n n
N! x=n
f (x; N ) = I{n,...,N } (xn )  −1 X N  
(N − n)!n! N x+1−1
= n·
(N − n)! n n+1−1
= |{z}
n! I{n,...,N } (xn ), x=n

| N!
=g2 (x) {z }  −1
N
N
X +1 
x−1

=g1 {h(x)=xn ;N } = n· .
n (n + 1) − 1
x=n+1
so from the Factorization Theorem (Result 2.2), we have that Xn is sufficient
PN  
for N . In order to show the minimal sufficiency, consider two data sets x and Since x−1
x=n n−1 = N
n , we have
y such that for every two parameter values N1 and N2 , the likelihood ratios are  −1  
N N +1
identical, i. e. E(Xn ) = n·
n n+1
Λx (N1 , N2 ) = Λy (N1 , N2 ).
n!(N − n)! (N + 1)!
= n·
This can be rewritten as N! (n + 1)!(N − n)!
I{n,...,N1 } (xn ) I{n,...,N1 } (yn ) n
= . (3.6) = (N + 1).
I{n,...,N2 } (xn ) I{n,...,N2 } (yn ) n+1
28 3 Elements of frequentist inference

Altogether thus
4 Frequentist properties of the
n+1
E(N̂) =
n
E(Xn ) − 1
n+1 n
likelihood
= (N + 1) − 1
n n+1
= N.

So N̂ is unbiased for N .
e) Study the ratio L(N + 1)/L(N ) and derive the ML estimator of N . Compare
it with N̂ .
◮ The likelihood ratio of N ≥ xn relative to N + 1 with respect to x is
1. Compute an approximate 95% confidence interval for the true correlation ρ based

f (x; N + 1)
N
N +1−n on the MLE r = 0.7, a sample of size of n = 20 and Fisher’s z-transformation.
= Nn+1 = < 1,
f (x; N ) n
N +1 ◮ Using Example 4.16, we obtain the transformed correlation as
 
1 + 0.7
so N must be as small as possible to maximise the likelihood, i. e. N̂ML = Xn . z = tanh−1 (0.7) = 0.5 log = 0.867.
1 − 0.7
From above, we have E(Xn ) = n+1
n
(N + 1), so the bias of the MLE is
n Using the more accurate approximation 1/(n−3) for the variance of ζ = tanh−1 (ρ),
E(Xn ) − N = (N + 1) − N √
n+1 we obtain the standard error 1/ n − 3 = 0.243. The 95%-Wald confidence interval
n(N + 1) − (n + 1)N for ζ is thus
=
n+1 [z ± 1.96 · se(z)] = [0.392, 1.343].
nN + n − nN − N
= By back-transforming using the inverse Fisher’s z-transformation, we obtain the
n+1
n−N following confidence interval for ρ:
= < 0.
n+1
[tanh(0.392), tanh(1.343)] = [0.373, 0.872].
This means that N̂ML systematically underestimates N , in contrast to N̂.
2. Derive a general formula for the score confidence interval in the Poisson model
based on the Fisher information, cf . Example 4.9.
◮ We consider a random sample X1:n from Poisson distribution Po(ei λ) with
known offsets ei > 0 and unknown rate parameter λ. As in Example 4.8, we can
see that if we base the score statistic for testing the null hypothesis that the true
rate parameter equals λ on the Fisher information I(λ; x1:n ), we obtain

S(λ; x1:n ) √ x̄ − ēλ


T2 (λ; x1:n ) = p = n· √ .
I(λ; x1:n ) x̄
30 4 Frequentist properties of the likelihood 31

We now determine the values of λ for which the score test based on the asymptotic To obtain an approximate one-sided P -value, we calculate its realisation z(0.8)
distribution of T2 would not reject the null hypothesis at level α. These are the and compare it to the approximate normal distribution of the score statistic
values for which we have |T2 (λ; x1:n )| ≤ q := z1−α/2 : under the null hypothesis that the true proportion is 0.8. Since a more extreme
result in the direction of the alternative H1 corresponds to a larger realisation
√ x̄ − ēλ
n· √ ≤q x and hence a larger observed value of the score statistic, the approximate

r one-sided P -value is the probability that a standard normal random variable is

− λ ≤ q x̄ greater than z(0.8):
ē ē n
> ## general settings
" r # > x <- 105
x̄ q x̄ > n <- 117
λ∈ ±
ē ē n > pi0 <- 0.8
> ## the first approximate pivot
> z.pi <- function(x, n, pi)
Note that this score confidence interval is symmetric around the MLE x̄/ē, unlike
{
the one based on the expected Fisher information derived in Example 4.9. sqrt(n) * (x - n * pi) / sqrt( (x * (n - x)) )
3. A study is conducted to quantify the evidence against the null hypothesis that }
> z1 <- z.pi(x, n, pi0)
less than 80 percent of the Swiss population have antibodies against the human > (p1 <- pnorm(z1, lower.tail=FALSE))
herpesvirus. Among a total of 117 persons investigated, 105 had antibodies. [1] 0.0002565128

a) Formulate an appropriate statistical model and the null and alternative hy- Pr{Z(0.8) > z(0.8)} ≈ 1 − Φ{z(0.8)} = 1 − Φ(3.47) ≈ 0.00026.
potheses. Which sort of P -value should be used to quantify the evidence against
c) Use the logit-transformation (compare Example 4.22) and the corresponding
the null hypothesis?
Wald statistic to obtain a P -value.
◮ The researchers are interested in the frequency of herpesvirus antibodies oc-
◮ We can equivalently formulate the testing problem as H0 : φ < φ0 =
currence in the Swiss population, which is very large compared to the n = 117
logit(0.8) versus H1 : φ > φ0 after parametrising the binomial model with φ =
probands. Therefore, the binomial model, actually assuming infinite popula-
logit(π) instead of π. Like in Example 4.22, we obtain the test statistic
tion, is appropriate. Among the total of n = 117 draws, x = 105 “successes”
were obtained, and the proportion π of these successes in the theoretically infi- log{X/(n − X)} − φ
Zφ (φ) = p , (4.2)
nite population is of interest. We can therefore suppose that the observed value 1/X + 1/(n − X)
x = 105 is a realisation of a random variable X ∼ Bin(n, π).
The null hypothesis is H0 : π < 0.8, while the alternative hypothesis is H1 : π ≥ which, by the delta method, is asymptotically normally distributed. To compute
0.8. Since this is a one-sided testing situation, we will need the corresponding the corresponding P -value, we may proceed as follows:
one-sided P -value to quantify the evidence against the null hypothesis. > ## the second approximate pivot
> z.phi <- function(x, n, phi)
b) Use the Wald statistic (4.12) and its approximate normal distribution to obtain {
a P -value. (log(x / (n - x)) - phi) / sqrt(1/x + 1/(n-x))
p }
◮ The Wald statistic is z(π) = I(π̂ML )(π̂ML − π). As in Example 4.10, we > (phi0 <- qlogis(pi0))
have π̂ML = x/n. Further, I(π) = x/π 2 + (n − x)/(1 − π)2 , and so [1] 1.386294
> z2 <- z.phi(x, n, phi0)
x n−x n2 n2 (n − x) n2 n2 n3 > (p2 <- pnorm(z2, lower.tail=FALSE))
I(π̂ML ) = + = + = + = .
(x/n)2 (1 − x/n)2 x (n − x)2 x n−x x(n − x) [1] 0.005103411

We thus have Pr[Zφ {logit(0.8)} > zφ {logit(0.8)}] ≈ 1 − Φ{zφ (1.3863)} = 1 − Φ(2.57) ≈ 0.0051.
s
n3 x  √ x − nπ
z(π) = −π = np . (4.1)
x(n − x) n x(n − x)
32 4 Frequentist properties of the likelihood 33

d) Use the score statistic (4.2) to obtain a P -value. Why do we not need to con- Note that the z-statistic on the φ-scale and the score statistic produce P -values
sider parameter transformations when using this statistic? which are closer to the exact P -value than that from the z-statistic on the π-
p
◮ By Result 4.5, the score statistic V (π) = S(π; X1:n )/ J1:n (π) asymp- scale. This is due to the bad quadratic approximation of the likelihood on the
totically follows the standard normal distribution under the Fisher regular- π-scale.
ity assumptions. In our case, we may use that a binomial random variable 4. Suppose X1:n is a random sample from an Exp(λ) distribution.
X ∼ Bin(n, π) can be viewed as the sum of n independent random variables
a) Derive the score function of λ and solve the score equation to get λ̂ML .
with Bernoulli distribution B(π), so the asymptotic results apply to the score
◮ From the log-likelihood
statistic corresponding to X as n → ∞. The score function corresponding to
the binomial variable is S(π; X) = X/π − (n − X)/(1 − π) and the expected n
X
Fisher information is J(π) = n/{π(1 − π)}, cf. Example 4.10. To calculate a l(λ) = log(λ) − λxi
i=1
third approximate P -value, we may therefore proceed as follows:
> ## and the third
= n log(λ) − nλx̄
> v.pi <- function(x, n, pi)
{ we get the score function
n
(x/pi - (n - x)/(1 - pi)) / sqrt(n/pi/(1-pi)) S(λ; x) = − nx̄,
} λ
> v <- v.pi(x, n, pi0) which has the root
> (p3 <- pnorm(v, lower.tail=FALSE))
[1] 0.004209022 λ̂ML = 1/x̄.

Pr{V (0.8) > v(0.8)} ≈ 1 − Φ(2.63) ≈ 0.00421. Since the Fisher information

We do not need to consider parameter transformations when using the score d


I(λ) = − S(λ; x)
statistic, because it is invariant to one-to-one transformations. That is, the dλ
test statistic does not change if we choose a different parametrisation. This is = −{(−1)nλ−2 }
easily seen from Result 4.3 and is written out in Section 4.1 in the context of = n/λ2
the corresponding confidence intervals.
e) Use the exact null distribution from your model to obtain a P -value. What is positive, we indeed have the MLE.
are the advantages and disadvantages of this procedure in general? b) Calculate the observed Fisher information, the standard error of λ̂ML and a
◮ Of course we can also look at the binomial random variable X itself and 95% Wald confidence interval for λ.
consider it as a test statistic. Then the one-sided P -value is ◮ By plugging the MLE λ̂ML into the Fisher information, we get the observed
> ## now the "exact" p-value Fisher information
> (p4 <- pbinom(x-1, size=n, prob=pi0, lower.tail=FALSE)) I(λ̂ML ) = nx̄2
[1] 0.003645007
> ## note the x-1 to get and hence the standard error of the MLE,
> ## P(X >= x) = P(X > x-1)
1
Xn  
n w se(λ̂ML ) = I(λ̂ML )−1/2 = √ .
Pr(X ≥ x; π0 ) = π (1 − π0 )n−w ≈ 0.00365. x̄ n
w 0
w=x
The 95% Wald confidence interval for λ is thus given by
The advantage of this procedure is that it does not require a large sample size n
h i 1 √

for a good fit of the approximate distribution (normal distribution in the above λ̂ML ± z0.975 se(λ̂ML ) = ± z0.975 /(x̄ n) .
cases) and a correspondingly good P -value. However, the computation of the x̄
P -value is difficult without a computer (as opposed to the easy use of standard
normal tables for the other statistics). Also, only a finite number of P -values
can be obtained, which corresponds to the discreteness of X.
34 4 Frequentist properties of the likelihood 35

c) Derive the expected Fisher information J(λ) and the variance stabilizing trans- e) Derive the Cramér-Rao lower bound for the variance of unbiased estimators of
formation φ = h(λ) of λ. λ.
◮ Because the Fisher information does not depend on x in this case, we have ◮ If T = h(X) is an unbiased estimator for λ, then Result 4.8 states that
simply
λ2
J(λ) = E{I(λ; X)} = n/λ2 . Var(T ) ≥ J(λ)−1 = ,
n
Now we can derive the variance stabilising transformation: which is the Cramér-Rao lower bound.
Zλ f ) Compute the expectation of λ̂ML and use this result to construct an unbiased
φ = h(λ) ∝ Jλ (u)1/2 du estimator of λ. Compute its variance and compare it to the Cramér-Rao lower
bound.
Zλ Pn
◮ By the properties of exponential distribution we know that i=1 Xi ∼
∝ u−1 du
G(n, λ), cf. Appendix A.5.2. Next, by the properties of Gamma distribution
Pn
= log(u)|u=λ we that get that X̄ = n1 i=1 Xi ∼ G(n, nλ), and λ̂ML = 1/X̄ ∼ IG(n, nλ), cf.
Appendix A.5.2. It follows that
= log(λ).

d) Compute the MLE of φ and derive a 95% confidence interval for λ by back- E(λ̂ML ) = > λ,
n−1
transforming the limits of the 95% Wald confidence interval for φ. Compare
with the result from 4b). cf. again Appendix A.5.2. Thus, λ̂ML is a biased estimator of λ. However,
◮ Due to the invariance of ML estimation with respect to one-to-one trans- we can easily correct it by multiplying with the constant (n − 1)/n. This new
formations we have estimator λ̂ = (n − 1)/(nX̄) is obviously unbiased, and has variance
φ̂ML = log λ̂ML = − log x̄
(n − 1)2
Var(λ̂) = Var(1/X̄)
as the MLE of φ = log(λ). Using the delta method we can get the corresponding n2
2
standard error as (n − 1) n2 λ 2
=

d

n 2 (n − 1)2 (n − 2)
se(φ̂ML ) = se(λ̂ML ) h(λ̂ML ) λ 2
dλ = ,
n−2
1
= √ 1/λ̂ML
x̄ n cf. again Appendix A.5.2. This variance only asymptotically reaches the
1 Cramér-Rao lower bound λ2 /n. Theoretically there might be other unbiased
= √ x̄
x̄ n estimators which have a smaller variance than λ̂.
= n−1/2 . 5. An alternative parametrization of the exponential distribution is
So the 95% Wald confidence interval for φ is 1  x
fX (x) = exp − IR+ (x), θ > 0.
  θ θ
− log x̄ ± z0.975 · n−1/2 ,
Let X1:n denote a random sample from this density. We want to test the null
and transformed back to the λ-space we have the 95% confidence interval hypothesis H0 : θ = θ0 against the alternative hypothesis H1 : θ 6= θ0 .
  a) Calculate both variants T1 and T2 of the score test statistic.
exp(− log x̄ − z0.975 n−1/2 ), exp(− log x̄ + z0.975 n−1/2 ) =
 √ √  ◮ Recall from Section 4.1 that
= x̄−1 / exp(z0.975 / n), x̄−1 · exp(z0.975 / n) ,
S(θ0 ; x1:n ) S(θ0 ; x1:n )
T1 (x1:n ) = p and T2 (x1:n ) = p .
which is not centred around the MLE λ̂ML = x̄−1 , unlike the original Wald J1:n (θ0 ) I(θ0 ; x1:n )
confidence interval for λ.
36 4 Frequentist properties of the likelihood 37

Like in the previous exercise, we can compute the log-likelihood a) Derive the probability mass function f (x; π) of Xi .
n ◮ Xi can only take one of the values 1, 2, . . . , so it is a discrete random
X xi
l(θ) = − log(θ) − variable supported on natural numbers N. For a given x ∈ N, the probability
θ
i=1 that Xi equals x is
nx̄
= −n log(θ) −
θ f (x; π) = Pr(First test negative, . . . , (x − 1)-st test negative, x-th test positive)
and derive the score function = (1 − π) · · · (1 − π) ·π
| {z }
  x−1 times
1 n(x̄ − θ)
S(θ; x1:n ) = (nθ − nx̄) · − 2 = , = (1 − π)x−1 π,
θ θ2

the Fisher information since the results of the different tests are independent. This is the probability
mass function of the geometric distribution Geom(π) (cf. Appendix A.5.1), i. e.
d 2x̄ − θ
I(θ; x1:n ) = − S(θ; x1:n ) = n , we have Xi ∼ Geom(π) for i = 1, . . . , n.
iid
dθ θ3
b) Write down the log-likelihood function for the random sample X1:n and com-
and the expected Fisher information
pute the MLE π̂ML .
2 E(X̄) − θ n ◮ For a realisation x1:n = (x1 , . . . , xn ), the likelihood is
J1:n (θ) = n = 2.
θ3 θ n
Y
The test statistics can now be written as L(π) = f (xi ; π)
i=1
n(x̄ − θ0 ) θ0 √ x̄ − θ0 Yn
T1 (x1:n ) = ·√ = n
θ02 n θ0 = π(1 − π)xi −1
3/2
r
n(x̄ − θ0 )
i=1
θ0 θ0 Pn
and T2 (x1:n ) = ·p = T1 (θ0 ) . = π n (1 − π) i=1 i
x −n
θ02 n(2x̄ − θ0 ) 2x̄ − θ0
= π n (1 − π)n(x̄−1) ,
b) A sample of size n = 100 gave x̄ = 0.26142. Quantify the evidence against
H0 : θ0 = 0.25 using a suitable significance test. yielding the log-likelihood
◮ By plugging these numbers into the formulas for T1 (x1:n ) and T2 (x1:n ), we
obtain l(π) = n log(π) + n(x̄ − 1) log(1 − π).
T1 (x1:n ) = 0.457 and T2 (x1:n ) = 0.437.
The score function is thus
Under the null hypothesis, both statistics follow asymptotically the standard
d n n(x̄ − 1)
normal distribution. Hence, to test at level α, we need to compare the observed S(π; x1:n ) = l(π) = −
dπ π 1−π
values with the (1 − α/2) · 100% quantile of the standard normal distribution.
For α = 0.05, we compare with z0.975 ≈ 1.96. As neither of the observed values and the solution of the score equation S(π; x1:n ) = 0 is
is larger than the critical value, the null hypothesis cannot be rejected.
π̂ML = 1/x̄.
6. In a study assessing the sensitivity π of a low-budget diagnostic test for asthma,
each of n asthma patients is tested repeatedly until the first positive test result is The Fisher information is
obtained. Let Xi be the number of the first positive test for patient i. All patients
d
and individual tests are independent, and the sensitivity π is equal for all patients I(π) = − S(π)

and tests. n n(x̄ − 1)
= 2+ ,
π (1 − π)2
38 4 Frequentist properties of the likelihood 39

yielding the observed Fisher information We therefore compute


   π 
x̄ − 1 d d
I(π̂ML ) = n x̄2 + x̄−1 2 logit(π) = log
( x̄ ) dπ dπ 1−π
 π −1 1(1 − π) − (−1)π
nx̄3 =
= , 1−π (1 − π)2
x̄ − 1
1−π 1−π+π
which is positive, as xi ≥ 1 for every i by definition. It follows that π̂ML = 1/x̄ = ·
π (1 − π)2
indeed is the MLE. 1
= ,
c) Derive the standard error se(π̂ML ) of the MLE. π(1 − π)
◮ The standard error is
and
r
x̄ − 1 d 1
se(π̂ML ) = I(π̂ML )−1/2 = . logit(π̂ML ) =
nx̄3 dπ 1/x̄(1 − 1/x̄)

d) Give a general formula for an approximate 95% confidence interval for π. What = x̄−1
could be the problem of this interval? x̄

◮ A general formula for an approximate 95% confidence interval for π is x̄2


= .
" r # x̄ − 1
x̄ − 1
[π̂ML ± z0.975 · se(π̂ML )] = 1/x̄ ± z0.975 The standard error of φ̂ML is hence
nx̄3

d
where z0.975 ≈ 1.96 is the 97.5% quantile of the standard normal distribution. se(φ̂ML ) = se(π̂ML )
logit(π̂ML )

The problem of this interval could be that it might contain
q values outside the r
x̄ − 1 x̄2
range (0, 1) of the parameter π. That is, 1/x̄ − z0.975 x̄−1
nx̄3 could be smaller
=
q nx̄3 x̄ − 1
than 0 or 1/x̄ + z0.975 x̄−1
nx̄3 could be larger than 1. =n −1/2
(x̄ − 1)1/2−1 (x̄)−3/2+2
e) Now we consider the parametrization with φ = logit(π) = log {π/(1 − π)}. = n−1/2 (x̄ − 1)−1/2 (x̄)1/2
Derive the corresponding MLE φ̂ML , its standard error and associated approx- r

imate 95% confidence interval. What is the advantage of this interval? = .
n(x̄ − 1)
◮ By the invariance of the MLE with respect to one-to-one transformations,
we have The associated approximate 95% confidence interval for φ is given by
 r 
φ̂ML = logit(π̂ML ) x̄
  − log(x̄ − 1) ± z0.975 · .
1/x̄ n(x̄ − 1)
= log
1 − 1/x̄ The advantage of this interval is that it is for a real-valued parameter φ ∈ R,
 
1 so its bounds are always contained in the parameter range.
= log
x̄ − 1 f ) n = 9 patients did undergo the trial and the observed numbers were x =
= − log(x̄ − 1). (3, 5, 2, 6, 9, 1, 2, 2, 3). Calculate the MLEs π̂ML and φ̂ML , the confidence intervals
from 6d) and 6e) and compare them by transforming the latter back to the
By the delta method, we further have π-scale.
◮ To compute the MLEs π̂ML and φ̂ML , we may proceed as follows.
d
se(φ̂ML ) = se(π̂ML ) logit(π̂ML ) .

40 4 Frequentist properties of the likelihood 41

> ## the data: > (ci.pi.2 <- plogis(ci.phi))


> x <- c(3, 5, 2, 6, 9, 1, 2, 2, 3) [1] 0.1484366 0.4465198
> xq <- mean(x) > ## This is identical to:
> ## the MLE for pi: > invLogit <- function(phi)
> (mle.pi <- 1/xq) {
[1] 0.2727273 exp(phi) / (1 + exp(phi))
> ## The logit function is the quantile function of the }
> ## standard logistic distribution, hence: > invLogit(ci.phi)
> (mle.phi <- qlogis(mle.pi)) [1] 0.1484366 0.4465198
[1] -0.9808293
> ## and this is really the same as We thus obtain two confidence intervals for π: (0.121, 0.425) and (0.148, 0.447).
> - log(xq - 1) We now compare their lengths.
[1] -0.9808293
> ## compare the lengths of the two pi confidence intervals:
We obtain that π̂ML = 0.273 and that φ̂ML = −0.981. To compute the confidence > ci.pi.2[2] - ci.pi.2[1]
[1] 0.2980833
intervals, we proceed as follows. > ci.pi[2] - ci.pi[1]
> n <- length(x) [1] 0.3039023
> ## the standard error for pi:
> (se.pi <- sqrt((xq-1) / (n * xq^3))) Compared to the first confidence interval for π, the new interval is slightly
[1] 0.07752753 shifted to the right, and smaller. An advantage is that it can never lie outside
the (0, 1) range.
> ## the standard error for phi:
> (se.phi <- sqrt(xq / (n * (xq - 1))))
[1] 0.390868 g) Produce a plot of the relative log-likelihood function l̃(π) and two approxima-
> ## the CIs: tions in the range π ∈ (0.01, 0.5): The first approximation is based on the direct
quadratic approximation of ˜ lπ (π) ≈ qπ (π), the second approximation is based
> (ci.pi <- mle.pi + c(-1, +1) * qnorm(0.975) * se.pi)
[1] 0.1207761 0.4246784
> (ci.phi <- mle.phi + c(-1, +1) * qnorm(0.975) * se.phi) on the quadratic approximation of ˜ lφ (φ) ≈ qφ (φ), i. e. qφ {logit(π)} values are
[1] -1.7469164 -0.2147421 plotted. Comment the result.
Now we can transform the bounds of the confidence interval for φ back to the ◮ We produce a plot of the relative log-likelihood function l̃(π) and two
π-scale. Note that the logit transformation, and hence also its inverse, are (quadratic) approximations:
> ## functions for pi:
strictly monotonically increasing, which is easily seen from
> loglik.pi <- function(pi)
d 1 {
logit(π) = > 0. n * log(pi) + n * (xq - 1) * log(1 - pi)
dπ π(1 − π) }
> rel.loglik.pi <- function(pi)
We can work out the inverse transformation by solving φ = logit(π) for π: {
 π  loglik.pi(pi) - loglik.pi(mle.pi)
φ = log }
1−π > approx.rel.loglik.pi <- function(pi)
π
exp(φ) = {
1−π - 0.5 * se.pi^(-2) * (pi - mle.pi)^2
}
exp(φ) − π exp(φ) = π > ## then for phi:
exp(φ) = π{1 + exp(φ)} > loglik.phi <- function(phi)
{
exp(φ) loglik.pi(plogis(phi))
π= .
1 + exp(φ) }
> rel.loglik.phi <- function(phi)
exp(φ) {
So we have logit−1 (φ) = 1+exp(φ) . This is also the cdf of the standard logis- loglik.phi(phi) - loglik.phi(mle.phi)
tic distribution. To transform the confidence interval in R, we may therefore }
proceed as follows. > approx.rel.loglik.phi <- function(phi)
{
42 4 Frequentist properties of the likelihood 43

- 0.5 * se.phi^(-2) * (phi - mle.phi)^2 7. A simple model for the drug concentration in plasma over time after a single
} intravenous injection is c(t) = θ2 exp(−θ1 t), with θ1 , θ2 > 0. For simplicity we
> ## and the plot
> piGrid <- seq(0.01, 0.5, length=201) assume here that θ2 = 1.
> plot(piGrid, rel.loglik.pi(piGrid),
type="l",
a) Assume that n probands had their concentrations ci , i = 1, . . . , n, measured
at the same single time-point t and assume that the model ci ∼ N(c(t), σ 2 ) is
iid
xlab=expression(pi),
ylab = expression(tilde(l)(pi)),
appropriate for the data. Calculate the MLE of θ1 .
lwd=2)
> abline(v=0, col="gray") ◮ The likelihood is
> lines(piGrid, approx.rel.loglik.pi(piGrid), n  
Y 1 1  2
L(θ1 ) = exp − 2 ci − exp(−θ1 t)
lty=2,
√ ,
col="blue") 2πσ 2 2σ
> lines(piGrid, approx.rel.loglik.phi(qlogis(piGrid)), i=1
lty=2,
yielding the log-likelihood
col="red")
> abline(v=mle.pi) n  
X 1  1  2
> legend("bottomright", l(θ1 ) = − log 2πσ 2 − 2 ci − exp(−θ1 t) .
legend= 2 2σ
i=1
c("relative log-lik.",
"quadratic approx.", For the score function we thus have
"transformed quad. approx."),
n
col= exp(−θ1 t) t X 
c("black", S(θ1 ; c1:n ) = − 2
ci − exp(−θ1 t)
"blue", σ
i=1
"red"),
exp(−θ1 t) nt
lty= = {exp(−θ1 t) − c̄},
c(1, σ2
2, and for the Fisher information
2),
lwd= exp(−θ1 t) nt2
c(2, I(θ1 ) = {2 exp(−θ1 t) − c̄}.
1, σ2
1)) The score equation is solved as
0
0 = S(θ1 ; c1:n )
−5 exp(−θ1 t) = c̄
1
−10
θ̂1 = − log(c̄).
~l (π)

t
The observed Fisher information
−15
c̄2 nt2
relative log−lik. I(θ̂1 ) =
−20 quadratic approx. σ2
transformed quad. approx.
is positive; thus, θ̂1 is indeed the MLE.
0.0 0.1 0.2 0.3 0.4 0.5
b) Calculate the asymptotic variance of the MLE.
π
◮ By Result 4.10, the asymptotic variance of θ̂1 is the inverse of the expected
The transformed quadratic approximation qφ (logit(π)) is closer to the true rel- Fisher information
ative log-likelihood l̃(π) than the direct quadratic approximation qπ (π). This
corresponds to a better performance of the second approximate confidence in- J1:n (θ1 ) = E{I(θ1 ; C1:n )}
terval. exp(−2θ1 t) nt2
= .
σ2
44 4 Frequentist properties of the likelihood 45

c) In pharmacokinetic studies one is often interested in the area under the concen- If we plug in topt = 1/θ1 for t, we obtain
R∞
tration curve, α = 0 exp(−θ1 t) dt. Calculate the MLE for α and its variance
2σ 2 2θ 4 σ 2 exp(2)
estimate using the delta theorem. exp(2)(2θ14 − 4θ14 + 3θ14 ) = 1 ,
n n
◮ By the invariance of the MLE with respect to one-to-one transformations,
we obtain that which is positive. Thus, topt indeed minimises the variance.

Z∞ 8. Assume the gamma model G(α, α/µ) for the random sample X1:n with mean
α̂ML = exp(−θ̂1 t) dt E(Xi ) = µ > 0 and shape parameter α > 0.
0 a) First assume that α is known. Derive the MLE µ̂ML and the observed Fisher
1 information I(µ̂ML ).
=
θ̂1 ◮ The log-likelihood kernel for µ is
t
=− . n
log(c̄) αX
l(µ) = −αn log(µ) − xi ,
µ
Further, by the delta method, we obtain that i=1

and the score function is


d 1
se(α̂ML ) = se(θ̂1 )
dθ1 θ̂1 X n
d αn
σ S(µ; x) = l(µ) = − + αµ−2 xi .
= √ . dµ µ
exp(−θ̂1 t) ntθ12 i=1

Thus, the asymptotic variance of α̂ML is The score equation S(µ; x) = 0 can be written as
n
σ2 1X
. n= xi
exp(−2θ̂1 t)nt2 θ14 µ
i=1

d) We now would like to determine the optimal time point for measuring the and is hence solved by µ̂ML = x̄. The ordinary Fisher information is
concentrations ci . Minimise the asymptotic variance of the MLE with respect
d
to t, when θ1 is assumed to be known, to obtain an optimal time point topt . I(µ) = − S(µ; x)

◮ We take the derivative of the asymptotic variance of θ̂1 with respect to t: n
!
X
    = − αnµ −2
− 2αµ −3
xi
d σ 2 exp(2θ1 t) σ 2 2θ1 exp(2θ1 t) 2 exp(2θ1 t)
· 2
= 2
− 3
, i=1
dt n t n t t n
2α X αn
= xi − 2 ,
and find that it is equal zero for topt satisfying that µ3 µ
i=1

2θ1 exp(2θ1 t) 2 exp(2θ1 t)


= , so the observed Fisher information equals
t2opt t3opt
I(µ̂ML ) = I(x̄)
i. e. for
1 2αnx̄ αn
topt = . = −
θ1 (x̄)3 (x̄)2
αn
In order to verify that topt minimises the asymptotic variance, we compute the =
(x̄)2
second derivative of the asymptotic variance with respect to t: αn
  = 2 .
2σ 2 2θ12 exp(2θ1 t) 4θ1 exp(2θ1 t) 3 exp(2θ1 t) µ̂ML
− +
n t2 t3 t4 As it is positive, we have indeed found the MLE.
46 4 Frequentist properties of the likelihood 47

b) Use the p∗ formula to derive an asymptotic density of µ̂ML depending on the and the score function reads
true parameter µ. Show that the kernel of this approximate density is exact in
d
this case, i. e. it equals the kernel of the exact density known from Exercise 9 S(α; x) = l(α)

from Chapter 3.   n n
α µ1 X 1X
◮ The p∗ formula gives us the following approximate density of the MLE: = n log + αn − nψ(α) + log(xi ) − xi
µ αµ µ
r i=1 i=1
I(µ̂ML ) L(µ)   n n
f (µ̂ML ) =
∗ α X 1X
2π L(µ̂ML ) = n log + n − nψ(α) + log(xi ) − xi .
r µ µ
i=1 i=1
I(µ̂ML )
= exp{l(µ) − l(µ̂ML )}
2π Hence, the Fisher information is
r
αn
= exp {−αn log(µ) − α/µ · nµ̂ML + αn log(µ̂ML ) + α/µ̂ML · nµ̂ML } I(α) = −
d
µ̂2ML 2π dα
S(α; x)
r   nn o
=
αn −αn
µ exp(αn) · µ̂αn−1 exp −
αn
µ̂ . (4.3) =− − nψ ′ (α)

ML

ML
µ 
1
Pn = n ψ ′ (α) − .
From Appendix A.5.2 we know that i=1 Xi ∼ G(nα, α/µ), and X̄ = µ̂ML ∼ α
G(nα, nα/µ); cf. Exercise 9 in Chapter 3. The corresponding density function
has the kernel e) Show, by rewriting the score equation, that the MLE α̂ML fulfils
 
αn
f (µ̂ML ) ∝ µ̂nα−1
ML
exp − µ̂ML , n
X 1X
n
µ −nψ(α̂ML ) + n log(α̂ML ) + n = − log(xi ) + xi + n log(µ). (4.5)
µ
which is the same as the kernel in (4.3). i=1 i=1

c) Stirling’s approximation of the gamma function is Hence show that the log-likelihood kernel can be written as
r
2π xx  
Γ(x) ≈ . (4.4) l(α) = n α log(α) − α − log{Γ(α)} + αψ(α̂ML ) − α log(α̂ML ) .
x exp(x)
Show that approximating the normalising constant of the exact density
with (4.4) gives the normalising constant of the approximate p∗ formula den- ◮ The score equation S(α̂ML ; x) = 0 can be written as
sity.
1X
n
X n
◮ The normalising constant of the exact distribution G(nα, nα/µ) is: n log(α̂ML ) − n log(µ) + n − nψ(α̂ML ) + log(xi ) − xi = 0
 αn µ
i=1 i=1
αn  αn r
µ αn αn exp(αn) n
X 1X
n

Γ(αn)

µ 2π (αn)αn − nψ(α̂ML ) + n log(α̂ML ) + n = − log(xi ) + xi + n log(µ).
µ
r i=1 i=1
αn
= µ−αn exp(αn) ,
2π Hence, we can rewrite the log-likelihood kernel as follows:
  Xn n
which equals the normalising constant of the approximate density in (4.3). α αX
l(α) = αn log − n log{Γ(α)} + α log(xi ) − xi
d) Now assume that µ is known. Derive the log-likelihood, score function and µ µ
i=1 i=1
Fisher information of α. Use the digamma function ψ(x) = dx d
log{Γ(x)} and ( )
1X
X n n
the trigamma function ψ (x) = dx ψ(x).
′ d
= αn log(α) − n log{Γ(α)} − α n log(µ) − log(xi ) + xi
µ
◮ The log-likelihood kernel of α is i=1 i=1
  Xn n = αn log(α) − n log{Γ(α)} − α {−nψ(α̂ML ) + n log(α̂ML ) + n}
α αX
l(α) = αn log − n log{Γ(α)} + α log(xi ) − xi , = n[α log(α) − α − log{Γ(α)} + αψ(α̂ML ) − α log(α̂ML )].
µ µ
i=1 i=1
48 4 Frequentist properties of the likelihood 49

f ) Implement an R-function of the p∗ formula, taking as arguments the MLE mu) # the mean parameter
value(s) α̂ML at which to evaluate the density, and the true parameter α. For {
## solve the score equation
numerical reasons, first compute the approximate log-density uniroot(f=scoreFun.alpha,
interval=c(1e-10, 1e+10),
1 1
log f ∗ (α̂ML ) = − log(2π) + log{I(α̂ML )} + l(α) − l(α̂ML ), x=x, # pass additional parameters
2 2 mu=mu)$root # to target function
}
and then exponentiate it. The R-functions digamma, trigamma and lgamma can > ## now simulate the datasets and compute the MLE for each
be used to calculate ψ(x), ψ ′ (x) and log{Γ(x)}, respectively. > nSim <- 10000
> alpha <- 2
◮ We first rewrite the relative log-likelihood l(α) − l(α̂ML ) as > mu <- 3
> n <- 10
n[α log(α) − α̂ML log(α̂ML ) − (α − α̂ML ) − log{Γ(α)} + log{Γ(α̂ML )} > alpha.sim.mles <- numeric(nSim)
> set.seed(93)
+ (α − α̂ML )ψ(α̂ML ) − (α − α̂ML ) log(α̂ML )] > for(i in seq_len(nSim))
= n[α{log(α) − log(α̂ML )} − (α − α̂ML ) − log{Γ(α)} + log{Γ(α̂ML )} + (α − α̂ML )ψ(α̂ML )]. {
alpha.sim.mles[i] <- getMle.alpha(x=
rgamma(n=n,
Now we are ready to implement the approximate density of the MLE, as de- alpha,
scribed by the p∗ formula: alpha / mu),
mu=mu)
> approx.mldens <- function(alpha.mle, alpha.true) }
{ > ## compare the histogram with the p* density
relLogLik <- n * (alpha.true * (log(alpha.true) - log(alpha.mle)) - > hist(alpha.sim.mles,
(alpha.true - alpha.mle) - lgamma(alpha.true) + prob=TRUE,
lgamma(alpha.mle) + (alpha.true - alpha.mle) * nclass=50,
digamma(alpha.mle)) ylim=c(0, 0.6),
logObsFisher <- log(n) + log(trigamma(alpha.mle) - 1 / alpha.mle) xlim=c(0, 12))
logret <- - 0.5 * log(2 * pi) + 0.5 * logObsFisher + relLogLik > curve(approx.mldens(x, alpha.true=alpha),
return(exp(logret)) add=TRUE,
} lwd=2,
n=201,
g) In order to illustrate the quality of this approximation, we consider the case
col="red")
with α = 2 and µ = 3. Simulate 10 000 data sets of size n = 10, and compute Histogram of alpha.sim.mles
0.6
the MLE α̂ML for each of them by numerically solving (4.5) using the R-function
uniroot (cf . Appendix C.1.1). Plot a histogram of the resulting 10 000 MLE 0.5

samples (using hist with option prob=TRUE). Add the approximate density 0.4

Density
derived above to compare.
0.3
◮ To illustrate the quality of this approximation by simulation, we may run
the following code. 0.2

> ## the score function 0.1


> scoreFun.alpha <- function(alpha,
x, # the data 0.0
mu) # the mean parameter
{ 0 2 4 6 8 10 12
n <- length(x) alpha.sim.mles
ret <- n * log(alpha / mu) + n - n * digamma(alpha) +
sum(log(x)) - sum(x) / mu We may see that we have a nice agreement between the sampling distribution
## be careful that this is vectorised in alpha! and the approximate density.
return(ret)
}
> ## this function computes the MLE for alpha
> getMle.alpha <- function(x, # the data
5 Likelihood inference in
multiparameter models

1. In a cohort study on the incidence of ischaemic heart disease (IHD) 337 male
probands were enrolled. Each man was categorised as non-exposed (group 1, daily
energy consumption ≥ 2750 kcal) or exposed (group 2, daily energy consumption
< 2750 kcal) to summarise his average level of physical activity. For each group,
the number of person years (Y1 = 2768.9 and Y2 = 1857.5) and the number of IHD
cases (D1 = 17 and D2 = 28) was registered thereafter.
We assume that Di | Yi ∼ Po(λi Yi ), i = 1, 2, where λi > 0 is the group-specific
ind

incidence rate.
a) For each group, derive the MLE λ̂i and a corresponding 95% Wald confidence
interval for log(λi ) with subsequent back-transformation to the λi -scale.
◮ The log-likelihood kernel corresponding to a random variable X with Poisson
distribution Po(θ) is
l(θ) = −θ + x log(θ),
implying the score function
d x
S(θ; x) = l(θ) = −1 +
dθ θ
and the Fisher information
d x
I(θ) = − S(θ; x) = 2 .
dθ θ
The score equation is solved by θ̂ML = x and since the observed Fisher informa-
tion I(θ̂ML ) = 1/x is positive, θ̂ML indeed is the MLE of θ.
The rates of the Poisson distributions for Di can therefore be estimated by the
maximum likelihood estimators θ̂i = Di . Now,
θi
λi = ,
Yi
so, by the invariance of the MLE, we obtain that
θi Di
λ̂i = = .
Yi Yi
52 5 Likelihood inference in multiparameter models 53

With the given data we have the results λ̂1 = D1 /Y1 = 6.14 · 10−3 and λ̂2 = In terms of the new parametrisation,
D2 /Y2 = 1.51 · 10−2 .
λ1 = λ and λ2 = λθ,
We can now use the fact that Poisson distribution Po(θ) with θ a natural number
may be seen as the distribution of a sum of θ independent random variables, each so if, moreover, we denote D = D1 + D2 , we have
with distribution Po(1). As such, Po(θ) for reasonably large θ is approximately l(λ, θ) = D1 log(λ) − λY1 + D2 log(λ) + D2 log(θ) − λθY2
N(θ, θ). In our case, D1 = θ̂1 = 17 and D2 = θ̂2 = 28 might be considered
= D log(λ) + D2 log(θ) − λY1 − λθY2 . (5.1)
approximately normally distributed, θ̂i ∼ N(θi , θi ). It follows that λ̂i are, too,
approximately normal, λ̂i ∼ N(λi , λi /Yi ). The standard errors of λ̂i can be c) Compute the MLE (λ̂, θ̂), the observed Fisher information matrix I(λ̂, θ̂) and

estimated as se(λ̂i ) = Di /Yi . derive expressions for both profile log-likelihood functions lp (λ) = l{λ, θ̂(λ)} and
Again by the invariance of the MLE, the MLEs of ψi = log(λi ) = f (λi ) are lp (θ) = l{λ̂(θ), θ}.
  ◮ The score function is
Di
ψ̂i = log . d
!
D
!
Yi dλ l(λ, θ) λ − Y1 − θY2
S(λ, θ) = = ,
d D2
By the delta method, the standard errors of ψ̂i are dθ l(λ, θ) θ − λY2

and the Fisher information is
se(ψ̂i ) = se(λ̂i ) · f ′ (λ̂i ) ! !
√ d2 d2 l(λ,θ) D
Di 1 dλ2
l(λ, θ) λ2
Y2
= · I(λ, θ) = − d2 l(λ,θ)
dλ dθ
= .
d2
Yi λ̂i
D2
dλ dθ dθ2 l(λ, θ) Y2 θ2
1 The score equation S(λ, θ) = 0 is solved by
=√ .
Di  
D1 D2 Y1
Therefore, the back-transformed limits of the 95% Wald confidence intervals with (λ̂, θ̂) = , ,
Y1 D1 Y2
log-transformation for λi equal
h  and, as the observed Fisher information
p   p i  
exp ψ̂i − z0.975 / Di , exp ψ̂i + z0.975 / Di , DY12
2 Y2
I(λ̂, θ̂) =  D1
D12 Y22

and with the data we get: Y2 D2 Y12
> (ci1 <- exp(log(d[1]/y[1]) + c(-1, 1) * qnorm(0.975) / sqrt(d[1])))
[1] 0.003816761 0.009876165 is positive definite, (λ̂, θ̂) indeed is the MLE.
> (ci2 <- exp(log(d[2]/y[2]) + c(-1, 1) * qnorm(0.975) / sqrt(d[2])))
[1] 0.01040800 0.02183188 In order to derive the profile log-likelihood functions, one first has to compute
the maxima of the log-likelihood with fixed λ or θ, which we call here θ̂(λ) and
i. e. the confidence interval for λ1 is (0.00382, 0.00988) and for λ2 it is
λ̂(θ). This amounts to solving the score equations dθd
l(λ, θ) = 0 and dλ
d
l(λ, θ) = 0
(0.01041, 0.02183).
separately for θ and λ, respectively. The solutions are
b) In order to analyse whether λ1 = λ2 , we reparametrise the model with λ = λ1
D2 D
and θ = λ2 /λ1 . Show that the joint log-likelihood kernel of λ and θ has the θ̂(λ) = and λ̂(θ) = .
λY2 Y1 + θY2
following form:
The strictly positive diagonal entries of the Fisher information show that the log-
l(λ, θ) = D log(λ) + D2 log(θ) − λY1 − θλY2 , likelihoods are strictly concave, so θ̂(λ) and λ̂(θ) indeed are the maxima. Now
we can obtain the profile log-likelihood functions by plugging in θ̂(λ) and λ̂(θ)
where D = D1 + D2 .
into the log-likelihood (5.1). The results are (after omitting additive constants
◮ First, note that θ now has a different meaning than in the solution of 1a).
not depending on the arguments λ and θ, respectively)
By the independence of D1 and D2 , the joint log-likelihood kernel in the original
parametrisation is lp (λ) = D1 log(λ) − λY1

l(λ1 , λ2 ) = D1 log(λ1 ) − λ1 Y1 + D2 log(λ2 ) − λ2 Y2 . and lp (θ) = −D log(Y1 + θY2 ) + D2 log(θ).


54 5 Likelihood inference in multiparameter models 55

5 −100
d) Plot both functions lp (λ) and lp (θ), and also create a contour plot of the relative
−400
log-likelihood l̃(λ, θ) using the R-function contour. Add the points {λ, θ̂(λ)} and 4 −120

{λ̂(θ), θ} to the contour plot, analogously to Figure 5.3a). −450

◮ The following R code produces the desired plots. 3 −140 −500

lp(λ)

lp(θ)
> ## log-likelihood of lambda, theta:

θ
> loglik <- function(param) −0 −550
2 .5 −160
{

−2
−1

0
lambda <- param[1] −600

theta <- param[2] 1 −5 −180

return(dtot * log(lambda) + d[2] * log(theta) - −10 −650


−20
lambda * y[1] - theta * lambda * y[2])
−50
−100
}
0.005 0.010 0.015 0.000 0.005 0.010 0.015 0 1 2 3 4 5
> ## the MLE
λ λ θ
> (mle <- c(d[1]/y[1], (y[1] * d[2]) / (y[2] * d[1])))
[1] 0.006139622 2.455203864 e) Compute a 95% Wald confidence interval for log(θ) based on the profile log-
> ## relative log-likelihood likelihood. What can you say about the P -value for the null hypothesis λ1 = λ2 ?
> rel.loglik <- function(param)
{ ◮ First we derive the standard error of θ̂, by computing the negative curvature
return(loglik(param) - loglik(mle)) of the profile log-likelihood lp (θ):
}
   
> ## set up parameter grids d d d DY2 D2 DY22 D2
> lambda.grid <- seq(1e-5, 0.018, length = 200) Ip (θ) = − lp (θ) = − − + =− + 2.
> theta.grid <- seq(1e-5, 5, length = 200) dθ dθ dθ Y1 + θY2 θ (Y1 + θY2 )2 θ
> grid <- expand.grid(lambda = lambda.grid, theta = theta.grid)
> values <- matrix(data = apply(grid, 1, rel.loglik), The profile likelihood is maximised at the MLE θ̂ = D2 Y1 /(D1 Y2 ), and the neg-
nrow = length(lambda.grid)) ative curvature there is
> ## set up plot frame D13 Y22
> par(mfrow = c(1,3)) Ip (θ̂) = .
> ## contour plot of the relative loglikelihood: DY12 D2
> contour(lambda.grid, theta.grid, values, Note that, by Result 5.1, we could have obtained the same expression by inverting
xlab = expression(lambda), ylab = expression(theta),
levels = -c(0.1, 0.5, 1, 5, 10, 20, 50, 100, 500, 1000, 1500, 2000), the observed Fisher information matrix I(λ̂, θ̂) and taking the reciprocal of the
xaxs = "i", yaxs = "i") second diagonal value. The standard error of θ̂ is thus
> points(mle[1], mle[2], pch = 19)
> ## add the profile log-likelihood points:

1 Y1 D2 D
> lines(lambda.grid, d[2] / (lambda.grid * y[2]), col = 2) se(θ̂) = {Ip (θ̂)}− 2 = √
> lines(dtot/(y[1] + theta.grid * y[2]), theta.grid, col = 3) Y2 D1 D1
> ## the profile log-likelihood functions:
> prof.lambda <- function(lambda){ and by the delta method we get the standard error for φ̂ = log(θ̂):
return(d[1] * log(lambda) - lambda * y[1])
√ r
} Y1 D2 D D1 Y2 D
> prof.theta <- function(theta){ se(φ̂) = se(θ̂) · (θ̂) =
−1
√ · = .
return(-dtot * log(y[1] + theta*y[2]) + d[2] * log(theta)) Y2 D1 D1 D2 Y1 D1 D2
}
> ## plot them separately: The 95% Wald confidence interval for φ is hence given by
> plot(lambda.grid, prof.lambda(lambda.grid), xlab = expression(lambda), "   r   r #
ylab = expression(l[p](lambda)), col = 2, type = "l") D2 Y1 D D2 Y1 D
> abline(v=mle[1], lty=2) log − z0.975 · , log + z0.975 ·
> plot(theta.grid, prof.theta(theta.grid), xlab = expression(theta), D1 Y2 D1 D2 D1 Y2 D1 D2
ylab = expression(l[p](theta)), col = 3, type = "l")
> abline(v=mle[2], lty=2) and for our data equals:
> (phiCi <- log(d[2] * y[1] / d[1] / y[2]) +
c(-1, +1) * qnorm(0.975) * sqrt(dtot / d[1] / d[2]))
[1] 0.2955796 1.5008400
56 5 Likelihood inference in multiparameter models 57

Since φ is the log relative incidence rate, and zero is not contained in the 95% The equations are solved by
confidence interval (0.296, 1.501), the corresponding P -value for testing the null Pn
i=1 xi yi
hypothesis φ = 0, which is equivalent to θ = 1 and λ1 = λ2 , must be smaller than ρ̂ML = 1
P 2 2
i=1 (xi + yi )
n
α = 5%. 2
n
2 1 X 2
Note that the use of (asymptotic) results from the likelihood theory can be justified and σ̂ML = (xi + yi2 ).
here by considering the Poisson distribution Po(n) as the distribution of a sum 2n
i=1
of n independent Poisson random variables with unit rates, as in the solution
As the observed Fisher information matrix shown below is positive definite, the
to 1a).
above estimators are indeed the MLEs.
2. Let Z 1:n be a random sample from a bivariate normal distribution N2 (µ, Σ) with b) Show that the Fisher information matrix is
mean vector µ = 0 and covariance matrix  
! n
4 − σ̂2 n(1−
ρ̂ML
2 )
1 ρ 2
I(σ̂ML , ρ̂ML ) = 
σ̂ML ML
ρ̂ML 
.
Σ = σ2 . − σ̂2 n(1−ρ̂ML n(1+ρ̂2ML )
ρ 1 ρ̂2 )
ML ML
(1−ρ̂2 )2 ML

2 2
a) Interpret σ and ρ. Derive the MLE (σ̂ML , ρ̂ML ). ◮ The components of the Fisher information matrix I(σ 2 , ρ) are computed as
◮ σ 2 is the variance of each of the components Xi and Yi of the bivariate vector
Z i . The components have correlation ρ. d2 Q(ρ) − nσ 2 (1 − ρ2 )
− l(σ 2 , ρ) = ,
To derive the MLE, we first compute the log-likelihood kernel d(σ 2 )2 σ 6 (1 − ρ2 )
P
d2 xi yi (1 − ρ2 ) − ρQ(ρ)
n
( !)
Xn
1 − 2 l(σ 2 , ρ) = i=1 ,
l(Σ) = − log |Σ| + (xi , yi )Σ −1 xi
. dσ dρ σ 4 (1 − ρ2 )2
2 Pn
yi d2 (1 − ρ2 )Q(ρ) − nσ 2 (1 − ρ4 ) − 4ρ i=1 (ρyi − xi )(ρxi − yi )
and − 2 l(σ 2 , ρ) =
i=1
.
dρ σ 2 (1 − ρ2 )3
In our case, since |Σ| = σ 4 (1 − ρ2 ) and
2
! and those of the observed Fisher information matrix I(σ̂ML , ρ̂ML ) are obtained
1 1 −ρ
Σ −1
= 2 , by plugging in the MLEs. The wished-for expressions can be obtained by simple
σ (1 − ρ2 ) −ρ 1 algebra, using that
we obtain n
X
n 1 2
l(σ 2 , ρ) = − log{σ 4 (1 − ρ2 )} − 2 Q(ρ), {(ρ̂ML yi − xi )(ρ̂ML xi − yi )} = nρ̂ML σ̂ML (ρ̂2ML − 1).
2 2σ (1 − ρ2 ) i=1
Pn 2 2
where Q(ρ) = i=1 (xi −2ρxi yi +yi ). The score function thus has the components
The computations can also be performed in a suitable software.
d n 1 c) Show that
l(σ 2 , ρ) = − 2 + 4 Q(ρ)
dσ 2 σ 2σ (1 − ρ2 ) 1 − ρ̂2
( n ) se(ρ̂ML ) = √ ML .
n
d 2 nρ 1 X ρ
and l(σ , ρ) = + 2 xi yi − Q(ρ) .
dρ 1 − ρ2 σ (1 − ρ2 ) 1 − ρ2
i=1 ◮ Using the expression for the inversion of a 2 × 2 matrix, cf. Appendix B.1.1,
The first component of the score equation can be rewritten as we obtain that the element I 22 of the inversed observed Fisher information matrix
2
1 I(σ̂ML , ρ̂ML )−1 is
σ2 = Q(ρ),
2n(1 − ρ2 )  −1
n2 (1 + ρ̂2ML ) n2 ρ̂2ML n (1 − ρ̂2ML )2
4 (1 − ρ̂2 )2
− 4 · = .
which, plugged into the second component of the score equation, yields σ̂ML ML
σ̂ ML
(1 − ρ̂2ML )2 4
σ̂ML n
Pn
2 i=1 xi yi ρ
= . The standard error of ρ̂ML is the square root of this expression.
Q(ρ) 1 − ρ2
58 5 Likelihood inference in multiparameter models 59

3. Calculate again the Fisher information of the profile log-likelihood in Result 5.2, which is equal to
but this time without using Result 5.1. Use instead the fact that α̂ML (δ) is a point
d
where the partial derivative of l(α, δ) with respect to α equals zero. Ip (δ) = S0 {α̂ML (δ)}

◮ Suppose again that the data are split in two independent parts (denoted by 0 d
and 1), and the corresponding likelihoods are parametrised by α and β, respectively. = −I0 {α̂ML (δ)} α̂ML (δ)

Then the log-likelihood decomposes as
as S1 {α̂ML (δ) + δ} = −S0 {α̂ML (δ)}. Here I0 and I1 denote the Fisher information
l(α, β) = l0 (α) + l1 (β). corresponding to l0 and l1 , respectively. Hence, we can solve
 
d d
We are interested in the difference δ = β − α. Obviously β = α + δ, so the joint I1 {α̂ML (δ) + δ} α̂ML (δ) + 1 = −I0 {α̂ML (δ)} α̂ML (δ)
dδ dδ
log-likelihood of α and δ is

dδ α̂ML (δ),
d
for and plug the result into (5.2) to finally obtain
l(α, δ) = l0 (α) + l1 (α + δ).
1 1 1
= + . (5.3)
Furthermore, Ip (δ) I0 {α̂ML (δ)} I1 {α̂ML (δ) + δ}
d d d
l(α, δ) = l0 (α) + l1 (α + δ)
dα dα dα 4. Let X ∼ Bin(m, πx ) and Y ∼ Bin(n, πy ) be independent binomial random variables.
= S0 (α) + S1 (α + δ), In order to analyse the null hypothesis H0 : πx = πy one often considers the relative
risk θ = πx /πy or the log relative risk ψ = log(θ).
where S0 and S1 are the score functions corresponding to l0 and l1 , respectively.
a) Compute the MLE ψ̂ML and its standard error for the log relative risk estimation.
For the profile log-likelihood lp (δ) = l{α̂ML (δ), δ}, we need the value α̂ML (δ) for which
Proceed as in Example 5.8.
dα l{α̂ML (δ), δ} = 0, hence it follows that S1 {α̂ML (δ) + δ} = −S0 {α̂ML (δ)}. This is
d
◮ As in Example 5.8, we may use the invariance of the MLEs to conclude that
also the derivative of the profile log-likelihood, because
 
π̂x x1 /m nx1 nx1
d
lp (δ) =
d
l{α̂ML (δ), δ} θ̂ML = = = and ψ̂ML = log .
dδ dδ π̂y x2 /n mx2 mx2
d d
= l0 {α̂ML (δ)} + l1 {α̂ML (δ) + δ} Further, ψ = log(θ) = log(πx ) − log(πy ), so we can use Result 5.2 to derive
dδ dδ   the standard error of ψ̂ML . In Example 2.10, we derived the observed Fisher
d d
= S0 {α̂ML (δ)} α̂ML (δ) + S1 {α̂ML (δ) + δ} α̂ML (δ) + 1 information corresponding to the MLE π̂ML = x/n as
dδ dδ
d n
= [S0 {α̂ML (δ)} + S1 {α̂ML (δ) + δ}] α̂ML (δ) + S1 {α̂ML (δ) + δ} I(π̂ML ) = .
dδ π̂ML (1 − π̂ML )
= S1 {α̂ML (δ) + δ}. Using Result 2.1, we obtain that
 −2
The Fisher information (negative curvature) of the profile log-likelihood is given by 1 n
I{log(π̂ML )} = I(π̂ML ) · = · π̂ 2 .
d2 π̂ML π̂ML (1 − π̂ML ) ML
Ip (δ) = − 2 lp (δ)
dδ By Result 5.2, we thus have that
d
= − S1 {α̂ML (δ) + δ} s
dδ   p 1 − π̂x 1 − π̂y
d se(φ̂ML ) = I{log(π̂x )}−1 + I{log(π̂y )}−1 = + .
= I1 {α̂ML (δ) + δ} α̂ML (δ) + 1 , (5.2) mπ̂x nπ̂y

60 5 Likelihood inference in multiparameter models 61

b) Compute a 95% confidence interval for the relative risk θ given the data in > ## profile log-likelihood of theta:
Table 3.1. > profilLoglik <- function(theta)
{
◮ The estimated risk for preeclampsia in the Diuretics group is π̂x = 6/108 = res <- theta
0.056, and in the Placebo group it is π̂y = 2/103 = 0.019. The log-relative risk is eps <- sqrt(.Machine$double.eps)
thus estimated by for(i in seq_along(theta)){ # the function can handle vectors
optimResult <-
ψ̂ML = log(6/108) − log(2/103) = 1.051, optim(par = 0.5,
fn = function(pi2) loglik(theta[i], pi2),
with the 95% confidence interval gr = function(pi2) grad(theta[i], pi2),
method = "L-BFGS-B", lower = eps, upper = 1/theta[i] - eps,
control = list(fnscale = -1))
[1.051 − 1.96 · 0.648, 1.051 + 1.96 · 0.648] = [−0.218, 2.321]. if(optimResult$convergence == 0){ # has the algorithm converged?
res[i] <- optimResult$value # only then save the value (not the parameter!)
Back-transformation to the relative risk scale by exponentiating gives the follow- } else {
ing 95% confidence interval for the relative risk θ = π1 /π2 : [0.804, 10.183]. res[i] <- NA # otherwise return NA
}
c) Also compute the profile likelihood and the corresponding 95% profile likelihood }
confidence interval for θ.
◮ The joint log-likelihood for θ = πx /πy and πy is return(res)
}
> ## plot the normed profile log-likelihood:
l(θ, πy ) = x log(θ) + (m − x) log(1 − θπy ) + (x + y) log(πy ) + (n − y) log(1 − πy ). > thetaGrid <- seq(from = 0.1, to = 30, length = 309)
> normProfVals <- profilLoglik(thetaGrid) - profilLoglik(thetaMl)
In order to compute the profile likelihood, we need to maximise l(θ, πy ) with > plot(thetaGrid, normProfVals,
type = "l", ylim = c(-5, 0),
respect to πy for fixed θ. To do this, we look for the points where xlab = expression(theta), ylab = expression(tilde(l)[p](theta))
)
d θ(m − x) x + y n−y
l(θ, πy ) = + + > ## show the cutpoint:
dπy θπy − 1 πy πy − 1 > abline(h = - 1/2 * qchisq(0.95, 1), lty = 2)
0
equals zero. To this aim, we need to solve the quadratic equation

πy2 θ(m + n) − πy θ(m + y) + n + x + x + y = 0 −1

−2
for πy . In this case, it is easier to perform the numerical maximisation.

~l p(θ)
> ## data −3
> x <- c(6, 2)
> n <- c(108, 103)
−4
> ## MLE
> piMl <- x/n
> thetaMl <- piMl[1] / piMl[2] −5

> ## log-likelihood of theta and pi2 0 5 10 15 20 25 30


> loglik <- function(theta, pi2)
{ θ
pi <- c(theta * pi2, pi2)
The downward outliers are artefacts of numerical optimisation. We therefore
sum(dbinom(x, n, pi, log = TRUE))
} compute the profile likelihood confidence intervals using the function programmed
> ## implementing the gradient in pi2 in Example 4.19. Note that they can be approximated from the positions of the
> grad <- function(theta, pi2)
{ cut-off values in the graph above.
(theta * (n[1] - x[1])) / (theta * pi2 - 1) + > ## general function that takes a given likelihood
sum(x) / pi2 + > likelihoodCi <- function(
(n[2] - x[2]) / (pi2 - 1) alpha = 0.05, # 1-alpha ist the level of the interval
} loglik, # log-likelihood (not normed)
62 5 Likelihood inference in multiparameter models 63

thetaMl, # MLE d+ and LR


so we can use Result 5.2 to derive the standard errors of LR d− . By
lower, # lower boundary of the parameter space
Example 2.10, the observed Fisher information corresponding to the MLE π̂ML =
upper, # upper boundary of the parameter space
... # additional arguments for the loglik (e.g. x/n is
n
# data) I(π̂ML ) = .
) π̂ML (1 − π̂ML )
{
## target function
Using Result 2.1, we obtain that
f <- function(theta, ...)  −2
1 n
loglik(theta, ...) - loglik(thetaMl, ...) + 1/2*qchisq(1-alpha, df=1) I{log(π̂ML )} = I(π̂ML ) · = · π̂ 2 ,
π̂ML π̂ML (1 − π̂ML ) ML
## determine the borders of the likelihood interval
eps <- sqrt(.Machine$double.eps) # stay a little from the boundaries and
lowerBound <- uniroot(f, interval = c(lower + eps, thetaMl), ...)$root  −2
1 n
upperBound <- uniroot(f, interval = c(thetaMl, upper - eps), ...)$root
I{log(1 − π̂ML )} = I(π̂ML ) · − = · (1 − π̂ML )2 .
1 − π̂ML π̂ML (1 − π̂ML )
return(c(lower = lowerBound, upper = upperBound))
} By Result 5.2, we thus have that
> thetaProfilCi <- s
likelihoodCi(alpha = 0.05, loglik = profilLoglik, thetaMl = thetaMl,
d+ )} = pI{log(π̂ )}−1 + I{log(1 − π̂ )}−1 = 1 − π̂x π̂y
lower = 0.01, upper = 20) se{log(LR +
n(1 − π̂y )
x y
> thetaProfilCi mπ̂x
lower upper
0.6766635 19.2217991 and
s
In comparison to the Wald confidence intervals with logarithmic transformation, d− )} = pI{log(1 − π̂ )}−1 + I{log(π̂ }−1 = π̂x 1 − π̂y
se{log(LR + .
m(1 − π̂x )
x y
the interval is almost twice as wide and implies therefore bigger uncertainty in nπ̂y
the estimation of θ. Also this time it does not appear (at the 5% level) that the
d+ = (95/100)/(1−90/100) = 9.5, and
The estimated positive likelihood ratio is LR
use of diuretics is associated with a higher risk for preeclampsia.
d− = (1 − 95/100)/(90/100) = 0.056.
the estimated negative likelihood ratio is LR
5. Suppose that ML estimates of sensitivity πx and specificity πy of a diagnostic test for
d+ ) = log(9.5) = 2.251, log(LR
Their logarithms are thus estimated by log(LR d− ) =
a specific disease are obtained from independent binomial samples X ∼ Bin(m, πx )
and Y ∼ Bin(n, πy ), respectively. log(0.056) = −2.89, with the 95% confidence intervals

a) Use Result 5.2 to compute the standard error of the logarithm of the positive and [2.251 − 1.96 · 0.301, 2.251 + 1.96 · 0.301] = [1.662, 2.841].
negative likelihood ratio, defined as LR+ = πx /(1 − πy ) and LR− = (1 − πx )/πy .
and
Suppose m = n = 100, x = 95 and y = 90. Compute a point estimate and the
[−2.89 − 1.96 · 0.437, −2.89 + 1.96 · 0.437] = [−3.747, −2.034].
limits of a 95% confidence interval for both LR+ and LR− , using the standard
error from above. Back-transformation to the positive and negative likelihood ratio scale by ex-
◮ As in Example 5.8, we may use the invariance of the MLEs to conclude that ponentiating gives the following 95% confidence intervals [5.268, 17.133] and
[0.024, 0.131].
d+ = π̂x x1 /m nx1
LR = = b) The positive predictive value PPV is the probability of disease, given a positive
1 − π̂y 1 − x2 /n nm − mx2
test result. The equation
d− = 1 − π̂x = 1 − x1 /m = nm − nx1 . PPV
and LR = LR+ · ω
π̂y x2 /n mx2 1 − PPV
Now, relates PPV to LR+ and to the pre-test odds of disease ω. Likewise, the follow-
ing equation holds for the negative predictive value NPV, the probability to be
log(LR+ ) = log(πx ) − log(1 − πy ) and log(LR− ) = log(1 − πx ) − log(πy ), disease-free, given a negative test result:
1 − NPV
= LR− · ω.
NPV
64 5 Likelihood inference in multiparameter models 65

Suppose ω = 1/1000. Use the 95% confidence interval for LR+ and LR− , ob- We can also get the MLE for the number needed to treat as
tained in 5a), to compute the limits of a 95% confidence interval for both PPV
\ 1 1
and NPV. N NT = = .
|x1 /n1 − x2 /n2 |
θ̂ML
◮ By multiplying the endpoints of the confidence intervals for LR+ and LR−
obtained in 5a) by ω, we obtain confidence intervals for PPV/(1 − PPV) and For the concrete data example:
(1 − NPV)/NPV, respectively. It remains to transform these intervals to the > ## the data
scales of PPV and NPV. Now, if the confidence interval for PPV/(1 − PPV) is > x <- c(6, 18)
> n <- c(38, 40)
[l, u], then the confidence interval for PPV is [l/(1+l), u/(1+u)]; and if the confi- > ## MLEs
dence interval for (1 − NPV)/NPV is [l, u], then the confidence interval for NPV > pi1Hat <- x[1] / n[1]
is [1/(1 + u), 1/(1 + l)]. With our data, we obtain the following confidence in- > pi2Hat <- x[2] / n[2]
> (thetaHat <- pi1Hat - pi2Hat)
tervals for PPV/(1 − PPV) and (1 − NPV)/NPV, respectively: [0.00527, 0.01713] [1] -0.2921053
and [2.4e − 05, 0.000131]; and the following confidence intervals for PPV and > (nntHat <- 1 / abs(thetaHat))
NPV, respectively: [0.00524, 0.01684] and [0.999869, 0.999976]. [1] 3.423423
> ## Wald CI
6. In the placebo-controlled clinical trial of diuretics during pregnancy to prevent > seTheta <- sqrt(pi1Hat * (1 - pi1Hat) / n[1] +
pi2Hat * (1 - pi2Hat) / n[2])
preeclampsia by Fallis et al. (cf . Table 1.1), 6 out of 38 treated women and 18 out
> (thetaWald <- thetaHat + c(-1, 1) * qnorm(0.975) * seTheta)
of 40 untreated women got preeclampsia. [1] -0.48500548 -0.09920505
a) Formulate a statistical model assuming independent binomial distributions c) Write down the joint log-likelihood kernel l(π1 , θ) of the risk parameter π1 in
in the two groups. Translate the null hypothesis “there is no difference in the treatment group and θ. In order to derive the profile log-likelihood of θ,
preeclampsia risk between the two groups” into a statement on the model pa- consider θ as fixed and write down the score function for π1 ,
rameters.

◮ We consider two independent random variables X1 ∼ Bin(n1 , π1 ) and Sπ1 (π1 , θ) = l(π1 , θ).
∂π1
X2 ∼ Bin(n2 , π2 ) modelling the number of preeclampsia cases in the treat-
ment and control group, respectively. Here the sample sizes are n1 = 38 and Which values are allowed for π1 when θ is fixed?
n2 = 40; and we have observed the realisations x1 = 6 and x2 = 18. No differ- ◮ The joint likelihood kernel of π1 and π2 is
ence in preeclampsia risk between the two groups is expressed in the hypothesis L(π1 , π2 ) = π1x1 (1 − π1 )n1 −x1 · π2x2 (1 − π2 )n2 −x2 .
H0 : π1 = π2 .
b) Let θ denote the risk difference between treated and untreated women. Derive We can rewrite the risk in the control group as π2 = π1 − θ. Plugging this into
the MLE θ̂ML and a 95% Wald confidence interval for θ. Also give the MLE for the log-likelihood kernel of π1 and π2 gives the joint log-likelihood kernel of risk
the number needed to treat (NNT) which is defined as 1/θ. in the treatment group π1 and the risk difference θ as
◮ In terms of the model parameters, we have θ = π1 − π2 . By the invariance
l(π1 , θ) = x1 log(π1 )+(n1 −x1 ) log(1−π1 )+x2 log(π1 −θ)+(n2 −x2 ) log(1−π1 +θ).
of the MLEs, we obtain the MLE of θ as
The score function for π1 is thus given by
θ̂ML = π̂1 − π̂2 = x1 /n1 − x2 /n2 .
d
To derive the standard error, we may use Result 5.2 and Example 2.10 to con- Sπ1 (π1 , θ) = l(π1 , θ)
dπ1
clude that r x1 x 1 − n1 x2 x 2 − n2
π̂1 (1 − π̂1 ) π̂2 (1 − π̂2 ) = + + + .
se(
b θ̂ML ) = + . π1 1 − π1 π1 − θ 1 − π1 + θ
n1 n2
If we want to solve the score equation Sπ1 (π1 , θ) = 0, we must think about the
It follows that a 95% Wald confidence interval for θ is given by
" r r # allowed range for π1 : Of course, we have 0 < π1 < 1. And we also have this for
π̂1 (1 − π̂1 ) π̂2 (1 − π̂2 ) π̂1 (1 − π̂1 ) π̂2 (1 − π̂2 ) the second proportion π2 , giving 0 < π1 − θ < 1 or θ < π1 < 1 + θ. Altogether we
π̂1 − π̂2 − z0.975 + , π̂1 − π̂2 + z0.975 + .
n1 n2 n1 n2 have the range max{0, θ} < π1 < min{1, 1 + θ}.
66 5 Likelihood inference in multiparameter models 67

d) Write an R-function which solves Sπ1 (π1 , θ) = 0 (use uniroot) and thus gives an Tab. 5.1: Probability of offspring’s blood group given allele frequencies
in parental generation, and sample realisations.
estimate π̂1 (θ). Hence write an R-function for the profile log-likelihood lp (θ) =
l{π̂1 (θ), θ}. Blood group Probability Observation

A={AA,A0} π1 = p2 + 2pr x1 = 182
> ## the score function for pi1:
> pi1score <- function(pi1, theta) B={BB,B0} π2 = q 2 + 2qr x2 = 60
{ AB={AB} π3 = 2pq x3 = 17
x[1] / pi1 + (x[1] - n[1]) / (1 - pi1) +
x[2] / (pi1 - theta) + (x[2] - n[2]) / (1 - pi1 + theta) 0={00} π4 = r2 x4 = 176
}
> ## get the MLE for pi1 given fixed theta:
> getPi1 <- function(theta)
{ The two 95% confidence intervals are quite close, and neither contains the ref-
eps <- 1e-9 erence value zero. Therefore, the P -value for testing the null hypothesis of no
uniroot(pi1score,
interval=
risk difference between the two groups against the two-sided alternative must be
c(max(0, theta) + eps, smaller than 5%.
min(1, 1 + theta) - eps),
theta=theta)$root 7. The AB0 blood group system was described by Karl Landsteiner in 1901, who
} was awarded the Nobel Prize for this discovery in 1930. It is the most important
> ## the joint log-likelihood kernel:
> loglik <- function(pi1, theta)
blood type system in human blood transfusion, and comprises four different groups:
{ A, B, AB and 0.
x[1] * log(pi1) + (n[1] - x[1]) * log(1 - pi1) + Blood groups are inherited from both parents. Blood groups A and B are dominant
x[2] * log(pi1 - theta) + (n[2] - x[2]) * log(1 - pi1 + theta)
} over 0 and codominant to each other. Therefore a phenotype blood group A may
> ## so we have the profile log-likelihood for theta: have the genotype AA or A0, for phenotype B the genotype is BB or B0, for
> profLoglik <- function(theta)
{
phenotype AB there is only the genotype AB, and for phenotype 0 there is only
pi1Hat <- getPi1(theta) the genotype 00.
loglik(pi1Hat, theta) Let p, q and r be the proportions of alleles A, B, and 0 in a population, so p+q+r = 1
}
and p, q, r > 0. Then the probabilities of the four blood groups for the offspring
e) Compute a 95% profile likelihood confidence interval for θ using numerical tools. generation are given in Table 5.1. Moreover, the realisations in a sample of size
Compare it with the Wald interval. What can you say about the P -value for n = 435 are reported.
the null hypothesis from 6a)?
◮ a) Explain how the probabilities in Table 5.1 arise. What assumption is tacitly
> ## now we need the relative profile log-likelihood, made?
> ## and for that the value at the MLE: ◮ The core assumption is random mating, i. e. there are no mating restric-
> profLoglikMle <- profLoglik(thetaHat)
tions, neither genetic or behavioural, upon the population, and that therefore all
> relProfLoglik <- function(theta)
{ recombination is possible. We assume that the alleles are independent, so the
profLoglik(theta) - profLoglikMle probability of the haplotype a1 /a2 (i. e. the alleles in the order mother/father)
}
> ## now compute the profile CI bounds: is given by Pr(a1 ) Pr(a2 ) when Pr(ai ) is the frequency of allele ai in the popu-
> lower <- uniroot(function(theta){relProfLoglik(theta) + 1.92}, lation. Then we look at the haplotypes which produce the requested phenotype,
c(-0.99, thetaHat))
and sum their probabilities to get the probability for the requested phenotype. For
> upper <- uniroot(function(theta){relProfLoglik(theta) + 1.92},
c(thetaHat, 0.99)) example, phenotype A is produced by the haplotypes A/A, A/0 and 0/A, having
> (profLogLikCi <- c(lower$root, upper$root)) probabilities p · p, p · r and r · p, and summing up gives π1 .
[1] -0.47766117 -0.09330746
> ## compare with Wald interval b) Write down the log-likelihood kernel of θ = (p, q)⊤ . To this end, assume that x =
> thetaWald (x1 , x2 , x3 , x4 )⊤ is a realisation from a multinomial distribution with parameters
[1] -0.48500548 -0.09920505
68 5 Likelihood inference in multiparameter models 69

n = 435 and π = (π1 , π2 , π3 , π4 )⊤ . > (thetaCov <- solve(- optimResult$hessian))


◮ We assume that [,1] [,2]
[1,] 0.0002639946 -0.0000280230
 [2,] -0.0000280230 0.0001023817
X = (X1 , X2 , X3 , X4 )⊤ ∼ M4 n = 435, π = (π1 , π2 , π3 , π4 )⊤ . > (thetaSe <- sqrt(diag(thetaCov)))
[1] 0.01624791 0.01011839
The log-likelihood kernel of π is
So we have p̂ML = 0.264 and q̂ML = 0.093 with corresponding standard errors
4
X se(p̂ML ) = 0.016 and se(q̂ML ) = 0.01.
l(π) = xi log(πi ),
i=1
d) Finally compute r̂ML and se(r̂ML ). Make use of Section 5.4.3.
◮ By the invariance of the MLE we have r̂ML = 1− p̂ML − q̂ML = 0.642. Moreover,
and by inserting the parametrisation from Table 5.1, we obtain the log-likelihood
!
kernel of p and q as p
r = 1 − p − q = 1 + (−1, −1) = g(p, q)
l(p, q) = x1 log{p2 + 2p(1 − p − q)} + x2 log{q 2 + 2q(1 − p − q)} q

+ x3 log(2pq) + x4 log{(1 − p − q)2 }. we can apply the multivariate delta method in the special case of a linear trans-
formation g(θ) = a⊤ · θ + b, and we have D(θ̂ ML ) = a⊤ . Thus,
Note that we have used here that r = 1 − p − q, so there are only two parameters
in this problem.  q
se g(θ̂ML ) = a⊤ I(θ̂ ML )−1 a.
c) Compute the MLEs of p and q numerically, using the R function optim. Use
the option hessian = TRUE in optim and process the corresponding output to In our case, θ = (p, q)⊤ , a⊤ = (−1, −1), and b = 1, so
receive the standard errors of p̂ML and q̂ML . v ! v
u u 2
◮  u −1 uX
t
se(r̂ML ) = se g(p, q) = (−1, −1)I(p̂ML , q̂ML )−1 =t {I(p̂ML , q̂ML )−1 }ij .
> ## observed data: −1 i,j=1
> data <- c(182, 60, 17, 176)
> n <- sum(data)
> ## the loglikelihood function of theta = (p, q) > (rMl <- 1 - sum(thetaMl))
> loglik <- function(theta,data) {
[1] 0.6423991
p <- theta[1]
> (rSe <- sqrt(sum(thetaCov)))
q <- theta[2]
r <- 1-p-q [1] 0.01761619

## check for valid parameters:


We thus have r̂ML = 0.642 and se(r̂ML ) = 0.018.
if ((p>0) && (p<1) && (r>0) && (r<1) && (q>0) && (q<1)) { e) Create a contour plot of the relative log-likelihood and mark the 95% likelihood
probs <- c(p^2+2*p*r,q^2+2*q*r, 2*p*q, r^2) confidence region for θ. Use the R-functions contourLines and polygon for
return(dmultinom(data,prob=probs,size=sum(data),log=TRUE))
} else { # if not valid, return NA sketching the confidence region.
return(NA) ◮
}
> ## fill grid with relative log-likelihood values
}
> gridSize <- 50
> ## numerically optimise the log-likelihood
> eps <- 1e-3
> optimResult <- optim(c(0.1,0.3), loglik,
> loglikgrid <- matrix(NA, gridSize, gridSize)
control = list(fnscale=-1),
hessian = TRUE, # also return the hessian! > p <- seq(eps,1,length=gridSize)
> q <- seq(eps,1,length=gridSize)
data = data)
> for (i in 1:length(p)) {
> ## check convergence:
for (j in 1:length(q)) {
> optimResult[["convergence"]] == 0
loglikgrid[i,j] <-
[1] TRUE
loglik(c(p[i],q[j]),data=data) - loglik(thetaMl, data = data)
> ## and extract MLEs and standard errors }
> (thetaMl <- optimResult$par) }
[1] 0.26442773 0.09317313 > ## plot
70 5 Likelihood inference in multiparameter models 71

> contour(p,q, [1] 1.438987


loglikgrid, > (Chi2 <- sum((x - e)^2 / x))
nlevels = 50, [1] 1.593398
xlab=expression (p),ylab= expression (q), xaxs = "i", yaxs = "i")
> ## add confidence region and MLE: We have k = 4 categories, i. e. 3 free probabilities, and r = 2 free parameters (p
> region <- contourLines(x = p, y = q, z = loglikgrid, levels = log (0.05))[[1]]
> polygon(region$x, region$y, density = NA, col = "gray")
and q). Under the null hypothesis that the model is correct, both test statistics
> points(thetaMl[1], thetaMl[2], pch = 19, col = 2) have asymptotically χ2 (1) distribution. The corresponding 95%-quantile is 3.84.
1.0 Since neither test statistic exceeds this threshold, the null hypothesis cannot be
900
−1 −16
00
−15
rejected at the 5% significance level.
−1450

−1050 00
0.8 −10
00 −14

8. Let T ∼ t(n − 1) be a standard t random variable with n − 1 degrees of freedom.


−800 00
−750 −90 −13
0 50
−85
−600 0 −13
−500 00
0.6 −70 −1
−400
0 250
−1
20
a) Derive the density function of the random variable
q

0
−1
150  
0.4 −1
100
T2
−9
50 W = n log 1 + ,
−50
n−1
0.2

−1
20 0
see Example 5.15, and compare it graphically with the density function of the
χ2 (1) distribution for different values of n.
−150 −250 −200 −100 −300 −350 −450 −550 −650

0.2 0.4 0.6 0.8 1.0


◮ W = g(T 2 ) is a one-to-one differentiable transformation of T 2 , so we can
p
apply the change-of-variables formula (Appendix A.2.3) to derive the density of
f ) Use the χ2 and the G2 test statistic to analyse the plausibility of the modeling W . We first need to derive the density of T 2 from that of T . This is not a one-
assumptions. to-one transformation and we will derive the density directly as follows. For
◮ The fitted (“expected”) frequencies ei in the restricted model are x ≥ 0, FT 2 the distribution function of T 2 , and FT the distribution function of
T , we have
e1 = n(p̂2ML + 2p̂ML r̂ML ) = 178.201,
2
e2 = n(q̂ML + 2q̂ML r̂ML ) = 55.85, FT 2 (x) = Pr(T 2 ≤ x)

e3 = 2np̂ML q̂ML = 21.435 = Pr(|T | ≤ x)
√ √
und 2
e4 = nr̂ML = 179.514. = Pr(− x ≤ T ≤ x)
√ √
= FT ( x) − FT (− x)
With the observed frequencies xi , we can compute the two statistics as √
= 2FT ( x) − 1.
4   4
X xi X (xi − ei )2
G2 = 2 xi log and χ2 = . The last equality follows from the symmetry of the t distribution around 0. The
ei ei
i=1 i=1
density fT 2 of T 2 is thus
In R, we can compute the values as follows. √
d √ 1 1 fT ( x)
fT 2 (x) = FT 2 (x) = 2fT ( x) · x− 2 = √ ,
2
> ## values
> (x <- data)
dx x
[1] 182 60 17 176
where fT is the density of T .
> (e <- n * c(thetaMl[1]^2 + 2 * thetaMl[1] * rMl,
thetaMl[2]^2 + 2 * thetaMl[2] * rMl, Now, the inverse transformation corresponding to g is
2 * prod(thetaMl), 
rMl^2 g −1 (w) = (n − 1) exp(w/n) − 1 .
))
[1] 178.20137 55.84961 21.43468 179.51434
> ## Statistics
> (G2 <- 2 * sum(x * log(x / e)))
72 5 Likelihood inference in multiparameter models 73

Using the change-of-variables formula, we obtain the following form for the den- Indeed we can see that the differences between the densities are small already for
sity fW of W : n = 20.

d −1
b) Show that for n → ∞, W follows indeed a χ2 (1) distribution.
fW (x) = fT 2 {g (x)} g (x)
−1 ◮ Since T −→ N(0, 1) as n → ∞, we have that T 2 −→ χ2 (1) as n → ∞. Further,
D D
dx
hq  i the transformation g is for large n close to identity, as
fT (n − 1) exp(x/n) − 1 n−1 n
= exp(x/n). x n o
q  g(x) = log 1+
(n − 1) exp(x/n) − 1 n n−1

and {1 + x/(n − 1)}n → exp(x) as n → ∞. Altogether therefore W = g(T 2 ) −→


D
We can now draw the resulting density.
> ## implementation of the density
χ2 (1) as n → ∞.
> densityW <- function(x, n) 9. Consider the χ2 statistic given k categories with n observations. Let
{
y <- sqrt((n - 1) * (exp(x / n) - 1))  
(ni − npi0 )2
k
X k
X ni
dt(y, df = n - 1) / y * (n - 1) / n * exp(x / n) Dn = and Wn = 2 ni log .
} npi0 npi0
i=1 i=1
> ## testing the density through a comparison with the histogram
> ## set.seed(1234)
→ 0 for n → ∞.
P
> Show that Wn − Dn −
> ## n <- 10 ◮ In the notation of Section 5.5, we now have ni = xi the observed frequencies
and npi0 = ei the expected frequencies in a multinomial model with true probabilities
> ## m <- 1e+5
> ## T2 <- rt(m, df = n - 1)^2
> ## W <- n * log(1 + T2 / (n - 1)) πi , and pi0 are maximum likelihood estimators under a certain model. If that model
> ## hist(W[W < 5], breaks = 100, prob = TRUE) is true, then both (ni /n − πi ) and (pi0 − πi ) converge to zero in probability and both
> grid <- seq(from = 0.01, to = 5, length = 101) √ √
> ## lines (grid, densityW(grid, n = n), col = 2) n(ni /n − πi ) and n(pi0 − πi ) are bounded in probability as n → ∞, the first one
> by the central limit theorem and the second one by the standard likelihood theory.
> ## plot for different values of n
We can write
> par(mfrow = c(2, 2))
Xk  
ni ni /n − pi0
Wn = 2n log 1 +
> for(n in c(2, 5, 20, 100)){
## density of W
.
n pi0
plot(grid, densityW(grid, n = n), type = "l", i=1
xlab = expression(x), ylab = expression(f(x)), The Taylor expansion (cf. Appendix B.2.3) of log(1 + x) around x = 0 yields
ylim = c(0, 4), main = paste("n =", n)
) 1
log(1 + x) = x − x2 + O(x3 ),
## density of chi^2(1)
2
lines(grid, dchisq(grid, df = 1), lty = 2, col = 2)
} cf. the Landau notation (Appendix B.2.6). By plugging this result into the equation
n=2 n=5
4 4 above, we obtain that
3 3
k n
"  2 #
X n o n /n − p 1 ni /n − pi0
f(x)

f(x)

2 2
3
Wn = 2n pi0 + + O{(ni /n − pi0 ) }
i i i0
− pi0 · −
1 1
n pi0 2 pi0
0 0 i=1
k  
0 1 2 3 4 5 0 1 2 3 4 5 X 1 (ni /n − pi0 )2
x x = 2n (ni /n − pi0 ) + · + O{(ni /n − pi0 )3 }
4 n = 20 4 n = 100 2 pi0
i=1
3 3
k
X
f(x)

f(x)

2 2
= Dn + n O{(ni /n − pi0 )3 },
1 1
i=1
0 0

0 1 2 3 4 5 0 1 2 3 4 5 where the last equality follows from the fact that both ni /n and pi0 must sum up to
x x one.
74 5 Likelihood inference in multiparameter models 75

Since both (ni /n − πi ) and (pi0 − πi ) converge to zero in probability and both > n <- 100
√ √
n(ni /n − πi ) and n(pi0 − πi ) are bounded in probability as n → ∞, the last > gridSize <- 100
> theta1grid <- seq(0.8,1,length=gridSize)
→ 0 as n → ∞.
P
sum converges to zero in probability as n → ∞, i. e. Wn − Dn − > theta2grid <- seq(0.3,0.6,length=gridSize)
10.In a psychological experiment the forgetfulness of probands is tested with the recog- > loglik <- matrix(NA, nrow=length(theta1grid), ncol=length(theta2grid))
> for(i in 1:length(theta1grid)){
nition of syllable triples. The proband has ten seconds to memorise the triple, af- for(j in 1:length(theta2grid)){
terwards it is covered. After a waiting time of t seconds it is checked whether the loglik[i, j] <-
loglik.fn(theta1=theta1grid[i], theta2=theta2grid[j], n=n, y=y, t=t)
proband remembers the triple. For each waiting time t the experiment is repeated }
n times. }
Let y = (y1 , . . . , ym ) be the relative frequencies of correctly remembered syllable > contour(theta1grid, theta2grid, loglik, nlevels=50, xlab=math (theta[1]),
ylab=math (theta[2]), xaxs = "i", yaxs = "i", labcex=1)
triples for the waiting times of t = 1, . . . , m seconds. The power model now assumes,
0.60
37 34 31 28 −317
that −3 −3 −3 −3

−327
333 30 27
π(t; θ) = θ1 t−θ2 , 0 ≤ θ1 ≤ 1, θ2 > 0, −3 −315
0.55 − −3
29
−3
is the probability to correctly remember a syllable triple after the waiting time 0.50

t ≥ 1. 0.45
−314

θ2
−316
a) Derive an expression for the log-likelihood l(θ) where θ = (θ1 , θ2 ). 0.40 −318
−319
◮ If the relative frequencies of correctly remembered triples are independent for −321
−322
−320
−323
−324
0.35 −325 41
different waiting times, then the likelihood kernel is −328
−326 −327
−3 2
−329 −330 −338
−332 −331 35
m 0.30 −334 −333 −335 −337 −339 −342 −346 −
Y
L(θ1 , θ2 ) = (θ1 t−θ
i ) (1 − θ1 t−θ
2 nyi
i )
2 n−nyi 0.80 0.85 0.90 0.95 1.00

i=1 θ1
Pm m
Y
n yi
= θ1 i=1
ti−θ2 nyi (1 − θ1 t−θ
i )
2 n−nyi
. 11.Often the exponential model is used instead of the power model (described in Ex-
i=1 ercise 10), assuming:
The log-likelihood kernel is thus
π(t; θ) = min{1, θ1 exp(−θ2 t)}, t > 0, θ1 > 0 and θ2 > 0.
m
X m
X m
X
l(θ1 , θ2 ) = n log(θ1 ) yi − nθ2 yi log(ti ) + n (1 − yi ) log(1 − θ1 t−θ
i
2
). a) Create a contour plot of the log-likelihood in the parameter range [0.8, 1.4] ×
i=1 i=1 i=1 [0, 0.4] for the same data as in Exercise 10.
b) Create a contour plot of the log-likelihood in the parameter range [0.8, 1] × ◮
[0.3, 0.6] with n = 100 and > pi.exp <- function(theta1, theta2, t){
return(
pmin( theta1*exp(-theta2*t), 1 )
y = (0.94, 0.77, 0.40, 0.26, 0.24, 0.16), t = (1, 3, 6, 9, 12, 18). )
}
> loglik.fn <- function(theta, n, y, t){
return(
◮ sum( dbinom(x=n*y, size=n,
> loglik.fn <- function(theta1, theta2, n, y, t){ prob=pi.exp(theta1=theta[1], theta2=theta[2], t=t), log=TRUE) )
return( )
n*log(theta1)*sum(y) }
- n*theta2*sum(y*log(t)) > y <- c(0.94,0.77,0.40,0.26,0.24,0.16)
+ n*sum((1-y)*log(1-theta1*t^(-theta2))) > t <- c(1, 3, 6, 9, 12, 18)
) > n <- 100
} > gridSize <- 100
> y <- c(0.94,0.77,0.40,0.26,0.24,0.16) > theta1grid <- seq(0.8,1.4,length=gridSize)
> t <- c(1, 3, 6, 9, 12, 18) > theta2grid <- seq(0,0.4,length=gridSize)
76 5 Likelihood inference in multiparameter models 77

0.4
> loglik <- matrix(NA, nrow=length(theta1grid), ncol=length(theta2grid)) −320 −300 −280 −260 −240 −220 −200
−180
> for(i in 1:length(theta1grid)){ −160
−140
for(j in 1:length(theta2grid)){ 0.3 −120
loglik[i, j] <- −100
−80
loglik.fn(theta=c(theta1grid[i], theta2grid[j]), n=n, y=y, t=t)
−60
} 0.2

θ2
−40
}
> contour(theta1grid, theta2grid, loglik, nlevels=50, xlab=math (theta[1]),
ylab=math (theta[2]), xaxs = "i", yaxs = "i", labcex=1) −20
0.1
0.4 −40 −60
−320 −300 −280 −260 −240 −220 −200 40
−180 −100 −120 −160 −1
−160 0.0 −340 −420 −540 −7
−76400
−140
0.3 −120
0.8 0.9 1.0 1.1 1.2 1.3 1.4
−100
−80 θ1
−60
0.2
c) For 0 ≤ t ≤ 20 create a plot of π(t; θ̂ML ) and add the observations y.
θ2

−40


−20
0.1 > tt <- seq(1, 20, length=101)
−40 −60
> plot(tt, pi.exp(theta1=thetaMl[1], theta2=thetaMl[2], t=tt), type="l",
40
−100 −120 −160 −1 xlab= math (t), ylab= math (pi(t, hat(theta)[ML])) )
0.0 −340 −420 −540 −7400
−76 > points(t, y, pch = 19, col = 2)
0.8 0.9 1.0 1.1 1.2 1.3 1.4

θ1
0.8
Note that we did not evaluate the likelihood values in the areas where π(t; θ) = 1
for some t. These are somewhat particular situations, because when π = 1,

θML)
0.6
the corresponding likelihood contribution is 1, no matter what we observe as the

π(t, ^
corresponding y and no matter what the values of θ1 and θ2 are. 0.4
b) Use the R-function optim to numerically compute the MLE θ̂ ML . Add the MLE
to the contour plot from 11a). 0.2

> ## numerically optimise the log-likelihood
5 10 15 20
> optimResult <- optim(c(1.0, 0.1), loglik.fn,
control = list(fnscale=-1), t
n=n, y=y, t=t)
> ## check convergence: 12.Let X1:n be a random sample from a log-normal LN(µ, σ 2 ) distribution, cf . Ta-
> optimResult[["convergence"]] == 0
ble A.2.
[1] TRUE
> ## extract the MLE a) Derive the MLE of µ and σ 2 . Use the connection between the densities of the
> thetaMl <- optimResult$par
normal distribution and the log-normal distribution. Also compute the corre-
> ## add the MLE to the plot
> contour(theta1grid, theta2grid, loglik, nlevels=50, xlab=math (theta[1]), sponding standard errors.
ylab=math (theta[2]), xaxs = "i", yaxs = "i", labcex=1) ◮ We know that if X is normal, i. e. X ∼ N(µ, σ 2 ), then exp(X) ∼ LN(µ, σ 2 )
> points(thetaMl[1], thetaMl[2], pch = 19, col = 2)
(cf. Table A.2). Thus, if we have a random sample X1:n from log-normal distri-
bution LN(µ, σ 2 ), then Y1:n = {log(X1 ), . . . , log(Xn )} is a random sample from
normal distribution N(µ, σ 2 ). In Example 5.3 we computed the MLEs in the
normal model:
1X
n
2
µ̂ML = ȳ and σ̂ML = (yi − ȳ)2 ,
n
i=1
78 5 Likelihood inference in multiparameter models 79

11.6
and in Section 5.2 we derived the corresponding standard errors se(µ̂ML ) =
√ 2 2
p 15
σ̂ML / n and se(σ̂ML ) = σ̂ML 2/n. 11.4

b) Derive the profile log-likelihood functions of µ and σ 2 and plot them for the
11.2
10
following data:

L p (σ2)
L p (µ)
11.0
x = (225, 171, 198, 189, 189, 135, 162, 136, 117, 162). 5
10.8

Compare the profile log-likelihood functions with their quadratic approxima- 10.6
tions. 0

◮ In Example 5.7, we derived the profile likelihood functions of µ and σ 2 for


4.0 4.5 5.0 5.5 6.0 0.02 0.03 0.04 0.05 0.06
the normal model. Together with the relationship between the density of a log-
µ σ2
normal and a normal random variable we obtain that, up to additive constants,
the profile log-likelihood function of µ is Denoting θ = (µ, σ 2 )⊤ , and recalling that the inverse observed Fisher informa-
( ) tion that we derived in Section 5.2 for the normal model is
n
n 2 1X 2 !
lp (µ) = − log (ȳ − µ) + (yi − ȳ) 2
σ̂ML
0
2 n −1
I(θ̂ ML ) = n
,
i=1 4
2σ̂ML
n  0
= − log (µ̂ML − µ)2 + σ̂ML
2
, n
2
we obtain the quadratic approximations
whereas the profile log-likelihood function of σ 2 is
n 1
l̃p (µ) ≈ − · (µ − µ̂ML )2
2 σ̂ML
2
n
n 1 X
lp (σ 2 ) = − log(σ 2 ) − 2 (yi − ȳ)2
2 2σ
i=1 and
  n 1
n σ̂ 2 l̃p (σ 2 ) ≈ − · (σ 2 − σ̂ML
2 2
)
=− log(σ 2 ) + ML . 4 σ̂ML
4
2 σ2
to the corresponding relative profile log-likelihoods. We can compare the rela-
For our data, they can be evaluated as follows. tive profile log-likelihoods to their quadratic approximations around the MLEs as
> x <- c(225, 171, 198, 189, 189, 135, 162, 136, 117, 162)
follows.
> y <- log(x)
> loglik.mu <- function(y, mu){ > relloglik.mu <- function(y, mu){
n <- length(y) n <- length(y)
return( -n/2*log( (mean(y)-mu)^2 + (n-1)/n*var(y) ) ) return(
} -n/2*log( (mean(y)-mu)^2 + (n-1)/n*var(y) )
> loglik.sigma2 <- function(y, sigma2){ +n/2*log( (n-1)/n*var(y) )
n <- length(y) )
return( -n/2*log(sigma2) - (n-1)*var(y)/2/sigma2 ) }
} > relloglik.sigma2 <- function(y, sigma2){
> par(mfrow=c(1, 2)) n <- length(y)
> mu.x <- seq(4, 6, length=101) hatsigma2 <- (n-1)/n*var(y)
> plot(mu.x, loglik.mu(y=y, mu=mu.x), return(
type="l", xlab=math(mu), ylab=math(L[p](mu))) -n/2*log(sigma2) - (n-1)*var(y)/2/sigma2
> sigma2.x <- seq(0.02, 0.06, length=101) +n/2*log(hatsigma2) + n/2
> plot(sigma2.x, loglik.sigma2(y=y, sigma2=sigma2.x), )
type="l", xlab=math(sigma^2), ylab=math(L[p](sigma^2))) }
> approxrelloglik.mu <- function(y, mu){
n <- length(y)
hatsigma2 <- (n-1)/n*var(y)
return(
-n/2/hatsigma2*(mean(y)-mu)^2
80 5 Likelihood inference in multiparameter models 81

) expected survival time is 1/η. θ is the multiplicative change of the rate for the
} treatment group. The expected survival time changes to 1/(ηθ).
> approxrelloglik.sigma2 <- function(y, sigma2){
n <- length(y) The likelihood function is, by independence of all patients and the distributional
hatsigma2 <- (n-1)/n*var(y) assumptions (cf. Example 2.8):
return(
-n/4/(hatsigma2)^2*(hatsigma2-sigma2)^2 n
Y m
Y
) L(η, θ) = f (xi ; η)γi {1 − F (xi ; η)}(1−γi ) · f (yi ; η, θ)δi {1 − F (yi ; η, θ)}(1−δi )
} i=1 i=1
> par(mfrow=c(1, 2)) n m
> mu.x <- seq(4.8, 5.4, length=101) Y Y
> plot(mu.x, relloglik.mu(y=y, mu=mu.x), = {η exp(−ηxi )}γi {exp(−ηxi )}(1−γi ) · {ηθ exp(−ηθyi )}δi {exp(−ηθyi )}(1−δi )
type="l", xlab=math(mu), ylab=math(L[p](mu))) i=1 i=1
> lines(mu.x, approxrelloglik.mu(y=y, mu=mu.x), = η nγ̄+mδ̄ θ mδ̄ exp {−η (nx̄ + θmȳ)} .
lty=2, col=2)
> abline(v=mean(y), lty=2)
> sigma2.x <- seq(0.02, 0.05, length=101) Hence the log-likelihood is
> plot(sigma2.x, relloglik.sigma2(y=y, sigma2=sigma2.x),
type="l", xlab=math(sigma^2), ylab=math(L[p](sigma^2))) l(η, θ) = log{L(η, θ)}
> lines(sigma2.x, approxrelloglik.sigma2(y=y, sigma2=sigma2.x),
lty=2, col=2) = (nγ̄ + mδ̄) log(η) + mδ̄ log(θ) − η(nx̄ + θmȳ).
> abline(v=(length(y)-1)/length(y)*var(y), lty=2)
0 0.0
b) Calculate the MLE (η̂ML , θ̂ML ) and the observed Fisher information matrix
I(η̂ML , θ̂ML ).
−1 −0.2 ◮ For calculating the MLE, we need to solve the score equations. The score
−2 function components are
−0.4
L p (σ2)
L p (µ)

−3
d nγ̄ + mδ̄
l(η, θ) = − (nx̄ + θmȳ)
−0.6
−4 dη η
−0.8
−5
and
−1.0 d mδ̄
−6
l(η, θ) = − ηmȳ.
dθ θ
l(η, θ) = 0 we get θ̂ML = δ̄/(η̂ML ȳ). Plugging
4.8 4.9 5.0 5.1 5.2 5.3 5.4 0.020 0.030 0.040 0.050 d
From the second score equation dθ
µ σ2 d
this into the first score equation dη l(η, θ) = 0 we get η̂ML = γ̄/x̄, and hence
13.We assume an exponential model for the survival times in the randomised placebo- θ̂ML = (x̄δ̄)/(ȳγ̄).
controlled trial of Azathioprine for primary biliary cirrhosis (PBC) from Sec- The ordinary Fisher information matrix is
tion 1.1.8. The survival times (in days) of the n = 90 patients in the placebo !
nγ̄+mδ̄
group are denoted by xi with censoring indicators γi (i = 1, . . . , n), while the sur- η2 mȳ
I(η, θ) = ,
vival times of the m = 94 patients in the treatment group are denoted by yi and
mδ̄
mȳ θ2
have censoring indicators δi (i = 1, . . . , m). The (partly unobserved) uncensored
so plugging in the MLEs gives the observed Fisher information matrix
survival times follow exponential models with rates η and θη in the placebo and
!
treatment group, respectively (η, θ > 0). (nγ̄ + mδ̄) · (x̄/γ̄)2 mȳ
I(η̂ML , θ̂ML ) = .
a) Interpret η and θ. Show that their joint log-likelihood is mȳ mδ̄ · {(ȳγ̄)/(x̄δ̄)}2
l(η, θ) = (nγ̄ + mδ̄) log(η) + mδ̄ log(θ) − η(nx̄ + θmȳ),
c) Show that s
where γ̄, δ̄, x̄, ȳ are the averages of the γi , δi , xi and yi , respectively. nγ̄ + mδ̄
se(θ̂ML ) = θ̂ML · ,
◮ η is the rate of the exponential distribution in the placebo group, so the nγ̄ · mδ̄
82 5 Likelihood inference in multiparameter models 83

and derive a general formula for a γ · 100% Wald confidence interval for θ. Use giving the standard error for ψ̂ML as
Appendix B.1.1 to compute the required entry of the inverse observed Fisher
d
information. se(ψ̂ML ) = se(θ̂ML ) log(θ̂ML )

◮ From Section 5.2 we know that the standard error of θ̂ML is given by the s
square root of the corresponding diagonal element of the inverse observed Fisher nγ̄ + mδ̄ 1
= θ̂ML
information matrix. From Appendix B.1.1 we know how to compute the required nγ̄ · mδ̄ θ̂ML
s
entry:
nγ̄ + mδ̄
h i = .
(nγ̄ + mδ̄) · (x̄/γ̄)2 nγ̄ · mδ̄
I(η̂ML , θ̂ML )−1 =
22 (nγ̄ + mδ̄) · (x̄/γ̄)2 · mδ̄ · (ȳγ̄)2 /(x̄δ̄)2 − (mȳ)2
Thus, a γ · 100% Wald confidence interval for ψ is given by
(nγ̄ + mδ̄) · (x̄/γ̄)2
=  s s 
(nγ̄ + mδ̄) · m · ȳ 2 /δ̄ − (mȳ)2
log{(x̄δ̄)/(ȳγ̄)} − z(1+γ)/2 nγ̄ + mδ̄ , log{(x̄δ̄)/(ȳγ̄)} + z(1+γ)/2 nγ̄ + mδ̄  .
(nγ̄ + mδ̄) · (x̄/γ̄)2
= nγ̄ · mδ̄ nγ̄ · mδ̄
nγ̄ · m · ȳ 2 /δ̄
 2
x̄δ̄ nγ̄ + mδ̄ e) Derive the profile log-likelihood for θ. Implement an R-function which calculates
=
ȳγ̄ nγ̄ · mδ̄ a γ · 100% profile likelihood confidence interval for θ.
nγ̄ + mδ̄ ◮ Solving dη d
l(η, θ) = 0 for η gives
2
= θ̂ML .
nγ̄ · mδ̄ d
l(η, θ) = 0
Hence the standard error is dη
s nγ̄ + mδ̄
nγ̄ + mδ̄ = nx̄ + θmȳ
se(θ̂ML ) = θ̂ML · , η
nγ̄ · mδ̄ nγ̄ + mδ̄
η= ,
nx̄ + θmȳ
and the formula for a γ · 100% Wald confidence interval for θ is
 s s  hence the profile log-likelihood of θ is
x̄ δ̄ x̄δ̄ nγ̄ + m δ̄ x̄ δ̄ x̄ δ̄ nγ̄ + m δ̄
 − z(1+γ)/2 , + z(1+γ)/2 ,. lp (θ) = l(η̂ML (θ), θ)
ȳγ̄ ȳγ̄ nγ̄ · mδ̄ ȳγ̄ ȳγ̄ nγ̄ · mδ̄
 
nγ̄ + mδ̄
= (nγ̄ + mδ̄) log + mδ̄ log(θ) − (nγ̄ + mδ̄).
d) Consider the transformation ψ = log(θ). Derive a γ · 100% Wald confidence nx̄ + θmȳ
interval for ψ using the delta method.
Dropping the additive terms, we obtain the profile log-likelihood as
◮ Due to the invariance of the MLE we have that
lp (θ) = mδ̄ log(θ) − (nγ̄ + mδ̄) log(nx̄ + θmȳ).
ψ̂ML = log(θ̂ML ) = log{(x̄δ̄)/(ȳγ̄)}.
To obtain a γ · 100% profile likelihood confidence interval, we need to find the
For the application of the delta method, we need the derivative of the transfor-
solutions of
mation, which is
d 1 2{lp (θ̂ML ) − lp (θ)} = χ2γ (1).
log(θ) = ,
dθ θ In R:
84 5 Likelihood inference in multiparameter models 85

> ## the profile log-likelihood of theta > n <- sum(pbcFull$treat == 1)


> profLogLik <- function(theta, > xbar <- mean(pbcFull$time[pbcFull$treat == 1])
n, > gammabar <- mean(pbcFull$d[pbcFull$treat == 1])
xbar, > m <- sum(pbcFull$treat == 2)
gammabar, > ybar <- mean(pbcFull$time[pbcFull$treat == 2])
m, > deltabar <- mean(pbcFull$d[pbcFull$treat == 2])
ybar,
deltabar) Now we compute 95% confidence intervals using the three different methods:
{ > ## the MLE
m * deltabar * log(theta) - > (thetaMle <- (xbar * deltabar) / (ybar * gammabar))
(n * gammabar + m * deltabar) * log(n * xbar + theta * m * ybar) [1] 0.8685652
}
> ## the standard error:
> ## the generalised likelihood ratio statistic
> (thetaMleSe <- thetaMle * sqrt((n * gammabar + m * deltabar) /
> likRatioStat <- function(theta,
(n * gammabar * m * deltabar)))
n,
[1] 0.1773336
xbar,
gammabar, > ## the Wald interval on the original scale:
m, > (waldCi <- thetaMle + c(-1, 1) * qnorm(0.975) * thetaMleSe)
ybar, [1] 0.5209977 1.2161327
deltabar) > ## then the Wald interval derived from the log-scale:
{ > (transWaldCi <- exp(log(thetaMle) + c(-1, 1) * qnorm(0.975) *
thetaMle <- (xbar * deltabar) / (ybar * gammabar) sqrt((n * gammabar + m * deltabar) /
return(2 * (profLogLik(thetaMle, n, xbar, gammabar, m, ybar, deltabar) - (n * gammabar * m * deltabar))))
profLogLik(theta, n, xbar, gammabar, m, ybar, deltabar))) [1] 0.5821219 1.2959581
} > ## and finally the profile likelihood CI:
> ## compute a gamma profile likelihood confidence interval > (profLikCiRes <- profLikCi(0.95, n, xbar, gammabar, m, ybar, deltabar))
> profLikCi <- function(gamma, [1] 0.5810492 1.2969668
n,
xbar, The Wald interval derived from the log-scale is much closer to the profile like-
gammabar,
lihood interval, which points to a better quadratic approximation of the relative
m,
ybar, profile log-likelihood for the transformed parameter ψ.
deltabar) Now we compute two-sided P -values for testing H0 : θ = 1.
{
targetFun <- function(theta) > ## the value to be tested:
{ > thetaNull <- 1
likRatioStat(theta, n, xbar, gammabar, m, ybar, deltabar) - > ## first with the Wald statistic:
qchisq(p=gamma, df=1) > waldStat <- (thetaMle - thetaNull) / thetaMleSe
} > (pWald <- 2 * pnorm(abs(waldStat), lower.tail=FALSE))
thetaMle <- (xbar * deltabar) / (ybar * gammabar) [1] 0.458589
lower <- uniroot(targetFun, > ## second with the Wald statistic on the log-scale:
interval=c(1e-10, thetaMle))$root > transWaldStat <- (log(thetaMle) - log(thetaNull)) /
upper <- uniroot(targetFun, sqrt((n * gammabar + m * deltabar) /
interval=c(thetaMle, 1e10))$root (n * gammabar * m * deltabar))
return(c(lower, upper)) > (pTransWald <- 2 * pnorm(abs(transWaldStat), lower.tail=FALSE))
} [1] 0.4900822
> ## finally with the generalised likelihood ratio statistic:
f ) Calculate 95% confidence intervals for θ based on 13c), 13d) and 13e). Also > likStat <- likRatioStat(thetaNull, n, xbar, gammabar, m, ybar, deltabar)
compute for each of the three confidence intervals 13c), 13d) and 13e) the cor- > (pLikStat <- pchisq(likStat, df=1, lower.tail=FALSE))
responding P -value for the null hypothesis that the exponential distribution for [1] 0.4900494

the PBC survival times in the treatment group is not different from the placebo All three test statistics say that the evidence against H0 is not sufficient for the
group. rejection.
◮ First we read the data:
14.Let X1:n be a random sample from the N(µ, σ 2 ) distribution.
86 5 Likelihood inference in multiparameter models 87

a) First assume that σ 2 is known. Derive the likelihood ratio statistic for testing implying the standard error of the MLE in the form
specific values of µ. √
◮ If σ 2 is known, the log-likelihood kernel is se(µ̂ML ) = I(µ)−1/2 = σ/ n.
n
1 X Therefore the γ · 100% Wald confidence interval for µ is
l(µ; x) = − (xi − µ)2
2σ 2
 √ √ 
x̄ − σ/ nz(1+γ)/2 , x̄ + σ/ nz(1+γ)/2
i=1

and the score function is


d and already found in Example 3.5. Now using the fact that the square root of
S(µ; x) = l(µ; x)
dµ the γ chi-squared quantile equals the (1 + γ)/2 standard normal quantile, as
n
1 X mentioned in Section 4.4, we have
=− 2 2(xi − µ) · (−1) ( )
2σ 2
i=1  X̄ − µ
n Pr W (µ; X) ≤ χ2γ (1) = Pr √ ≤ χ2γ (1)
1 X σ/ n
= 2 (xi − µ).
σ  q q 
i=1 X̄ − µ
= Pr − χ2γ (1) ≤ √ ≤ χ2γ (1)
The root of the score equation is µ̂ML = x̄. Hence the likelihood ratio statistic is σ/ n
 
W (µ; x) = 2{l(µ̂ML ; x) − l(µ; x)} X̄ − µ
= Pr −z(1+γ)/2 ≤ √ ≤ z(1+γ)/2
( n n
) σ/ n
1 X 2 1 X 2
 √ √
=2 − 2 (xi − x̄) + 2 (xi − µ) = Pr −X̄ − σ/ nz(1+γ)/2 ≤ −µ ≤ −X̄ + σ/ nz(1+γ)/2
2σ 2σ  √ √
( n
i=1 i=1
) = Pr X̄ + σ/ nz(1+γ)/2 ≥ µ ≥ X̄ − σ/ nz(1+γ)/2
n
1 X X   √ √ 
= 2 2
(xi − x̄ + x̄ − µ) − (xi − x̄)2 = Pr µ ∈ X̄ − σ/ nz(1+γ)/2 , X̄ + σ/ nz(1+γ)/2 .
σ
i=1 i=1
( n n n n
) As the γ · 100% likelihood ratio confidence interval is given by all values of µ with
1 X X X X
= 2 (xi − x̄)2 + 2 (xi − x̄)(x̄ − µ) + (x̄ − µ)2 − (xi − x̄)2 W (µ; X) ≤ χ2γ (1), we have shown that here it equals the γ ·100% Wald confidence
σ
i=1 i=1 i=1 i=1 interval.
n
= 2 (x̄ − µ)2 d) Now assume that µ is known. Derive the likelihood ratio statistic for testing
σ
 2 specific values of σ 2 .
x̄ − µ
= √ . ◮ If µ is known, the log-likelihood function for σ 2 is given by
σ/ n
n
b) Show that, in this special case, the likelihood ratio statistic is an exact pivot n 1 X
l(σ 2 ; x) = − log(σ 2 ) − 2 (xi − µ)2
and exactly has a χ2 (1) distribution. 2 2σ
√ i=1
◮ From Example 3.5 we know that X̄ ∼ N(µ, σ 2 /n) and Z = n/σ(X̄ − µ) ∼
N(0, 1). Moreover, from Table A.2 in the Appendix we know that Z 2 ∼ χ2 (1). and the score function is
It follows that W (µ; X) ∼ χ2 (1), and so the distribution of the likelihood ratio d
S(σ 2 ; x) = l(σ 2 ; x)
statistic does not depend on the unknown parameter µ. Therefore, the likelihood dσ 2
n
ratio statistic is an exact pivot. Moreover, the chi-squared distribution holds n 1 X
=− 2 + (xi − µ)2 .
exactly for each sample size n, not only asymptotically for n → ∞. 2σ 2(σ 2 )2
i=1
c) Show that, in this special case, the corresponding likelihood ratio confidence
interval equals the Wald confidence interval. By solving the score equation S(σ 2 ; x) = 0 we obtain the MLE
◮ The Fisher information is n
2 1X
d n σ̂ML = (xi − µ)2 .
I(µ; x) = − S(µ; x) = 2 , n
i=1
dµ σ
88 5 Likelihood inference in multiparameter models 89

The likelihood ratio statistic thus is > ## define the data


> n <- 185
W (σ 2 ; x) = 2{l(σ̂ML
2
; x) − l(σ 2 ; x)} > mu <- 2449.2
  > mlsigma2 <- (237.8)^2
n 2 nσ̂ 2 n 2
nσ̂ML > ## the log-likelihood
= 2 − log(σ̂ML ) − 2ML + log(σ 2 ) +
2 2σ̂ML 2 2σ 2 > loglik <- function(sigma2)
 2  2
{
σ̂ML σ̂ - n / 2 * log(sigma2) - n * mlsigma2 / (2 * sigma2)
= −n log − n + n ML
σ2 σ2 }
 2  2   > ## the relative log-likelihood
σ̂ML σ̂ML > rel.loglik <- function(sigma2)
=n − log − 1 .
σ2 σ2 {
loglik(sigma2) - loglik(mlsigma2)
e) Compare the likelihood ratio statistic and its distribution with the exact pivot }
> ## the likelihood ratio statistic
mentioned in Example 4.21. Derive a general formula for a confidence interval > likstat <- function(sigma2)
based on the exact pivot, analogously to Example 3.8. {
n * (mlsigma2 / sigma2 - log(mlsigma2 / sigma2) - 1)
◮ In Example 4.21 we encountered the exact pivot }
2
> ## find the bounds of the likelihood ratio CI
σ̂ML
V (σ 2 ; X) = n ∼ χ2 (n) = G(n/2, 1/2). > lik.lower <- uniroot(function(sigma2){likstat(sigma2) - qchisq(0.95, df=1)},
σ2 interval=c(0.5*mlsigma2, mlsigma2))$root
> lik.upper <- uniroot(function(sigma2){likstat(sigma2) - qchisq(0.95, df=1)},
It follows that interval=c(mlsigma2, 2*mlsigma2))$root
V (σ 2 ; X)/n ∼ G(n/2, n/2). > (lik.ci <- c(lik.lower, lik.upper))
[1] 46432.99 69829.40
The likelihood ratio statistic is a transformation of the latter: > ## illustrate the likelihood ratio CI
> sigma2.grid <- seq(from=0.8 * lik.lower,

W (σ 2 ; X) = n V (σ 2 ; X)/n − log(V (σ 2 ; X)/n) − 1 . to=1.2 * lik.upper,
length=1000)
> plot(sigma2.grid,
Analogously to Example 3.8 we can derive a γ · 100% confidence interval for σ 2 rel.loglik(sigma2.grid),
based on the exact pivot V (σ 2 ; X): type="l",
xlab=math(sigma^2),
n o ylab= math (tilde(l)(sigma^2)),
γ = Pr χ2(1−γ)/2 (n) ≤ V (σ 2 ; X) ≤ χ2(1+γ)/2 (n) main="95% confidence intervals",
  las=1)
σ2
= Pr 1/χ2(1−γ)/2 (n) ≥ 2
≥ 1/χ 2
(1+γ)/2 (n) > abline(h = - 1/2 * qchisq(0.95, df=1),
nσ̂ML v = c(lik.lower, lik.upper),
n o lty=2)
= Pr nσ̂ML /χ(1+γ)/2 (n) ≤ σ 2 ≤ nσ̂ML
2 2 2
/χ2(1−γ)/2 (n) . > ## now compute the other 95% CI:
> (alt.ci <- c(n * mlsigma2 / qchisq(p=0.975, df=n),
n * mlsigma2 / qchisq(p=0.025, df=n)))
f ) Consider the transformation factors from Table 1.3, and assume that the “mean”
[1] 46587.27 70104.44
is the known µ and the “standard deviation” is σ̂ML . Compute both a 95% > abline(v = c(alt.ci[1], alt.ci[2]),
likelihood ratio confidence interval and a 95% confidence interval based on the lty=3)
exact pivot for σ 2 . Illustrate the likelihood ratio confidence interval by plotting
the relative log-likelihood and the cut-value, similar to Figure 4.8. In order to
compute the likelihood ratio confidence interval, use the R-function uniroot (cf .
Appendix C.1.1).
◮ The data are n = 185, µ = 2449.2, σ̂ML = 237.8.
90 5 Likelihood inference in multiparameter models 91

95% confidence intervals Pni


0 which, after plugging in µ̂i = x̄i for µi , is solved by σ̂i2 = n−1
i j=1 (xij − x̄i )2 .
This can be done for any i = 1, . . . , K, giving the MLE θ̂ML .
−2
b) Compute the MLE θ̂0 under the restriction σi2 = σ 2 of the null hypothesis.
−4 ◮ Under H0 we have σi2 = σ 2 . Obviously the MLEs for the means µi do not
l (σ2)

change, so we have µ̂i,0 = x̄i . However, the score function for σ 2 now comprises
~

−6 all groups:
K K ni
−8 1 X 1 XX
Sσ2 (θ) = − n + (xij − µi )2 .
2σ 2 2(σ 2 )2
i
i=1 i=1 j=1
40000 50000 60000 70000 80000
Plugging in the estimates µ̂i,0 = x̄i for µi and solving for σ 2 gives
σ2
ni
K X
1 X
The two confidence intervals are very similar here. σ̂02 = PK (xij − x̄i )2 ,
i=1 ni i=1 j=1
15.Consider K independent groups of normally distributed observations with group-
specific means and variances, i. e. let Xi,1:ni be a random sample from N(µi , σi2 ) which is a pooled variance estimate. So the MLE under the restriction σi2 = σ 2
for group i = 1, . . . , K. We want to test the null hypothesis that the variances are of the null hypothesis is θ̂ 0 = (x̄1 , . . . , x̄K , σ̂02 , . . . , σ̂02 )⊤ .
identical, i. e. H0 : σi2 = σ 2 . c) Show that the generalised likelihood ratio statistic for testing H0 : σi2 = σ 2 is
a) Write down the log-likelihood kernel for the parameter vector K
X
θ = (µ1 , . . . , µK , σ12 , . . . , σK
2 ⊤
) . Derive the MLE θ̂ ML by solving the score equa- W = ni log(σ̂02 /σ̂i2 )
tions Sµi (θ) = 0 and then Sσi2 (θ) = 0, for i = 1, . . . , K. i=1

◮ Since all random variables are independent, the log-likelihood is the sum of where σ̂02 and σ̂i2 are the ML variance estimates for the i-th group with and
all individual log density contributions log f (xij ; θ i ), where θi = (µi , σi2 )⊤ : without the H0 assumption, respectively. What is the approximate distribution
ni
K X
of W under H0 ?
X
l(θ) = log f (xij ; θ). ◮ The generalised likelihood ratio statistic is

W = 2{l(θ̂ML ) − l(θ̂0 )}
i=1 j=1

Hence the log-likelihood kernel is K


( ni ni
)
X ni 2 1 X 2 ni 2 1 X 2
XK X ni 
1 1
 =2 − log(σ̂i ) − 2 (xij − µ̂i ) + log(σ̂0 ) + 2 (xij − µ̂i,0 )
− log(σi2 ) − (xij − µi )2 2 2σ̂i 2 2σ̂0
i=1 j=1 j=1
2 2σi2 Pn i PK Pni
2 2
j=1 (xij − x̄i ) j=1 (xij − x̄i )
i=1 j=1 K K
( ) X X
2 2
= ni log(σ̂0 /σ̂i ) − +
i=1
K
X 1 X
ni
1
P P P
j=1 (xij − x̄i )
n 2
=
ni
− log(σi2 ) − (xij − µi )2 . i=1 i=1 ni
i
PK1 K
i=1 j=1 (xij − x̄i )
ni 2
2 2
2σi i=1
ni
i=1 j=1
K
X K
X K
X
Now the score function for the mean µi of the i-th group is given by = ni log(σ̂02 /σ̂i2 ) − ni + ni
ni
1 X i=1 i=1 i=1
d
Sµi (θ) = l(θ) = 2 (xij − µi ), K
X
dµi σi = ni log(σ̂02 /σ̂i2 ).
j=1

and solving the corresponding score equation Sµi (θ) = 0 gives µ̂i = x̄i , the
i=1

average of the observations in the i-th group. The score function for the variance Since there are p = 2K free parameters in the unconstrained model, and r = K+1
σi2 of the i-th group is given by free parameters under the H0 restriction, we have

W ∼ χ2 (p − r) = χ2 (K − 1)
a
ni
d ni 1 X
Sσi2 (θ) = l(θ) = − 2 + (xij − µi )2 ,
dσi2 2σi 2(σi )
2 2
under H0 .
j=1
92 5 Likelihood inference in multiparameter models 93

d) Consider the special case with K = 2 groups having equal size n1 = n2 = n. is used because T /C converges more rapidly to the asymptotic χ2 (K − 1) distri-
Show that W is large when the ratio bution than T alone.
Pn 2 Write two R-functions which take the vector of the group sizes (n1 , . . . , nK ) and
j=1 (x1j − x̄1 )
R = Pn the sample variances (s21 , . . . , s2K ) and return the values of the statistics W and
j=1 (x2j − x̄2 )
2
B, respectively.
is large or small. Show that W is minimal if R = 1. Which value has W for ◮
R = 1? > ## general function to compute the likelihood ratio statistic.
> W <- function(ni, # the n_i's
◮ In this special case we have s2i) # the s^2_i's
nP Pn o {
1 2 2
j=1 (x1j − x̄1 ) + j=1 (x2j − x̄2 )
n
2n 1 mleVars <- (ni - 1) * s2i / ni # sigma^2_i estimate
σ̂02 /σ̂12 = Pn = (1 + 1/R) pooledVar <- sum((ni - 1) * s2i) / sum(ni) # sigma^2_0 estimate
1
n
(x
j=1 1j − x̄ 1 )2 2
return(sum(ni * log(pooledVar / mleVars)))
and analogously }
1
σ̂02 /σ̂22 = (1 + R). > ## general function to compute the Bartlett statistic.
2 > B <- function(ni, # the n_i's
Hence the likelihood ratio statistic is s2i) # the s^2_i's
{
W = n log(σ̂02 /σ̂12 ) + n log(σ̂02 /σ̂22 )
pooledSampleVar <- sum((ni - 1) * s2i) / sum(ni - 1)

= 2n log(1/2) + n {log(1 + 1/R) + log(1 + R)} , T <- sum((ni - 1) * log(pooledSampleVar / s2i))


C <- 1 + (sum(1 / (ni - 1)) - 1 / sum(ni - 1)) / (3 * (length(ni) - 1))
which is large if 1/R or R is large, i. e. if R is small or large. We now consider return(T / C)
the derivative of W with respect to R: }
 
d 1 1 f ) In the H0 setting with K = 3, ni = 5, µi = i, σ 2 = 1/4, simulate 10 000
W (R) = n (−1)R−2 +
dR 1 + 1/R 1+R data sets and compute the statistics W and B in each case. Compare the
  empirical distributions with the approximate χ2 (K − 1) distribution. Is B closer
n 1
= − +1 , to χ2 (K − 1) than W in this case?
1+R R

which is equal to zero if and only if R = 1. Since we know that the function is
> ## simulation setting under H0:
increasing for small and large values of R, this is the minimum. The value of > K <- 3
W for R = 0 is > n <- rep(5, K)
> mu <- 1:K
> sigma <- 1/2
W = 2n log(1/2) + n{log(2) + log(2)} = −2n log(2) + 2n log(2) = 0. > ## now do the simulations
> nSim <- 1e4L
e) Bartlett’s modified likelihood ratio test statistic (Bartlett, 1937) is B = T /C > Wsims <- Bsims <- numeric(nSim)
where in > s2i <- numeric(K)
XK > set.seed(93)
T = (ni − 1) log(s20 /s2i ) > for(i in seq_len(nSim))
i=1
{
## simulate the sample variance for the i-th group
compared to W the numbers of observations ni have been replaced by the degrees for(j in seq_len(K))
Pn i
of freedom ni −1 and the sample variances s2i = (ni −1)−1 j=1 (xij − x̄i )2 define {
2
P P 2
s2i[j] <- var(rnorm(n=n[j],
the pooled sample variance s0 = { i=1 (ni −1)} i=1 (ni −1)si . The correction
K −1 K
mean=mu[j],
factor sd=sigma))
( K ! )
1 X 1 1
}
C =1+ − PK
3(K − 1) ni − 1 (ni − 1) ## compute test statistic results
i=1 i=1
94 5 Likelihood inference in multiparameter models 95

Wsims[i] <- W(ni=n, lower.tail=FALSE)


s2i=s2i) > p.W
Bsims[i] <- B(ni=n, [1] 0.9716231
s2i=s2i) > p.B
} [1] 0.9847605
> ## now compare
> par(mfrow=c(1, 2)) According to both test statistics, there is very little evidence against equal vari-
> hist(Wsims,
nclass=50, ances.
prob=TRUE,
16.In a 1:1 matched case-control study, one control (i. e. a disease-free individual)
ylim=c(0, 0.5))
> curve(dchisq(x, df=K-1), is matched to each case (i. e. a diseased individual) based on certain individual
col="red", characteristics, e. g. age or gender. Exposure history to a potential risk factor is
add=TRUE,
n=200, then determined for each individual in the study. If exposure E is binary (e. g.
lwd=2) smoking history? yes/no) then it is common to display the data as frequencies of
> hist(Bsims,
case-control pairs, depending on exposure history:
nclass=50,
prob=TRUE,
ylim=c(0, 0.5)) History of control
> curve(dchisq(x, df=K-1),
col="red", Exposed Unexposed
add=TRUE,
n=200, History Exposed a b
lwd=2)
of case Unexposed c d
Histogram of Wsims Histogram of Bsims
0.5 0.5

For example, a is the number of case-control pairs with positive exposure history
0.4 0.4
of both the case and the control.
0.3 0.3 Let ω1 and ω0 denote the odds for a case and a control, respectively, to be exposed,
Density

Density

such that
ω1 ω0
0.2 0.2
Pr(E | case) = and Pr(E | control) = .
1 + ω1 1 + ω0
0.1 0.1
To derive conditional likelihood estimates of the odds ratio ψ = ω1 /ω0 , we argue
0.0 0.0 conditional on the number NE of exposed individuals in a case-control pair. If
0 5 10 15 20 25 0 5 10 15 20
NE = 2 then both the case and the control are exposed so the corresponding a
case-control pairs do not contribute to the conditional likelihood. This is also the
Wsims Bsims
case for the d case-control pairs where both the case and the control are unexposed
We see that the empirical distribution of B is slightly closer to the χ2 (2) distri-
(NE = 0). In the following we therefore only consider the case NE = 1, in which
bution than that of W , but the discrepancy is not very large.
case either the case or the control is exposed, but not both.
g) Consider the alcohol concentration data from Section 1.1.7. Quantify the evi-
dence against equal variances of the transformation factor between the genders a) Conditional on NE = 1, show that the probability that the case rather than
using P -values based on the test statistics W and B. the control is exposed is ω1 /(ω0 + ω1 ). Show that the corresponding conditional

> ni <- c(33, 152)
> s2i <- c(220.1, 232.5)^2
> p.W <- pchisq(W(ni, s2i),
df=3,
lower.tail=FALSE)
> p.B <- pchisq(B(ni, s2i),
df=3,
96 5 Likelihood inference in multiparameter models 97

odds are equal to the odds ratio ψ. and the observed Fisher information is
◮ For the conditional probability we have  
1 b
I(ψ̂ML ) = · · (1 + 2b/c) − c
Pr(case E, control not E) {1 + (b/c)}2 (b/c)2
Pr(case E | NE = 1) =
Pr(case E, control not E) + Pr(case not E, control E) c2 c · (c + b)
ω1
· 1 = ·
1+ω1 1+ω0 (c + b)2 b
= 1 1
1+ω0 + 1+ω1 c3
ω1 ω0
1+ω1 · · 1+ω0
= .
ω1 b(c + b)
= ,
ω1 + ω0
Note that, since the latter is positive, ψ̂ML = b/c indeed maximises the likelihood.
and for the conditional odds we have
By Result 2.1, the observed Fisher information corresponding to log(ψ̂ML ) is
Pr(case E | NE = 1)
ω1
ω1 +ω0  −2
=
1 − Pr(case E | NE = 1) ω0
ω1 +ω0 I{log(ψ̂ML )} =
d
log(ψ̂ML ) · I(ψ̂ML )
ω1 dψ
= .
ω0 b2 c3
= ·
c b(c + b)
2
b) Write down the binomial log-likelihood in terms of ψ and show that the MLE of
p b·c
the odds ratio ψ is ψ̂ML = b/c with standard error se{log(ψ̂ML )} = 1/b + 1/c. = .
c+b
Derive the Wald test statistic for H0 : log(ψ) = 0.
◮ Note that Pr(case E | NE = 1) = ω1 /(ω1 + ω0 ) = 1/(1 + 1/ψ) and so It follows that
Pr(control E | NE = 1) = ω0 /(ω1 +ω0 ) = 1/(1 +ψ). The conditional log-likelihood
se{log(ψ̂ML )} = [I{log(ψ̂ML )}]−1/2
is p
    = 1/b + 1/c.
1 1
l(ψ) = b log + c log
1 + 1/ψ 1+ψ
  The Wald test statistic for H0 : log(ψ) = 0 is
1
= −b log 1 + − c log(1 + ψ).
ψ log(ψ̂ML ) − 0 log(b/c)
=p .
Hence the score function is se{log(ψ̂ML )} 1/b + 1/c

d c) Derive the standard error se(ψ̂ML ) of ψ̂ML and derive the Wald test statistic for
S(ψ) = l(ψ)
dψ H0 : ψ = 1. Compare your result with the Wald test statistic for H0 : log(ψ) = 0.
b 1 c ◮ Using the observed Fisher information computed above, we obtain that
= · −
1 + 1/ψ ψ 2 1+ψ
b c se(ψ̂ML ) = {I(ψ̂ML )}−1/2
= − , r
ψ(1 + ψ) 1 + ψ b(c + b)
=
and the score equation S(ψ) = 0 is solved by ψ̂ML = b/c. The Fisher information c3
r
is b 1 1
= · + .
d c b c
I(ψ) = − S(ψ)
dψ The Wald test statistic for H0 : ψ = 1 is
b c
= 2 · {(1 + ψ) + ψ} − ψ̂ML − 1 b/c − 1
ψ (1 + ψ)2 (1 + ψ)2 = p = p
b−c
.
 
1 b se(ψ̂ML ) b/c · 1/b + 1/c b 1/b + 1/c
= · · (1 + 2ψ) − c ,
(1 + ψ)2 ψ2
98 5 Likelihood inference in multiparameter models 99

d) Finally compute the score test statistic for H0 : ψ = 1 based on the expected which shows that F = logit−1 . For the derivative:
Fisher information of the conditional likelihood.  
d d exp(x)
◮ We first compute the expected Fisher information. F (x) =
dx dx 1 + exp(x)
J(ψ) = E{I(ψ)} exp(x){1 + exp(x)} − exp(x) exp(x)
=
  {1 + exp(x)}2
1 b+c 1 + 2ψ b+c
= · · − exp(x) 1
(1 + ψ)2 1 + 1/ψ ψ2 1+ψ =
1 + exp(x) 1 + exp(x)
1 b+c 1+ψ
= · · = F (x){1 − F (x)}.
(1 + ψ)2 1 + ψ ψ
b+c
= . b) Use the results on multivariate derivatives outlined in Appendix B.2.2 to show
ψ(1 + ψ)2
that the log-likelihood, score vector and Fisher information matrix of β, given
p
The score test statistic is S(ψ0 )/ J(ψ0 ), where ψ0 = 1. Using the results derived the realisation y = (y1 , . . . , yn )⊤ , are
above, we obtain the statistic in the form n
X
b c l(β) = yi log(πi ) + (1 − yi ) log(1 − πi ),
1·(1+1)
− 1+1 b−c
q =√ . i=1
b+c b+c Xn
1·(1+1)2
S(β) = (yi − πi )xi = X ⊤ (y − π)
i=1
17.Let Yi ∼ Bin(1, πi ), i = 1, . . . , n, be the binary response variables in a logistic
ind
Xn

regression model , where the probabilities πi = F (x⊤ i β) are parametrised via the
and I(β) = i = X W X,
πi (1 − πi )xi x⊤ ⊤

inverse logit function i=1


exp(x)
F (x) = respectively, where X = (x1 , . . . , xn )⊤ is the design matrix, π = (π1 , . . . , πn )⊤
1 + exp(x)
and W = diag{πi (1 − πi )}ni=1 .
by the regression coefficient vector β = (β1 , . . . , βp )⊤ . The vector xi = ◮ The probability mass function for Yi is
(xi1 , . . . , xip )⊤ contains the values of the p covariates for the i-th observation.
f (yi ; β) = πiyi (1 − πi )(1−yi ) ,
a) Show that F is indeed the inverse of the logit function logit(x) = log{x/(1 − x)},
d
and that dx F (x) = F (x){1 − F (x)}. where πi = F (x⊤ i β) depends on β. Because of independence of the n random
◮ We have variables Yi , i = 1, . . . , n, the log-likelihood of β is
 x 
y = logit(x) = log n
X
1−x l(β) = log{f (yi ; β)}
x
⇐⇒ exp(y) = i=1
1−x Xn
⇐⇒ exp(y)(1 − x) = x = yi log(πi ) + (1 − yi ) log(1 − πi ).
⇐⇒ exp(y) = x + x exp(y) = x{1 + exp(y)}
i=1

exp(y) To derive the score vector, we have to take the derivative with respect to βj ,
⇐⇒ x = = F (y),
1 + exp(y) j = 1, . . . , p. Using the chain rule for the partial derivative of a function g of πi ,
we obtain that
d d d d d
g(πi ) = g[F {ηi (β)}] = g(πi ) · F (ηi ) · ηi (β),
dβj dβj dπi dηi dβj
100 5 Likelihood inference in multiparameter models 101

Pp Pn
where ηi (β) = x⊤
i β = j=1 xij βj is the linear predictor for the i-th observation. c) Show that the statistic T (y) = i=1 yi xi is minimal sufficient for β.
In our case, ◮ We can rewrite the log-likelihood as follows:
n   n
d X yi yi − 1 X
l(β) = + · πi (1 − πi ) · xij l(β) = yi log(πi ) + (1 − yi ) log(1 − πi )
dβj πi 1 − πi
i=1 i=1
Xn Xn
= (yi − πi )xij . = yi {log(πi ) − log(1 − πi )} + log(1 − πi )
i=1 i=1
Xn n
X
Together we have obtained = i β+
yi x⊤ log{1 − F (x⊤
i β)}
  Pn  i=1 i=1
i=1 (yi − πi )xi1
d
l(β)

dβ1
   X n = T (y)⊤ β − A(β), (5.4)
.. ..
S(β) = 
 .
=
  .
=
 (yi − πi )xi . Pn
Pn i=1 where A(β) = − i=1 log{1 − F (x⊤ i β)}. Now, (5.4) is in the form of an expo-
i=1 (yi − πi )xip
d
dβp l(β) nential family of order p in canonical form; cf. Exercise 8 in Chapter 3. In that
We would have obtained the same result using vector differentiation: exercise, we showed that T (y) is minimal-sufficient for β.
d) Implement an R-function which maximises the log-likelihood using the Newton-

S(β) = l(β) Raphson algorithm (see Appendix C.1.2) by iterating
∂β
n  
X yi yi − 1 ∂ β(t+1) = β (t) + I(β(t) )−1 S(β (t) ), t = 1, 2, . . .
= + · πi (1 − πi ) · ηi (β)
πi 1 − πi ∂β
i=1 until the new estimate β (t+1) and the old one β (t) are almost identical and
n
X β̂ML = β(t+1) . Start with β (1) = 0.
= (yi − πi )xi ,
i=1
◮ Note that
β(t+1) = β(t) + I(β (t) )−1 S(β (t) )
∂β ηi (β) = = xi . It is easily
∂ ∂ ⊤
because the chain rule also works here, and ∂β xi β
is equivalent to
seen that we can also write this as S(β) = X (y − π). ⊤
I(β (t) )(β (t+1) − β(t) ) = S(β (t) ).
We now use vector differentiation to derive the Fisher information matrix:

∂ Since it is numerically more convenient to solve an equation directly instead of


I(β) = − ⊤
S(β) computing a matrix inverse and then multiplying it with the right-hand side, we
∂β
Xn will be solving
=− (−xi ) · πi (1 − πi ) · x⊤
i I(β(t) )v = S(β (t) )
i=1
n
X for v and then deriving the next iterate as β (t+1) = β (t) + v.
= πi (1 − πi )xi x⊤
i . > ## first implement score vector and Fisher information:
i=1 > scoreVec <- function(beta,
data) ## assume this is a list of vector y and matrix X
We can rewrite this as I(β) = X ⊤ W X if we define W = diag{πi (1 − πi )}ni=1 . {
yMinusPiVec <- data$y - plogis(data$X %*% beta) # (y - pi)
return(crossprod(data$X, yMinusPiVec)) # X^T * (y - pi)
}
> fisherInfo <- function(beta,
data)
{
piVec <- as.vector(plogis(data$X %*% beta))
W <- diag(piVec * (1 - piVec))
return(t(data$X) %*% W %*% data$X)
102 5 Likelihood inference in multiparameter models 103

} having had an X-ray gives odds ratio of exp(β3 ), so the odds are exp(β3 ) times
> ## here comes the Newton-Raphson algorithm: higher; and the father’s ever having had an X-ray changes the odds by factor
> computeMle <- function(data)
{ exp(β4 ).
## start with the null vector We now consider the data set amlxray and first compute the ML estimates of
p <- ncol(data$X)
beta <- rep(0, p) the regression coefficients:
names(beta) <- colnames(data$X) > library(faraway)
> # formatting the data
## loop only to be left by returning the result > data.subset <- amlxray[, c("disease", "age", "Sex", "Mray", "Fray")]
while(TRUE) > data.subset$Sex <- as.numeric(data.subset$Sex)-1
{ > data.subset$Mray <- as.numeric(data.subset$Mray)-1
## compute increment vector v > data.subset$Fray <- as.numeric(data.subset$Fray)-1
v <- solve(fisherInfo(beta, data), > data <- list(y = data.subset$disease,
scoreVec(beta, data)) X =
cbind(intercept=1,
## update the vector as.matrix(data.subset[, c("age", "Sex", "Mray", "Fray")])))
beta <- beta + v > (myMle <- computeMle(data))
[,1]
## check if we have converged intercept -0.282982807
if(sum(v^2) < 1e-8) age -0.002951869
{ Sex 0.124662103
return(beta) Mray -0.063985047
} Fray 0.394158386
} > (myOr <- exp(myMle))
} [,1]
intercept 0.7535327
e) Consider the data set amlxray on the connection between X-ray usage and acute
age 0.9970525
myeloid leukaemia in childhood, which is available in the R-package faraway. Sex 1.1327656
Here yi = 1 if the disease was diagnosed for the i-th child and yi = 0 otherwise Mray 0.9380190
Fray 1.4831354
(disease). We include an intercept in the regression model, i. e. we set x1 = 1.
We want to analyse the association of the diabetes status with the covariates We see a negative association of the leukaemia risk with age and the mother’s
x2 (age in years), x3 (1 if the child is male and 0 otherwise, Sex), x4 (1 if the ever having had an X-ray, and a positive association of the leukaemia risk with
mother ever have an X-ray and 0 otherwise, Mray) and x5 (1 if the father ever male gender and the father’s ever having had an X-ray, and could now say the
have an X-ray and 0 otherwise, Fray). same as described above, e. g. being a male increases the odds for leukaemia by
Interpret β2 , . . . , β5 by means of odds ratios. Compute the MLE β̂ML = 13.3 %. However, these are only estimates and we need to look at the associated
(β̂1 , . . . , β̂5 )⊤ and standard errors se(β̂i ) for all coefficient estimates β̂i , and con- standard errors before we make conclusions.
struct 95% Wald confidence intervals for βi (i = 1, . . . , 5). Interpret the results, We easily obtain the standard errors by inverting the observed Fisher infor-
and compare them with those from the R-function glm (using the binomial fam- mation and taking the square root of the resulting diagonal. This leads to the
ily). (transformed) Wald confidence intervals:
◮ In order to interpret the coefficients, consider two covariate vectors xi and > (obsFisher <- fisherInfo(myMle, data))
xj . The modelled odds ratio for having leukaemia (y = 1) is then intercept age Sex Mray
intercept 58.698145 414.95349 29.721114 4.907156
πi /(1 − πi ) exp(x⊤
i β)
age 414.953492 4572.35780 226.232641 38.464410
= = exp{(xi − xj )⊤ β}. Sex 29.721114 226.23264 29.721114 1.973848
πj /(1 − πj ) exp(x⊤
j β) Mray 4.907156 38.46441 1.973848 4.907156
Fray 16.638090 128.19588 8.902213 1.497194
If now xik − xjk = 1 for one covariate k and xil = xjl for all other covariates Fray
l 6= k, then this odds ratio is equal to exp(βk ). Thus, we can interpret β2 as the intercept 16.638090
age 128.195880
log odds ratio for person i versus person j who is one year younger than person Sex 8.902213
i; β3 as the log odds ratio for a male versus a female; likewise, the mother’s ever
104 5 Likelihood inference in multiparameter models 105

Mray 1.497194 Mray -0.063985047


Fray 16.638090 Fray 0.394158386
> (mySe <- sqrt(diag(solve(obsFisher)))) > mySe
intercept age Sex Mray Fray intercept age Sex Mray Fray
0.25790020 0.02493209 0.26321083 0.47315533 0.29059384 0.25790020 0.02493209 0.26321083 0.47315533 0.29059384
> (myWaldCis <- as.vector(myMle) + qnorm(0.975) * t(sapply(mySe, "*", c(-1, +1))))
[,1] [,2] We now have a confirmation of our results.
intercept -0.78845792 0.22249230 f ) Implement an R-function which returns the profile log-likelihood of one of the p
age -0.05181788 0.04591414 parameters. Use it to construct 95% profile likelihood confidence intervals for
Sex -0.39122164 0.64054584
Mray -0.99135245 0.86338235 them. Compare with the Wald confidence intervals from above, and with the
Fray -0.17539508 0.96371185 results from the R-function confint applied to the glm model object.
> (myWaldOrCis <- exp(myWaldCis))
◮ In order to compute the profile log-likelihood of a coefficient βj , we have to
[,1] [,2]
intercept 0.4545452 1.249186 think about how to obtain β̂−j (βj ), the ML estimate of all other coefficients in
age 0.9495018 1.046985 the vector β except the j-th one, given a fixed value for βj . We know that we
Sex 0.6762303 1.897516
Mray 0.3710745 2.371167 have to solve all score equations except the j-th one. And we can do that again
Fray 0.8391254 2.621409 with the Newton-Raphson algorithm, by now iterating
(t+1) (t)
Note that all the confidence intervals for interesting coefficients cover zero; thus, β−j = β−j + I(β(t) )−1 (t)
−j S(β )−j , t = 1, 2, . . . ,
neither of the associations interpreted above is significant.
(t)
We can compare the results with those from the standard glm function: where I(β )−j is the Fisher information matrix computed from the current
> amlGlm <- glm(disease ~ age + Sex + Mray + Fray, estimate of β −j and the fixed βj , and then leaving out the j-th row and column.
family=binomial(link="logit"), Likewise, S(β(t) )−j is the score vector without the j-th element.
The rest is then easy, we just plug in β̂−j (βj ) and βj into the full log-likelihood:
data=amlxray)
> summary(amlGlm)
Call:
glm(formula = disease ~ age + Sex + Mray + Fray, family = binomial(link = "logit"), > ## the full log-likelihood:
data = amlxray) > fullLogLik <- function(beta,
data)
Deviance Residuals: {
Min 1Q Median 3Q Max piVec <- as.vector(plogis(data$X %*% beta))
-1.279 -1.099 -1.043 1.253 1.340 ret <- sum(data$y * log(piVec) + (1 - data$y) * log(1 - piVec))
return(ret)
Coefficients: }
Estimate Std. Error z value Pr(>|z|) > ## compute the MLE holding the j-th coefficient fixed:
(Intercept) -0.282983 0.257896 -1.097 0.273 > computeConditionalMle <- function(data,
age -0.002952 0.024932 -0.118 0.906 value, # the fixed value
SexM 0.124662 0.263208 0.474 0.636 index) # j
Mrayyes -0.063985 0.473146 -0.135 0.892 {
Frayyes 0.394158 0.290592 1.356 0.175 ## start with the MLE except for the value
p <- ncol(data$X)
(Dispersion parameter for binomial family taken to be 1) beta <- computeMle(data)
beta[index, ] <- value
Null deviance: 328.86 on 237 degrees of freedom
Residual deviance: 326.72 on 233 degrees of freedom ## loop only to be left by returning the result
AIC: 336.72 while(TRUE)
{
Number of Fisher Scoring iterations: 3 ## compute increment vector v for non-fixed part
> myMle v <- solve(fisherInfo(beta, data)[-index, -index, drop=FALSE],
[,1] scoreVec(beta, data)[-index, , drop=FALSE])
intercept -0.282982807
age -0.002951869 ## update the non-fixed part of the beta vector
Sex 0.124662103 beta[-index, ] <- beta[-index, ] + v
106 5 Likelihood inference in multiparameter models 107

> ## comparison with the R results:


## check if we have converged > confint(amlGlm)
if(sum(v^2) < 1e-8) 2.5 % 97.5 %
{ (Intercept) -0.79331983 0.22054713
return(beta) age -0.05202761 0.04594725
} SexM -0.39140156 0.64192116
} Mrayyes -1.01582557 0.86404508
} Frayyes -0.17432400 0.96782542
> ## so the profile log-likelihood is: > ## so we are quite close to these!
> profileLogLik <- function(beta, # this is scalar now! >
data, > ## the Wald results:
index) > myWaldCis
{ [,1] [,2]
fullLogLik(computeConditionalMle(data=data, intercept -0.78845792 0.22249230
value=beta, age -0.05181788 0.04591414
index=index), Sex -0.39122164 0.64054584
data=data) Mray -0.99135245 0.86338235
} Fray -0.17539508 0.96371185
> ## now for our data, compute 95% profile likelihood CIs:
> profLogLikCis <- matrix(nrow=ncol(data$X), Unfortunately, our algorithm is not very stable, which means that we must start
ncol=2,
dimnames= with the full MLE configuration to search the constrained solution, and we must
list(colnames(data$X), not choose parameter values too far away from the MLE. Here we pragmatically
c("lower", "upper")))
start with the Wald interval bounds and add an amount depending on the width
> for(j in seq_len(ncol(data$X)))
{ of the Wald interval, which gives us a sensible parameter range. In practice one
## this function must equal 0 should always use the confint routine to compute profile likelihood intervals in
targetFun <- function(beta)
{ GLMs.
likRatioStat <- In comparison to the Wald intervals, the profile likelihood intervals are a little
- 2 * (profileLogLik(beta, data=data, index=j) -
shifted to the left for the coefficients β2 , β3 and β4 , slightly wider for the intercept
profileLogLik(myMle[j], data=data, index=j))
likRatioStat - qchisq(0.95, df=1) β1 and slightly shorter for the coefficient β5 . However, all these differences are
} very small in size.
## compute bounds g) We want to test if the inclusion of the covariates x2 , x3 , x4 and x5 improves
## note that we cannot use too large intervals because the fit of the model to the data. To this end, we consider the null hypothesis
## the numerics are not very stable... so we start with the Wald
H0 : β2 = · · · = β5 = 0.
## interval approximation first
addSpace <- abs(diff(myWaldCis[j,])) / 10 How can this be expressed in the form H0 : Cβ = δ, where C is a q × p
lower <- uniroot(targetFun, contrast matrix (of rank q ≤ p) and δ is a vector of length q? Use a result from
interval=c(myWaldCis[j,1] - addSpace, myMle[j]))$root
upper <- uniroot(targetFun, Appendix A.2.4 to show that under H0 ,
interval=c(myMle[j], myWaldCis[j,2] + addSpace))$root n o−1
(C β̂ML − δ)⊤ CI(β̂ML )−1 C ⊤ (C β̂ML − δ) ∼ χ2 (q).
a
(5.5)
## save them correctly
profLogLikCis[j, ] <- c(lower, upper)
}
> ## our result: ◮ If we set
> profLogLikCis  
0 1 0 0 0
lower upper
 
intercept -0.79330733 0.22054433 0 0 1 0 0
age -0.05203807 0.04595938 C=

,

Sex -0.39139772 0.64191554 0 0 0 1 0
0 0 0 0 1
Mray -1.01577053 0.86400486
Fray -0.17431963 0.96781339
108 5 Likelihood inference in multiparameter models 109

then Cβ = (β2 , β3 , β4 , β5 ), and with the right hand side δ = 0 we have expressed > ## Second the generalized likelihood ratio statistic:
the null hypothesis H0 : β2 = β3 = β4 = β5 = 0 in the form of a so-called linear >
> ## We have to fit the model under the H0 restriction, so
hypothesis: H0 : Cβ = δ. > ## only with intercept.
From Section 5.4.2 we know that > h0data <- list(y=data$y,
X=data$X[, "intercept", drop=FALSE])
> myH0mle <- computeMle(data=h0data)
β̂ML ∼ Np (β, I(β̂ ML )−1 ),
a
> ## then the statistic is
> (likRatioStat <- 2 * (fullLogLik(beta=myMle,
hence data=data) -
C β̂ML ∼ Nq (Cβ, CI(β̂ ML )−1 C ⊤ ).
a fullLogLik(beta=myH0mle,
data=h0data)))
Now under H0 , Cβ = δ, therefore [1] 2.141096
> (p.likRatio <- pchisq(q=likRatioStat, df=3, lower.tail=FALSE))
[1] 0.5436436
C β̂ML ∼ Nq (δ, CI(β̂ML )−1 C ⊤ )
a

We see that neither of the two statistics finds much evidence against H0 , so the
and association of the oral cancer risk with the set of these four covariates is not
Σ−1/2 (C β̂ML − δ) ∼ Nq (0, I q )
a
significant.
where Σ = CI(β̂ML )−1 C ⊤ . Analogous to Section 5.4, we finally get Note that we can compute the generalised likelihood ratio statistic for the compar-
ison of two models in R using the anova function. For example, we can reproduce
n o⊤ n o
Σ−1/2 (C β̂ML − δ) Σ−1/2 (C β̂ ML − δ) the null deviance from above using the following code:
n o⊤ > anova(update(amlGlm, . ~ 1), amlGlm, test="LRT")
= (C β̂ML − δ)⊤ Σ−1/2 Σ−1/2 (C β̂ ML − δ) Analysis of Deviance Table

= (C β̂ML − δ)⊤ Σ−1 (C β̂ ML − δ) χ2 (q).


a Model 1: disease ~ 1

Model 2: disease ~ age + Sex + Mray + Fray
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
h) Compute two P -values quantifying the evidence against H0 , one based on the 1 237 328.86
squared Wald statistic (5.5), the other based on the generalised likelihood ratio 2 233 326.72 4 2.1411 0.7098
statistic. i) Since the data is actually from a matched case-control study, where pairs of one
◮ case and one control have been matched (by age, race and county of residence;
> ## First the Wald statistic: the variable ID denotes the matched pairs), it is more appropriate to apply
> (Cmatrix <- cbind(0, diag(4)))
[,1] [,2] [,3] [,4] [,5] conditional logistic regression. Compute the corresponding MLEs and 95% con-
[1,] 0 1 0 0 0 fidence intervals with the R-function clogit from the package survival, and
[2,] 0 0 1 0 0
compare the results.
[3,] 0 0 0 1 0
[4,] 0 0 0 0 1 ◮ Since the controls and cases are matched by age, we cannot evaluate the
> (waldStat <- t(Cmatrix %*% myMle) %*% effect of age. We therefore fit a conditional logistic regression model with Sex,
solve(Cmatrix %*% solve(obsFisher) %*% t(Cmatrix)) %*%
(Cmatrix %*% myMle))
Mray and Fray as covariates. We do not look at the estimate of the intercept
[,1] either, as each pair has its own intercept in the conditional logistic regression.
[1,] 2.1276 > library(survival)
> (p.Wald <- pchisq(q=waldStat, df=nrow(Cmatrix), lower.tail=FALSE)) > # compute the results
[,1] > amlClogit <- clogit( disease ~ Sex + Mray + Fray + strata(ID), data=amlxray )
[1,] 0.7123036 > amlClogitWaldCis <- amlClogit$coefficients +
qnorm(0.975) * t(sapply(sqrt(diag(amlClogit$var)), "*", c(-1, +1)))
> colnames(amlClogitWaldCis, do.NULL=TRUE)
NULL
> rownames(amlClogitWaldCis, do.NULL=TRUE)
NULL
110 5 Likelihood inference in multiparameter models 111

> colnames(amlClogitWaldCis) <-c("2.5 %", "97.5 %") Mrayyes -1.0158256 0.8640451


> rownames(amlClogitWaldCis) <-c("Sex", "Mray", "Fray") Frayyes -0.1743240 0.9678254
> # compare the results
> summary(amlClogit) The results from the conditional logistic regression are similar to those obtained
Call: above in the direction of the effects (negative or positive) and, more importantly,
coxph(formula = Surv(rep(1, 238L), disease) ~ Sex + Mray + Fray +
strata(ID), data = amlxray, method = "exact") in that neither of the associations is significant. The actual numerical values
differ, of course.
n= 238, number of events= 111
18.In clinical dose-finding studies, the relationship between the dose d ≥ 0 of the med-
coef exp(coef) se(coef) z Pr(>|z|) ication and the average response µ(d) in a population is to be inferred. Considering
SexM 0.10107 1.10635 0.34529 0.293 0.770
Mrayyes -0.01332 0.98677 0.46232 -0.029 0.977 a continuously measured response y, then a simple model for the individual mea-
surements assumes yij ∼ N(µ(dij ; θ), σ 2 ), i = 1, . . . , K, j = 1, . . . , ni . Here ni is the
ind
Frayyes 0.44567 1.56154 0.30956 1.440 0.150
number of patients in the i-th dose group with dose di (placebo group has d = 0).
exp(coef) exp(-coef) lower .95 upper .95
SexM 1.1064 0.9039 0.5623 2.177 The Emax model has the functional form
Mrayyes 0.9868 1.0134 0.3987 2.442
d
Frayyes 1.5615 0.6404 0.8512 2.865 µ(d; θ) = θ1 + θ2 .
d + θ3
Rsquare= 0.01 (max possible= 0.501 )
Likelihood ratio test= 2.36 on 3 df, p=0.5003 a) Plot the function µ(d; θ) for different choices of the parameters θ1 , θ2 , θ3 > 0.
Wald test = 2.31 on 3 df, p=0.5108 Give reasons for the interpretation of θ1 as the mean placebo response, θ2 as
Score (logrank) test = 2.35 on 3 df, p=0.5029
> summary(amlGlm) the maximum treatment effect, and θ3 as the dose giving 50% of the maximum
Call: treatment effect.
glm(formula = disease ~ age + Sex + Mray + Fray, family = binomial(link = "logit"), ◮ If d = 0 (the dose in the placebo group), then µ(0; θ) = θ1 +θ2 ·0/(0+θ3 ) = θ1 .
data = amlxray)
Thus, θ1 is the mean response in the placebo group. Further, µ(d; θ) = θ1 + θ2 ·
Deviance Residuals: d/(d + θ3 ) = θ1 + θ2 · 1/(1 + θ3 /d). Hence, the mean response is smallest for
the placebo group (µ(0; θ) = θ1 ) and increases for increasing d. As d tends to
Min 1Q Median 3Q Max
-1.279 -1.099 -1.043 1.253 1.340
infinity, θ2 · 1/(1 + θ3 /d) approaches θ2 from below. Thus, θ2 is the maximum
Coefficients: treatment effect. Finally, µ(θ3 ; θ) = θ1 + θ2 · θ3 /(2θ3 ) = θ1 + θ2 /2. This dose
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.282983 0.257896 -1.097 0.273 therefore gives 50% of the maximum treatment effect θ2 . We now draw µ(d; θ)
age -0.002952 0.024932 -0.118 0.906 for various values of θ.
SexM 0.124662 0.263208 0.474 0.636
> # define the Emax function
Mrayyes -0.063985 0.473146 -0.135 0.892
> Emax <- function(dose, theta1, theta2, theta3){
Frayyes 0.394158 0.290592 1.356 0.175
return( theta1 + theta2*dose/(theta3+dose) )
}
(Dispersion parameter for binomial family taken to be 1)
> # we look at doses between 0 and 4
> dose <- seq(from=0, to=4, by=0.001)
Null deviance: 328.86 on 237 degrees of freedom
> # choice of the parameter values
Residual deviance: 326.72 on 233 degrees of freedom
> theta1 <- c(0.2, 0.2, 0.5, 0.5)
AIC: 336.72
> theta2 <- c(1, 1, 2, 2)
> theta3 <- c(0.2, 0.5, 1, 2)
Number of Fisher Scoring iterations: 3
> ##
> amlClogitWaldCis
> # store the responses for the different parameter combinations in a list
2.5 % 97.5 % > resp <- list()
Sex -0.5756816 0.7778221 > for(i in 1:4){
Mray -0.9194578 0.8928177 resp[[i]] <- Emax(dose, theta1[i], theta2[i], theta3[i])
Fray -0.1610565 1.0523969 }
> confint(amlGlm)[3:5, ] > # plot the different dose-mean-response relationships
2.5 % 97.5 % > m.resp <- expression(mu(d)) # abbreviation for legend
SexM -0.3914016 0.6419212 > par(mfrow=c(2,2))
112 5 Likelihood inference in multiparameter models 113

Pn i
> plot(x=dose, y=resp[[1]], type="l", ylim=c(0,1.2), xlab="dose", ylab=m.resp) where ỹi = j=1 {yij − µ(di ; θ)} and δi = di + θ3 . The expected Fisher infor-
> legend("bottomright", legend= "theta_1=0.2, theta_2=1, theta_3=0.2") mation matrix is thus
> plot(x=dose, y=resp[[2]], type="l", ylim=c(0,1.2), xlab="dose", ylab=m.resp)
 PK PK ni di PK 
> legend("bottomright", legend= "theta_1=0.2, theta_2=1, theta_3=0.5")
i=1 ni i=1 δi −θ2 i=1 nδi2di
> plot(x=dose, y=resp[[3]], type="l", ylim=c(0,2.6), xlab="dose", ylab=m.resp)
1  PK ni di PK ni d2i PK n d2 i

> legend("bottomright", legend= "theta_1=0.5, theta_2=2, theta_3=1") J(θ) = 2  i=1 δi2 −θ2 i=1 δi 3 i 
.
> plot(x=dose,resp[[4]], type="l", ylim=c(0,2.6), xlab="dose", ylab=m.resp) σ  i=1 δi
PK PK n d2 PK n di2
> legend("bottomright", legend= "theta_1=0.5, theta_2=2, theta_3=2") −θ2 i=1 nδi2di −θ2 i=1 δi 3 i θ22 i=1 δi 4 i
i i i
1.2 1.2
1.0 1.0 The following R function uses J(θ) to obtain the approximate covariance matrix
0.8 0.8
of the MLE θ̂ML for a given set of doses d1 , . . . , dK , a total sample size N =
µ(d)

µ(d)
0.6 0.6
0.4 0.4 PK 2
0.2
theta_1=0.2, theta_2=1, theta_3=0.2
0.2
theta_1=0.2, theta_2=1, theta_3=0.5 i=1 ni , allocation weights wi = ni /N , and given error variance σ .
0.0 0.0

0 1 2 3 4 0 1 2 3 4
> ApprVar <- function(doses, N, w, sigma, theta1, theta2, theta3){
delta <- doses + theta3
dose dose
n <- w*N
2.5 2.5
2.0 2.0
V <- matrix(NA, nrow=3, ncol=3)
diag(V) <- c( N, sum(n*doses^2/delta^2), theta2^2*sum(n*doses^2/delta^4) )
µ(d)

1.5 µ(d) 1.5


1.0 1.0 V[1, 2] <- V[2, 1] <- sum(n*doses/delta)
0.5 0.5 V[1, 3] <- V[3, 1] <- -theta2*sum(n*doses/delta^2)
theta_1=0.5, theta_2=2, theta_3=1 theta_1=0.5, theta_2=2, theta_3=2
0.0 0.0 V[2, 3] <- V[3, 2] <- -theta2*sum(n*doses^2/delta^3)
0 1 2 3 4 0 1 2 3 4 return(sigma^2*solve(V))
dose dose }

b) Compute the expected Fisher information for the parameter vector θ. Using this c) Assume θ1 = 0, θ2 = 1, θ3 = 0.5 and σ 2 = 1. Calculate the approximate
result, implement an R function which calculates the approximate covariance covariance matrix, first for K = 5 doses 0, 1, 2, 3, 4, and second for doses 0,
matrix of the MLE θ̂ ML for a given set of doses d1 , . . . , dK , a total sample size 0.5, 1, 2, 4, both times with balanced allocations wi = 1/5 and total sample size
PK
N = i=1 ni , allocation weights wi = ni /N and given error variance σ 2 . N = 100. Compare the approximate standard deviations of the MLEs of the
◮ The log-likelihood kernel is parameters between the two designs, and also compare the determinants of the
two calculated matrices.
K ni
1 XX ◮
l(θ) = − {yij − µ(dij ; θ)}2 ,
2σ 2 > # with the first set of doses:
i=1 j=1
> V1 <- ApprVar(doses=c(0:4), N=100, w=c(rep(1/5, 5)),
the score function is sigma=1, theta1=0, theta2=1, theta3=0.5)
> # with the second set of doses:
   PK  > V2 <- ApprVar(doses=c(0, 0.5, 1, 2, 4), N=100, w=c(rep(1/5, 5)),
d
l(θ) i=1 ỹi sigma=1, theta1=0, theta2=1, theta3=0.5)
 dθ1  1  PK
 
S(θ) = 
 dθ2 l(θ) = σ 2 
d  di
i=1 ỹi · di +θ3
,

> # standard errors
PK > sqrt(diag(V1))
d di
dθ3 l(θ) −θ 2 i=1 ỹi · (di +θ3 ) 2 [1] 0.2235767 0.4081669 0.9185196
> sqrt(diag(V2))
and the Fisher information matrix is [1] 0.2230535 0.3689985 0.6452322
 PK  > # determinant
PK ni di PK ni di > det(V1)
i=1 ni i=1 δi −θ2 i=1 δi2
1  PK ni di PK ni d2i PK  [1] 0.0007740515
I(θ) = 2  i=1 δi2 i=1 3 (δi ỹi − ni di θ2 )
di , > det(V2)
σ  i=1 δ
PK i PK di PK dδii 
i=1 δ 3 (δi ỹi − ni di θ2 ) θ2 i=1 δ4 (−2δi ỹi + ni di θ2 ) [1] 0.000421613
−θ2 i=1 nδi2di
i i i
The second design attains achieves a lower variability of estimation by placing
more doses on the increasing part of the dose-response curve.
114 5 Likelihood inference in multiparameter models

d) Using the second design, determine the required total sample size N so that the
standard deviation for estimation of θ2 is 0.35 (so that the half-length of a 95%
6 Bayesian inference
confidence interval is about 0.7).

> sample.size <- function(N){
V2 <- ApprVar(doses=c(0, 0.5, 1, 2, 4), N=N, w=c(rep(1/5, 5)),
sigma=1, theta1=0, theta2=1, theta3=0.5)
return(sqrt(V2[2,2])-0.35)
}
> # find the root of the function above
> uniroot(sample.size, c(50, 200))
£root 1. In 1995, O. J. Simpson, a retired American football player and actor, was accused
[1] 111.1509 of the murder of his ex-wife Nicole Simpson and her friend Ronald Goldman. His
lawyer, Alan M. Dershowitz stated on T.V. that only one-tenth of 1% of men who
£f.root
[1] -8.309147e-10 abuse their wives go on to murder them. He wanted his audience to interpret this
to mean that the evidence of abuse by Simpson would only suggest a 1 in 1000
£iter
[1] 7 chance of being guilty of murdering her.
However, Merz and Caulkins (1995) and Good (1995) argue that a different proba-
£estim.prec bility needs to be considered: the probability that the husband is guilty of murdering
[1] 6.103516e-05
his wife given both that he abused his wife and his wife was murdered. Both com-
We need a little more than 111 patients in total, that is a little more than 22 pute this probability using Bayes theorem, but in two different ways. Define the
per dose. By taking 23 patients per dose, we obtain the desired length of the following events:
confidence interval:
> V2 <- ApprVar(doses=c(0, 0.5, 1, 2, 4), N=5*23, w=c(rep(1/5, 5)), A: “The woman was abused by her husband.”
sigma=1, theta1=0, theta2=1, theta3=0.5)
> sqrt(V2[2,2]) M : “The woman was murdered by somebody.”
[1] 0.3440929 G: “The husband is guilty of murdering his wife.”
a) Merz and Caulkins (1995) write the desired probability in terms of the corre-
sponding odds as:

Pr(G | A, M ) Pr(A | G, M ) Pr(G | M )


= · . (6.1)
Pr(Gc | A, M ) Pr(A | Gc , M ) Pr(Gc | M )

They use the fact that, of the 4936 women who were murdered in 1992, about
1430 were killed by their husband. In a newspaper article, Dershowitz stated
that “It is, of course, true that, among the small number of men who do kill
their present or former mates, a considerable number did first assault them.”
Merz and Caulkins (1995) interpret “a considerable number” to be 1/2. Finally,
they assume that the probability of a wife being abused by her husband, given
that she was murdered by somebody else, is the same as the probability of a
randomly chosen woman being abused, namely 0.05.
Calculate the odds (6.1) based on this information. What is the corresponding
probability of O. J. Simpson being guilty, given that he has abused his wife and
116 6 Bayesian inference 117

she has been murdered? Now we obtain


◮ Using the method of Merz and Caulkins (1995) we obtain Pr(G | A, M ) Pr(M | G, A) Pr(G | A)
= ·
1430 3506 Pr(Gc | A, M ) Pr(M | Gc , A) Pr(Gc | A)
Pr(G | M ) = ≈ 0.29 ⇒ Pr(G | M ) =
c
≈ 0.71 1
4936 4936 1 10000
≈ ·
Pr(A | G, M ) = 0.5 1
10000
9999
10000
Pr(A | Gc , M ) = 0.05. ≈ 1,

so that Pr(G | A, M ) = 0.5. That means the probability of O.J. Simpson being
The odds (6.1) are therefore guilty, given that he has abused his wife and she has been murdered, is about
50%.
Pr(G | A, M ) Pr(A | G, M ) Pr(G | M )
= · c) Good (1996) revised this calculation, noting that approximately only a quarter
Pr(Gc | A, M ) Pr(A | Gc , M ) Pr(Gc | M )
1430
of murdered victims are female, so Pr(M | Gc , A) reduces to 1/20 000. He also
0.5
= · 4936 corrects Pr(G | A) to 1/2000, when he realised that Dershowitz’s estimate was an
0.05 3506
4936 annual and not a lifetime risk. Calculate the probability of O. J. Simpson being
≈ 4.08, guilty based on this updated information.
◮ The revised calculation is now
so that
4.08 Pr(G | A, M ) Pr(M | G, A) Pr(G | A)
Pr(G | A, M ) = ≈ 0.8. =
1 + 4.08 Pr(Gc | A, M )
·
Pr(M | Gc , A) Pr(Gc | A)
That means the probability of O.J. Simpson being guilty, given that he has abused 1
1 2000
his wife and she has been murdered, is about 80%. ≈ 1
· 1999
20000 2000
b) Good (1995) uses the alternative representation
≈ 10,
Pr(G | A, M ) Pr(M | G, A) Pr(G | A)
= · . (6.2) so that Pr(G | A, M ) ≈ 0.91. Based on this updated information, the probability
Pr(Gc | A, M ) Pr(M | Gc , A) Pr(Gc | A)
of O.J. Simpson being guilty, given that he has abused his wife and she has been
He first needs to estimate Pr(G | A) and starts with Dershowitz’s estimate of
murdered, is about 90%.
1/1000 that the abuser will murder his wife. He assumes the probability is at
least 1/10 that this will happen in the year in question. Thus Pr(G | A) is at 2. Consider Example 6.4. Here we will derive the implied distribution of θ =
least 1/10 000. Obviously Pr(M | Gc , A) = Pr(M | A) ≈ Pr(M ). Since there are Pr(D+ | T +) if the prevalence is π ∼ Be(α̃, β̃).
about 25 000 murders a year in the U.S. population of 250 000 000, Good (1995) a) Deduce with the help of Appendix A.5.2 that
estimates Pr(M | Gc , A) to be 1/10 000.
α̃ 1 − π
Calculate the odds (6.2) based on this information. What is the corresponding γ= ·
β̃ π
probability of O. J. Simpson being guilty, given that he has abused his wife and
she has been murdered? follows an F distribution with parameters 2β̃ and 2α̃, denoted by F(2β̃, 2α̃).
◮ Using the method of Good (1995) it follows: ◮ Following the remark on page 336, we first show 1 − π ∼ Be(β̃, α̃) and then
1 9999 we deduce γ = α̃ · 1−π ∼ F(2β̃, 2α̃).
Pr(G | A) = ⇒ Pr(Gc | A) =
β̃ π
10000 10000 Step 1: To obtain the density of 1 − π, we apply the change-of-variables formula
1 to the density of π.
Pr(M | Gc , A) ≈
10000 The transformation function is
Pr(M | G, A) = 1.
g(π) = 1 − π
118 6 Bayesian inference 119

and we have b) Show that as a function of γ, the transformation (6.12) reduces to

g −1 (y) = 1 − y, θ = g(γ) = (1 + γ/c)


−1

dg −1
(y)
= −1. where
dy α̃ Pr(T + | D+)
c= .
This gives β̃{1 − Pr(T − | D−)}
−1
dg (y) 
f1−π (y) = fπ g −1 (y) ◮ We first plug in the expression for ω given in Example 6.4 and then we
dy
express the term depending on π as a function of γ:
= fπ (1 − y)
1 θ = (1 + ω −1 )−1
= (1 − y)α̃−1 (1 − (1 − y))β̃−1  −1
B(α̃, β̃) 1 − Pr(T − | D−) 1 − π
1 = 1+ ·
= y β̃−1 (1 − y)α̃−1 , Pr(T + | D+) π
B(β̃, α̃)  −1
1 − Pr(T − | D−) β̃γ
= 1+ ·
where we have used that B(α̃, β̃) = B(β̃, α̃), which follows easily from the defini- Pr(T + | D+) α̃
tion of the beta function (see Appendix B.2.1). Thus, 1 − π ∼ Be(β̃, α̃).  −1
α̃ Pr(T + | D+)
Step 2: We apply the change-of-variables formula again to obtain the density of = 1 + γ/
β̃{1 − Pr(T − | D−)}
γ from the density of 1 − π.
= (1 + γ/c)
−1
.
We have γ = g(1 − π), where
α̃ x c) Show that
g(x) = · , d 1
β̃ 1 − x g(γ) = −
dγ c(1 + γ/c)2
y β̃y
g −1 (y) = = and
α̃/β̃ + y α̃ + β̃y and that g(γ) is a strictly monotonically decreasing function of γ.
2 ◮ Applying the chain rule to the function g gives
dg −1
(y) β̃(α̃ + β̃y) − β̃ y α̃β̃
= = .
dy (α̃ + β̃y)2 (α̃ + β̃y)2 d d 1
g(γ) = − (1 + γ/c) · (1 + γ/c) = −
−2
.
dγ dγ c(1 + γ/c)2
Hence,
As c > 0, we have
 dg −1 (y)
fγ (y) = f1−π g −1 (y) d
dy g(γ) < 0 for all γ ∈ [0, ∞),

 β̃−1  α̃−1
1 β̃y β̃y α̃β̃ which implies that g(γ) is a strictly monotonically decreasing function of γ.
= 1−
B(β̃, α̃) α̃ + β̃y α̃ + β̃y (α̃ + β̃y)2 d) Use the change of variables formula (A.11) to derive the density of θ in (6.13).
 1−β̃  1−α̃ ◮ We derive the density of θ = g(γ) from the density of γ obtained in 2a).
1 α̃ + β̃y α̃ + β̃y α̃β̃
= Since g is strictly monotone by 2c) and hence one-to-one, we can apply the
B(β̃, α̃) β̃y α̃ (α̃ + β̃y)2
 −β̃  −α̃ change-of-variables formula to this transformation. We have
1 α̃ + β̃y α̃ + β̃y
=
B(β̃, α̃)y β̃y α̃ θ = (1 + γ/c)−1 by 2b) and
 −β̃  −α̃
1 α̃ β̃y γ = g −1 (θ) =
c(1 − θ)
= c(1/θ − 1).
= 1+ 1+ , θ
B(β̃, α̃)y β̃y α̃

which establishes γ ∼ F(2β̃, 2α̃).


120 6 Bayesian inference 121

Thus, part (a), we thus obtain γ̄ ∼ F(2α̃, 2β̃).


Step b: Next, we express τ = Pr(D− | T −) as a function of γ̄.
dg(γ) −1
fθ (θ) = · fγ (γ)
dγ τ = (1 + ω̄ −1 )−1
2
= c · (1 + γ/c) · fγ (c(1/θ − 1))  −1
(6.3) 1 − Pr(T + | D+) π
 = 1+ ·
=c·θ −2
· fF c(1/θ − 1) ; 2β̃, 2α̃ , (6.4) Pr(T − | D−) 1−π
 −1
1 − Pr(T + | D+) α̃γ̄
where we have used 2c) in (6.3) and 2a) in (6.4). = 1+ ·
Pr(T − | D−) β̃
e) Analogously proceed with the negative predictive value τ = Pr(D− | T −) to show  −1
that the density of τ is β̃ Pr(T + | D+)
= 1 + γ̄ /
α̃{1 − Pr(T − | D−)}

f (τ ) = d · τ −2 · fF d(1/τ − 1); 2α̃, 2β̃ , = (1 + γ̄/d)
−1
.
where Step c: We show that the transformation h(γ̄) = (1 + γ̄/d)−1 is one-to-one by
β̃ Pr(T − | D−)
d= establishing strict monotonicity.
α̃{1 − Pr(T + | D+)}
Applying the chain rule to the function h gives
and fF (x ; 2α̃, 2β̃) is the density of the F distribution with parameters 2α̃ and
2β̃. d 1
h(γ̄) = − .
◮ In this case, the posterior odds can be expressed as dγ̄ d(1 + γ̄/d)2

Pr(D− | T −) Pr(T − | D−) 1−π As d > 0, we have


ω̄ := = · , d
Pr(D+ | T −) 1 − Pr(T + | D+) π h(γ̄) < 0 for all γ̄ ∈ [0, ∞),
dγ̄
so that which implies that h(γ̄) is a strictly monotonically decreasing function of γ̄.
ω̄
τ = Pr(D− | T −) = = (1 + ω̄ −1 )−1 . Step d: We derive the density of τ = h(γ̄) from the density of γ̄.
1 + ω̄
We have
Step a: We show that
β̃ π
γ̄ = · ∼ F(2α̃, 2β̃). τ = (1 + γ̄/d)−1 by Step b and
α̃ 1 − π
γ̄ = h −1
(τ ) = d(1/τ − 1).
We know that π ∼ Be(α̃, β̃) and we have γ̄ = g(π) for
Thus,
β̃ y
g(y) = · .
α̃ 1 − y dg(γ̄) −1
fτ (τ ) = · fγ̄ (γ̄)
We are dealing with the same transformation function g as in step 2 of part (a), dγ̄
except that α̃ and β̃ are interchanged. By analoguous arguments as in step 2 of = d · (1 + γ̄/d)2 · fγ̄ (d(1/τ − 1)) (6.5)

=d·τ −2
· fF d(1/τ − 1) ; 2α̃, 2β̃ , (6.6)

where we have used Step c in (6.5) and Step a in (6.6).


3. Suppose the heights of male students are normally distributed with mean 180 and
unknown variance σ 2 . We believe that σ 2 is in the range [22, 41] with approximately
95% probability. Thus we assign an inverse-gamma distribution IG(38, 1110) as
prior distribution for σ 2 .
122 6 Bayesian inference 123

a) Verify with R that the parameters of the inverse-gamma distribution lead to a > (beta.post <- beta + 0.5*sum((heights - mu)^2))
prior probability of approximately 95% that σ 2 ∈ [22, 41]. [1] 1380

◮ We use the fact that if σ 2 ∼ IG(38, 1110), then 1/σ 2 ∼ G(38, 1110) (see The posterior distribution of σ 2 is IG(46, 1380).
Table A.2). We can thus work with the cumulative distribution function of the > library(MCMCpack)
corresponding gamma distribution in R. We are interested in the probability > # plot the posterior distribution
       > curve(dinvgamma(x, shape=alpha.post, scale=beta.post), from=15, to=50,
1 1 1 1 1 1 1 xlab=expression(sigma^2), ylab="Density", col=1, lty=1)
Pr ∈ , = Pr ≤ − Pr < . > # plot the prior distribution
σ2 41 22 σ2 22 σ2 41 > curve(dinvgamma(x, shape=alpha, scale=beta), from=15, to=50,
n=200, add=T, col=1, lty=2)
> (prior.prob <- pgamma(1/22, shape=38, rate=1110) > legend("topright",
- pgamma(1/41, shape=38, rate=1110)) c("Prior density: IG(38, 1110)",
[1] 0.9431584 "Posterior density: IG(46, 1380)"), bg="white", lty=c(2,1), col=1)

b) Derive and plot the posterior density of σ 2 corresponding to the following data: Prior density: IG(38, 1110)
Posterior density: IG(46, 1380)
0.08
183, 173, 181, 170, 176, 180, 187, 176, 171, 190, 184, 173, 176, 179, 181, 186.
0.06
.

Density
◮ We assume that the observed heights x1 = 183, x2 = 173, . . . , x16 = 186 are 0.04
realisations of a random sample X1:n (in particular: X1 , . . . ,Xn are indepen-
dent) for n = 16. We know that 0.02

Xi | σ 2 ∼ N(µ = 180, σ 2 ), i = 1, . . . , n. 0.00

15 20 25 30 35 40 45 50
A priori we have σ 2 ∼ IG(38, 1110) and we are interested in the posterior dis- 2
σ
tribution of σ 2 | x1:n . It can be easily verified, see also the last line in Table 6.2,
that c) Compute the posterior density of the standard deviation σ.

1 1
 ◮ For notational convenience, let Y = σ 2 . We know that Y ∼ IG(α = 46, β =

σ 2 | xi ∼ IG 38 + , 1110 + (xi − µ)2 . √
1380) and are interested in Z = g(Y ) = Y = σ, where g(y) = y. Thus,
2 2

As in Example 6.8, this result can be easily extended to a random sample X1:n : g −1 (z) = z 2 and
! dg −1 (z)
n = 2z.
n 1X dz
σ 2 | x1:n ∼ IG 38 + , 1110 + (xi − µ)2 .
2 2 Using the change-of-variables formula, we obtain
i=1
 
βα β
fZ (z) = |2z| (z 2 )−(α+1) exp − 2
> # parameters of the inverse gamma prior Γ(α) z
> alpha <- 38  
2β α −(2α+1) β
> beta <- 1110
= z exp − 2 .
> # prior mean of the normal distribution Γ(α) z
> mu <- 180
> # data vector This is the required posterior density of the standard deviation Z = σ for α = 46
> heights <- c(183, 173, 181, 170, 176, 180, 187, 176, and β = 1380.
171, 190, 184, 173, 176, 179, 181, 186)
> # number of observations 4. Assume that n throat swabs have been tested for influenza. We denote by X
> n <- length(heights)
> # compute the parameters of the inverse gamma posterior distribution the number of throat swabs which yield a positive result and assume that X
> (alpha.post <- alpha + n/2) is binomially distributed with parameters n and unknown probability π, so that
[1] 46 X | π ∼ Bin(n, π).
124 6 Bayesian inference 125

a) Determine the expected Fisher information and obtain Jeffreys’ prior. so that for the log odds η = log{π/(1 − π)}, we have
◮ As X | π ∼ Bin(n, π), the log-likelihood is given by  
n
l(π) = x log π + (n − x) log(1 − π), f (x | η) = exp(ηx){1 + exp(η)}−n .
x
so the score function is The log-likelihood is therefore
dl(π) x n−x
S(π) = = + · (−1).
dπ π 1−π l(η) = ηx − n log{1 + exp(η)},
The Fisher information turns out to be
the score function is
dS(π) x n−x
I(π) = − = 2+ .
dπ π (1 − π)2 dl(η) n
S(η) = =x− · exp(η),
Thus, the expected Fisher information is dη 1 + exp(η)

J(π) = E{I(π; X)} and the Fisher information is


   
X n−X {1 + exp(η)} · n exp(η) − n{exp(η)}2 n exp(η)
=E + E I(η) = −
dS(η)
= =
π2 (1 − π)2 dη {1 + exp(η)}2 {1 + exp(η)}2
,
E(X) n − E(X)
= +
π2 (1 − π)2 independent of x. The expected Fisher information is therefore equal to the
nπ n − nπ (observed) Fisher information,
= 2 +
π (1 − π)2
n n J(η) = E{I(η)} = I(η),
= +
π 1−π
n so Jeffreys’ prior is
= ,
π(1 − π) s
p exp(η) exp(η)1/2
as derived in Example 4.1. Jeffreys’ prior therefore is f (η) ∝ J(η) ∝ = .
p {1 + exp(η)} 2 1 + exp(η)
p(π) ∝ J(π) ∝ π −1/2 (1 − π)−1/2 ,
c) Take the prior distribution of 4a) and apply the change of variables formula to
which corresponds to the kernel of a Be(1/2, 1/2) distribution.
obtain the induced prior for η. Because of the invariance under reparameteriza-
b) Reparametrise the binomial model using the log odds η = log{π/(1−π)}, leading
tion this prior density should be the same as in part 4b).
to  
n ◮ The transformation is
f (x | η) = exp(ηx){1 + exp(η)}−n .
x  π 
g(π) = log =η
Obtain Jeffreys’ prior distribution directly for this likelihood and not with the 1−π
change of variables formula.
and we have
◮ We first deduce the reparametrisation given above:
  dg(π) 1−π 1
f (x | π) =
n x
π (1 − π)n−x = · = π −1 (1 − π)−1 , and
x dπ π (1 − π)2
   −n exp(η)
n π x 1 g −1 (η) = = π.
= 1 + exp(η)
x 1−π 1−π
   
n π x π −n
= 1+
x 1−π 1−π
    π      π −n
n
= exp log x 1 + exp log ,
x 1−π 1−π
126 6 Bayesian inference 127

Applying the change-of-variables formula gives because x(r+1) , . . . , x(n) are assumed to be censored at time x(r) .
The posterior distribution under Jeffreys’ prior f (λ) ∝ λ−1 is thus
dg(π) −1
f (η) = fπ (π)
dπ f (λ | x1:n ) ∝ f (λ) · L(λ)
( !)
∝ π(1 − π)π −1/2 (1 − π)−1/2 r
X
= λr−1 exp −λ x(i) + (n − r)x(r) .
= π 1/2 (1 − π)1/2
i=1
 1/2  1/2
exp(η) 1
= c) Show that the posterior is improper if all observations are censored.
1 + exp(η) 1 + exp(η)
◮ If no death has occured prior to some time c, the likelihood of λ is
exp(η)1/2
= .
1 + exp(η) L(λ) = exp(−nλc).
This density is the same as the one we received in part 4b).
Using Jeffreys’ prior f (λ) ∝ λ−1 , we obtain the posterior
5. Suppose that the survival times X1:n form a random sample from an exponential
1
distribution with parameter λ. f (λ | x1:n ) ∝ exp(−ncλ).
λ
a) Derive Jeffreys’ prior for λ und show that it is improper.
This can be identified as the kernel of an improper G(0, nc) distribution.
◮ From Exercise 4a) in Chapter 4 we know the score function
6. After observing a patient, his/her LDL cholesterol level θ is estimated by a. Due
n
n X to the increased health risk of high cholesterol levels, the consequences of under-
S(λ) = − xi .
λ estimating a patient’s cholesterol level are considered more serious than those of
i=1
overestimation. That is to say that |a − θ| should be penalised more when a ≤ θ
Viewed as a random variable in X1 , . . . , Xn , its variance is
than when a > θ. Consider the following loss function parameterised in terms of
n c, d > 0:
J(λ) = Var{S(λ)} = n · Var(X1 ) = . (
λ2
−c(a − θ) if a − θ ≤ 0
Jeffreys’ prior therefore is l(a, θ) = .
d(a − θ) if a − θ > 0
p
f (λ) ∝ J(λ) ∝ λ−1 , a) Plot l(a, θ) as a function of a − θ for c = 1 and d = 3.

which cannot be normalised, since
> # loss function with argument a-theta
Z∞ > loss <- function(aMinusTheta, c, d)
0 = log(∞) − log(0) = ∞ − (−∞) = ∞,
λ−1 dλ = [log(λ)]∞ {
ifelse(aMinusTheta <= 0, - c * aMinusTheta, d * aMinusTheta)
0 }
> aMinusTheta <- seq(-3, 3, length = 101)
so f (λ) is improper. > plot(aMinusTheta, loss(aMinusTheta, c = 1, d = 3),
b) Suppose that the survival times are only partially observed until the r-th death type = "l", xlab = expression(a - theta), ylab = "loss")
such that n−r observations are actually censored. Write down the corresponding
likelihood function and derive the posterior distribution under Jeffreys’ prior.
◮ Let x(1) , . . . , x(n) denote the ordered survival times. Only x(1) , . . . , x(r) are ob-
served, the remaining survival times are censored. The corresponding likelihood
Pn
function can be derived from Example 2.8 with δ(i) = I{1,...,r} (i) and i=1 δi = r:
( r
!)
X
L(λ) = λ exp −λ
r
x(i) + (n − r)x(r) ,
i=1
128 6 Bayesian inference 129

7. Our goal is to estimate the allele frequency at one bi-allelic marker, which has either
8
allele A or B. DNA sequences for this location are provided for n individuals. We
denote the observed number of allele A by X and the underlying (unknown) allele
6
frequency with π. A formal model specification is then a binomial distribution
loss

4
X | π ∼ Bin(n, π) and we assume a beta prior distribution π ∼ Be(α, β) where
α, β > 0.
2
a) Derive the posterior distribution of π and determine the posterior mean and
mode.
0
◮ We know
X | π ∼ Bin(n, π) π ∼ Be(α, β).
−3 −2 −1 0 1 2 3
and
a−θ
As in Example 6.3, we are interested in the posterior distribution π | X:
b) Compute the Bayes estimate with respect to the loss function l(a, θ).
◮ The expected posterior loss is f (π | x) ∝ f (x | π)f (π)
Z
 ∝ π x (1 − π)n−x π α−1 (1 − π)β−1
E l(a, θ) | x = l(a, θ)f (θ | x) dθ
= π α+x−1 (1 − π)β+n−x−1 ,
Za Z∞
= d(a − θ)f (θ | x) dθ + c(θ − a)f (θ | x) dθ. i.e. π | x ∼ Be(α + x, β + n − x). Hence, the posterior mean is given by (α +
−∞ a x)/(α + β + n) and the posterior mode by (α + x − 1)/(α + β + n − 2).
b) For some genetic markers the assumption of a beta prior may be restrictive
To compute the Bayes estimate
and a bimodal prior density, e. g., might be more appropriate. For example,

â = arg min E l(a, θ) | x , we can easily generate a bimodal shape by considering a mixture of two beta
a
distributions:
we take the derivative with respect to a, using Leibniz integral rule, see Ap-
pendix B.2.4. Using the convention ∞ · 0 = 0 we obtain f (π) = wfBe (π; α1 , β1 ) + (1 − w)fBe (π; α2 , β2 )

d 
Za Z∞ with mixing weight w ∈ (0, 1).
E l(a, θ) | x = d f (θ | x) dθ − c f (θ | x) dθ i. Derive the posterior distribution of π.
da
−∞ a
 ◮
= dF (a | x) − c 1 − F (a | x)
f (π | x) ∝ f (x | π)f (π)
= (c + d)F (a | x) − c. 
w
∝ π x (1 − π)n−x π α1 −1 (1 − π)β1 −1
The root of this function in a is therefore B(α1 , β1 )

 1−w
â = F −1 c/(c + d) | x , + π α2 −1 (1 − π)β2 −1
B(α2 , β2 )
w
i. e. the Bayes estimate â is the c/(c + d) · 100% quantile of the posterior distribu- = π α1 +x−1 (1 − π)β1 +n−x−1
B(α1 , β1 )
tion of θ. For c = d we obtain as a special case the posterior median. For c = 1
1−w
and d = 3 the Bayes estimate is the 25%-quantile of the posterior distribution. + π α2 +x−1 (1 − π)β2 +n−x−1 .
B(α2 , β2 )
Remark: If we choose c = 3 and d = 1 as mentioned in the Errata, then the
Bayes estimate is the 75%-quantile of the posterior distribution.
130 6 Bayesian inference 131

ii. The posterior distribution is a mixture of two familiar distributions. Identify > # Distribution function of a mixture of beta distributions with
these distributions and the corresponding posterior weights. > # weight gamma1 of the first component and parameter vectors
> # alpha und beta
◮ We have > pbetamix <- function(pi, gamma1, alpha, beta){
gamma1 * pbeta(pi, alpha[1], beta[1]) +
w B(α⋆1 , β1⋆ ) α⋆1 −1
f (π | x) ∝ (1 − π)β1 −1

π (1 - gamma1) * pbeta(pi, alpha[2], beta[2])
B(α1 , β1 ) B(α⋆1 , β1⋆ ) }
1 − w B(α⋆2 , β2⋆ ) α⋆2 −1 > # corresponding quantile function
+ (1 − π)β2 −1

π > qbetamix <- function(q, gamma1, alpha, beta){
B(α2 , β2 ) B(α⋆2 , β2⋆ ) f <- function(pi){
pbetamix(pi, gamma1, alpha, beta) - q
for }
unirootResult <- uniroot(f, lower=0, upper=1)
α⋆1 = α1 + x β1⋆ = β1 + n − x, if(unirootResult$iter < 0)
return(NA)
α⋆2 = α2 + x β2⋆ = β2 + n − x. else
return(unirootResult$root)
}
> # credibility interval with level level
Hence, the posterior distribution is a mixture of two beta distributions > credBetamix <- function(level, gamma1, alpha, beta){
halfa <- (1 - level)/2
Be(α⋆1 , β1⋆ ) and Be(α⋆2 , β2⋆ ). The mixture weights are proportional to ret <- c(qbetamix(halfa, gamma1, alpha, beta),
qbetamix(1 - halfa, gamma1, alpha, beta))
w · B(α⋆1 , β1⋆ ) (1 − w) · B(α⋆2 , β2⋆ )
γ1 = and γ2 = . return(ret)
B(α1 , β1 ) B(α2 , β2 ) }
v. Let n = 10 and x = 3. Assume an even mixture (w = 0.5) of two beta distri-
The normalized weights are γ1⋆ = γ1 /(γ1 + γ2 ) and γ2⋆ = γ2 /(γ1 + γ2 ).
butions, Be(10, 20) and Be(20, 10). Plot the prior and posterior distributions
iii. Determine the posterior mean of π.
in one figure.
◮ The posterior distribution is a linear combination of two beta distribu-
> # data
tions:
> n <- 10
f (π | x) = γ1⋆ Be(π | α⋆1 , β1⋆ ) + (1 − γ1⋆ ) Be(π | α⋆2 , β2⋆ ), > x <- 3
> #
so the posterior mean is given by > # parameters for the beta components
> a1 <- 10
E(π | x) = γ1⋆ E(π | α⋆1 , β1⋆ ) + (1 − γ1⋆ ) E(π | α⋆2 , β2⋆ ) > b1 <- 20
> a2 <- 20
α⋆1 α⋆2 > b2 <- 10
= γ1⋆ ⋆ + (1 − γ1 ) ⋆

.
α⋆1 + β1 α2 + β2⋆ > #
> # weight for the first mixture component
iv. Write an R-function that numerically computes the limits of an equi-tailed > w <- 0.5
> #
credible interval. > # define a function that returns the density of a beta mixture
◮ The posterior distribution function is > # with two components
> mixbeta <- function(x, shape1a, shape2a, shape1b, shape2b, weight){
F (π | x) = γ1∗ F (π | α∗1 , β1∗ ) + (1 − γ1∗ )F (π | α∗2 , β2∗ ).
y <- weight * dbeta(x, shape1=shape1a, shape2=shape2a) +
(1-weight)* dbeta(x, shape1=shape1b, shape2=shape2b)
return(y)
The equi-tailed (1 − α)-credible interval is therefore }
> #
[F −1 (α/2 | x), F −1 (1 − α/2 | x)], > # plot the prior density
> curve(mixbeta(x, shape1a=a1, shape2a=b1, shape1b=a2, shape2b=b2,weight=w),
from=0, to=1, col=2, ylim=c(0,5), xlab=expression(pi), ylab="Density")
i. e. we are looking for arguments of π where the distribution function has > #
the values α/2 and (1 − α/2), respectively. > # parameters of the posterior distribution
> a1star <- a1 + x
132 6 Bayesian inference 133

> b1star <- b1 + n - x a) Derive the posterior density f (π | x). Which distribution is this and what are its
> a2star <- a2 + x parameters?
> b2star <- b2 + n - x
> # ◮ We have
> # the posterior weights are proportional to
> gamma1 <- w*beta(a1star,b1star)/beta(a1,b1) f (π | x) ∝ f (x | π)f (π)
> gamma2 <- (1-w)*beta(a2star,b2star)/beta(a2,b2)
∝ π r (1 − π)x−r π α−1 (1 − π)β−1
> #
> # calculate the posterior weight = π α+r−1 (1 − π)β+x−r−1 .
> wstar <- gamma1/(gamma1 + gamma2)
> # This is the kernel of a beta distribution with parameters α̇ = α + r and
> # plot the posterior distribution
> curve(mixbeta(x, shape1a=a1star, shape2a=b1star, β̇ = β + x − r, that is π | x ∼ Be(α + r, β + x − r).
shape1b=a2star, shape2b=b2star,weight=wstar), b) Define conjugacy and explain why, or why not, the beta prior is conjugate with
from=0, to=1, col=1, add=T) respect to the negative binomial likelihood.
> #
> legend("topright", c("Prior density", "Posterior density"), ◮ Definition of conjugacy (Def. 6.5): Let L(θ) = f (x | θ) denote a likelihood
col=c(2,1), lty=1, bty="n") function based on the observation X = x. A class G of distributions is called
5
Prior density
conjugate with respect to L(θ) if the posterior distribution f (θ | x) is in G for all
Posterior density x whenever the prior distribution f (θ) is in G.
4
The beta prior is conjugate with respect to the negative binomial likelihood since
3 the resulting posterior distribution is also a beta distribution.
Density

c) Show that the expected Fisher information is proportional to π −2 (1 − π)−1 and


2 derive therefrom Jeffreys’ prior and the resulting posterior distribution.
◮ The log-likelihood is
1

l(π) = r log(π) + (x − r) log(1 − π).


0

0.0 0.2 0.4 0.6 0.8 1.0


Hence,
dl(π) r x−r
π
S(π) = = − and
dπ π 1−π
8. The negative binomial distribution is used to represent the number of trials, x, 2
d l(π) r x−r
I(π) = − = 2+ ,
needed to get r successes, with probability π of success in any one trial. Let X be dπ 2 π (1 − π)2
negative binomial, X | π ∼ NBin(r, π), so that
which implies
 
x−1 r J(π) = E(I(π; X))
f (x | π) = π (1 − π)x−r ,
r−1
r E(X) − r
= +
with 0 < π < 1, r ∈ N and support T = {r, r + 1, . . . }. As a prior distribution π2 (1 − π)2
r
assume π ∼ Be(α, β), r −r
= 2+ π
π (1 − π)2
f (π) = B(α, β)−1 π α−1 (1 − π)β−1 , r(1 − π)2 + rπ(1 − π)
=
π 2 (1 − π)2
with α, β > 0. r
= 2 ∝ π −2 (1 − π)−1 .
π (1 − π)
p
Hence Jeffreys’ prior is given by J(π) ∝ π −1 (1 − π)−1/2 , which corresponds to
a Be(0, 0.5) distribution and is improper.
By part (a), the posterior distribution is therefore π | x ∼ Be(r, x − r + 0.5).
134 6 Bayesian inference 135

9. Let X1:n denote a random sample from a uniform distribution on the interval [0, θ] a) Derive the posterior distribution of λ1 and λ2 . Plot these in R for comparison.
with unknown upper limit θ. Suppose we select a Pareto distribution Par(α, β) ◮ We first derive Jeffreys’ prior for λi , i = 1, 2:
with parameters α > 0 and β > 0 as prior distribution for θ, cf . Table A.2 in
f (di | λi Yi ) ∝ (λi Yi )di exp(−λi Yi ) ∝ (λi )di exp(−λi Yi ),
Section A.5.2.
l(λi ) ∝ di log(λi ) − λi ,
a) Show that T (X1:n ) = max{X1 , . . . , Xn } is sufficient for θ.
l′ (λi ) ∝ di /λi − 1,
◮ This was already shown in Exercise 6 of Chapter 2 (see the solution there).
b) Derive the posterior distribution of θ and identify the distribution type. I(λi ) = −l′′ (λi ) ∝ di λ−2
i ,

◮ The posterior distribution is also a Pareto distribution since for t = J(λi ) = E(Di )λ−2
i ∝ λi λ−2
i = λ−1
i .
max{x1 , . . . , xn }, we have p
Thus, f (λi ) ∝ J(λi ) ∝ λi , which corresponds to the improper G(1/2, 0)
−1/2

f (θ | x1:n ) ∝ f (x1:n | θ)f (θ) distribution (compare to Table 6.3 in the book). This implies
1 1 f (λi | di ) ∝ f (di | λi Yi )f (λi )
∝ n I[0,θ] (t) · α+1 I[β,∞) (θ)
θ θ
∝ (λi )di exp(−λi Yi )λi
−1/2
1
= (α+n)+1 I[max{β,t},∞) (θ),
θ = (λi )di +1/2−1 exp(−λi Yi ),
that is θ | x1:n ∼ Par(α + n, max{β, t}).
Thus, the Pareto distribution is conjugate with respect to the uniform likelihood which is the density of the G(di + 1/2, Yi ) distribution (compare to Table 6.2).
function. Consequently,
c) Determine posterior mode Mod(θ | x1:n ), posterior mean E(θ | x1:n ), and the gen-
λ1 | D1 = 17 ∼ G(17 + 1/2, Y1 ) = G(17.5, 2768.9),
eral form of the 95% HPD interval for θ.
◮ The formulas for the mode and mean of the Pareto distribution are listed in λ2 | D2 = 28 ∼ G(28 + 1/2, Y2 ) = G(28.5, 1857.5).
Table A.2 in the Appendix. Here, we have > # the data is:
> # number of person years in the non-exposed and the exposed group
Mod(θ | x1:n ) = max{β, t} = max{β, x1 , . . . , xn } > y <- c(2768.9, 1857.5)
> # number of cases in the non-exposed and the exposed group
and > d <- c(17, 28)
(α + n) max{β, t} > #
E(θ | x1:n ) = , > # plot the gamma densities for the two groups
α+n−1 > curve(dgamma(x, shape=d[1]+0.5, rate=y[1]),from=0,to=0.05,
where the condition α + n > 1 is satisfied for any n ≥ 1 as α > 0. ylim=c(0,300), ylab="Posterior density", xlab=expression(lambda))
> curve(dgamma(x, shape=d[2]+0.5, rate=y[2]), col=2, from=0,to=0.05, add=T)
Since the density f (θ | x1:n ) equals 0 for θ < max{β, t} and is strictly monotoni- > legend("topright", c("Non-exposed", "Exposed"), col=c(1,2), lty=1, bty="n")
cally decreasing for θ ≥ max{β, t}, the 95% HPD interval for θ has the form
300
Non−exposed
Exposed
[max{β, t}, q], 250

Posterior density
where q is the 95%-quantile of the Par(α + n, max{β, t}) distribution. 200

150
10.We continue Exercise 1 in Chapter 5, so we assume that the number of IHD cases
is Di | λi ∼ Po(λi Yi ), i = 1, 2, where λi > 0 is the group-specific incidence rate. We
ind
100

use independent Jeffreys’ priors for the rates λ1 and λ2 . 50

0.00 0.01 0.02 0.03 0.04 0.05

λ
136 6 Bayesian inference 137

b) Derive the posterior distribution of the relative risk θ = λ2 /λ1 as follows: which is a beta prime distribution with parameters α2 and α1 .
i. Derive the posterior distributions of τ1 = λ1 Y1 and τ2 = λ2 Y2 . ◮
◮ From Appendix A.5.2 we know that if X ∼ G(α, β), then c · X ∼ Z∞
G(α, β/c). Therefore, we have τ1 ∼ G(1/2 + d1 , 1) and τ2 ∼ G(1/2 + d2 , 1). f (η1 ) = f (η1 , η2 )dη2
ii. An appropriate multivariate transformation of τ = (τ1 , τ2 )⊤ to work with is 0

g(τ ) = η = (η1 , η2 )⊤ with η1 = τ2 /τ1 and η2 = τ2 + τ1 to obtain the joint 1


Z∞

density fη (η) = fτ g −1 (η) (g −1 )′ (η) , cf . Appendix A.2.3. = η α2 −1 (1 + η1 )−α1 −α2 exp(−η2 )η2α1 +α2 −1 dη2
Γ(α1 )Γ(α2 ) 1
◮ We first determine the inverse transformation g −1 (η) by solving the two 0
1
equations = η α2 −1 (1 + η1 )−α1 −α2 Γ(α1 + α2 )
Γ(α1 )Γ(α2 ) 1
τ2
η1 = η 2 = τ2 + τ1 ; η1α2 −1 (1 + η1 )−α1 −α2
τ1 = ,
B(α1 , α2 )
for τ1 and τ2 , which gives
which is a beta prime distribution β ′ (α2 , α1 ) with parameters α2 and α1 .
 ⊤ The beta prime distribution is a scaled F distribution. More preciseliy, if
 η2 η1 η2
g −1 (η1 , η2 )⊤ = , . X ∼ β ′ (α2 , α1 ), then α1 /α2 · X ∼ F(2α2 , 2α1 ). We write this as X ∼
1 + η1 1 + η1
α2 /α1 × F (2α2 , 2α1 ).
The Jacobian matrix of g −1 is iv. From this distribution of τ2 /τ1 , the posterior distribution of λ2 /λ1 is then
! easily found.
  −η2 1
−1 ′ (1+η1 )2 1+η1 ◮ We know that
J := g (η1 , η2 )

= η2 η1
,
(1+η1 )2 1+η1 τ2 λ2 Y2
= ∼ β ′ (α2 , α1 ),
τ1 λ1 Y1
and its determinant is
η2
det(J) = − . so
(η1 + 1)2
λ2 Y1 2768.9
For ease of notation, let α1 = 1/2 + d1 and α2 = 1/2 + d2 and note that fτ ∼ × β ′ (α2 , α1 ) = × β ′ (28.5, 17.5).
λ1 Y2 1857.5
is a product of two gamma densities. Thus,
The relative risk θ = λ2 /λ1 follows a 2768.9/1857.5 × β ′ (28.5, 17.5) distribu-
  η2
f (η) = fτ g −1 (η1 , η2 )⊤ tion, which corresponds to a
(1 + η1 )2
  α1 −1 Y1 (1/2 + d2 ) 78913.65
1 η2 η2 × F(1 + 2d2 , 1 + 2d1 ) = × F(57, 35) (6.7)
= exp − Y2 (1/2 + d1 ) 32506.25
Γ(α1 ) 1 + η1 1 + η1
  α2 −1 distribution.
1 η1 η2 η1 η2 η2
· exp − c) For the given data, compute a 95% credible interval for θ and compare the
Γ(α2 ) 1 + η1 1 + η1 (1 + η1 )2
results with those from Exercise 1 in Chapter 5.
1
= exp(−η2 )η1α2 −1 η2α1 +α2 −1 (1 + η1 )−α1 −α2 . ◮ We determine an equi-tailed credible interval. Since quantiles are invariant
Γ(α1 )Γ(α2 )
with respect to monotone one-to-one transformations, we can first determine the
iii. Since η1 = τ2 /τ1 is the parameter of interest, integrate η2 out of fη (η) and quantiles of the F distribution in (6.7) and then transform them accordingly:
show that the marginal density is > # compute the 2.5 %- and the 97.5 %-quantile of the beta prime distribution
> # obtained in part (b)
η α2 −1 (1 + η1 )−α1 −α2
f (η1 ) = 1 , > c <- y[1]*(0.5+ d[2])/(y[2]* (0.5+ d[1]))
B(α1 , α2 ) > lower.limit <- c*qf(0.025, df1=2*(0.5+d[2]), df2=2*(0.5+d[1]) )
> upper.limit <- c*qf(0.975, df1=2*(0.5+d[2]), df2=2*(0.5+d[1]) )
> cred.int <- c(lower.limit, upper.limit)
> cred.int
138 6 Bayesian inference 139

[1] 1.357199 4.537326 To obtain to expression for f (N | xn ) given in the exercise, we thus have to show
that    −1
An equi-tailed credible interval for θ is thus [1.357, 4.537]. X∞
(N − n)! n − 1 xn (xn − n)!
In Exercise 1 in Chapter 5, we have obtained the confidence interval [0.296, 1.501] = n! = . (6.8)
N! xn n (n − 1)(xn − 1)!
N =xn
for log(θ). Transforming the limits of this interval with the exponential function
gives the confidence interval [1.344, 4.485] for θ, which is quite similar to the To this end, note that
credible interval obtained above. The credible interval is slightly wider and shifted ∞ k
X (N − n)! X (N − n)!
towards slightly larger values than the confidence interval. = lim and
N! k→∞ N!
N =xn N =xn
11.Consider Exercise 10 in Chapter 3. Our goal is now to perform Bayesian inference
k
with an improper discrete uniform prior for the unknown number N of beds:
X (N − n)! (xn − n)! (k − (n − 1))!
= − , (6.9)
N! (n − 1)(xn − 1)! k!(n − 1)
N =xn
f (N ) ∝ 1 for N = 2, 3, . . .
where (6.9) can be shown easily by induction on k ≥ xn (and can be deduced
a) Why is the posterior mode equal to the MLE?
by using the software Maxima, for example). Now, the second term in on the
◮ This is due to Result 6.1: The posterior mode Mod(N | xn ) maximizes the
right-hand side of (6.9) converges to 0 as k → ∞ since
posterior distribution, which is proportional to the likelihood function under a
uniform prior: (k − (n − 1))! 1 1
= · ,
f (N | xn ) ∝ f (xn | N )f (N ) ∝ f (xn | N ). k!(n − 1) k(k − 1) · · · (k − (n − 1) + 1) n − 1

Hence, the posterior mode must equal the value that maximizes the likelihood which impies (6.8) and completes the proof.
function, which is the MLE. In Exercise 10 in Chapter 3, we have obtained c) Show that the posterior expectation is
N̂ML = Xn . n−1
b) Show that for n > 1 the posterior probability mass function is E(N | xn ) = · (xn − 1) for n > 2.
n−2
  −1
n − 1 xn N
f (N | xn ) = , for N ≥ xn . ◮ We have
xn n n

X
◮ We have E(N | xn ) = N f (N | xn )
f (xn | N )f (N ) f (xn | N ) N =0
f (N | xn ) = ∝ .   X
f (xn ) f (xn )

n − 1 xn (N − n)!
= n! (6.10)
From Exercise 10 in Chapter 3, we know that xn n (N − 1)!
N =xn
  −1
xn − 1 N and to determine the limit of the involved series, we can use (6.8) again:
f (xn | N ) = for N ≥ xn .
n−1 n
∞ ∞
X (N − n)! X (N − (n − 1))! (xn − n)!
Next, we derive the marginal likelihood f (xn ): = = .
(N − 1)! N! (n − 2)(xn − 2)!
N =xn N =xn −1

X
f (xn ) = f (xn | N ) Plugging this result into expression (6.10) yields the claim.
N =1 d) Compare the frequentist estimates from Exercise 10 in Chapter 3 with the pos-
∞  −1
X xn − 1 N terior mode and mean for n = 48 and xn = 1812. Numerically compute the
=
n−1 n associated 95% HPD interval for N .
N =xn
  X ∞ ◮ The unbiased estimator from Exercise 10 is
xn − 1 (N − n)!
= n! .
n−1 N! n+1
N =xn N̂ = xn − 1 = 1848.75,
n
140 6 Bayesian inference 141

which is considerably larger than the MLE and posterior mode xn = 1812. The a) Show that the marginal distribution of Xi is beta-binomial, see Appendix A.5.1
posterior mean for details. The first two moments of this distribution are
n−1
E(N | xn ) = · (xn − 1) = 1850.37 α
n−2 µ1 = E(Xi ) = m , (6.11)
α+β
is even larger than N̂ . We compute the 95% HPD interval for N in R: α{m(1 + α) + β}
µ2 = E(Xi2 ) = m . (6.12)
> # the data is (α + β)(1 + α + β)
> n <- 48 Pn
> x_n <- 1812 Solve for α and β using the sample moments µ b1 = n−1 i=1 xi , µ b2 =
P
n−1 i=1 x2i to obtain estimates of α and β.
> # n
> # compute the posterior distribution for a large enough interval of N values
> N <- seq(from = x_n, length = 2000) ◮ The marginal likelihood of Xi can be found by integrating πi out in the joint
> posterior <- exp(log(n - 1) - log(x_n) + lchoose(x_n, n) - lchoose(N, n)) distribution of Xi and πi :
> plot(N, posterior, type = "l", col=4,
ylab = "f(N | x_n)") Z1  
m xi 1
> # we see that this interval is large enough f (xi ) = π (1 − πi )m−xi · π α−1 (1 − πi )β−1 dπi
> # xi i B(α, β) i
> # the posterior density is monotonically decreasing for values of N >= x_n 0
 
> # hence, the mode x_n is the lower limit of of the HPD interval m B(α + xi , β + m − xi )
> level <- 0.95 =
> hpdLower <- x_n xi B(α, β)
> # Z1
1
> # we next determine the upper limit
· π α+xi −1 (1 − πi )β+m−xi −1 dπi
> # since the posterior density is monotonically decreasing B(α + xi , β + m − xi ) i
> # the upper limit is the smallest value of N for 0
 
> # which the cumulative posterior distribution function is larger or equal to 0.95 m B(α + xi , β + m − xi )
> cumulatedPosterior <- cumsum(posterior) = ,
> hpdUpper <- min(N[cumulatedPosterior >= level]) xi B(α, β)
> #
> # add the HPD interval to the figure which is known as a beta-binomial distribution.
> abline(v = c(hpdLower, hpdUpper), col=2) To derive estimates for α and β, we first solve the two given equations for α and
β and then we replace µ1 and µ2 by the corresponding sample moments µ c1 and
0.025
c
µ2 .
0.020 First, we combine the two equations by replacing part of the expression on the
right-hand side of Equation (6.12) by µ1 , which gives
f(N | x_n)

0.015
m(1 + α) + β
µ2 = µ1 · .
0.010
1+α+β
0.005 We solve the above equation for β to obtain
0.000 (1 + α)(mµ1 − µ2 )
β= . (6.13)
2000 2500 3000 3500 µ2 − µ1
N Solving Equation (6.11) for α yields
Thus, the HPD interval is [1812, 1929]. α=β·
µ1
.
m − µ1
12.Assume that X1 , . . . , Xn are independent samples from the binomial models
Bin(m, πi ) and assume that πi ∼ Be(α, β). Compute empirical Bayes estimates
iid Next, we plug this expression for α into Equation (6.13) and solve the resulting
π̂i of πi as follows: equation    
1 mµ21 − µ1 µ2
β= β + mµ1 − µ2
µ2 − µ1 m − µ1
142 6 Bayesian inference

for β to obtain
(mµ1 − µ2 )(m − µ1 )
7 Model selection
β=
m(µ2 − µ1 (mu1 + 1)) + µ21
and consequently

µ1 (mµ1 − µ2 )µ1
α=β· = .
m − µ1 m(µ2 − µ1 (mu1 + 1)) + µ21

The estimators for α and β are therefore

(mµ µ2 ) µ
c1 − c c1 1. Derive Equation (7.18).
b=
α 2 and
m (c µ1 (µ
µ2 − c c1 + 1)) + µ
c1 ◮ Since the normal prior is conjugate to the normal likelihood with known variance
(mµ µ2 ) (m − µ
c1 − c c1 ) (compare to Table 7.2), we can avoid integration and use Equation (7.16) instead
βb = 2
.
m (c µ1 (µ
µ2 − c c1 + 1)) + µ
c1 to compute the marginal distribution:

b) Now derive the empirical Bayes estimates π̂i . Compare them with the corre- f (x | µ)f (µ)
f (x | M1 ) = .
sponding MLEs. f (µ | x)
b +xi , βb+m−xi ) distribution
◮ By Example 6.3, the posterior πi |xi has a Beta(α We know that
and the empirical Bayes estimate is thus the posterior mean
x | µ ∼ N(µ, κ−1 ), µ ∼ N(ν, δ −1 )
b + xi
α
πbi = E[πi |xi ] = .
b + βb + m
α and in Example 6.8, we have derived the posterior distribution of µ:
 
For comparison, the maximum likelihood estimate is π̂i ML = xi /m . Hence, the nκx̄ + δν
b = βb = 0, which corresponds µ|x ∼ N , (nκ + δ)−1 .
Bayes estimate is equal to the MLE if and only if α nκ + δ
to an improper prior distribution. In general, the Bayes estimate
Consequently, we obtain
b + xi
α b + βb
α b
α m xi Pn  
= · + · (2πκ−1 )−n/2 exp −κ/2 i=1 (xi − µ)2 · (2πδ −1 )−1/2 exp −δ/2(µ − ν)2
b + βb + m
α b + βb + m α
α b + βb αb + βb + m m f (x | M1 ) = 
(2π(nκ + δ) )
−1 −1/2 exp −(nκ + δ)/2(µ − nκx̄+δν
nκ+δ
)2
b/(α
is a weighted average of the prior mean α b + β)
b and the MLE xi /m. The  κ  n2  δ  12
weights are proportional to the prior sample size m0 = α
b + βb and the data =
2π nκ + δ
sample size m, respectively. " n
!
1 X
· exp − κ xi − 2nx̄µ + mµ + δ(µ2 − 2µν + ν 2 )
2 2
2
i=1
 
2 (nκx̄ + δν)2
− (nκ + δ)µ − 2µ(nκx̄ + δν) +
nκ + δ
" !#
 κ  n2  δ  12 κ X 2 δν 2
n
(nκx̄ + δν)2
= · exp − xi + −
2π nκ + δ 2 κ κ(nκ + δ)
i=1
  1
" ( )#
 κ  n2 δ 2
κ X
n

2 2
= exp − (xi − x̄) + (x̄ − ν) .
2π nκ + δ 2 nκ + δ
i=1
144 7 Model selection 145

2. Let Yi ∼ N(µi , σ 2 ), i = 1, . . . , n, be the response variables in a normal regression


ind
which implies that rankings of different normal regression models with respect to
model , where the variance σ 2 is assumed known and the conditional means are AIC and Cp , respectively, will be the same.
µi = x⊤i β. The design vectors xi and the coefficient vector β have dimension p and c) Now assume that σ 2 is unknown as well. Show that AIC is given by
are defined as for the logistic regression model (Exercise 17 in Chapter 5). 2
AIC = n log(σ̂ML ) + 2p + n + 2.
a) Derive AIC for this normal regression model.
◮ Due to independence of Yi , i = 1, . . . , n, the likelihood function is
n
Y ◮ The log-likelihood function is the same as in part (a). Solving the score
L(y1:n ; β) = L(yi ; β) equation 
i=1 ∂ l y1:n ; (β, σ 2 )T
n   =0
Y 1 1 2 ∂σ 2
= exp − (y − µ )
(2πσ 2 )1/2 2σ 2
i i
gives
i=1 n
" # 2 1X 2
1 1 X
n
2 σ̂ML = (yi − x⊤ i β̂ ML ) .
= exp − y − x ⊤
β n
(2πσ 2 )n/2 2σ 2
i i i=1
i=1
Plugging this estimate into the log-likelihood function yields
and thus the log-likelihood function is   n
2 T 2
l y1:n ; (β̂ ML , σ̂ML ) = −n log(σ̂ML )−
2
n
n 1 X 2
l(y1:n ; β) = − log(2πσ 2 ) − 2 yi − x⊤
i β .
2 2σ up to some additive constant. In this model with unknown σ 2 , there are p + 1
i=1
parameters. Consequently,
The coefficient vector β has dimension p, i. e. we have p parameters. Let β̂ML
 
denote the ML estimate so that µ̂i = x⊤
i β̂ ML . Consequently, up to some constant,
2 T
AIC = −2 l y1:n ; (β̂ML , σ̂ML ) + 2(p + 1)
we obtain
2
  = n log(σ̂ML ) + 2p + n + 2.
AIC = −2 l y1:n ; β̂ ML + 2 p
3. Repeat the analysis of Example 7.7 with unknown variance κ−1 using the conjugate
1 X 2
n
=n+ 2 yi − x⊤
i β̂ ML + 2p normal-gamma distribution (see Example 6.21) as prior distribution for κ and µ.
σ
i=1 a) First calculate the marginal likelihood of the model by using the rearrangement
n
1 X of Bayes’ theorem in (7.16).
= 2 (yi − µ̂i )2 + 2 p + n.
σ
i=1 ◮ We know that for θ = (µ, κ)T , we have
Remark: As the ML estimate is the least squares estimate in the normal regres- X | θ ∼ N(µ, κ−1 ) and θ ∼ NG(ν, λ, α, β)
sion model, we have
β̂ML = (X ⊤ X)−1 X ⊤ y and in Example 6.21, we have seen that the posterior distribution is also a
normal-gamma distribution:
for the n × p matrix X = (x1 , . . . , xn ) and the n-dimensional vector y =
(y1 , . . . , yn )⊤ . θ | x1:n ∼ NG(ν ∗ , λ∗ , α∗ , β ∗ ),
b) Mallow’s Cp statistic
SS where
Cp = 2 + 2p − n
s
Pn
is often used to assess the fit of a regression model. Here SS = i=1 (yi − µ̂i )2 is ν ∗ = (λ + n)−1 (λν + nx̄),
the residual sum of squares and σ̂ML 2
is the MLE of the variance σ 2 . How does λ∗ = λ + n,
AIC relate to Cp ? α∗ = α + n/2,
◮ Up to some multiplicative constant, we have
nσˆ2 ML + (λ + n)−1 nλ(ν − x̄)2
β∗ = β + .
AIC = Cp + 2n, 2
146 7 Model selection 147

Thus, Next, we store the alcohol concentration data given in Table 1.3 and use the given
n
! priori parameters to compute the marginal likelihoods of the the two models M1
κ X
f (x | θ) = (2π/κ)−n/2 exp − (xi − µ)2 , and M2 :
2
i=1 > ## store the data
 
βα λκ > (alcoholdata <-
f (θ) = (2π(λκ)−1 )−1/2 κα−1 exp(−βκ) · exp − (µ − ν)2 , data.frame(gender = c("Female", "Male", "Total"),
Γ(α) 2 n = c(33, 152, 185),
∗α
∗  ∗  mean = c(2318.5, 2477.5, 2449.2),
β λ κ ∗ 2
f (θ | x) = (2π(λ κ) )
∗ −1 −1/2 α∗ −1
κ exp(−β ∗
κ) · exp − (µ − ν ) . sd = c(220.1, 232.5, 237.8))
Γ(α∗ ) 2 )
gender n mean sd
Note that it is important to include the normalizing constants of the above den- 1 Female 33 2318.5 220.1
sities in the following calculation to get 2 Male 152 2477.5 232.5
3 Total 185 2449.2 237.8
f (x | θ)f (θ)
f (x | M1 ) = > attach(alcoholdata)
f (θ | x) > ##
− 12 1/2
> ## priori parameter
(2π)− 2 Γ(α) (2π)
βα
n
λ > nu <- 2000
= (β ∗ )α
∗ 1 > lambda <- 5
(2π)− 2 (λ∗ )1/2
Γ(α∗ ) > alpha <- 1
   n1/2 > beta <- 50000
1 λ 2
Γ(α + n/2)β α
= · > ##
2π λ+n Γ(α) > ## vector to store the marginal likelihood values
 2
 2)
−(α+ n > logMargLik <- numeric(2)
nσ̂ML + (λ + n)−1 nλ(ν − x̄)2
β+
> ## compute the marginal log-likelihood of model M_1
. (7.1)
2 > ## use the accumulated data for both genders
> logMargLik[1] <-
b) Next, calculate explicitly the posterior probabilities of the four (a priori equally marginalLikelihood(n = n[3], mean = mean[3], var = sd[3]^2,
nu, lambda, alpha, beta,
probable) models M1 to M3 using a NG(2000, 5, 1, 50 000) distribution as prior log = TRUE)
for κ and µ. > ##
◮ We work with model M1 and M2 only, as there is no model M3 specified in > ## compute the marginal log-likelihood of model M_2
> ## first compute the marginal log-likelihoods for the
Example 7.7. > ## two groups (female and male)
We first implement the marginal (log-)likelihood derived in part (a): > ## the marginal log-likelihood of model M_2 is the sum
> ## of these two marginal log-likelihoods
> ## marginal (log-)likelihood in the normal-normal-gamma model,
> logMargLikFem <-
> ## for n realisations with mean "mean" and MLE estimate "var"
marginalLikelihood(n = n[1], mean = mean[1], var = sd[1]^2,
> ## for the variance
nu, lambda, alpha, beta,
> marginalLikelihood <- function(n, mean, var, # data
log = TRUE)
nu, lambda, alpha, beta, # priori parameter
> logMargLikMale <-
log = FALSE # should log(f(x))
marginalLikelihood(n = n[2], mean = mean[2], var = sd[2]^2,
# or f(x) be returned?
nu, lambda, alpha, beta,
)
log = TRUE)
{
> logMargLik[2] <- logMargLikFem + logMargLikMale
betaStar <- beta + > logMargLik
(n * var + n * lambda * (nu - mean)^2 / (lambda + n)) / 2
[1] -1287.209 -1288.870
logRet <- - n/2 * log(2 * pi) + 1/2 * log(lambda / (lambda + n)) +
lgamma(alpha + n/2) - lgamma(alpha) + alpha * log(beta) -
(alpha + n/2) * log(betaStar)
if(log)
return(logRet)
else
return(exp(logRet))
}
148 7 Model selection 149

Hence, the marginal likelihood of model M1 is larger than the marginal likelihood [1] 0.952 0.048
of model M2 . [1] 0.892 0.108
[1] 0.864 0.136
For equally probable models M1 and M2 , we have [1] 0.936 0.064
> ## vary beta
f (x | Mi )
Pr(Mi | x) = P2 > for(betaNew in c(10000, 30000, 70000, 90000))
j=1 f (x | Mj )
print(modelcomp(nu, lambda, alpha, betaNew))
[1] 0.931 0.069
f (x | Mi )/c [1] 0.863 0.137
= P2 ,
j=1 f (x | Mj )/c
[1] 0.839 0.161
[1] 0.848 0.152

where the expansion with the constant c−1 ensures that applying the implemented The larger ν is chosen the smaller the posterior probability of the simpler model
exponential function to the marginal likelihood values in the range of −1290 does M1 becomes. In contrast, larger values of λ favour the simpler model M1 . The
not return the value 0. Here we use log(c) = min{log(f (x | M1 ), log(f (x | M2 ))}: choice of α and β only has a small influence on the posterior model probabilities.
> const <- min(logMargLik) 4. Let X1:n be a random sample from a normal distribution with expected value µ and
> posterioriProb <- exp(logMargLik - const)
> (posterioriProb <- posterioriProb / sum(posterioriProb)) known variance κ−1 , for which we want to compare two models. In the first model
[1] 0.8403929 0.1596071 (M1 ) the parameter µ is fixed to µ = µ0 . In the second model (M2 ) we suppose
Thus, given the data above, model M1 is more likely than model M2 , i. e. the that the parameter µ is unknown with prior distribution µ ∼ N(ν, δ −1 ), where ν
model using the same transformation factors for both genders is much more likely and δ are fixed.
than the model using the different transformation factors for women and men, a) Determine analytically the Bayes factor BF12 of model M1 compared to model
respectively. M2 .
c) Evaluate the behaviour of the posterior probabilities depending on varying pa- ◮ Since there is no unknown parameter in model M1 , the marginal likelihood
rameters of the prior normal-gamma distribution. of this model equals the usual likelihood:
◮ The R-code from part (b) to compute the posterior probabilities can be used !
 κ  n2 κX
n
2
to define a function modelcomp, which takes the four priori parameters as argu- f (x | M1 ) = exp − (xi − µ0 ) .
2π 2
ments and returns the posterior probabilities of model M1 and M2 (rounded to i=1

three decimals). We will vary one parameter at a time in the following: The marginal likelihood of model M2 is given in Equation (7.18) in the book.
> ## given parameters The Bayes factor BF12 is thus
> modelcomp(nu, lambda, alpha, beta)
[1] 0.84 0.16 f (x | M1 )
B12 =
> ## vary nu f (x | M2 )
> for(nuNew in c(1900, 1950, 2050, 2100))  1 ( n n
!)
print(modelcomp(nuNew, lambda, alpha, beta)) nκ + δ 2 κ X
2
X
2 nδ
= exp − (xi − µ0 ) − (xi − x̄) − (x̄ − ν)2 .
[1] 0.987 0.013 δ 2 nκ + δ
[1] 0.95 0.05 i=1 i=1
[1] 0.614 0.386 (7.2)
[1] 0.351 0.649
> ## vary lambda b) As an example, calculate the Bayes factor for the centered alcohol concentration
> for(lambdaNew in c(1, 3, 7, 9)) data using µ0 = 0, ν = 0 and δ = 1/100.
print(modelcomp(nu, lambdaNew, alpha, beta))
[1] 0.166 0.834 ◮ For µ0 = ν = 0, the Bayes factor can be simplified to
[1] 0.519 0.481  1 (  −1 )
nκ + δ 2 nκx̄2 δ
B12 = exp − 1+
[1] 0.954 0.046
.
[1] 0.985 0.015 δ 2 nκ
> ## vary alpha
> for(alphaNew in c(0.2, 0.5, 2, 3))
We choose the parameter κ as the precision estimated from the data: κ = 1/σ̂ 2 .
print(modelcomp(nu, lambda, alphaNew, beta))
150 7 Model selection 151

> # define the Bayes factor as a function of the data and delta for arbitrary prior distribution on µ, where z = x/σ is standard normal under
> bayesFactor <- function(n, mean, var, delta) model M0 . The expression exp(−1/2z 2 ) is called the minimum Bayes factor
{
kappa <- 1 / var (Goodman, 1999).
logbayesFactor <- 1/2 * (log(n * kappa + delta) - log(delta)) - ◮ We denote the unknown prior distribution of µ by f (µ). Then, the Bayes
n * kappa * mean^2 / 2 *
(1 + delta / (n * kappa))^{-1} factor BF01 can be expressed as
exp(logbayesFactor)
f (x | M0 ) f (x | M0 )
} BF01 = = R . (7.3)
> # centered alcohol concentration data f (x | M1 ) f (x | µ)f (µ) dµ
> n <- 185
> mean <- 0 The model M0 has no free parameters, so its marginal likelihood is the usual
> sd <- 237.8
likelihood  
1
> # compute the Bayes factor for the alcohol data
1
> bayesFactor(n, mean, sd^2, delta = 1/100) f (x | M0 ) = (2πσ 2 )− 2 exp − z 2 ,
[1] 1.15202 2

According to the Bayes factor, model M1 is more likely than model M2 , i. e. the for z = x/σ. To find a lower bound for BF01 , we have to maximise the integral
mean transformation factor does not differ from 0, as expected. in the denomiator in (7.3). Note that the density f (µ) averages over the values
c) Show that the Bayes factor tends to ∞ for δ → 0 irrespective of the data and of the likelihood function f (x | µ). Hence, it is intuitively clear that the integral
the sample size n. is maximized if we keep the density constant at its maximum value, which is
◮ The claim easily follows from Equation (7.2) since for δ → 0, the expression reached at the MLE µ̂ML = x. We thus obtain
in the exponential converges to
f (x | M1 ) ≤ f (x | µ̂ML )
n n
! (  2 )
κ X 2
X
2 1 x − µ̂ML
− (xi − µ0 ) − (xi − x̄) 1
= (2πσ 2 )− 2 exp −
2 2 σ
i=1 i=1
1
p = (2πσ 2 )− 2 ,
and the factor (nκ + δ)/δ diverges to ∞.
For the alcohol concentration data, we can obtain a large Bayes factor by using
which implies  
an extremely small δ: f (x | M0 ) 1
B01 = ≥ exp − z 2 .
> bayesFactor(n, mean, sd^2, f (x | M1 ) 2
delta = 10^{-30})
[1] 5.71971e+13 b) Calculate for selected values of z the two-sided P -value 2{1 − Φ(|z|)}, the mini-
mum Bayes factor and the corresponding posterior probability of M0 , assuming
This is an example of Lindley’s paradox.
equal prior probabilities Pr(M0 ) = Pr(M1 ) = 1/2. Compare the results.
(One can deduce in a similar way that for µ0 = ν = 0, the Bayes factor converges

to 1 as δ → ∞, i. e. the two models become equally likely.)
> ## minimum Bayes factor:
5. In order to compare the models > mbf <- function(z)
exp(-1/2 * z^2)
M0 : X ∼ N(0, σ 2 ) > ## use these values for z:
> zgrid <- seq(0, 5, length = 101)
and M1 : X ∼ N(µ, σ 2 ) > ##
> ## compute the P-values, the values of the minimum Bayes factor and
with known σ 2 we calculate the Bayes factor BF01 . > ## the corresponding posterrior probability of M_0
> ## note that under equal proir probabilities for the models,
a) Show that   > ## the posterior odds equals the Bayes factor
1 > pvalues <- 2 * (1 - pnorm(zgrid))
BF01 ≥ exp − z 2 ,
2 > mbfvalues <- mbf(zgrid)
> postprob.M_0 <- mbfvalues/(1 + mbfvalues)
> ##
152 7 Model selection 153

> ## plot the obtained values for some prior density f (θ) for θ.
> matplot(zgrid, cbind(pvalues, mbfvalues, postprob.M_0), type = "l", ◮ We have
xlab = expression(z), ylab = "values")
> legend("topright", legend = c("P-values from Wald test", "Minimum Bayes factor", "Posterior f (p | M0 )
col = 1:3, lty = 1:3, bty = "n") BF(p) =
> ## comparisons: f (p | M1 )
> all(pvalues <= mbfvalues) 1
= R1
[1] TRUE
0
B(θ, 1)−1 pθ−1 f (θ) dθ
> zgrid[pvalues == mbfvalues]  1 −1
[1] 0 Z 
1.0
= θp θ−1
f (θ)dθ ,
P−values from Wald test  
Minimum Bayes factor 0
Posterior prob. of M_0
0.8
since
Γ(θ) Γ(θ)
0.6 B(θ, 1) = = = 1/θ.
Γ(θ + 1) θ Γ(θ)
values

0.4 To see that Γ(θ + 1) = θ Γ(θ), we can use integration by parts:

0.2 Z∞
Γ(θ + 1) = tθ exp(−t) dt
0.0
0
0 1 2 3 4 5 Z∞
 
z = −tθ exp(−t) | ∞
0 + θ tθ−1 exp(−t) dt
0
Thus, the P-values from the Wald test are smaller than or equal to the minimum
Z∞
Bayes factors for all considered values of z. Equality holds for z = 0 only and
=θ tθ−1 exp(−t) dt = θ Γ(θ).
for z > 3, the P-values and the minimum Bayes factors are very similar.
0
6. Consider the models
b) Show that the minimum Bayes factor mBF over all prior densities f (θ) has the
M0 : p ∼ U(0, 1) form (
−e p log p for p < e−1 ,
and M1 : p ∼ Be(θ, 1) mBF(p) =
1 otherwise,
where 0 < θ < 1. This scenario aims to reflect the distribution of a two-sided
where e = exp(1) is Euler’s number.
P -value p under the null hypothesis (M0 ) and some alternative hypothesis (M1 ),
◮ We have
where smaller P -values are more likely (Sellke et al., 2001). This is captured by
 1 −1
the decreasing density of the Be(θ, 1) for 0 < θ < 1. Note that the data are now Z  −1
represented by the P -value. mBF(p) = min  θpθ−1 f (θ) dθ  = max θpθ−1 ,
f density θ∈[0,1]
a) Show that the Bayes factor for M0 versus M1 is 0

 1 −1 where the last equality is due to the fact that the above integral is maximum if
Z  the density f (θ) is chosen as a point mass at the value of θ which maximises
BF(p) = θ−1
θp f (θ)dθ
  θpθ−1 .
0
We now consider the function g(θ) = θpθ−1 to determine its maximum. For p <
1/e, the function g has a unique maximum in (0, 1). For p ≥ 1/e, the function
g is strictly monotoically increasing on [0,1] and thus attains its maximum at
θ = 1 (compare to the figure below).
154 7 Model selection 155

1.0 c) Compute and interpret the minimum Bayes factor for selected values of p
1.5
(e.g. p = 0.05, p = 0.01, p = 0.001).
0.8

1.0 0.6
g(θ) > ## minimum Bayes factor:

g(θ)
> mbf <- function(p)
0.4 { if(p < 1/exp(1))
0.5 { - exp(1)*p*log(p) }
0.2 else
{ 1 }
p=0.1 p=0.5 }
0.0 0.0
> ## use these values for p:
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 > p <- c(0.05, 0.01, 0.001)
θ θ > ## compute the corresponding minimum Bayes factors
> minbf <- numeric(length =3)
We now derive the maxima described above analytically: > for (i in 1:3)
{minbf[i] <- mbf(p[i])
i. Case p < 1/e: }
We compute the maximum of the function h(θ) = log(g(θ)): > minbf
[1] 0.40716223 0.12518150 0.01877723
h(θ) = log(θ) + (θ − 1) log(p), > ratio <- p/minbf
> ratio
d 1
h(θ) = + log(p) and hence [1] 0.12280117 0.07988401 0.05325600
dθ θ > ## note that the minimum Bayes factors are considerably larger
d 1 > ## than the corresponding p-values
h(θ) = 0 ⇒ θ = − .
dθ log(p)
For p = 0.05, we obtain a minimum Bayes factor of approximately 0.4. This
d
It is easy to see that dθ h(θ) > 0 for θ < −(log p)−1 and dθd
h(θ) < 0 for θ > means that given the data p = 0.05, model M0 as at least 40% as likely as model
M1 . If the prior odds of M0 versus M1 is 1, then the posterior odds of M0
−1
−(log p) , so that h and hence also g are strictly monotonically increasing
for θ < −(log p)−1 and strictly monotonically decreasing for θ > −(log p)−1 . versus M1 is at least 0.4. Hence, the data p = 0.05 does not correspond to
Consequently, the maximum of g determined above is unique and we have strong evidence against model M0 . The other minimum Bayes factors have an
 
1 analoguous interpretation.
sup θpθ−1 = g −
θ∈[0,1] log p 7. Box (1980) suggested a method to investigate the compatibility of a prior with the
  
1 1 observed data. The approach is based on computation of a P -value obtained from
=− exp log p − −1 the prior predictive distribution f (x) and the actually observed datum xo . Small
log p log p
1 p-values indicate a prior-data conflict and can be used for prior criticism.
=− ,
log(p)e p Box’s p-value is defined as the probability of obtaining a result with prior predictive
ordinate f (X) equal to or lower than at the actual observation xo :
which implies mBF(p) = −e p log p.
ii. Case p ≥ 1/e: Pr{f (X) ≤ f (xo )},
We have
d
g(θ) = pθ−1 (1 + θ log p) ≥ 0 here X is distributed according to the prior predictive distribution f (x), so f (X)

is a random variable. Suppose both likelihood and prior are normal, i. e. X | µ ∼
for all θ ∈ [0, 1] since log p ≥ −1 in this case. Thus, g is monotonically
N(µ, σ 2 ) and µ ∼ N(ν, τ 2 ). Show that Box’s p-value is the upper tail probability of
increasing on [0, 1] and
a χ2 (1) distribution evaluated at
!−1
mBF(p) = sup θpθ−1 = (g(1))
−1
= 1. (xo − ν)2
.
θ∈[0,1] σ2 + τ 2
156 7 Model selection

◮ We have already derived the prior predictive density for a normal likelihood
8 Numerical methods for Bayesian
with known variance and a normal prior for the mean µ in Exercise 1. By setting
κ = 1/σ 2 , δ = 1/τ 2 and n = 1 in Equation (7.18), we obtain
inference
   
1 1 2
f (x) = exp − (x − ν) ,
2π(τ 2 + σ 2 ) 2(τ 2 + σ 2 )

that is X ∼ N(ν, σ 2 + τ 2 ). Consequently,

X −ν (X − ν)2
√ ∼ N(0, 1) and ∼ χ2 (1) (7.4)
σ2 + τ 2 σ2 + τ 2
1. Let X ∼ Po(eλ) with known e, and assume the prior λ ∼ G(α, β).
(see Table A.2 for the latter fact).
a) Compute the posterior expectation of λ.
Thus, Box’s p-value is
◮ The posterior density is
Pr{f (X) ≤ f (xo )}
     f (λ | x) ∝ f (x | λ)f (λ)
1 (X − ν)2 1 (xo − ν)2
= Pr exp − ≤ exp − ∝ λx exp(−eλ) · λα−1 exp(−βλ)
(2π(σ 2 + τ 2 ))1/2 2(σ 2 + τ 2 ) (2π(σ 2 + τ 2 ))1/2 2(σ 2 + τ 2 ) 
 2 2
 = λ(α+x)−1 exp −(β + e)λ ,
(X − ν) (xo − ν)
= Pr − 2 ≤− 2
σ + τ2 σ + τ2 which is the kernel of the G(α + x, β + e) distribution (compare to Table 6.2).
 2

(X − ν) (xo − ν)2 The posterior expectation is thus
= Pr ≥ 2 .
σ +τ
2 2 σ +τ 2
α+x
E(λ | x) = .
2
Due to (7.4), the latter probability equals the upper tail probability of a χ (1) dis- β+e
tribution evaluated at (xo − ν)2 /(σ 2 + τ 2 ). b) Compute the Laplace approximation of this posterior expectation.
◮ We use approximation (8.6) with g(λ) = λ und n = 1. We have

−k(λ) = log(f (x | λ)) + log(f (λ))


= (α + x − 1) log(λ) − (β + e)λ
and − kg (λ) = log(λ) − k(λ).

The derivatives of these functions are


d(−k(λ)) α+x−1
= − (β + e)
dλ λ
d(−kg (λ)) α+x
bzw. = − (β + e)
dλ λ
with roots
α+x−1 α+x
λ̂ = and λ̂g = .
β+e β+e
The negative curvatures of −k(λ) and −kg (λ) at the above maxima turn out to
be
d2 k(λ̂) (β + e)2 d2 kg (λ̂g ) (β + e)2
κ̂ = = and κ̂g = = ,
dλ2 α+x−1 dλ2 α+x
158 8 Numerical methods for Bayesian inference 159

which yields the following Laplace approximation of the posterior expectation: For a fixed ratio of observed value x and offset e, the Laplace approximation
r n thus improves for larger values of x and e. If we consider the ratio of the
α+x
Ê(λ | x) = exp log(λ̂∗ ) + (α + x − 1) log(λ̂∗ ) − (β + e)λ̂∗ Laplace approximation and the exact value
α+x−1
o Ê(λ | x)  
− (α + x − 1) log(λ̂) + (β + e)λ̂ = exp (α + x − 0.5) log(α + x) − log(α + x − 1) − 1
E(λ | x)
r  α+x−0.5 .
α+x n o 1
= exp log(λ̂∗ ) + (α + x − 1) log(λ̂∗ /λ̂) + (β + e)(λ̂ − λ̂∗ ) = 1+ exp(1),
α+x−1 α+x−1
r     α+x  
α+x α+x it is not hard to see that this ratio converges to 1 as x → ∞.
= exp log − (α + x − 1) log −1
α+x−1 e+β α+x−1 d) Now consider θ = log(λ). First derive the posterior density function using the

= exp (α + x + 0.5) log(α + x) − (α + x − 0.5) log(α + x − 1) change of variables formula (A.11). Second, compute the Laplace approxima-
tion of the posterior expectation of θ and compare again with the exact value
− log(β + e) − 1 .
which you have obtained by numerical integration using the R-function inte-
c) For α = 0.5 and β = 0, compare the Laplace approximation with the exact grate.
value, given the observations x = 11 and e = 3.04, or x = 110 and e = 30.4. ◮ The posterior density is

Also compute the relative error of the Laplace approximation.  d
fθ (θ | x) = fλ g −1 (θ) | x · g −1 (θ)
◮ We first implement the Laplace approximation and the exact formula: dθ
> ## Laplace approximation of the posterior expectation (β + e)α+x 
> ## for data x, offset e und priori parameters alpha, beta: = exp(θ)α+x−1 exp −(β + e) exp(θ) · exp(θ)
> laplaceApprox1 <- function(x, e, alpha, beta) Γ(α + x)
{ (β + e)α+x 
= exp (α + x)θ − (β + e) exp(θ) ,
logRet <- (alpha + x + 0.5) * log(alpha + x) -
Γ(α + x)
(alpha + x - 0.5) * log(alpha + x - 1) -
log(beta + e) - 1 which does not correspond to any well-known distribution.
exp(logRet)
} > ## posterior density of theta = log(lambda):
> ## exact calculation of the posterior expectation > thetaDens <- function(theta, x, e, alpha, beta, log = FALSE)
> exact <- function(x, e, alpha, beta) {
(alpha + x) / (beta + e) logRet <- (alpha + x) * (theta + log(beta + e)) -
(beta + e) * exp(theta) - lgamma(alpha + x)
Using the values given above, we obtain if(log)
> (small <- c(exact = exact(11, 3.04, 0.5, 0), return(logRet)
approx = laplaceApprox1(11, 3.04, 0.5, 0))) else
exact approx return(exp(logRet))
3.782895 3.785504 }
> (large <- c(exact = exact(110, 30.4, 0.5, 0), > # check by simulation if the density is correct:
approx = laplaceApprox1(110, 30.4, 0.5, 0))) > x <- 110
> e <- 30.4
exact approx
> alpha <- 0.5
3.634868 3.634893
> beta <- 0
> ## relative errors:
> ## draw histogram of a sample from the distribution of log(theta)
> diff(small) / small["exact"]
> set.seed(59)
approx > thetaSamples <- log(rgamma(1e+5, alpha + x, beta + e))
0.0006897981 > histResult <- hist(thetaSamples, prob= TRUE, breaks = 50,
> diff(large) / large["exact"] xlab = expression(theta), main = "")
approx > ## plot the computed density
6.887162e-06 > thetaGrid <- seq(from = min(histResult$breaks),
to = max(histResult$breaks), length = 101)
> lines(thetaGrid, thetaDens(thetaGrid, x, e, alpha, beta))
> ## looks correct!
160 8 Numerical methods for Bayesian inference 161

4
and the second-order derivative is
d2 k(θ)
= (β + e) exp(θ),
3 dθ 2

Density
which yields the following curvature of k(θ) at its minimum:
2
d2 k(θ̂)
κ̂ = = α + x.
1 dθ 2

0 > # numerical computation of Laplace approximation


> # for the posterior expectation of theta=log(lambda)
0.8 1.0 1.2 1.4 1.6
> numLaplace <- function(x, e, alpha, beta)
θ {
# first implement the formulas calculated above
We first reparametrise the likelihood and the prior density. The transformation minus.k <- function(theta)
{
is h(λ) = log(λ). Since the likelihood is invariant with respect to one-to-one + (alpha+x)*theta - (beta +e)*exp(theta)
parameter transformations, we can just plug in h−1 (θ) = exp(θ) in place of λ: }
# location of maximum of -k(theta)
1
f (x | h−1 (θ)) = exp(θ)α−1 exp(−e exp(θ)). theta.hat <- log(alpha+x) - log(beta+e)
x! # curvature of -k(theta) at maximum
kappa <- alpha+x
To transform the proir density, we apply the change-of-variables formula to get
# function to be maximised
αβ
f (θ) = exp(θ)α exp(−β exp(θ)). minus.kg <- function(theta)
Γ(α) {
log(theta) + (alpha+x)*theta - (beta +e)*exp(theta)
Thus, }
  # numerical optimisation to find the maximum of -kg(theta)
−k(θ) = log f x | h−1 (θ) + log {f (θ)} optimObj <- optim(par=1, fn=minus.kg, method="BFGS",
control = list(fnscale=-1), hessian=TRUE)
= (α + x)θ − (β + e) exp(θ) + const and # location of maximum of -kg(theta)

−kg (θ) = log h−1 (θ) − k(θ) thetag.hat <- optimObj$par
# curvature at maximum
= log(θ) + (α + x)θ − (β + e) exp(θ) + const. kappa.g <- - optimObj$hessian
# Laplace approximation
with g = id. The derivatives are approx <- sqrt(kappa/kappa.g) * exp(minus.kg(thetag.hat)
- minus.k(theta.hat))
dk(θ) return(approx)
− = α + x − (β + e) exp(θ) and }
dθ > # numerical integration
dkg (θ) 1
− = + (α + x) − (β + e) exp(θ). > numInt <- function(x, e, alpha, beta)
dθ θ {
integrand <- function(theta, x, e, alpha, beta)
The root of k(θ) is thus {
  theta* thetaDens(theta, x, e, alpha, beta)
α+x
θ̂ = log = log(α + x) − log(β + e), }
β+e numInt <- integrate(integrand, x, e, alpha, beta, lower = -Inf,
upper = Inf, rel.tol = sqrt(.Machine$double.eps))
but the root of kg (θ) cannot be computed analytically. We therefore proceed return(numInt$value)
with numerical maximisation for kg (θ) below. First, we complete the analytical }
> # comupte the exact and approximated values
calculations for k(θ): > # use the data from part (c)
> (small <- c(exact = numInt(11, 3.04, 0.5, 0),
−k(θ̂) = (α + x)(log(α + x) − log(β + e) − 1) approx = numLaplace(11, 3.04, 0.5, 0)))
162 8 Numerical methods for Bayesian inference 163

exact approx with roots


1.286382 1.293707  
α+x
θ̂ = log = log(α + x) − log(β + e),
> (large <- c(exact = numInt(110, 30.4, 0.5, 0),
and
approx = numLaplace(110, 30.4, 0.5, 0))) β+e
exact approx  
α+x+1
1.286041 1.286139 θˆg = log .
> ## relative errors: β+e
> diff(small) / small["exact"]
approx The negative curvatures of −k(θ) and −kg (θ) at the above maxima turn out to
0.005694469 be
> diff(large) / large["exact"] d2 k(θ̂) d2 kg (θˆg )
approx κ̂ = =α+x and κˆg = = α + x + 1.
7.602533e-05 dθ 2 dθ 2
Combining these results yields the Laplace approximation
The posterior expectations obtained by the Laplace approximation are close to
 
the exact values. As observed before for λ, the Laplace approximation is more Ê exp(θ) | x = exp (α+x+0.5) log(α+x+1)−(α+x−0.5) log(α+x)−log(β+e)−1 ,
accurate for larger values of e and x if the ratio of e and x is kept fixed.
e) Additional exercise: Compute the Laplace approximation of the posterior ex- a very similar formula as in Exercise 1b). Note that these are two different
pectation of exp(θ) and compare it with the Laplace approximation in 1c) and approximations for the same posterior expectation since exp(θ) = λ. Here, the
with the exact value obtained by numerical integration. ratio of the approximated value and the true value is
◮ The support of the parameter θ = g(λ) = log(λ) is now the whole real line,  α+x+0.5 .
Ê(exp(θ) | x) 1
which may lead to an improved Laplace approximation. To calculate the Laplace = 1+ exp(1),
E(exp(θ) | x) α+x
approximation of the posterior expectation of exp(θ), we first reparametrise the
likelihood and the prior density. The transformation is h(λ) = log(λ). Since the that is one step in x closer to the limit 1 than the approximation in 1b). The
likelihood is invariant with respect to one-to-one parameter transformations, we approximation derived here is hence a bit more accurate than the one in 1b).
can just plug in h−1 (θ) = exp(θ) in place of λ: We now compare the two Laplace approximations with the values obtained by
numerical integration using the data from 1c):
1
f (x | h−1 (θ)) = exp(θ)α−1 exp(−e exp(θ)). > ## Laplace approximation of E(exp(theta))
x! > laplaceApprox2 <- function(x, e, alpha, beta)
To transform the proir density, we apply the change-of-variables formula to get {
logRet <- (alpha + x + 0.5) * log(alpha + x + 1) -
αβ (alpha + x - 0.5) * log(alpha + x) -
f (θ) = exp(θ)α exp(−β exp(θ)). log(beta + e) - 1
Γ(α) exp(logRet)
}
Hence, we have > ## numerical approximation
  > numApprox <- function(x, e, alpha, beta)
−k(θ) = log f x | h−1 (θ) + log {f (θ)} {
integrand <- function(theta)
= (α + x)θ − (β + e) exp(θ) + const and ## important: add on log scale first and exponentiate afterwards

−kg (θ) = log h−1 (θ) − k(θ) ## to avoid numerical problems
exp(theta + thetaDens(theta, x, e, alpha, beta, log = TRUE))
= (α + x + 1)θ − (β + e) exp(θ) + const.
intRes <- integrate(integrand, lower = -Inf, upper = Inf,
with g(θ) = exp(θ). The derivatives are rel.tol = sqrt(.Machine$double.eps))
if(intRes$message == "OK")
return(intRes$value)
dk(θ)
− = α + x − (β + e) exp(θ) and else
dθ return(NA)
dkg (θ) }
− = α + x + 1 − (β + e) exp(θ). > ## comparison of the three methods:

164 8 Numerical methods for Bayesian inference 165

> (small <- c(exact = exact(11, 3.04, 0.5, 0), with g = id. The derivatives are
approx1 = laplaceApprox1(11, 3.04, 0.5, 0),
approx2 = laplaceApprox2(11, 3.04, 0.5, 0), dk(φ) 2x cos(φ) 2(n − x) sin(φ)
− = −
approx3 = numApprox(11, 3.04, 0.5, 0))) dφ sin(φ) cos(φ)
exact approx1 approx2 approx3  
x 2(x − n sin2 (φ))
3.782895 3.785504 3.785087 3.782895
=2 − (n − x) tan(φ) = and
> (large <- c(exact = exact(110, 30.4, 0.5, 0), tan(φ) sin(φ) cos(φ)
dkg (φ)
approx = laplaceApprox1(110, 30.4, 0.5, 0),
dk(π) 2 cos(φ)
approx2 = laplaceApprox2(110, 30.4, 0.5, 0), − =− +
approx3 = numApprox(110, 30.4, 0.5, 0))) dφ dφ sin(φ)
 
exact approx approx2 approx3 x+1
3.634868 3.634893 3.634893 3.634868 =2 − (n − x) tan(φ)
> ## relative errors:
tan(φ)
> (small[2:4] - small["exact"]) / small["exact"] 2{x + 1 − (n + 1) sin2 (φ)}
= .
approx1 approx2 approx3 sin(φ) cos(φ)
6.897981e-04 5.794751e-04 -1.643516e-15
> (large[2:4] - large["exact"]) / large["exact"] The different expressions for the derivatives will be useful for calculating the roots
approx approx2 approx3
6.887162e-06 6.763625e-06 -4.105072e-14 and the second-order derivatives, respectively. From the last expressions, we easily
obtain the roots
Numerical integration using integrate thus gives even more accurate results r !
r 
x x+1
φˆg = arcsin
than the two Laplace approximations in this setting.
φ̂ = arcsin and .
n n+1
2. In Example 8.3, derive the Laplace approximation (8.9) for the posterior expec-
tation of π using the variance stabilising transformation.
Exploiting the relation cos(arcsin(x)) = (1 − x2 )1/2 gives
◮ As mentioned in Example 8.3, the variance stabilising transformation is
√ ( r  r !)
φ = h(π) = arcsin( π) and its inverse is h−1 (φ) = sin2 (φ). The relation x n−x
sin2 (φ) + cos2 (φ) = 1 will be used several times in the following. −k(φ̂) = 2 x log + (n − x) log and
n n
We first reparametrise the likelihood and the prior density:   ( r ! r )
  x+1 x+1 n−x
−kg (φ̂) = log + 2 x log + (n − x) log .
f (x | h−1 (φ)) =
n
sin(φ)2x (1 − sin2 (φ))n−x n+1 n+1 n+1
x
 
n By using for example
= sin(φ)2x cos(φ)2(n−x) d tan(φ) 1
x = ,
dφ cos2 (φ)
and applying the change-of-variables formula gives
we obtain the second-order derivatives
− 12  
f (φ) = B(0.5, 0.5)−1 {sin2 (φ)(1 − sin2 (φ))} 2 sin(φ) cos(φ) d2 k(φ) x n−x
= 2 − and
= 2 B(0.5, 0.5)−1
, dφ2 sin2 (φ) cos2 (φ)
2
 
d kg (φ) x+1 n−x
= 2 − ,
i. e. the transformed density is constant. Thus, dφ2 sin2 (φ) cos2 (φ)
 
−k(φ) = log f x | h−1 (φ) + log {f (φ)} which yields the following curvatures of k(φ) and kg (φ) at their minima:
= 2x log {sin(φ)} + 2(n − x) log {cos(φ)} + const and
 d2 k(φ̂)
−kg (φ) = log h−1 (φ) − k(φ) κ̂ = = 4n and
dφ2

= log sin2 (φ) + 2x log {sin(φ)} + 2(n − x) log {cos(φ)} + const. d2 k(φ̂)
κˆg = = 4(n + 1).
dφ2
166 8 Numerical methods for Bayesian inference 167

Combining the above results, we obtain prob = 0.95 # based on samples


s ) # for the probability prob
  {
κ̂
Ê2 (π | x) = exp −{kg (θ̂g ) − k(θ̂)} ## sort the sample "samples"
κ̂g M <- length(samples)
r level <- 1 - prob
n  
= exp (x + 1) log(x + 1) − (n + 1) log(n + 1) − x log(x) + n log(n) samplesorder <- samples[order(samples)]
n+1
(x + 1)x+1 nn+0.5
## determine and return the smallest interval
= . ## containing at least 95% of the values in "samples"
xx (n + 1)n+3/2 max.size <- round(M * level)
size <- rep(NA, max.size)
for(i in 1:max.size){
3. For estimating the odds ratio θ from Example 5.8 we will now use Bayesian infer- lower <- samplesorder[i]
ence. We assume independent Be(0.5, 0.5) distributions as priors for the probabil- upper <- samplesorder[M-max.size+i]
size[i] <- upper - lower
ities π1 and π2 . }
size.min <- which.min(size)
a) Compute the posterior distributions of π1 and π2 for the data given in Table 3.1. HPD.lower <- samplesorder[size.min]
Simulate samples from these posterior distributions and transform them into HPD.upper <- samplesorder[M-max.size+size.min]
samples from the posterior distributions of θ and ψ = log(θ). Use the samples to return(c(lower = HPD.lower, upper = HPD.upper))
}
compute Monte Carlo estimates of the posterior expectations, medians, equi- > ## compute the Monte Carlo estimates:
tailed credible intervals and HPD intervals for θ and ψ. Compare with the > ## estimates for theta
> quantile(theta, prob = c(0.025, 0.5, 0.975)) # equi-tailed CI with median
results from likelihood inference in Example 5.8.
2.5% 50% 97.5%
◮ By Example 6.3, the posterior distributions are 0.6597731 2.8296129 16.8312942
> mean(theta) # mean
π1 | x1 ∼ Be(0.5 + x1 , 0.5 + n1 − x1 ) = Be(6.5, 102.5) and [1] 4.338513
π2 | x2 ∼ Be(0.5 + x2 , 0.5 + n2 − x2 ) = Be(2.5, 101.5).
> (thetaHpd <- hpd(theta)) # HPD interval
lower upper
0.2461746 12.2797998
We now generate random numbers from these two distributions and transform > ##
them to > ## estimates for psi
π1 /(1 − π1 ) > quantile(psi, prob = c(0.025, 0.5, 0.975)) # equi-tailed CI with median
θ=
π2 /(1 − π2 ) 2.5% 50% 97.5%
-0.4158593 1.0401399 2.8232399
and ψ = log(θ), respectively, to obtain the desired Monte Carlo estimates: > mean(psi) # mean
> ## data from Table 3.1: [1] 1.082216
> x <- c(6, 2) > (psiHpd <- hpd(psi)) # HPD interval
> n <- c(108, 103) lower upper
> ## simulate from the posterior distributions of pi1 and pi2 -0.5111745 2.7162842
> size <- 1e+5
> set.seed(89) The Bayesian point estimate E(ψ | x) ≈ 1.082 is slightly smaller than the
> pi1 <- rbeta(size, 0.5 + x[1], 0.5 + n[1] - x[1])
MLE ψ̂ML ≈ 1.089. The HPD interval [−0.511, 2.716] is similar to the
> pi2 <- rbeta(size, 0.5 + x[2], 0.5 + n[2] - x[2])
> ## transform the samples Wald conficence interval [−0.54, 2.71], whereas to equi-tailed credible inter-
> theta <- (pi1 / (1 - pi1)) / (pi2 / (1 - pi2)) val [−0.416, 2.823] is more similar to the profile likelihood confidence interval
> psi <- log(theta)
> ## [−0.41, 3.02]. (Also compare to Exercise 4 in Chapter 5.)
> ## function for HPD interval calculation (see code in Example 8.9) b) Try to compute the posterior densities of θ and ψ analytically. Use the density
> ## approximate unimodality of the density is assumed!
functions to numerically compute the posterior expectations and HPD inter-
> ## (HPD intervals consisting of several disjoint intervals
> ## cannot be found)
> hpd <- function( # returns a MC estimate
samples, # of the HPD interval
168 8 Numerical methods for Bayesian inference 169

vals. Compare with the Monte Carlo estimates from 3a). > ## density of theta= pi_1/(1-pi_1)* (1-pi_2)/pi_2
◮ To find the density of the odds ratio > ## for pij ~ Be(a[j], b[j])
> thetaDens <- function(theta, a, b, log = FALSE)
π1 1 − π2 {
θ= · , logRet <- theta
1 − π1 π2 ## compute the value of the density function
we first derive the density of γ = π/(1 − π) for π ∼ Be(a, b): Similarly to
integrand <- function(gamma, theta)
{
Exercise 2 in Chapter 6, we apply the change-of-variables formula with trans- ## use built-in distributions if possible
formation g(π) = π/(1 − π), inverse function g −1 (γ) = γ/(1 + γ) and derivative logRet <-
lbeta(a[1],b[1]+2) + lbeta(b[2],a[2]+2) -
d −1 1 lbeta(a[1],b[1]) - lbeta(a[2],b[2]) +
g (γ) = dbeta(gamma/(gamma+1), a[1], b[1]+2, log=TRUE) +
dγ (1 + γ)2 dbeta(theta/(gamma+theta), b[2], a[2]+2, log=TRUE) -
log(abs(gamma))
to get exp(logRet)
 a−1 
b−1 }
1 γ γ 1
fγ (γ) = 1− for(i in seq_along(theta)){
B(a, b) γ+1 γ+1 (γ + 1)2 ## if the integration worked, save the result
 a−1  b+1 intRes <- integrate(integrand, lower = 0, upper = 1,
1 γ 1 theta = theta[i],
= . (8.1)
B(a, b) γ + 1 γ+1 stop.on.error = FALSE, rel.tol = 1e-6,
subdivisions = 200)
Since π ∼ Be(a, b) implies 1 − π ∼ Be(b, a), (1 − π)/π also has a density of the if(intRes$message == "OK")
logRet[i] <- log(intRes$value)
form (8.1) with the roles of a and b interchanged. To obtain the density of θ, else
we use the following result: logRet[i] <- NA
}
If X and Y are two independent random variables with density fX and fY , ## return the vector of results
respectively, then the density of Z = X · Y is given by if(log)
return(logRet)
Z∞ z 1 else
fZ (z) = fX (x)fY dx. return(exp(logRet))
x |x| }
−∞ > ## test the function using the simulated data:
> histRes <- hist(theta, prob = TRUE, xlim=c(0,25), breaks=1000,
Setting γ1 = π1 /(1 − π1 ) and γ2∗ = (1 − π2 )/π2 , we get main = "", xlab = expression(theta))
> thetaGrid <- seq(from = 0, to = 25,
Z∞
1 length = 501)
fθ (θ) = fγ1 (γ)fγ2∗ (γ) dγ > lines(thetaGrid, thetaDens(thetaGrid, 0.5 + x, 0.5 + n - x))
|γ|
−∞
0.25
Z∞  a1 −1  b1 +1
1 γ 1
= 0.20
B(a1 , b1 )B(a2 , b2 ) γ+1 γ+1
−∞
 b2 −1  a2 +1 0.15

Density
θ γ 1
· dγ.
γ+θ γ+θ |γ| 0.10

where 0.05

a1 = x1 + 0.5, b1 = n1 − x1 + 0.5, 0.00

a2 = x2 + 0.5, b2 = n2 − x2 + 0.5. 0 5 10 15 20 25

θ
We use numerical integration to compute the above integral:
170 8 Numerical methods for Bayesian inference 171

The log odds ratio ψ is the difference of the two independent log odds φi , i = 1, 2: Now let φi = logit(πi ) for πi ∼ Be(ai , bi ). As φ1 and φ2 are independent, the
  density of ψ = φ1 − φ2 can be calculated by applying the convolution theorem:
π1 /(1 − π1 )
ψ = log = logit(π1 ) − logit(π2 ) = φ1 − φ2 . Z
π2 /(1 − π2 ) fψ (ψ) = f1 (ψ + φ2 )f2 (φ2 ) dφ2
To compute the density of ψ, we therefore first compute the density of φ = Z∞
g(π) = logit(π) assuming π ∼ Be(a, b). Since 1 exp(ψ + φ2 )a1 exp(φ2 )a2
= a1 +b1 a2 +b2 dφ2 .
B(a1 , b1 )B(a2 , b2 ) 1 + exp(ψ + φ2 ) 1 + exp(φ2 )
exp(φ) d −1  −∞
g −1 (φ) = and g (φ) = g −1 (φ) 1 − g −1 (φ) ,
1 + exp(φ) dφ By substituting π = g −1
(φ2 ), the above integral can be expressed as
applying the change-of-variables formula gives Z1
a1   b1
1 exp(φ)a g −1 g(π) + ψ 1 − g −1 g(π) + ψ π a2 −1 (1 − π)b2 −1 dπ,
f (φ) =  .
B(a, b) 1 + exp(φ) a+b 0

which cannot be simplified further. Thus, the density of ψ has to be computed


We can verify this result for φ1 by using the simulated random numbers from by numerical integration:
part (a): > ## density of psi = logit(pi1) - logit(pi2) for pij ~ Be(a[j], b[j])
> psiDens <- function(psi, a, b, log = FALSE)
> ## density of phi = logit(pi) for pi ~ Be(a, b) {
> phiDens <- function(phi, a, b) logRet <- psi
{ ## compute the value of the density function
pi <- plogis(phi) integrand <- function(pi, psi)
logRet <- a * log(pi) + b * log(1 - pi) - lbeta(a, b) {
return(exp(logRet)) ## use built-in distributions if possible
} logRet <-
> ## simulated histogram a[1] * plogis(qlogis(pi) + psi, log.p = TRUE) +
> histRes <- hist(qlogis(pi1), prob = TRUE, breaks = 50, b[1] * plogis(qlogis(pi) + psi, lower = FALSE, log.p = TRUE) -
xlab = expression(phi[1]), main = "") lbeta(a[1], b[1]) +
> ## analytic density function dbeta(pi, a[2], b[2], log = TRUE)
> phiGrid <- seq(from = min(histRes$breaks), to = max(histRes$breaks), exp(logRet)
length = 101) }
> lines(phiGrid, phiDens(phiGrid, 0.5 + x[1], 0.5 + n[1] - x[1])) for(i in seq_along(psi)){
## if the integration worked, save the result
intRes <- integrate(integrand, lower = 0, upper = 1, psi = psi[i],
0.8 stop.on.error = FALSE, rel.tol = 1e-6,
subdivisions = 200)
if(intRes$message == "OK")
0.6
logRet[i] <- log(intRes$value)
Density

else
0.4 logRet[i] <- NA
}
0.2 ## return the vector of results
if(log)
return(logRet)
0.0
else
−5 −4 −3 −2 return(exp(logRet))
}
φ1
> ## test the function using the simulated data:
> histRes <- hist(psi, prob = TRUE, breaks = 50,
main = "", xlab = expression(psi))
> psiGrid <- seq(from = min(histRes$breaks), to = max(histRes$breaks),
length = 201)
> lines(psiGrid, psiDens(psiGrid, 0.5 + x, 0.5 + n - x))
172 8 Numerical methods for Bayesian inference 173

0.5 intersec.right <- uniroot(function(x) thetaDens(x, a, b) - h,


interval = c(mode, 100))$root
0.4 ## probability mass outside of the points of intersection
p.left <- integrate(thetaDens, lower = -Inf, upper = intersec.left,
0.3 a = a, b = b)$value
Density
p.right <- integrate(thetaDens, lower = intersec.right, upper = Inf,
a = a, b = b)$value
0.2
## return this probability and the points of intersection
return(c(prob = p.left + p.right,
0.1 lower = intersec.left, upper = intersec.right))
}
0.0 > ## determine the optimal h: want to have 5% of probability mass outside
> result <- uniroot(function(h) outerdensTheta(h)[1] - 0.05,
−2 0 2 4 6
interval = c(0.001, 0.2))
ψ > height <- result[["root"]]
> ## this numerical optimisation gives
The problem here was at first that the integration routine did not converge in > (thetaHpdNum <- outerdensTheta(height)[c(2,3)])
the range of ψ ≈ 4.7 for the default settings. This problem could be solved lower upper
by changing the relative tolerance (rel.tol) and the number of subdivisions 0.2448303 12.2413591
> ## the Monte Carlo estimate was:
(subdivisions) to different values. To compute the posterior expectation of ψ, > thetaHpd
we use numerical integration again. Note that we need to be able to calculate lower upper
the density f (ψ | x) for each value of ψ to do so! 0.2461746 12.2797998

> ## numerical integration > ## HPD interval computation for psi:


> integrate(function(theta) theta * thetaDens(theta, 0.5 + x, 0.5 + n - x), > ## for given h, the function outerdensPsi returns
lower = -Inf, upper = Inf) > ## the probability mass of all psi values having density smaller than h
4.333333 with absolute error < 1.1e-06 > ## (a, b are as for the function psiDens)
> integrate(function(psi) psi * psiDens(psi, 0.5 + x, 0.5 + n - x), > outerdensPsi <- function(h)
lower = -Inf, upper = Inf) {
1.079639 with absolute error < 0.00012 ## only valid for this data!
> ## Monte-Carlo estimate was: mode <- 1 # estimated from graph
> mean(theta) a <- 0.5 + x
[1] 4.338513 b <- 0.5 + n - x
> mean(psi) ## find the points of intersection of h and the density
intersec.left <- uniroot(function(x) psiDens(x, a, b) - h,
[1] 1.082216
interval = c(-2, mode))$root
Thus, the Monte Carlo estimate is close to the value obtained by numerical intersec.right <- uniroot(function(x) psiDens(x, a, b) - h,
interval = c(mode, 100))$root
integration for θ and relatively close for ψ. We now turn to the numerical ## probability mass outside of the points of intersection
calculation of HPD intervals. We write separate functions for θ and φ, which p.left <- integrate(psiDens, lower = -Inf, upper = intersec.left,
a = a, b = b)$value
are tailored to the corresponding histograms: p.right <- integrate(psiDens, lower = intersec.right, upper = Inf,
> ## HPD interval computation for theta: a = a, b = b)$value
> ## for given h, the function outerdensTheta returns ## return this probability and the points of intersection
> ## the probability mass of all theta values having density smaller than h return(c(prob = p.left + p.right,
> ## (a, b are as for the function thetaDens) lower = intersec.left, upper = intersec.right))
> outerdensTheta <- function(h) }
{ > ## determine the optimal h: want to have 5% of probability mass outside
## only valid for this data! > result <- uniroot(function(h) outerdensPsi(h)[1] - 0.05,
mode <- 2 # estimated from graph interval = c(0.001, 0.4))
a <- 0.5 + x > height <- result[["root"]]
b <- 0.5 + n - x > ## this numerical optimisation gives
## find the points of intersection of h and the density > (psiHpdNum <- outerdensPsi(height)[c(2,3)])
intersec.left <- uniroot(function(x) thetaDens(x, a, b) - h, lower upper
interval = c(-2, mode))$root -0.4898317 2.7333078
174 8 Numerical methods for Bayesian inference 175

> ## the Monte Carlo estimate was: which corresponds to the normal model for a random sample. Using the last
> psiHpd equation in Example 6.8 on page 182 with xi replaced by ψi and σ 2 by τ 2 yields
lower upper
-0.5111745 2.7162842 the full conditional distribution
 −1    −1 !
The Monte Carlo estimates of the HPD intervals are also close to the HPD n 1 nψ̄ 0 n 1
ν | ψ, τ 2 , D ∼ N + + , + .
intervals obtained by numerical methods. The above calculations illustrate that τ2 10 τ2 10 τ2 10
Monte Carlo estimation is considerably easier (e. g. we do not have to calculate
We further have
any densities!) than the corresponding numerical methods and does not require
any tuning of integration routines as the numerical methods do. n
Y
f (τ 2 | ψ, ν, D) ∝ f (ψi | ν, τ 2 ) · f (τ 2 )
4. In this exercise we will estimate a Bayesian hierarchical model with MCMC meth- i=1
ods. Consider Example 6.31, where we had the following model:   
(ψi − ν)2
Yn
1
∝ (τ 2 )− 2 exp − τ 2 · (τ 2 )−(1+1) exp(−1/τ 2 )
 2
ψ̂i | ψi ∼ N ψi , σi2 , i=1
  Pn  
ψi | ν, τ ∼ N ν, τ 2 , (ψi − ν)2 + 2
= (τ 2 )−( 2 +1) exp − i=1 τ2 ,
n+2

2
where we assume that the empirical log odds ratios ψ̂i and corresponding variances
that is  Pn 
σi2 := 1/ai + 1/bi + 1/ci + 1/di are known for all studies i = 1, . . . , n. Instead of − ν)2 + 2
n+2 i=1 (ψi
empirical Bayes estimation of the hyper-parameters ν and τ 2 , we here proceed in τ 2 | ψ, ν, D ∼ IG , .
2 2
a fully Bayesian way by assuming hyper-priors for them. We choose ν ∼ N(0, 10)
and τ 2 ∼ IG(1, 1). b) Implement a Gibbs sampler to simulate from the corresponding posterior dis-
a) Derive the full conditional distributions of the unknown parameters ψ1 , . . . , ψn , tributions.
ν and τ 2 . ◮ In the following R code, we iteratively simulate from the full conditional
◮ Let i ∈ {1, . . . , n = 9}. The conditional density of ψi given all other distributions of ν, τ 2 , ψ1 , . . . , ψn :
parameters and the data D (which can be reduced to the empirical log odds > ## the data is
ratios {ψ̂i } and the corresponding variances {σi2 }) is > preeclampsia <- read.table ("../Daten/preeclampsia.txt", header = TRUE)
> preeclampsia
f (ψi | {ψj }j6=i , ν, τ 2 , D) ∝ f (ψ, ν, τ 2 , D) Trial Diuretic Control Preeclampsia
1 Weseley 14 14 yes
∝ f (ψ̂i | ψi )f (ψi | ν, τ 2 ). 2 Weseley 117 122 no
3 Flowers 21 17 yes
4 Flowers 364 117 no
This corresponds to the normal model in Example 6.8: µ is replaced by ψi 5 Menzies 14 24 yes
here, σ 2 by σi2 and x by ψ̂i . We can thus use Equation (6.16) to obtain the full 6 Menzies 43 24 no
conditional distribution 7 Fallis 6 18 yes
8 Fallis 32 22 no
 −1    −1 ! 9 Cuadros 12 35 yes
1 1 ψ̂i ν 1 1
ψi | {ψj }j6=i , ν, τ 2 , D ∼ N + + , + . 10 Cuadros 999 725 no
σi2 τ2 σi2 τ2 σi2 τ2 11 Landesman 138 175 yes
12 Landesman 1232 1161 no
For the population mean ν, we have 13 Krans 15 20 yes
14 Krans 491 504 no
n
Y 15 Tervila 6 2 yes
f (ν | ψ, τ 2 , D) ∝ f (ψi | ν, τ 2 ) · f (ν), 16 Tervila 102 101 no
17 Campbell 65 40 yes
i=1
18 Campbell 88 62 no
176 8 Numerical methods for Bayesian inference 177

> ## functions for calculation of odds ratio and variance }


> ## for a 2x2 table square }
> oddsRatio <- function (square)
(square[1,1] * square[2,2]) / (square[1,2] * square[2,1]) Having generated the Markov chain, one should check for convergence of the
> variance <- function (square)
sum (1 / square) random numbers, e. g. in trace trace plots. For illustration, we generate such
> ## extract the data we need plots for the random variables ν, τ 2 und ψ1 :
> groups <- split (subset (preeclampsia, select = c (Diuretic, Control)),
preeclampsia[["Trial"]]) # list of 2x2 tables > par(mfrow = c(1, 3))
> (logOddsRatios <- log (sapply (groups, oddsRatio))) # psihat vector > for(j in 1:3){
Campbell Cuadros Fallis Flowers Krans plot(s[j, ], pch = ".",
0.13530539 -1.39102454 -1.47330574 -0.92367084 -0.26154993 xlab = "Iteration", ylab = expression(nu, tau^2, psi[1])[j])
Landesman Menzies Tervila Weseley }
-0.29688945 -1.12214279 1.08875999 0.04184711
> (variances <- sapply (groups, variance)) # sigma^2 vector 12
1 1.0
Campbell Cuadros Fallis Flowers Krans
0.06787728 0.11428507 0.29892677 0.11773684 0.12068745 10

Landesman Menzies Tervila Weseley 0 0.5


0.01463368 0.17801772 0.68637158 0.15960087 8

> n <- length(groups) # number of studies

ψ1
τ2
ν
6 0.0
−1
> ## Gibbs sampler for inference in the fully Bayesian model:
4
> niter <- 1e+5
−0.5
> s <- matrix(nrow = 2 + n, ncol = niter) −2
2
> rownames(s) <- c("nu", "tau2", paste("psi", 1:n, sep = ""))
> psiIndices <- 3:(n + 2)
0 −1.0
> ## set initial values (other values in the domains are also possible)
0e+00 4e+04 8e+04 0e+00 4e+04 8e+04 0e+00 4e+04 8e+04
> s[, 1] <- c(nu = mean(logOddsRatios),
Iteration Iteration Iteration
tau2 = var(logOddsRatios), logOddsRatios)
> set.seed(59)
> ## iteratively update the values The generated Markov chain seems to converge quickly so that a burn-in of
> for(j in 2:niter){ 1000 iterations seems sufficient.
## nu first c) For the data given in Table 1.1, compute 95% credible intervals for ψ1 , . . . , ψn
nuPrecision <- n / s["tau2",j-1] + 1 / 10
and ν. Produce a plot similar to Figure 6.15 and compare with the results
psiSum <- sum(s[psiIndices,j-1]) from the empirical Bayes estimation.
nuMean <- (psiSum / s["tau2",j-1]) / nuPrecision
◮ We use the function hpd written in Exercise 3a) to calculate Monte Carlo
s["nu",j] <- rnorm(1, mean = nuMean, sd = 1 / sqrt(nuPrecision)) estimates of 95% HPD intervals based on samples from the posterior distribu-
tions and produce a plot similar to Figure 6.15 as follows:
## then tau^2
sumPsiNuSquared <- sum( (s[psiIndices,j-1] - s["nu",j])^2 ) > ## remove burn-in
> s <- s[, -(1:1000)]
tau2a <- (n + 2) / 2 > ## estimate the 95 % HPD credible intervals and the posterior expectations
tau2b <- (sumPsiNuSquared + 2) / 2 > (mcmcHpds <- apply(s, 1, hpd))
nu tau2 psi1 psi2 psi3
s["tau2",j] <- 1 / rgamma(1, shape = tau2a, rate = tau2b) lower -1.1110064 0.1420143 -0.4336897 -1.8715310 -2.0651776
upper 0.1010381 1.5214695 0.5484652 -0.6248588 -0.2488816
## finally psi1, ..., psin psi4 psi5 psi6 psi7
for(i in 1:n){ lower -1.4753737 -0.9334114 -0.5345147 -1.7058202
psiiPrecision <- 1 / variances[i] + 1 / s["tau2",j] upper -0.2362761 0.3118408 -0.0660448 -0.2232895
psiiMean <- (logOddsRatios[i] / variances[i] + s["nu",j] / psi8 psi9
s["tau2",j])/psiiPrecision lower -0.9592916 -0.8061968
upper 1.4732138 0.6063260
s[psiIndices[i],j] <- rnorm(1, mean = psiiMean,
> (mcmcExpectations <- rowMeans(s))
sd = 1 / sqrt(psiiPrecision))
178 8 Numerical methods for Bayesian inference 179

nu tau2 psi1 psi2 psi3 scales = list (cex = 1)


-0.50788898 0.68177858 0.05943317 -1.23486599 -1.13470903 )
psi4 psi5 psi6 psi7 psi8 )
-0.85002638 -0.30665961 -0.30249620 -0.96901680 0.22490430 > print (randomEffectsCiPlot)
psi9
-0.08703254
> ## produce the plot
Campbell
> library (lattice)
> ## define a panel function Cuadros

> panel.ci <- function( Fallis


x, # point estimate Flowers
y, # height of the "bar" Krans
lx, # lower, lower limit of the interval Landesman
ux, # upper, upper limit of the interval Menzies
subscripts, # vector of indices Tervila
... # further graphics arguments Weseley
) mean effect size
{
−1.0 −0.5 0.0 0.5 1.0
x <- as.numeric(x)
Log odds ratio
y <- as.numeric(y)

lx <- as.numeric(lx[subscripts]) Compared to the results of the empirical Bayes analysis in Example 6.31, the
ux <- as.numeric(ux[subscripts])
credible intervals are wider in this fully Bayesian analysis. The point estimate
# normal dotplot for point estimates of the mean effect ν is E(ν | D) = −0.508 here, which is similar to the result
panel.dotplot(x, y, lty = 2, ...) ν̂ML = −0.52 obtained with empirical Bayes. However, the credible interval
# draw intervals
panel.arrows(lx, y, ux, y, for ν includes zero here, which is not the case for the empirical Bayes result.
length = 0.1, unit = "native", In addition, shrinkage of the Bayesian point estimates for the single studies
angle = 90, # deviation from line
code = 3, # left and right whisker
towards the mean effect ν is less pronounced here, see for example the Tervila
...) study.
# reference line
panel.abline (v = 0, lty = 2) 5. Let Xi , i = 1, . . . , n denote a random sample from a Po(λ) distribution with
} gamma prior λ ∼ G(α, β) for the mean λ.
> ## labels:
> studyNames <- c (names(groups), "mean effect size") a) Derive closed forms of E(λ | x1:n ) and Var(λ | x1:n ) by computing the posterior
> studyNames <- ordered (studyNames, levels = rev(studyNames)) distribution of λ | x1:n .
> # levels important for order!
> indices <- c(psiIndices, nu = 1) ◮ We have
> ## collect the data in a dataframe
> ciData <- data.frame (low = mcmcHpds["lower", indices], f (λ | x1:n ) = f (x1:n | λ)f (λ)
up = mcmcHpds["upper", indices], n
mid = mcmcExpectations[indices], Y
names = studyNames ∝ (λxi exp(−λ)) λα−1 exp(−βλ)
) i=1
Pn
> ciData[["signif"]] <- with (ciData,
=λ exp(−(β + n)λ),
α+ xi −1
i=1
up < 0 | low > 0)
> ciData[["color"]] <- with (ciData,
ifelse (signif, "black", "gray")) that is λ | x1:n ∼ G(α + nx̄, β + n). Consequently,
> randomEffectsCiPlot <- with (ciData,
dotplot (names ~ mid, α + nx̄
E(λ | x1:n ) = and
panel = panel.ci, β+n
lx = low, ux = up,
α + nx̄
pch = 19, col = color, Var(λ | x1:n ) = .
xlim = c (-1.5, 1.5), (β + n)2
xlab = "Log odds ratio",
180 8 Numerical methods for Bayesian inference 181

b) Approximate E(λ | x1:n ) and Var(λ | x1:n ) by exploiting the asymptotic normal- iii. and Monte Carlo integration.
ity of the posterior (cf . Section 6.6.2). .
◮ We use the following result from Section 6.6.2: ◮
 i. Analogously to part (b), we have
λ | x1:n ∼ N λ̂n , I(λ̂n )−1 ,
a


θ | x1:n ∼ N θ̂n , I(θ̂n )−1 ,
a
where λ̂n denotes the MLE and I(λ̂n )−1 the inverse observed Fisher informa-
tion. We now determine these two quantities for the Poisson likelihood: The where θ̂n denotes the MLE and I(θ̂n )−1 the inverse observed Fisher infor-
log-likelihood is mation. We exploit the invariance of the MLE with respect to one-to-one
n
X
l(x1:n | λ) = xi log(λ) − nλ, transformations to obtain
i=1
Pn E(θ | x1:n ) ≈ θ̂n = log(λ̂n ) = log(x̄) = 2.2925.
which yields the MLE λ̂n = ( i=1 xi )/n = x̄. We further have
Pn To transform the observed Fisher information obtained in part (b), we apply
d2 l(x1:n | λ) i=1 xi nx̄
I(λ) = − = = 2 Result 2.1:
dλ2 λ2 λ
 −2
and thus d exp(θ̂n ) 1
x̄2 x̄ Var(θ | x1:n ) ≈ I(θ̂n )−1 = I(λ̂n )−1 = = 0.0101.
I(λ̂n )−1 = = . dθ nx̄
nx̄ n
Consequently, ii. For the numerical integration, we work with the density of λ = exp(θ) in-
stead of θ to avoid numerical problems. We thus compute the posterior
E(λ | x1:n ) ≈ λ̂n = x̄ and expectation and variance of log(λ). We use the R function integrate to
x̄ compute the integrals numerically:
Var(λ | x1:n ) ≈ I(λ̂n )−1 = .
n > ## given data
> alpha <- beta <- 1 ## parameters of the prior distribution
c) Consider now the log mean θ = log(λ). Use the change of variables for- > n <- 10 ## number of observed values
mula (A.11) to compute the posterior density f (θ | x1:n ). > xbar <- 9.9 ## mean of observed values
> ##
◮ From part (a), we know that the posterior density of λ is > ## function for numerical computation of
> ## posterior expectation and variance of theta=log(lambda)
(β + n)α+nx̄ α+nx̄−1
f (λ | x1:n ) = λ exp(−(β + n)λ). > numInt <- function(alpha, beta, n, xbar)
Γ(α + nx̄) {
## posterior density of lambda
Applying the change of variables formula with transformation function g(y) = lambdaDens <- function(lambda, alpha, beta, n, xbar, log=FALSE)
log(y) gives
{
## parameters of the posterior gamma density
alphapost <- alpha + n*xbar
(β + n)α+nx̄
f (θ | x1:n ) = exp(θ)α+nx̄−1 exp(−(β + n) exp(θ)) exp(θ) betapost <- beta + n
Γ(α + nx̄) logRet <- dgamma(lambda, alphapost, betapost, log=TRUE)
(β + n)α+nx̄
if(log)
= exp(θ)α+nx̄ exp(−(β + n) exp(θ)). return(logRet)
Γ(α + nx̄) else
return(exp(logRet))
d) Let α = 1, β = 1 and assume that x̄ = 9.9 has been obtained for n = 10 }
observations from the model. Compute approximate values of E(θ | x1:n ) and # integrand for computation of posterior expectation
integrand.mean <- function(lambda)
Var(θ | x1:n ) via: {
i. the asymptotic normality of the posterior, log(lambda) *
lambdaDens(lambda, alpha, beta, n, xbar)
ii. numerical integration (cf . Appendix C.2.1), }
182 8 Numerical methods for Bayesian inference 183

# numerical integration to get posterior expectation as the sample size n = 10 is quite small so that an asymtotic approximation
res.mean <- integrate(integrand.mean, lower = 0, upper = Inf, may be inaccurate.
stop.on.error = FALSE,
rel.tol = sqrt(.Machine$double.eps)) 6. Consider the genetic linkage model from Exercise 5 in Chapter 2. Here we assume
if(res.mean$message == "OK")
mean <- res.mean$value a uniform prior on the proportion φ, i. e. φ ∼ U(0, 1). We would like to compute
else the posterior mean E(φ | x).
mean <- NA
a) Construct a rejection sampling algorithm to simulate from f (φ | x) using the
# numerical computation of variance prior density as the proposal density.
integrand.square <- function(lambda)
{ (log(lambda))^2 * ◮
lambdaDens(lambda, alpha, beta, n, xbar)
> ## define the log-likelihood function (up to multiplicative constants)
}
> ## Comment: on log-scale the numerical calculations are more robust
> loglik <- function(phi, x)
res.square <- integrate(integrand.square, lower = 0, upper = Inf,
{
stop.on.error = FALSE,
loglik <- x[1]*log(2+phi)+(x[2]+x[3])*log(1-phi)+x[4]*log(phi)
rel.tol = sqrt(.Machine$double.eps))
return(loglik)
if(res.square$message == "OK")
}
var <- res.square$value - mean^2
> ## rejection sampler (M: number of samples, x: data vector)
else
> rej <- function(M, x)
var <- NA
{
return(c(mean=mean,var=var))
post.mode <- optimize(loglik, x=x, lower = 0, upper = 1,
}
maximum = TRUE)
> # numerical approximation for posterior mean and variance of theta
> numInt(alpha, beta, n, xbar)
## determine constant a to be ordinate at the mode
mean var ## a represents the number of trials up to the first success
2.20226658 0.01005017 (a <- post.mode$objective)
iii. To obtain a random sample from the distribution of θ = log(λ), we first
## empty vector of length M
generate a random sample from the distribution of λ - which is a Gamma phi <- double(M)
distribution - and then transform this sample:
## counter to get M samples
> ## Monte-Carlo integration
N <- 1
> M <- 10000
while(N <=M)
> ## parameters of posterior distribution
{
> alphapost <- alpha + n*xbar
while (TRUE)
> betapost <- beta + n
{
> ## sample for lambda
## value from uniform distribution
> lambdaSam <- rgamma(M,alphapost,betapost)
u <- runif(1)
> ## sample for theta
## proposal for phi
> thetaSam <- log(lambdaSam)
z <- runif(1)
> # conditional expectation of theta
## check for acceptance
> (Etheta <- mean(thetaSam))
## exit the loop after acceptance
[1] 2.201973
if (u <= exp(loglik(phi=z, x=x)-a))
> ## Monte Carlo standard error break
> (se.Etheta <- sqrt(var(thetaSam)/M)) }
[1] 0.001001739 ## save the proposed value
> EthetaSq <- mean(thetaSam^2) phi[N] <- z
> # conditional variance of theta ## go for the next one
> (VarTheta <- EthetaSq - Etheta^2) N <- N+1
[1] 0.0100338 }
return(phi)
The estimates obtained by numerical integration and Monte Carlo integration }
are similar. The estimate of the posterior mean obtained from asymtotic nor-
mality is larger than the other two estimates. This difference is not surprising
184 8 Numerical methods for Bayesian inference 185

Density
4

0.4 0.5 0.6 0.7 0.8

phipost

b) Estimate the posterior mean of φ by Monte Carlo integration using M = 10 000 c) In 6b) we obtained samples of the posterior distribution assuming a uniform
samples from f (φ | x). Calculate also the Monte Carlo standard error. prior on φ. Suppose we now assume a Be(0.5, 0.5) prior instead of the previous
◮ U(0, 1) = Be(1, 1). Use the importance sampling weights to estimate the pos-
> ## load the data
terior mean and Monte Carlo standard error under the new prior based on the
> x <- c(125, 18, 20, 34) old samples from 6b).
> ## draw values from the posterior ◮
> set.seed(2012)
> M <- 10000 > ## Importance sampling -- try a new prior, the beta(0.5,0.5)
> phipost <- rej(M=M, x=x) > ## posterior density ratio = prior density ratio!
> ## posterior mean by Monte Carlo integration > weights <- dbeta(phipost, .5, .5)/dbeta(phipost, 1, 1)
> (Epost <- mean(phipost)) > (Epost2 <- sum(phipost*weights)/sum(weights))
[1] 0.6227486 [1] 0.6240971
> ## compute the Monte Carlo standard error > ## or simpler
> (se.Epost <- sqrt(var(phipost)/M)) > (Epost2 <- weighted.mean(phipost, w=weights))
[1] 0.0005057188 [1] 0.6240971
> ## check the posterior mean using numerical integration: > ##
> numerator <- integrate(function(phi) phi * exp(loglik(phi, x)), > (se.Epost2 <- 1/sum(weights)*sum((phipost - Epost2)^2*weights^2))
lower=0,upper=1)$val [1] 0.001721748
> denominator <- integrate(function(phi) exp(loglik(phi, x)), > hist(phipost, prob=TRUE, nclass=100, main=NULL)
lower=0,upper=1)$val > abline(v=Epost2, col="blue")
> numerator/denominator
[1] 0.6228061
> ## draw histogram of sampled values 8
> hist(phipost, prob=TRUE, nclass=100, main=NULL)
> ## compare with density
6
> phi.grid <- seq(0,1,length=1000)

Density
> dpost <- function(phi) exp(loglik(phi, x)) / denominator
> lines(phi.grid,dpost(phi.grid),col=2) 4
> abline(v=Epost, col="red")
2

0.4 0.5 0.6 0.7 0.8

phipost
186 8 Numerical methods for Bayesian inference 187

7. As in Exercise 6, we consider the genetic linkage model from Exercise 5 in Chap- # ml <- optim(0.5, log.lik, x=data, control=mycontrol, hessian=T,
ter 2. Now, we would like to sample from the posterior distribution of φ using # method="L-BFGS-B", lower=0+eps, upper=1-eps)
# mymean <- ml$par
MCMC. Using the Metropolis-Hastings algorithm, an arbitrary proposal distribu- # mystd <- sqrt(-1/ml$hessian)
tion can be used and the algorithm will always converge to the target distribution. ##################################################################
However, the time until convergence and the degree of dependence between the # count number of accepted and rejected values
samples depends on the chosen proposal distribution. yes <- 0
no <- 0
a) To sample from the posterior distribution, construct an MCMC sampler based
on the following normal independence proposal (cf . approximation 6.6.2 in # use as initial starting value mymean
xsamples[1] <- mymean
Section 6.6.2):

φ∗ ∼ N Mod(φ | x), F 2 · C −1 , # Metropolis-Hastings iteration
for(k in 2:M){
where Mod(φ | x) denotes the posterior mode, C the negative curvature of the # value of the past iteration
old <- xsamples[k-1]
log-posterior at the mode and F a factor to blow up the variance.
◮ # propose new value
# factor fac blows up standard deviation
> # define the log-likelihood function proposal <- rnorm(1, mean=mymean, sd=mystd*factor)
> log.lik <- function(phi, x)
{ # compute acceptance ratio
if((phi<1)&(phi>0)) { # under uniform proir: posterior ratio = likelihood ratio
loglik <- x[1]*log(2+phi)+(x[2]+x[3])*log(1-phi)+x[4]*log(phi) posterior.ratio <- exp(log.lik(proposal, data)-log.lik(old, data))
} if(is.na(posterior.ratio)){
else { # happens when the proposal is not between 0 and 1
# if phi is not in the defined range return NA # => acceptance probability will be 0
loglik <- NA posterior.ratio <- 0
} }
return(loglik) proposal.ratio <- exp(dnorm(old, mymean, mystd*factor, log=T) -
} dnorm(proposal, mymean, mystd*factor, log=T))
> # MCMC function with independence proposal
> # M: number of samples, x: data vector, factor: factor to blow up # get the acceptance probability
> # the variance alpha <- posterior.ratio*proposal.ratio
> mcmc_indep <- function(M, x, factor)
{ # accept-reject step
# store samples here if(runif(1) <= alpha){
xsamples <- rep(NA, M) # accept the proposed value
xsamples[k] <- proposal
# Idea: Normal Independence proposal with mean equal to the posterior # increase counter of accepted values
# mode and standard deviation equal to the standard error or to yes <- yes + 1
# a multiple of the standard error. }
mymean <- optimize(log.lik, x=x, lower = 0, upper = 1, else{
maximum = TRUE)$maximum # stay with the old value
xsamples[k] <- old
# negative curvature of the log-posterior at the mode no <- no + 1
a <- -1*(-x[1]/(2+mymean)^2 - (x[2]+x[3])/(1-mymean)^2 - x[4]/mymean^2) }
mystd <- sqrt(1/a) }

################################################################## # acceptance rate


## Alternative using optim instead of optimize. cat("The acceptance rate is: ", round(yes/(yes+no)*100,2), "%\n", sep="")
## Optim returns directly Hessian. return(xsamples)
# eps <- 1E-9 }
# # if fnscale < 0 the maximum is comupted
# mycontrol <- list(fnscale=-1, maxit=100)
#
188 8 Numerical methods for Bayesian inference 189

proposal.ratio <- 1

# get the acceptance probability


alpha <- posterior.ratio*proposal.ratio

# accept-reject step
if(runif(1) <= alpha){
# accept the proposed value
xsamples[k] <- proposal
# increase counter of accepted values
yes <- yes + 1
}
else{
# stay with the old value
xsamples[k] <- old
no <- no + 1
b) Construct an MCMC sampler based on the following random walk proposal: }
}
φ∗ ∼ U(φ(m) − d, φ(m) + d),
# acceptance rate
cat("The acceptance rate is: ", round(yes/(yes+no)*100,2),
(m)
where φ denotes the current state of the Markov chain and d is a constant. "%\n", sep="")
◮ return(xsamples)
}
> # MCMC function with random walk proposal
> # M: number of samples, x: data vector, fac: factor to blow up
> # the variance
> mcmc_rw <- function(M, x, d){

# store samples here


xsamples <- rep(NA, M)

# count number of accepted and rejected values


yes <- 0
no <- 0

# specify a starting value


xsamples[1] <- 0.5

# Metropolis-Hastings iteration
for(k in 2:M){
# value of the past iteration c) Generate M = 10 000 samples from algorithm 7a), setting F = 1 and F = 10,
and from algorithm 7b) with d = 0.1 and d = 0.2. To check the convergence of
old <- xsamples[k-1]

# propose new value the Markov chain:


# use a random walk proposal: U(phi^(k) - d, phi^(k) + d)
i. plot the generated samples to visually check the traces,
proposal <- runif(1, old-d, old+d)
ii. plot the autocorrelation function using the R-function acf,
# compute acceptance ratio iii. generate a histogram of the samples,
posterior.ratio <- exp(log.lik(proposal, data)-log.lik(old, data))
if(is.na(posterior.ratio)){ iv. compare the acceptance rates.
# happens when the proposal is not between 0 and 1 What do you observe?
# => acceptance probability will be 0
posterior.ratio <- 0 ◮
}
> # data
# the proposal ratio is equal to 1
> data <- c(125, 18, 20, 34)
# as we have a symmetric proposal distribution
190 8 Numerical methods for Bayesian inference 191

> # number of iterations All four Markov chains converge quickly after a few hundred iterations. The
> M <- 10000 independence proposal with the original variance (F=1) performs best: It pro-
> # get samples using Metropolis-Hastings with independent proposal
> indepF1 <- mcmc_indep(M=M, x=data, factor=1) duces uncorrelated samples and has a high acceptance rate. In contrast, the
The acceptance rate is: 96.42% independence proposal with blown-up variance (F = 10) performs worst. It has
> indepF10 <- mcmc_indep(M=M, x=data, factor=10) a low acceptance rate and thus the Markov chain often gets stuck in the same
The acceptance rate is: 12.22%
> # get samples using Metropolis-Hastings with random walk proposal value for several iterations, which leads to correlated samples. Regarding the
> RW0.1 <- mcmc_rw(M=M, x=data, d=0.1) random walk proposals, the one with the wider proposal distribution (d = 0.2)
The acceptance rate is: 63.27% performs better since it yields less correlated samples and has a preferrable ac-
> RW0.2 <- mcmc_rw(M=M, x=data, d=0.2)
ceptance rate. (For random walk proposals, acceptance rates between 30% and
The acceptance rate is: 40.33%
> ## some plots 50% are recommended.)
> # independence proposal with F=1
> par(mfrow=c(4,3))
8. Cole et al. (2012) describe a rejection sampling approach to sample from a poste-
> # traceplot rior distribution as a simple and efficient alternative to MCMC. They summarise
> plot(indepF1, type="l", xlim=c(2000,3000), xlab="Iteration") their approach as:
> # autocorrelation plot
> acf(indepF1) I. Define model with likelihood function L(θ; y) and prior f (θ).
> # histogram
II. Obtain the maximum likelihood estimate θ̂ML .
> hist(indepF1, nclass=100, xlim=c(0.4,0.8), prob=T, xlab=expression(phi),
ylab=expression(hat(f)*(phi ~"|"~ x)), main="") III.To obtain a sample from the posterior:
> # ylab=expression(phi^{(k)})
i. Draw θ ∗ from the prior distribution (note: this must cover the range of the
> # independence proposal with F=10
> plot(indepF10, type="l", xlim=c(2000,3000), xlab="Iteration") posterior).
> acf(indepF10) ii. Compute the ratio p = L(θ ∗ ; y)/L(θ̂ML ; y).
> hist(indepF10, nclass=100, xlim=c(0.4,0.8), prob=T, xlab=expression(phi),
ylab=expression(hat(f)*(phi ~"|"~ x)), main="") iii. Draw u from U(0, 1).
> # random walk proposal with d=0.1 iv. If u ≤ p, then accept θ ∗ . Otherwise reject θ ∗ and repeat.
> plot(RW0.1, type="l", xlim=c(2000,3000), xlab="Iteration")
> acf(RW0.1) a) Using Bayes’ rule, write out the posterior density of f (θ | y). In the notation of
> hist(RW0.1, nclass=100, xlim=c(0.4,0.8), prob=T, xlab=expression(phi), Section 8.3.3, what are the functions fX (θ), fZ (θ) and L(θ; x) in the Bayesian
ylab=expression(hat(f)*(phi ~"|"~ x)), main="")
> # random walk proposal with d=0.2 formulation?
> plot(RW0.2, type="l", xlim=c(2000,3000), xlab="Iteration") ◮ By Bayes’ rule, we have
> acf(RW0.2)
> hist(RW0.2, nclass=100, xlim=c(0.4,0.8), prob=T, xlab=expression(phi), f (y | θ)f (θ)
f (θ | y) = ,
ylab=expression(hat(f)*(phi ~"|"~ x)), main="") f (y)
R
^f (φ | x)

0.80 1.0
indepF1

0.75 8
where f (y) = f (y | θ)f (θ) dθ is the marginal likelihood. The posterior density
0.70 0.8
ACF

0.65 0.6 6
0.60 0.4 4
0.55 0.2 2
0.50 0.0
0.45 0
2000 2200 2400 2600 2800 3000 0 10 20 30 40 0.4 0.5 0.6 0.7 0.8 is the target, so f (θ | y) = fZ (θ), and the prior density is the proposal, so that
f (θ) = fX (θ). As usual, the likelihood is f (y | θ) = L(θ; y). Thus, we can
Iteration Lag φ
indepF10

0.80
^f (φ | x)

0.75 1.0 10
0.70 0.8
ACF

0.65 0.6 8
0.60 0.4 6
0.55
0.50
0.45
0.2
0.0
4
2
0 rewrite the above equation as
2000 2200 2400 2600 2800 3000 0 10 20 30 40 0.4 0.5 0.6 0.7 0.8
L(θ; y)
fX (θ) = fZ (θ)
Iteration Lag φ
(8.2)
^f (φ | x)

1.0
RW0.1

0.7 0.8 8
c′
ACF

0.6 0.6 6
0.5 0.4 4
0.2 2
0.4
with constant c′ = f (y).
0.0 0
2000 2200 2400 2600 2800 3000 0 10 20 30 40 0.4 0.5 0.6 0.7 0.8

Iteration Lag φ
b) Show that the acceptance probability fX (θ ∗ )/{afZ (θ ∗ )} is equal to
^f (φ | x)

1.0
RW0.2

0.75 0.8 8
0.70
L(θ ∗ ; y)/L(θ̂ML ; y). What is a?
ACF

0.65 0.6 6
0.60 0.4 4
0.55 0.2 2
0.50
0.45 0.0 0
2000 2200 2400 2600 2800 3000 0 10 20 30 40 0.4 0.5 0.6 0.7 0.8
◮ Let U denote a random variable with U ∼ U(0, 1). Then, the acceptance
Iteration Lag φ
probability is
Pr(U ≤ p) = p = L(θ ∗ ; y)/L(θ̂ML ; y)
192 8 Numerical methods for Bayesian inference 193

as p ∈ [0, 1]. Solving the equation ## proposal from prior distribution


z <- rbeta(1,0.5,0.5)
fX (θ ∗ ) f (θ ∗ | y) L(θ ∗ ; y) ## compute relative likelihood of the proposal
= =
afZ (θ ∗ ) af (θ ∗ ) L(θ̂ML ; y)
p <- exp(loglik(z,x) - loglik(mle,x))
## value from uniform distribution
u <- runif(1)
for a and applying Bayes’ rule yields ## check for acceptance
## exit the loop after acceptance
f (θ ∗ | y) L(θ̂ML ; y) if (u <= p)
a = L(θ̂ML ; y) = . (8.3)
L(θ ∗ ; y)f (θ ∗ ) c′ break
}
Note that the constant c′ = f (y) is not explicitly known. ## save the proposed value
phi[N] <- z
c) Explain why the inequality fX (θ) ≤ afZ (θ) is guaranteed by the approach of ## go for the next one
Cole et al. (2012). N <- N+1
}
◮ Since L(θ; y) ≤ L(θ̂ML ; y) for all θ by construction, the inequality follows return(phi)
easily by combining (8.2) and (8.3). Note that the expression for a given in }
(8.3) is the smallest constant a such that the sampling criterion fX (θ) ≤ afZ (θ) > ## draw histogram of sampled values
> phipost <- rejCole(10000,x)
is satisfied. This choice of a will result in more samples being accepted than > hist(phipost, prob=TRUE, nclass=100, main=NULL)
for a larger a. > ## compare with density from Ex. 6
> phi.grid <- seq(0,1,length=1000)
d) In the model of Exercise 6c), use the proposed rejection sampling scheme to > dpost <- function(phi) exp(loglik(phi, x)) / denominator
generate samples from the posterior of φ. > lines(phi.grid,dpost(phi.grid),col=2)
◮ In Exercise 6c), we have produced a histogram of the posterior distribution > abline(v=Epost, col="red")

of φ. We therefore know that the Be(0.5, 0.5) distribution, which has range 8
[0, 1], covers the range of the posterior distribution so that the condition in
Cole et al.’s rejection sampling algorithm is satisfied. 6
> # data

Density
> x <- c(125,18,20,34)
4
> n <- sum(x)
> ## define the log-likelihood function (up to multiplicative constants)
> loglik <- function(phi, x) 2
{
loglik <- x[1]*log(2+phi)+(x[2]+x[3])*log(1-phi)+x[4]*log(phi)
return(loglik) 0
}
0.45 0.50 0.55 0.60 0.65 0.70 0.75
> ## rejection sampler (M: number of samples, x: data vector);
> ## approach by Cole et al. phipost
> rejCole <- function(M, x)
{
# determine the MLE for phi
mle <- optimize(loglik, x=x, lower = 0, upper = 1,
maximum = TRUE)$maximum

## empty vector of length M


phi <- double(M)

## counter to get M samples


N <- 1
while(N <=M)
{
while (TRUE)
{
9 Prediction

1. Five physicians participate in a study to evaluate the effect of a medication for


migraine. Physician i = 1, . . . , 5 treats ni patients with the new medication and
it shows positive effects for yi of the patients. Let π be the probability that an
arbitrary migraine patient reacts positively to the medication. Given that

n = (3, 2, 4, 4, 3) and y = (2, 1, 4, 3, 3)

a) Provide an expression for the likelihood L(π) for this study.


◮ We make two assumptions:
i. The outcomes for different patients treated by the same physician are inde-
pendent.
ii. The study results of different physicians are independent.
By assumption (i), the yi can be modelled as realisations of a binomial distribu-
tion:
Yi ∼ Bin(ni , π), i = 1, . . . , 5.
iid

By assumption (ii), the likelihood of π is then


Y5  
ni yi
L(π) = π (1 − π)ni −yi
yi
i=1

∝ π 5ȳ (1 − π)5n̄−5ȳ ,
P5
where ȳ = 1/5 i=1 yi is the mean number of successful treatments per physi-
P5
cians and n̄ = 1/5 i=1 ni the mean number of patients treated per study.
b) Specify a conjugate prior distribution f (π) for π and choose appropriate values
for its parameters. Using these parameters derive the posterior distribution
f (π | n, y).
◮ It is easy to see that the beta distribution Be(α, β) with kernel

f (π) ∝ π α−1 (1 − π)β−1

is conjugate with respect to the above likelihood (or see Example 6.7). We choose
the non-informative Jeffreys’ prior as prior for π, i. e. we choose α = β = 1/2
(see Table 6.3). This gives the following posterior distribution for π:

π | n, y ∼ Be(5ȳ + 1/2, 5n̄ − 5ȳ + 1/2).


196 9 Prediction 197

c) A sixth physician wants to participate in the study with n6 = 5 patients. De- [,1] [,2] [,3] [,4] [,5]
termine the posterior predictive distribution for y6 (the number of patients out [1,] 0.000000000 1.00000000 2.00000000 3.0000000 4.0000000
[2,] 0.001729392 0.01729392 0.08673568 0.2824352 0.6412176
of the five for which the medication will have a positive effect). [,6]
◮ The density of the posterior predictive distribution is [1,] 5
[2,] 1
Z1
Thus, the 2.5% quantile is 2 and the 97.5% quantile is 5 so that the 95% pre-
f (y6 | n6 , y, n) = f (y6 | π, n6 )f (π | y, n) dπ
diction interval is [2, 5]. Clearly, this interval does not contain exactly 95% of
0
the probability mass of the predictive distribution since the distribution of Y6 is
Z1  
n6 y6 discrete. In fact, the predictive probability for Y6 to fall into [2, 5] is larger:
= π (1 − π)n6 −y6
y6
0 1 − Pr(Y6 ≤ 1) = 0.9827.
1
· π 5ȳ−1/2 (1 − π)5n̄−5ȳ−1/2 dπ
B(5ȳ + 1/2, 5n̄ − 5ȳ + 1/2) d) Calculate the likelihood prediction as well.
 
n6 ◮ The extended likelihood function is
= B(5ȳ + 1/2, 5n̄ − 5ȳ + 1/2)−1  
y6
n6 5ȳ+y6
Z1 L(π, y6 ) = π (1 − π)5n̄+n6 −5ȳ−y6 .
y6
· π 5ȳ+y6 −1/2 (1 − π)5n̄+n6 −5ȳ−y6 −1/2 dπ (9.1)
0
If y6 had been observed, then the ML estimate of π would be
 
n6 B(5ȳ + y6 + 1/2, 5n̄ + n6 − 5ȳ − y6 + 1/2) 5ȳ + y6
= , π̂(y6 ) = ,
y6 B(5ȳ + 1/2, 5n̄ − 5ȳ + 1/2) 5n̄ + n6

where in (9.1), we have used that the integrand is the kernel of a Be(5ȳ + y6 + which yields the predictive likelihood
1/2, 5n̄ + n6 − 5ȳ − y6 + 1/2) density. The obtained density is the density of a
Lp (y6 ) = L(π̂(y6 ), y6 )
beta-binomial distribution (see Table A.1), more precisely
  5ȳ+y6  5n̄+n6 −5ȳ−y6
n6 5ȳ + y6 5n̄ + n6 − 5ȳ − y6
y6 | n6 , y, n ∼ BeB(n6 , 5ȳ + 1/2, 5n̄ − 5ȳ + 1/2). = .
y6 5n̄ + n6 5n̄ + n6

Addition: Based on this posterior predictive distribution, we now compute a The likelihood prediction
point prediction and a prognostic interval for the given data:
Lp (y6 )
> ## given observations fp (y6 ) = Pn6
> n <- c(3, 2, 4, 4, 3) y=0 Lp (y)
> y <- c(2, 1, 4, 3, 3)
> ## parameters of the beta-binomial posterior predictive distribution can now be calculated numerically:
> ## (under Jeffreys' prior) > ## predictive likelihood
> alphaStar <- sum(y) + 0.5 > predLik <- function(yNew, nNew)
> betaStar <- sum(n - y) + 0.5 {
> nNew <- 5 ## number of patients treated by the additional physician sumY <- sum(y) + yNew
> ## point prediction: expectation of the post. pred. distr. sumN <- sum(n) + nNew
> (expectation <- nNew * alphaStar / (alphaStar + betaStar)) pi <- sumY / sumN
[1] 3.970588
> ## compute cumulative distribution function to get a prediction interval logRet <- lchoose(nNew, yNew) + sumY * log(pi)
> library(VGAM, warn.conflicts = FALSE) + (sumN - sumY) * log(1 - pi)
> rbind(0:5, return(exp(logRet))
pbetabinom.ab(0:5, size = nNew, alphaStar, betaStar)) }
> ## calculate values of the discrete likelihood prediction:
> predictiveProb <- predLik(0:5, 5)
> (predictiveProb <- predictiveProb / sum(predictiveProb))
198 9 Prediction 199

[1] 0.004754798 0.041534762 0.155883020 0.312701779 This yields the predictive likelihood
[5] 0.333881997 0.151243644
> ## distribution function Lp (y) = L(µ̂(y), σ̂ 2 (y), y),
> cumsum(predictiveProb)
[1] 0.004754798 0.046289560 0.202172580 0.514874359
which can only be normalised numerically for a given data set to obtain the
[5] 0.848756356 1.000000000 R
likelihood prediction f (y) = Lp (y)/ Lp (u) du.
The values of the discrete distribution function are similar to ones obtained from To determine the bootstrap predictive distribution, we need the distribution of
the Bayes prediction. The 95% prediction interval is also [2, 5] here. The point the ML estimators in (9.2). In Example 3.5 and Example 3.8, respectively, we
estimate from the likehood prediction turns out to be: have seen that
> sum((0:5) * predictiveProb) Pn 2
i=1 (Xi − X̄)
µ̂ML | µ, σ 2 ∼ N(µ, σ 2 /n) and | σ 2 ∼ χ2 (n − 1).
[1] 3.383152
σ 2
This estimate is close to 3.9706 from the Bayes prediction.
In addition, the two above random variables are independent. Since χ2 (d) =
2. Let X1:n be a random sample from a N(µ, σ 2 ) distribution from which a further
G(d/2, 1/2), we can deduce
observation Y = Xn+1 is to be predicted. Both the expectation µ and the variance
σ 2 are unknown. n  
2 1X n−1 n
σ̂ML | σ2 = (Xi − X̄)2 | σ 2 ∼ G , 2
a) Start by determining the plug-in predictive distribution. n 2 2σ
i=1
◮ Note that in contrast to Example 9.2, the variance σ 2 is unknown here. By
Example 5.3, the ML estimates are by using the fact that the second parameter of the Gamma distribution is an
inverse scale parameter (see Appendix A.5.2).
n
2 1X The bootstrap predictive distribution of y given θ = (µ, σ 2 )T has density
µ̂ML = x̄ and σ̂ML = (xi − x̄)2 . (9.2)
n
i=1 Z∞ Z∞
2
The plug-in predictive distribution is thus g(y; θ) = f (y | µ̂ML , σ̂ML )f (µ̂ML | µ, σ 2 )f (σ̂ML
2
| µ, σ 2 ) dµ̂ML dσ̂ML
2

! 0 −∞
n
1X Z∞ Z∞
Y ∼ N x̄, (xi − x̄)2 . 2
n = f (y | µ̂ML , σ̂ML )f (µ̂ML | µ, σ 2 ) dµ̂ML f (σ̂ML
2
| µ, σ 2 ) dσ̂ML
2
. (9.3)
i=1
0 −∞
b) Calculate the likelihood and the bootstrap predictive distributions.
◮ The extended likelihood is The inner integral in (9.3) corresponds to the marginal likelihood in the normal-
normal model. From (7.18) we thus obtain
L(µ, σ 2 , y) = f (y | µ, σ 2 ) · L(µ, σ 2 )
  ( n
!) Z∞
1 2 2 −n 1 X
2 2
f (y | µ̂ML , σ̂ML )f (µ̂ML | µ, σ 2 ) dµ̂ML
∝ σ exp − 2 (y − µ) · (σ )
−1 2 exp − 2 (xi − µ)
2σ 2σ
i=1 −∞
( n
!)   21   
1 X 1/σ 2 1 1/σ 2
= (σ 2 )− 2 exp − 2 (y − µ)2 + n(x̄ − µ)2 + (xi − x̄)2
n+1 1
2 −2 2
= (2πσ̂ML ) exp − 2 (y − µ) .
2σ 1/σ̂ML + 1/σ 2
2 2σ̂ML 1/σ̂ML
2 + 1/σ 2
i=1
  (9.4)
2 − n+1 1 2 2

∝ (σ ) 2 exp − 2 (y − µ) + n(x̄ − µ)

Analytical computation of the outer integral in (9.3) is however difficult. To
and the ML estimates of the parameters based on the extended data set are calculate g(y; θ), we can use Monte Carlo integration instead: We draw a large
2 (i) 2
! number of random numbers (σ̂ML ) from the G((n−1)/2, n/(2σ̂ML )) distribution,
nx̄ + y 1 X n
plug them into (9.4) and compute the mean for the desired values of y. Of course,
µ̂(y) = and σ̂ 2 (y) = (xi − µ̂(y))2 + (y − µ̂(y))2 .
n+1 n+1 this only works for a given data set x1 , . . . , xn .
i=1
200 9 Prediction 201

c) Derive the Bayesian predictive distribution under the assumption of the reference 3. Derive Equation (9.11).
prior f (µ, σ 2 ) ∝ σ −2 . ◮ We proceed analoguously as in Example 9.7. By Example 6.8, the posterior
◮ As in Example 6.24, it is convenient to work with the precision κ = (σ 2 )−1 distribution of µ is  
and the corresponding reference prior f (µ, κ) ∝ κ−1 , which formally corresponds σ2
µ | x1:n ∼ N µ̄, ,
to the normal-gamma distribution NG(0, 0, −1/2, 0). By (6.26), the posterior n + δσ 2

distribution is again a normal-gamma distribution: where  


σ2 nx̄
µ̄ = E(µ | x1:n ) = + δν .
(µ, κ) | x1:n ∼ NG (µ∗ , λ∗ , α∗ , β ∗ ) , δσ 2 + n σ2
The posterior predictive distribution of Y | x1:n has thus density
where Z
f (y | x1:n ) = f (y | µ)f (µ | x1:n ) dµ
µ∗ = x̄, λ∗ = n,
Z   
1 1X
n 1 (µ − y)2 (δσ 2 + n)(µ − µ̄)2
α∗ = (n − 1) and β∗ = (xi − x̄)2 . ∝ exp − + dµ.
2 2 2 σ2 σ2
i=1
We now use Appendix B.1.5 to combine these two quadratic forms:
In Exercise 3 in Chapter 7, we have calculated the prior predictive distribution
for this model. Since we have a conjugate prior distribution, we can infer the (µ − y)2 (δσ 2 + n)(µ − µ̄)2
2
+
posterior predictive distribution from the prior predictive distribution by replac- σ σ2
δσ 2 + n + 1 δσ 2 + n
ing the prior parameters by the posterior parameters and adjusting the likelihood = (µ − c)2 + 2 (y − µ̄)2 ,
appropriately. For the latter task, we set n = 1, σ̂ML 2
= 0 and x̄ = y in for- σ2 σ (δσ 2 + n + 1)
mula (7.1) in the solutions since we want to predict one observation only. In for
addidtion, we replace the prior parameters by the posterior parameters in (7.1) y + (δσ 2 + n)µ̄
c= .
to obtain δσ 2 + n + 1
  21  1/2 Since the second term does not depend on µ, this implies
1 Γ(α∗ + 1/2)(β ∗ )α

λ∗  Z  
f (y | x) = · δσ 2 + n δσ 2 + n + 1
2π λ∗ + 1 Γ(α∗ ) f (y | x1:n ) ∝ exp − 2 (y − µ̄)2
exp − (µ − c)2

 −(α∗ + 21 ) 2σ (δσ 2 + n + 1) 2σ 2
(λ∗ + 1)−1 λ∗ (ν ∗ − y)2 | {z }
β∗ + √ √
= 2πσ/ δσ 2 +n+1
2  
!−( n−1 δσ 2 + n
2 +2) ∝ exp − (y − µ̄)2
1
, (9.5)
(n + 1)−1 n(x̄ − y)2
n
1X 2 2σ 2 (δσ 2 + n + 1)
∝ (xi − x̄) +
2 2
i=1 where we have used that the above integrand is the kernel of a normal density.
! 2 
n
− n2 Now, (9.5) is the kernel of a normal density with mean µ̄ and variance σ 2 (δσ 2 +

2n(y − x̄)2
n
1X 2
= (xi − x̄) 1+ nP n + 1)/(δσ 2 + n), so the posterior predictive distribution is
2 i=1 (xi − x̄) (n + 1)2
2
i=1
  2 
!− (n−1)+1 δσ + n + 1
(y − x̄)2
2
Y | x1:n ∼ N µ̄, σ 2 .
∝ 1+  1 Pn . δσ 2 + n
1
(n − 1) 1 + n n−1 i=1 (xi − x̄)2
4. Prove Murphy’s decomposition (9.16) of the Brier score.
This is the kernel of the t distribution mentioned at the end of Example 9.7, i.e. ◮ We first establish a useful identity: Since the observations are binary, the mean
the posterior predictive distribution is of the squared observations (y 2 ) equals the mean of the original observations (ȳ).
   Pn  First, we have
1 (xi − x̄)2 n
Y | x1:n ∼ t x̄, 1 + · i=1 , n−1 . 1X
n n−1 (yi − ȳ)2 = y 2 − ȳ 2 = ȳ(1 − ȳ). (9.6)
n
i=1
202 9 Prediction 203

Let the observations be assigned to J groups and denoted by yji , i = 1, . . . , nj , is proper for a binary observation Y .
j = 1, . . . , J with group means ◮ Let B(π0 ) denote the true distribution of the observation Yo and f the probability
nj mass of the predictive distribution Y ∼ B(π) as introduced in Definition 9.9. The
1 X
ȳj = yji expected score under the true distribution is then
nj
i=1
E[S(f (y), Yo )] = − E[f (Yo )] = −f (0) · (1 − π0 ) − f (1) · π0
PJ
and respresentative prediction probabilities πj . In total, there are N = j=1 nj = −(1 − π)(1 − π0 ) − π · π0
observed values with overall mean (or relative frequency) ȳ.
= (1 − 2π0 )π + π0 − 1.
We now calculate the right-hand side of Murphy’s decomposition (9.16) and use
(9.6): As a function of π, the expected score is thus a line with slope 1 − 2π0 . If this slope
is positive or negative, respectively, then the score is minimised by π = 0 or π = 1,
ȳ(1 − ȳ) + SC − MR
respectively (compare to the proof of Result 9.2 for the absolute score). Hence, the
J nj J J
1 XX 1 X 1 X score is in general not minimised by π = π0 , i. e. this scoring rule is not proper.
= (yji − ȳ)2 + nj (ȳj − πj )2 − nj (ȳj − ȳ)2
N
j=1 i=1
N
j=1
N
j=1
6. For a normally distributed prediction show that it is possible to write the CRPS as
nj
! in (9.17) using the formula for the expectation of the folded normal distribution in
1 X X
J
2 2 2
= (yji − ȳ) + nj (ȳj − πj ) − nj (ȳj − ȳ) . Appendix A.5.2.
N
j=1 i=1 ◮ The predictive distribution here is the normal distribution N(µ, σ 2 ). Let Y1
und Y2 be independent random variables with N(µ, σ 2 ) distribution. From this, we
The aim is to obtain the mean Brier score
deduce
Y1 − yo ∼ N(µ − yo , σ 2 ) and Y1 − Y2 ∼ N(0, 2σ 2 ),
J nj
1 XX
BS = (yji − πj )2 ,
N
j=1 i=1 where for the latter result, we have used Var(Y1 + Y2 ) = Var(Y1 ) + Var(Y2 ) due to
independence (see Appendix A.3.5). This implies (see Appendix A.5.2)
which we can isolate from the first term above since
nj
X nj
X nj
X |Y1 − yo | ∼ FN(µ − yo , σ 2 ) and |Y1 − Y2 | ∼ FN(0, 2σ 2 ).
(yji − ȳ)2 = (yji − πj )2 + 2(πj − ȳ) (yji − πj ) + nj (πj − ȳ)2 .
i=1 i=1 i=1 The CRPS is therefore
Consequently, 1
CRP S(f (y), yo ) = E{|Y1 − yo |} − E{|Y1 − Y2 |}
2
ȳ(1 − ȳ) + SC − MR
µ − y  n µ − y  o 1n √ o
= 2σϕ + (µ − yo ) 2Φ −1 − 2 2σϕ(0) + 0
o o
J σ σ 2
1 X  y − µ h n  y − µ o i √2σ
=BS + 2(πj − ȳ)nj (ȳj − πj ) + nj (πj − ȳ)2 + nj (ȳj − πj )2 − nj (ȳj − ȳ)2
= 2σϕ + (µ − yo ) 2 1 − Φ −1 − √
o o
N
j=1 σ σ 2π
µ − y  σ
= 2σϕ(ỹo ) + σ {1 − 2Φ(ỹo )} − √
o
and by expanding the quadratic terms on the right-hand side of the equation, we see
σ π
that they all cancel. This completes the proof of Murphy’s decomposition.  
1
5. Investigate if the scoring rule = σ ỹo {2Φ(ỹo ) − 1} + 2ϕ(ỹo ) − √ .
π
S(f (y), yo ) = −f (yo )
Bibliography
Bartlett M. S. (1937) Properties of sufficiency and statistical tests. Proceedings of the Royal
Society of London. Series A, Mathematical and Physical Sciences, 160(901):268–282.

Box G. E. P. (1980) Sampling and Bayes’ inference in scientific modelling and robustness (with
discussion). Journal of the Royal Statistical Society, Series A, 143:383–430.

Cole S. R., Chu H., Greenland S., Hamra G. and Richardson D. B. (2012) Bayesian posterior
distributions without Markov chains. American Journal of Epidemiology, 175(5):368–375.

Davison A. C. (2003) Statistical Models. Cambridge University Press, Cambridge.

Dempster A. P., Laird N. M. and Rubin D. B. (1977) Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodolog-
ical), 39(1):1–38.

Good I. J. (1995) When batterer turns murderer. Nature, 375(6532):541.

Good I. J. (1996) When batterer becomes murderer. Nature, 381(6532):481.

Goodman S. N. (1999) Towards evidence-based medical statistics. 2.: The Bayes factor. Annals
of Internal Medicine, 130:1005–1013.

Merz J. F. and Caulkins J. P. (1995) Propensity to abuse - Propensity to murder? Chance,


8(2):14.

Rao C. R. (1973) Linear Statistical Inference and Its Applications. Wiley series in probability
and mathematical statistics. John Wiley & Sons, New York.

Sellke T., Bayarri M. J. and Berger J. O. (2001) Calibration of p values for testing precise null
hypotheses. The American Statistician, 55:62–71.
Index
A N
arithmetic mean 14 normal distribution
folded 203
normal-gamma distribution 200
B normal-normal model 199
beta distribution 195 numerical integration 163, 168, 171, 172
beta-binomial distribution 196
binomial distribution 195
bootstrap predictive distribution 199 P
burn-in 177 P-value 152
Pareto distribution 134
point prediction 196
C power model 74
case-control study prediction interval 198
matched 95 predictive likelihood 199
change-of-variables formula 170 prior
convolution theorem 171 -data conflict 155
criticism 155
E prior distribution
Emax model 111 non-informative 195
examples prior predictive distribution see marginal
analysis of survival times 80 likelihood
blood alcohol concentration 94, 145 profile likelihood confidence interval 61
capture-recapture method 7 prognostic interval 196
prevention of preeclampsia 64, 174
exponential model 75 R
regression model
F logistic 98
Fisher information 57 normal 144
risk
log relative 59
H relative 59
HPD interval 172
S
I score equations 56
inverse gamma distribution 175 significance test 71

J T
Jeffreys’ Priori 195 Taylor approximation 73

L
likelihood function
extended 197
likelihood]extended 198
Lindley’s paradox 150

M
Mallow’s Cp statistic 144
marginal likelihood 199
minimum Bayes factor 151
Monte Carlo estimate 172, 199

You might also like