Bayesian Course Main

Concepts in Bayesian Inference
Christel Faes
Interuniversity Institute for Biostatistics and statistical Bioinformatics
Hasselt University
christel.faes@uhasselt.be
Emmanuel Lesaffre
Department of Biostatistics, Erasmus Medical Center
Interuniversity Institute for Biostatistics and statistical Bioinformatics
Catholic University Leuven & University Hasselt
Contents
I Basic concepts in Bayesian methods 1
1 Modes of statistical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1 The frequentist approach: a critical reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 The classical statistical approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 The P -value as a measure of evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.3 The confidence interval as a measure of evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.1.4 An historical note on the frequentist paradigms* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2 Statistical inference based on the likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.2.1 The likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2.2 The likelihood principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Concepts in Bayesian Inference i

1.3 The Bayesian approach: some basic ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.3.2 Bayes theorem Discrete version for simple events - 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.4 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2 Bayes theorem: computing the posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.2 Bayes theorem The binary version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.3 Probability in a Bayesian context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.4 Bayes theorem The categorical version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.5 Bayes theorem The continuous version - 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.6 The binomial case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.7 The Gaussian case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.8 The Poisson case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
2.9 The prior and posterior of h() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
2.10 Bayesian versus likelihood approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Concepts in Bayesian Inference ii

2.11 Bayesian versus frequentist approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.12 The different modes of the Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
2.13 An historical note on the Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3 Introduction to Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3.2 Summarizing the posterior with probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.3 Posterior summary measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.3.1 Measures of location & variability - Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.3.2 Posterior interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.4 Predictive distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.4.1 Frequentist approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
3.4.2 Bayesian approach: posterior predictive distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
3.4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
3.5 Exchangeability - 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
3.6 A normal approximation to the posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Concepts in Bayesian Inference iii

3.6.1 A normal approximation to the likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
3.6.2 Asymptotic properties of the posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
3.7 Numerical techniques to determine the posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
3.7.1 Numerical integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
3.7.2 Sampling from the posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
3.7.3 Choice of posterior summary measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
3.8 Bayesian hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
3.8.1 Inference based on credible intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
3.8.2 The Bayes factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
3.8.3 Bayesian versus frequentist hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4 More than one parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
4.2 Joint versus marginal posterior inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4.3 The normal distribution with and 2 unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.3.1 No prior knowledge on and 2 is available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Concepts in Bayesian Inference iv

4.3.2 An historical study is available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
4.3.3 Expert knowledge is available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
4.4 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
4.4.1 The multivariate normal and related distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
4.4.2 The multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
4.5 Frequentist properties of Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
4.6 The Method of Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
4.7 Bayesian linear regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
4.7.1 The frequentist approach to linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
4.7.2 A noninformative Bayesian linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
4.7.3 Posterior summary measures for the linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
4.7.4 Sampling from the posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
4.8 Bayesian generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
4.8.1 More complex regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
5 Markov chain Monte Carlo sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Concepts in Bayesian Inference v

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
5.2 The Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
5.2.1 The bivariate Gibbs sampler 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
5.2.2 The general Gibbs sampler 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
5.2.3 The general Gibbs sampler 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
5.2.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
5.2.5 Review of Gibbs sampling approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
5.2.6 The Slice sampler* 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
5.3 The Metropolis(-Hastings) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
5.3.1 The Metropolis algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
5.3.2 The Metropolis-Hastings algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
5.3.3 Remarks* 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
5.3.4 Review of Metropolis(-Hastings) approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
5.4 Justification of the MCMC approaches* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
5.4.1 Properties of the MH algorithm* 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
Concepts in Bayesian Inference vi

5.5 Choice of the sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
5.6 The Reversible Jump MCMC algorithm* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
Concepts in Bayesian Inference vii

Part I
Basic concepts in Bayesian methods
Concepts in Bayesian Inference 1

Chapter 1
Modes of statistical inference
Central to statistics = statistical inference
Three approaches:
. Frequentist approach
. (Likelihood approach)
. Bayesian approach

1.1 The frequentist approach: a critical reflection
FIRST: Review of classical approach on statistical inference

1.1.1 The classical statistical approach
Classical approach:
Mix of two approaches (Fisher & Neyman and Pearson)
Here: based on P -value, significance level, power and confidence interval
Example: RCT

Example I.1: Toenail RCT
Randomized, double blind, parallel group, multi-center study (Debacker et al.,

1996)
Two treatments (A : Lamisil and B : Itraconazol) on 2 189 patients
12 weeks of treatment and 48 weeks of follow up (FU)
Significance level = 0.05
Sample size to ensure that 0.20
Primary endpoint = negative mycology (negative microscopy & negative culture)
Here unaffected nail length at week 48 on big toenail
163 patients treated with A and 171 treated with B

Example I.1 (continued)
A: 1 & B: 2
H0: = 1 2 = 0
Completion of study:
b = 1.38 with tobs = 2.19 in 0.05 rejection region
Neyman-Pearson: reject that A and B are equally effective
Fisher: 2-sided P = 0.030 strong evidence against H0
Wrong statement: Result is significant at 2-sided of 0.030

1.1.2 The P -value as a measure of evidence
Use and misuse of P -value:
The P -value is not the probability that H0 is (not) true
The P -value depends on fictive data (Example I.2)
The P -value depends on the sample space (Examples I.3 and I.4)
The P -value is not an absolute measure
The P -value does not take all evidence into account (Example I.5)

The P -value is not the probability that H0 is (not) true
Often P -value is interpreted in a wrong manner
P -value = probability that observed or a more extreme result occurs
P -value = surprise index
P -value 6= p(H0 | y)
p(H0 | y) = Bayesian probability

The P -value depends on fictive data
P -value is based not only on the observed result

but also on fictive (never observed) data
Probability has a long-run frequency definition
Example I.2

Example I.2: Graphical representation of P -value
P -value of RCT (Example I.1)
0.4
unobserved y values
0.3
observed t value
Density
0.2
0.015 0.015
0.1
area area
0.0
4 2 0 2 4
t

The P -value depends on the sample space
Calculation of P -value depends on all possible samples (under H0)
The possible samples are similar in some characteristics to the observed sample
(e.g. same sample size)
Examples I.3 and I.4

Example I.3: Accounting for interim analyses in a RCT
2 identical RCTs except for the number of analyses:
RCT 1: 4 interim analyses + final analysis

. Correction for multiple testing
. Group sequential trial: Pococks rule
. Global = 0.05, nominal significance level=0.016
RCT 2: 1 final analysis

. Global = 0.05, nominal significance level=0.05
If P = 0.02 then for RCT 1: NO significance, for RCT 2: Significance

Example I.4: Kaldor et als case-control study
Case-control study (Kaldor et al., 1990) to examine the impact of chemotherapy

on leukaemia in Hodgkins survivors
149 cases (leukaemia) and 411 controls
Question: Does chemotherapy induce excess risk of developing solid tumors,

leukaemia and/or lymphomas?
Treatment Controls Cases

No Chemo 160 11
Chemo 251 138
Total 411 149

Pearson 2(1)-test: P =7.8959x1013
Fishers Exact test: P = 1.487x1014
Odds ratio = 7.9971 with a 95% confidence interval = [4.19,15.25]
Reason for difference: 2 sample spaces are different

The P -value is not an absolute measure
Small P -value does not necessarily indicate large difference between treatments,
strong association, etc.
Interpretation of a small P -value in a small/large study
When does H0 occur in practice?

The P -value does not take all evidence into account
Studies are analyzed in isolation, no reference to historical data
Why not incorporating past information in current study?

Example I.5: Merseyside registry results
Subsequent registry study in UK
Preliminary results of the Merseyside registry: P = 0.67
Conclusion: no excess effect of chemotherapy (?)

No Chemo 3 0
Chemo 3 2
Total 6 2

1.1.3 The confidence interval as a measure of evidence
95% confidence interval: expression of uncertainty on parameter of interest
Technical definition: in 95 out of 100 studies true parameter is enclosed
In each study confidence interval includes/does not include true value
Practical interpretation has a Bayesian nature

Example I.6: 95% confidence interval toenail RCT
95% confidence interval for = [0.14, 2.62]
Interpretation: most likely (with 0.95 probability) lies between 0.14 and 2.62
= a Bayesian interpretation

1.1.4 An historical note on the frequentist paradigms*
1. Fishers approach
Views of Fisher presented in:
Statistical Methods for Research Workers (1925)
The Design of Experiments (1935)

1. Fishers approach
Inductive approach
Introduction of:
. Null-hypothesis (H0)
. Significance test
. P -value = evidence against H0
. Significance level
. NO alternative hypothesis
. NO power

2. Neyman & Pearsons approach
Views of Neyman & Pearson presented in:
Papers in 1928 & 1933
Deductive approach
Introduction of:
. Alternative hypothesis (HA)
. Type I error
. Type II error & power
. Hypothesis test

3. Fishers versus Neyman & Pearsons approach
Fisher and Neyman & Pearson in a never-ending debate
In practice two approaches are mixed
Other measures (Bayes factor) have been proposed as measure for or against an
hypothesis
Bayesian inference and P -value are close in one-sided hypothesis test
Dont shoot at the play, but at the pianist

1.2 Statistical inference based on the likelihood function
Inference purely on likelihood function has not been developed to a full-blown

statistical approach
Considered here as a pre-cursor to Bayesian approach

1.2.1 The likelihood function
Likelihood was introduced by Fisher in 1922
Likelihood function = plausibility of the observed data as a function of the

parameters of the stochastic model
Inference based on likelihood function is QUITE different from inference based on

P -value

Example I.7: A surgery experiment
New but rather complicated surgical technique
Surgeon operates n = 12 patients with s = 9 successes
Notation:
Result on ith operation: success yi = 1, failure yi = 0
Total experiment: n operations with s successes
Sample {y1, . . . , yn} y
Probability of success = p(yi) =
Binomial distribution:
Expresses probability of s successes out of n experiments.

Binomial distribution

n
n X
f (s) = s (1 )ns with s = yi
s i=1
fixed & function of s:

n
P
f (s) is discrete distribution with f (s) = 1
s=0
s fixed & function of :

binomial likelihood function L(|s)

Binomial distribution
0
(a) (b)
10
0.20
Loglikelihood
Likelihood
30
0.10
50
0.00
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Maximum likelihood estimate (MLE) b maximizes L(|s)

Maximizing L(|s) equivalent to maximizing log[L(|s)] `(|s)

Example I.7 Determining MLE
To determine MLE first derivative of likelihood function is needed:
`(|s) = c + s ln + (n s) ln(1 )
d
d `(|s) = s (ns)
(1) = 0 = s/n
b
For s = 9 and n = 12 b = 0.75

1.2.2 The likelihood principles
Two likelihood principles (LP):
LP 1: All evidence, which is obtained from an experiment, about an unknown

quantity , is contained in the likelihood function of for the given data
. Standardized likelihood
. Interval of evidence
LP 2: Two likelihood functions for contain the same information about if they
are proportional to each other.

Likelihood principle 1
LP 1: All evidence, which is obtained from an experiment, about an unknown quantity

, is contained in the likelihood function of for the given data

Maximal evidence for = 0.75
Likelihood ratio L(0.5|s)/L(0.75|s) = relative evidence for 2 hypotheses = 0.5

& = 0.75 (0.21 ??)
Standardized likelihood: LS (|s) L(|s)/L(|s)

b
LS (0.5|s) = 0.21 = test for hypothesis H0 without involving fictive data
Interval of ( 1/2 maximal) evidence

Inference on the likelihood function
0.25
0.20
Binomial likelihood
Likelihood
0.15
95% CI
0.10
0.05
interval of >= 1/2 evidence
MLE
0.00
0.0 0.2 0.4 0.6 0.8 1.0


Likelihood principle 2
LP 2: Two likelihood functions for contain the same information about

if they are proportional to each other
LP 2 = Relative likelihood principle
When likelihood is proportional under two experimental conditions, then

information about must be the same!

Example I.8: Another surgery experiment
. Surgeon 1: (Example I.7) Operate 12 patients, observe 9 successes (and 3 failures)

. Surgeon 2: Operate n patients until 3 failures are observed
n
P
. Surgeon 1: s = yi has a binomial distribution
i=1
n
s
binomial likelihood L1(|s) = s (1 )(ns)
n
P
. Surgeon 2: s = yi has a negative binomial (Pascal) distribution
i=1
s+k1
s k
negative binomial likelihood L2(|s) = s (1 )

Example I.8 Likelihood inference
0.25
0.20
Binomial likelihood
Likelihood
0.15
0.10
Negative binomial likelihood

0.05
interval of >= 1/2 evidence
MLE
0.00
0.0 0.2 0.4 0.6 0.8 1.0

LP 2: 2 experiments give us the same information about

Example I.8 Frequentist inference
H0: = 0.5 & HA: > 0.5
Surgeon 1: Calculation P -value = 0.0730

12
X 12
p [s 9 | = 0.5] = 0.5s (1 0.5)12s
s=9 s
Surgeon 2: Calculation P -value = 0.0337

X 2+s
p [s 9 | = 0.5] = 0.5s (1 0.5)3
s=9 s
Frequentist conclusion 6= Likelihood conclusion

Example I.8 - Conclusion
Design aspects (stopping rule) are important in frequentist context

1.3 The Bayesian approach: some basic ideas
Bayesian methodology = topic of the course
Statistical inference through different type of glasses

1.3.1 Introduction
Examples I.7 and I.8: combination of information from a similar historical surgical
technique could be used in the evaluation of current technique = Bayesian exercise
Planning phase III study:

. Comparison new old treatment for treating breast cancer
. Background information is incorporated when writing the protocol
. Background information is not incorporated in the statistical analysis
. Suppose small-scaled study with unexpectedly positive result (P < 0.01)
Reaction???

Introduction-2
New mouthwash:
. Daily use of the new mouthwash before tooth brushing reduces plaque?
. Results: new mouthwash reduced 25% of plaque with a 95% CI = [10%, 40%]
. Previous trials: overall reduction in plaque in-between 5% and 15%
. Experts: plaque reduction will probably not exceed 30%
. What to conclude then?

Introduction-3
Central idea of Bayesian approach:
Combine likelihood (data) with Your prior

knowledge (prior probability) to update information
on the parameter to result in a revised probability
associated with the parameter (posterior probability)

Example I.9: Examples of Bayesian reasoning in daily life
Tourist example: Prior view on Belgians + visit to Belgium (data) posterior

view on Belgians
Marketing example: Launch of new energy drink on the market
Medical example: Patients treated for CVA with thrombolytic agent suffer from
SBAs. Historical studies (20% - prior), pilot study (10% - data) posterior

1.3.2 Bayes theorem Discrete version for simple events - 1
A (diseased) & B (positive diagnostic test)

AC (not diseased) & B C (negative diagnostic test)
p(A, B) = p(A) p(B | A) = p(B) p(A | B)
Bayes theorem = Theorem on Inverse Probability
p (A | B ) p (B)
p (B | A) =
p (A)

Bayes theorem Discrete version for simple events - 2
Bayes Theorem - version II:
p (A | B ) p (B)
p (B | A) =
p (A | B ) p (B) + p (A | B C ) p (B C )

Example I.10: Sensitivity, specificity, prevalence and predictive values
B = diseased, A = positive diagnostic test
Characteristics of diagnostic test:

. Sensitivity (Se) = p (A | B )

C C

. Specificity (Sp) = p A B
. Positive predictive value (pred+) = p (B | A)

C C

. Negative predictive value (pred-) = p B A
. Prevalence (prev) = p(B)
pred+ calculated from Se, Sp and prev using Bayes Theorem

Folin-Wu blood test: screening test for diabetes (Boston City Hospital)
Test Diabetic Non-diabetic Total

+ 56 49 105
- 14 461 475
Total 70 510 580
Se = 56/70 = 0.80
Sp = 461/510 = 0.90
prev = 70/580 = 0.12

Bayes Theorem:
+ + p (T + | D+ ) p (D+)
p D T =
p (T + | D+ ) p (D+) + p (T + | D ) p (D)
In terms of Se, Sp and prev:
Se prev
pred+ =
Se prev + (1 Sp) (1 prev)

For p(B)=0.03 pred+ = 0.20 & pred- = 0.99
For p(B)=0.30 pred+ = ?? & pred- = ??
Individual prediction: combine prior knowledge (prevalence of diabetes in

population) with result of Folin-Wu blood test on patient to arrive at revised
opinion for patient
Folin-Wu blood test: prior (prevalence) = 0.10 & positive test posterior = 0.47

Example I.11: The Bayesian interpretation of published research findings
Ioannidis (2005) explains why many medical research findings appear to be false
. Let = P(Type I error) & = P(Type II error)

. Purpose = find true relationships among G possible risk indicators and disease
& there is only 1 true relationship
. Prior probability that research finding is positive = 1/G
R = 1/(G 1) = prior odds
. Average number of truly positive associations in c relationships examined in an
independent manner =
c P (+|truly+) prior(truly+) = c(1 )R/(R + 1) (sensitivity prevalence)
. Average number of false positive findings =
c P (+|truly) prior(truly) = c/(R + 1) ((1-specificity)(1-prevalence))

(1 )R
pred+ =
(1 )R +
If (1 )R >
Posterior probability of finding a true relationship > 0.5
Power to find a positive result must be > 0.05/R to find with high likelihood a
truly positive result, which is impossible for G large
Other (interesting!) conclusions, see Ioannidis (2005)

1.4 Outlook
Bayes theorem will be further developed in the next chapter such that it
becomes useful in statistical practice
Reanalyze examples such as those seen in this chapter
Valid question: What can a Bayesian analysis do more than a classical

frequentist analysis?
Six additional chapters are needed to develop useful Bayesian tools
But it is worth the effort!

Example I.12: Toenail RCT A Bayesian analysis
Re-analysis of toenail data using WinBUGS (most popular Bayesian software)
Figure 1.4 (a): AUC on pos x-axis represents = our posterior belief that is
positive (= 0.98)
Figure 1.4 (b): AUC on AUC for the interval [1, [ = our belief that 1/2 > 1
Figure 1.4 (c): incorporation of skeptical prior that is positive (a priori around
-0.5, with some uncertainty) (= 0.54)
Figure 1.4 (d): incorporation of information on the variance parameters (22/12

varies around 2)

(a) (b)
delta sample: 5000 rat sample: 5000
0.6 6.0
0.4 4.0
0.2 2.0
0.0 0.0
-2.0 0.0 2.0 4.0 0.75 1.0 1.25 1.5
(c) (d)
delta sample: 5000 ratsig sample: 5000
0.8 6.0
0.6 4.0
0.4
0.2 2.0
0.0 0.0
-2.0 0.0 2.0 0.8 1.0 1.2 1.4

Wy Bayesian?
Philosophical differences aside, there are practical reasons that drive the recent
popularity of Bayesian analysis:
Simplicity in thinking about problems and answering questions
Flexibility in making inference on a wide range of models (data augmentation,
hierarchical models)
Incorporation of prior information
Development of efficient inference and sampling tools
Fast computers

Historical notes on Thomas Bayes
Thomas Bayes: was probably born

on 1701 and died on 7/4/1761.
He was a minister and a reverent with
mathematical interests.
None of his work on mathematics was
published during his life.
One of 2 posthumous works, published in 1764:
An Essay toward a Problem in the Doctrine of Chances
is the basis for Bayesian Inference.
Laplace (1749-1827) reinvented Bayes theorem, because he did not know Bayes
work.

Historical notes on the Bayesian computational procedures
Before 80s: Conjugate analysis

Early 80s: Numerical integration &
Monte Carlo integration
Mid 80s: Laplace approximations
Late 80s: Gelfand Smith use Gibbs sampler for simple Normal models
1989: The development of BUGS started, followed by the development of
WINBUGS

Has one accepted the Bayesian methodology?
Statistical world:
. Bayesian statistics has not been (widely) accepted for a long time.
. Frequentist world versus Bayesian (Likelihood) world
. From 1990: change in attitude of statistical community.
Each viewpoint has its merits.

Use Bayesian methods when appropriate.
Medical world:
. The medical world is more conservative. Bayesian methods are better accepted
in exploratory/epidemiological studies than in clinical trials. In clinical trials,
there seems to be a role for Bayesian methods except for in phase III studies.

Bayesian software
Statistical packages like S+, R, or GAUSS can be used.

For elementary (educational) calculations, one could use 1stBayes developed by
Tony OHagan (Sheffield). It can be downloaded and installed from the Website
http://tonyohagan.co.uk/1b/
More advance Bayesian analyses can be done using WINBUGS, free downloadable
from http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml
To find other Bayesian software for simple or complex problems a good starting
point is to consult the Webpage of the ISBA Society: http://www.bayesian.org/
SAS using the PROC GENMOD, LIFEREG, PHREG and MCMC procedures

True story ...
A Bayesian and a frequentist were sentenced to death. When the judge asked what
their final wishes were, the Bayesian replied that he wished to teach the frequentist the
ultimate lesson. The judge granted his request and then repeated the question to the
frequentist. He replied that he wished to get the lesson again and again and again . . .

Chapter 2
Bayes theorem: computing the posterior distribution
General expression of Bayes theorem is derived

2.1 Introduction
In this chapter:
Bayes theorem for binary, categorical and continuous case
Derivation of posterior distribution: for binomial, normal and Poisson
A variety of examples

2.2 Bayes theorem The binary version
D+ = 1 and D = 0
T + y = 1 and T y = 0
p(y = 1 | = 1) p( = 1)
p( = 1 | y = 1) =
p(y = 1 | = 1) p( = 1) + p(y = 1 | = 0) p( = 0)
p( = 1), p( = 0) prior probabilities

p(y = 1 | = 1) likelihood
p( = 1 | y = 1) posterior probability
Now parameter has also a probability

Bayes theorem
Shorthand notation
p(y | )p()
p( | y) =
p(y)
where can stand for = 0 or = 1.

2.3 Probability in a Bayesian context
Bayesian probability = expression of Our/Your uncertainty of the parameter value
Coin tossing: truth is there, but unknown to us
Diabetes: from population to individual patient
Probability can have two meanings: limiting proportion (objective) or personal belief
(subjective)

Other examples of Bayesian probabilities
Subjective probability varies with individual, in time, etc.
Tour de France
Global warming
...

Subjective probability rules
Let A1, A2, . . . , AK mutually exclusive events with total event S

Subjective probability p should be coherent:
Ak : p(Ak ) 0 (k=1, . . ., K)
p(S) = 1
p(AC ) = 1 p(A)
With B1, B2, . . . , BL another set of mutually exclusive events:

p(Ai, Bj )
p(Ai | Bj ) =
p(Bj )

2.4 Bayes theorem The categorical version
Subject can belong to K > 2 classes: 1, 2, . . . , K
y takes L different values: y1, . . . , yL or continuous
Bayes theorem for categorical parameter:
p (y | k ) p (k )
p (k | y) = K
P
p (y | k ) p (k )
k=1

2.5 Bayes theorem The continuous version - 1
1-dimensional continuous parameter

i.i.d. sample y = y1, . . . , yn
Qn
Joint distribution of sample = p(y|) = i=1 p(yi |) = likelihood L(|y)
Prior density function p()
Split up: p(y, ) = p(y|)p() = p(|y)p(y)
Bayes Theorem for continuous parameters:
L(|y)p() L(|y)p()
p(|y) = = R
p(y) L(|y)p()d

Shorter: p(|y) L(|y)p()
R
L(|y)p()d = averaged likelihood
Averaged likelihood posterior distribution involves integration
Important: Also in Bayesian context, there is a true parameter 0

2.6 The binomial case
Example II.1: Stroke study Monitoring safety
. Rt-PA: thrombolytic for ischemic stroke

. Historical studies ECASS 1 and ECASS 2: complication SICH
. ECASS 3 study: patients with ischemic stroke (Tx between 3 & 4.5 hours)
. DSMB: monitor SICH in ECASS 3
. Fictive situation:
First interim analysis ECASS 3: 50 rt-PA patients with 10 SICHs
Historical data ECASS 2: 100 rt-PA patients with 8 ISCHs
. Estimate risk for SICH in ECASS 3 construct stopping rule

Comparison of 3 approaches
Frequentist
Likelihood
Bayesian - different prior distributions
Exemplify mechanics of calculating the posterior distribution using Bayes theorem

Notation
SICH incidence:
i.i.d. Bernoulli random variables y1, . . . , yn
SICH: yi = 1, otherwise yi = 0
Pn n

y= 1 yi has Bin(n, ): p(y|) = y y (1 )(ny)

Frequentist approach
MLE b = y/n = 10/50 = 0.20
Test hypothesis = 0.09 (value of ECASS 2 study)
Classical 95% confidence interval = [0.089, 0.31]

Likelihood inference
MLE b = 0.20
No hypothesis test is performed
Confidence interval 0.95 interval of evidence = [0.09, 0.36]

Bayesian approach: prior obtained from ECASS 2 study
1. Specifying the (ECASS 2) prior distribution
2. Constructing the posterior distribution
3. Characteristics of the posterior distribution
4. Equivalence of prior information and extra data

1. Specifying the (ECASS 2) prior distribution - 1
n0
y0 (1 )(n0y0) (y0 = 8 & n0 = 100)

ECASS 2 likelihood: L(|y0) = y0
ECASS 2 likelihood expresses prior belief on (a)

but is not (yet) prior distribution
0.15
As a function of
L(|y0) 6= density (AUC 6= 1)
0.10
LIKELIHOOD
How to standardize?
0.05
Numerically or analytically?
0.0
0.0 0.05 0.10 0.15 0.20 0.25 0.30
PROPORTION SICH

1. Specifying the (ECASS 2) prior distribution - 2
Kernel of binomial likelihood y0 (1 )(n0y0) beta density Beta(, ):
p() = 1
B(0 ,0 ) 01(1 )01
(b)
with B(, ) = ()()
15
(+)
() gamma function proportional to
LIKELIHOOD
10
0(9) y0 + 1
0(100 8 + 1) n0 y0 + 1
5
LIKELIHOOD
0
0.0 0.05 0.10 0.15 0.20 0.25 0.30
PROPORTION SICH

2. Constructing the posterior distribution - 1
Bayes theorem needs:

. Prior p() (ECASS 2 study)
. Likelihood L(|y) (ECASS 3 interim analysis), y=10 & n=50
R
. Averaged likelihood L(|y)p()d
Numerator of Bayes theorem

n 1
L(|y) p() = 0+y1(1 )0+n+y1
y B(0, 0)
Averaged likelihood

n B(0 + y, 0 + n y)
p(y) =
y B(0, 0)

2. Constructing the posterior distribution - 2
Posterior distribution Beta(, )
1
p(|y) = 1(1 )1
B(, )
with
= 0 + y
= 0 + n y

2. Prior, likelihood & posterior
15
POSTERIOR
10
PRIOR proportional to
LIKELIHOOD
5
0
0.0 0.1 0.2 0.3 0.4


Posterior = compromise between prior & likelihood
n0
Posterior mode: = n0 +n 0 + n0n+n b (analogous result for mean)
Shrinkage: 0 b when (y0/n0 y/n)
Here: posterior more peaked than prior & likelihood (not in general)
Likelihood dominates the prior for large sample sizes
Posterior = beta distribution = prior (conjugacy)
Posterior estimate = MLE of combined ECASS 2 data & interim data ECASS 3

Beta(0, 0) prior
binomial experiment with (0 1) successes in (0 + 0 2) experiments
Prior
extra data to observed data set: (0 1) successes and (0 1) failures

Bayesian approach: using a subjective prior
Suppose DSMB neurologists believe that SICH incidence is probably more than
5% but most likely not more than 20%
If prior belief = ECASS 2 prior density posterior inference is the same
The neurologists could also combine their qualitative prior belief with ECASS 2
data to construct a prior distribution adjust ECASS 2 prior

Example subjective prior
ECASS 2
15
POSTERIOR
SUBJECTIVE
POSTERIOR
10
ECASS 2 SUBJECTIVE
PRIOR PRIOR
5
0
0.0 0.1 0.2 0.3 0.4


Bayesian approach: no prior information is available
Suppose no prior information is available
Need: a prior distribution that expresses ignorance

= noninformative (NI) prior
8
For stroke study: NI prior = proportional to
POSTERIOR LIKELIHOOD
p() = I[0,1] = flat prior on [0,1]
6
Uniform prior on [0,1] = Beta(1,1)
4
2
FLAT PRIOR
0 0.0 0.1 0.2 0.3 0.4


2.7 The Gaussian case
Example II.2: Dietary study Monitoring dietary behavior in Belgium
IBBENS study: dietary survey in Belgium
Of interest: intake of cholesterol
Monitoring dietary behavior in Belgium: IBBENS-2 study

Bayesian approach: prior obtained from the IBBENS study
1. Specifying the (IBBENS) prior distribution

1. Specifying the (IBBENS) prior distribution
Histogram of the dietary cholesterol of 563 bank employees normal
y N(, 2) when
1 2 2

f (y) = exp (y ) /2
2
Sample y1, . . . , yn likelihood

" n
# " 2 #
1 X
2 1 y
L(|y) exp 2 (yi ) exp L(|y)
2 i=1
2 / n

Histogram and likelihood IBBENS study
0.004
0.08
(a) (b)
0.06
0.002
0.04
0.02
MLE= 328
0.000
0.00
0 200 400 600 800 310 320 330 340
cholesterol (mg/day)

Denote sample n0 IBBENS data: y 0 {y0,1, . . . , y0,n0 } with mean y 0
Likelihood N(0, 02)

0 y 0 = 328

0 = / n0 = 120.3/ 563 = 5.072
IBBENS prior distribution

" 2 #
1 1 0
p() = exp
20 2 0
with 0 y 0

IBBENS-2 study:
sample y with n=50
y = 318 mg/day & s = 119.5 mg/day
95% confidence interval = [284.3, 351.9] mg/day wide
Combine IBBENS prior distribution IBBENS-2 normal likelihood:

IBBENS-2 likelihood: L(|y)
IBBENS prior density: N(0, 02)
Posterior distribution p()L(|y):

( " 2 2#)
1 0 y
p(|y) p(|y) exp +
2 0 / n

. Integration constant to obtain density?
. Recognize standard distribution: exponent (quadratic function of )
. Posterior distribution:
p(|y) = N(, 2),

with
1
02
0 + n2 y 1
2
= 1 and =
02
+ n2 1
02
+ n2
. Here: = 327.2 and = 4.79.

IBBENS-2 posterior distribution
0.08
IBBENS2 POSTERIOR IBBENS PRIOR
0.06
0.04
0.02
IBBENS2 LIKELIHOOD
0.00
250 300 350 400


. Posterior distribution: compromise between prior and likelihood

. Posterior mean: weighted average of prior and the sample mean
w0 w1
= 0 + y
w0 + w1 w0 + w1
with weights
1 1
w0 = 2 & w1 = 2
0 /n
. The posterior precision = 1/posterior variance:

1
2 = w 0 + w1

with w0 = 1/02 = prior precision and w1 = 1/( 2/n) = sample precision

Posterior is always more peaked than prior and likelihood
When n or 0 : : p(|y) = N(y, 2/n)
When sample size increases the likelihood dominates the prior
Posterior = normal = prior conjugacy

Prior variance 02 = 2 2 = 2/(n + 1)
Prior information = adding one extra observation to the sample
General: 02 = 2/n0, with n0 general

n0 n
= 0 + y
n0 + n n0 + n
and
2 2
=
n0 + n

Bayesian approach: using a subjective prior
Discounted IBBENS prior: increase IBBENS prior variance from 25 to 100
Discounted IBBENS prior + shift: increase from 0 = 328 to 0 = 340

Discounted priors
0.05
0.05
POSTERIOR
(a) (b) POSTERIOR
0.04
0.04
PRIOR PRIOR
0.03
0.03
LIKELIHOOD LIKELIHOOD
0.02
0.02
0.01
0.01
0.00
0.00
250 300 350 400 250 300 350 400


Non-informative prior: 02
Posterior: N(y, 2/n)

Non-informative prior
0.025
0.020
0.015 POSTERIOR
LIKELIHOOD
0.010
0.005
PRIOR
0.000
250 300 350 400


2.8 The Poisson case
Take y {y1, . . . , yn} independent counts Poisson distribution
Poisson() y e
p(y| ) =
y!
Mean and variance =
Poisson likelihood:
n
Y n
Y
L(| y) p(yi| ) = (yi /yi!) en
i=1 i=1

Example II.6: Describing caries experience in Flanders
The Signal-Tandmobielr (STM) study:
Longitudinal oral health study in Flanders
0.4
Annual examinations from 1996 to 2001
0.3
PROPORTION
4468 children (7% of children born in 1989)
0.2
Caries experience measured by dmft-index
0.1
(min=0, max=20)
0.0
0 5 10 15
dmft-index

Frequentist and likelihood calculations
MLE of : b = y = 2.42
Likelihood-based 95% confidence interval for : [2.1984,2.2875]

Bayesian approach: prior distribution based on historical data
1. Specifying the prior distribution

1. Specifying the prior distribution
Information from literature:

Average dmft-index 4.1 (Liege, 1983) & 1.39 (Gent, 1994)
Oral hygiene has improved considerably in Flanders
Average dmft-index bounded above by 10
Candidate for prior: Gamma(0, 0)
00 01 0
p() = e
(0)
0 = shape parameter & 0 = inverse of scale parameter

E(0) = 0/0 & var() = 0/02
STM study: 0 = 3 & 0 = 1

Gamma prior for STM study
0.25
0.20
0.15
0.10
Gamma(3,1)
0.05
0.00
0 5 10 15 20


Posterior
n
n
Y
yi 00 01 0
p(| y) e ( /yi!) e
i=1
( 0 )
P
( yi +0 )1 (n+0 )
e
P
Recognize kernel of a Gamma( yi + 0, n + 0) distribution
1
p(| y) p(| y) = e
()
P
with = yi + 0= 9758 + 3 = 9761 and = n + 0= 4351 + 1 = 4352
STM study: effect of prior is minimal

Posterior is a compromise between prior and likelihood
Posterior mode and mean demonstrate shrinkage
For STM study posterior more peaked than prior likelihood, but not in general
Prior is dominated by likelihood for a large sample size
Posterior = gamma = prior conjugacy

Prior = equivalent to experiment of size 0 with counts summing up to 0 1
STM study: prior corresponds to an experiment of size 1 with count equal to 2

Gamma with 0 1 and 0 0 = non-informative prior

2.9 The prior and posterior of h()
h monotone transformation of h() = new parameter
Transformation rule: distribution of

1
1
d
p(h ())
d
with p() = p(h1()) prior or posterior of

Example II.4: Stroke study Posterior distribution of log()
Probability of success is often modeled on the log-scale
Posterior distribution of = log()

1
p(|y) = exp (1 exp )1
B(, )
with = 19 and = 133.

2.10 Bayesian versus likelihood approach
Bayesian approach satisfies 1st likelihood principle in that inference does not
depend on never observed results
Bayesian approach satisfies 2nd likelihood principle:

Z
p2( | y) = L2( | y)p()/ L2( | y)p()d
Z
= c L1( | y)p()/ c L1( | y)p()d
= p1( | y)
In Bayesian approach parameter is stochastic

different effect of transformation h() in Bayesian and likelihood approach

2.11 Bayesian versus frequentist approach
Frequentist approach:
. fixed and data are stochastic
. Many tests are based on asymptotic arguments
. Maximization is key tool
. Does depend on stopping rules
Bayesian approach:
. Condition on observed data (data fixed), uncertainty about ( stochastic)
. No asymptotic arguments are needed, all inference depends on posterior
. Integration is key tool
. Does not depend on stopping rules

Frequentist and Bayesian approach can give the same numerical output (with
different interpretation), but sometimes also very different inference
Frequentist ideas in Bayesian approaches (MCMC)

2.12 The different modes of the Bayesian approach
Subjectivity objectivity
Subjective Bayesian objective Bayesian
46656 varieties of Bayesians
Pragmatic Bayesian = Bayesian ??

2.13 An historical note on the Bayesian approach
born on 1701
. Thomas Bayesand died on
was probably born7-4-1761.
in 1701
and died in 1761
verent. with
He wasmathematical
a Presbyterian minister, studied logic
interests.
and theology at Edinburgh University,
and had strong mathematical interests
matics. was published during his life.
Bayes theorem was submitted posthumously by
his friend Richard Price in 1763
and was entitled
An Essay toward a Problem in the Doctrine of Chances

. Up to 1950 Bayes theorem was called
Theorem of Inverse Probability
. Fundament of Bayesian theory was developed by
Pierre-Simon Laplace (1749-827)
. Laplace first assumed indifference prior,
later he relaxed this assumption
. Much opposition:
e.g. Poisson, Fisher, Neyman and Pearson, etc

. Fisher strong opponent to Bayesian theory
. Because of his influence
dramatic negative effect
. was opposed to use of flat prior and
that conclusions change when putting flat prior
on h() rather than on
. Some connection between
Fisher and (inductive) Bayesian approach,
but much difference with N&P approach

Proponents of the Bayesian approach:
de Finetti: exchangeability
Jeffreys: noninformative prior, Bayes factor
Savage: theory of subjective and personal probability and statistics
Lindley: Gaussian hierarchical models
Geman & Geman: Gibbs sampling
Gelfand & Smith: introduction of Gibbs sampling into statistics
Spiegelhalter: (Win)BUGS

Recommendation
The theory that would not die. How Bayes rule cracked the enigma
code, hunted down Russian submarines & emerged triumphant from
two centuries of controversy
Mc Grayne (2011)

Chapter 3
Introduction to Bayesian inference
We can start working NOW!

3.1 Introduction
In this chapter:
Exploration of the posterior distribution:

. Summary statistics for location and variability
. Interval estimation
. Predictive distribution
Normal approximation of posterior
Simple sampling procedures
Bayesian hypothesis tests

3.2 Summarizing the posterior with probabilities
Direct exploration of the posterior: P (a < < b|y) for different a and b
Example III.1: Stroke study SICH incidence
= probability of SICH due to rt-PA at first ECASS-3 interim analysis

p(|y) = Beta(19, 133)-distribution
P(a < < b|y):

. a = 0.2, b = 1.0: P(0.2 < < 1|y)= 0.0062
. a = 0.0, b = 0.08: P(0 < < 0.08|y)= 0.033

3.3 Posterior summary measures
Posterior mode
Posterior mean
Posterior median
Posterior variance and standard deviation
Credible intervals: HPD and equal-tail

3.3.1 Measures of location & variability - Mode
Posterior mode bM :
bM = arg max p(|y)
Properties:
. Posterior mode only involves maximization

. For p() c, posterior mode = MLE
6 h(bM )
. = h() with h monotone transformation: bM =

Measures of location & variability - Mean
Posterior mean :
R
= p(|y)d
Properties:
R
. minimizes b 2 p(|y)d over all estimators b
( )
. Posterior mean involves twice integration
. = h() with h monotone transformation: 6= h()

Measures of location & variability - Median
Posterior median M :
R
0.5 = M p(|y)d
Properties:
R
. M minimizes a | |
b p(|y)d with a > 0 over all estimators b
. For a symmetric posterior: posterior median = posterior mean = posterior mode

. Posterior median involves one integration and solving an integral equation
. = h() with h monotone transformation: M = h(M )

Measures of location & variability - Variance/SD
Posterior variance 2:
2
R
= ( )2 p(|y)d
Posterior SD:
Calculations involve three integrals

Example III.2: Stroke study Posterior summary measures
Posterior at 1st interim ECASS 3 analysis: Beta(, ) ( = 19 & = 133)
Posterior mode: maximize ( 1) ln() + ( 1) ln(1 ) wrt

bM = ( 1)/(( + 2)) = 18/150 = 0.12
1
R1
Posterior mean: integrate B(,) 0 1(1 )1 d
= B( + 1, )/B(, ) = /( + ) = 19/152 = 0.125
1
R1
Posterior median: solve 0.5 = B(,) M 1(1 )1 d for
M = = 0.122 (R-function qbeta)
12 1
R1 1
Posterior variance: calculate also 0
B(,)
(1 ) d

2 2
= / ( + ) ( + + 1) = 0.02672

Example III.3: Dietary study Posterior summary measures
Posterior for (based on IBBENS prior):

bM M
Gaussian with
Posterior mode=mean=median:
bM = 327.2 mg/dl
Posterior variance & SD: 2 = 22.99 & = 4.79 mg/dl

3.3.2 Posterior interval estimation
Definition credible/credibility interval:
[a, b] = 100(1 )% credible interval for if
P (a b | y) = 1
In terms of cdf F ():
P (a b | y) = 1 = F (b) F (a)

Two types of credible interval
Definition of CI does not uniquely define credible interval
Two special cases:

. 100(1 )% equal tail credible interval [a, b]:
P ( a| y) F (a) = /2
P ( b| y) 1 F (b) = /2
. 100(1 )% highest posterior density (HPD) interval [a, b]

A 100(1 )% credible interval
For all 1 [a, b] and for all 2
/ [a, b] : p (1 | y) p (2 | y)

Properties of HPD interval
HPD interval = an interval of evidence based on posterior
HPD interval explicitly needs a density
HDI = highest density interval (software FirstBayes)
100(1-)% HPD interval = shortest interval with size (1 ) (Press, 2003)
Image of an HPD interval under a monotone transformation is not HPD anymore
Symmetric posterior: equal tail = HPD interval
Calculation of HPD interval: iterative procedure

95% confidence interval 95% credible interval
Given data set y: 95% credible interval contains the 95% most plausible
parameter values a posteriori
Given data set y: 95% confidence interval either contains or does not contain the
true value. Adjective 95% gets its meaning only in the long run
Interpretation of the classical confidence interval = Bayesian flavor

Example III.4: Dietary study Interval estimation of dietary intake
Posterior = N(, 2)
Obvious choice for a 95% CI is [ 1.96 , + 1.96 ]
Equal 95% tail CI = 95% HPD interval
Results IBBENS-2 study:

. IBBENS prior distribution 95% CI = [317.8, 336.6] mg/dl
. N(328; 10,000) prior 95% CI = [285.6, 351.0] mg/dl
. Classical (frequentist) 95% confidence interval = [284.9, 351.1] mg/dl

Example III.5: Stroke study Interval estimation of probability of SICH
Posterior = Beta(19, 133)-distribution
95% equal tail CI (R function qbeta) = [0.077, 0.18] (see figure)
95% HPD interval = [0.075, 0.18] (see figure)
Computations HPD interval:

. 95% HPD interval: interval [a, a+h] with
F (a + h) F (a) = 0.95 (F cdf)
f (a + h) = f (a) (f pdf)
. Optimization program is needed (R-function optimize), to find a and h

Stroke study: equal tail credible interval
BETA(19,133)
15
10
0.025 0.025
5
95% EQUAL TAIL CI

0
0.05 0.10 0.15 0.20


Stroke study: HPD of transformed parameter
HPD interval is not-invariant to (monotone) transformations:
2.0
(a) BETA(19,133) (b)
15
1.5
10
1.0
5
0.5
95% HPD interval log(95% original HPDI)
0.0
0
0.05 0.10 0.15 0.20 2.8 2.4 2.0 1.6


3.4 Predictive distributions
Predictive distribution = distribution of a future observation ye after having

observed the sample {y1, . . . , yn}
Assumption: ye is independent of y given and is drawn from the same

distribution as {y1, . . . , yn} (e
y p(y | ))
We look at (a) frequentist and (b) Bayesian approach
Three cases: (a) Gaussian, (b) binomial and (c) Poisson

3.4.1 Frequentist approach
Naive approach
. Estimate b
y | )
. Predictive distribution of ye: p(e b
. 100(1-)%-predictive interval (PI): interval containing 100(1-)% of the

future observations
. When based on b 95% PI based on p(ey | )
b too short
Realistic approach: allow for the sampling distribution of b

3.4.2 Bayesian approach: posterior predictive distribution
Central idea: Take the posterior uncertainty into account
Three cases:
. All mass (AUC 1) of p( | y) at bM distribution of ye: p(e y | bM )
1 K
PK
y | k )p(k | y)
. All mass at , . . . , distribution of ye: k=1 p(e
. General case: posterior predictive distribution (PPD)
distribution of ye
R
y | y) =
p(e y | )p( | y) d
p(e

PPD
PPD = marginal distribution of future observation taking posterior uncertainty

into account
Z Z
y | y) = p(e
p(e y , | y) d = p(e y | , y)p( | y) d
Use hierarchical independence

y | , y) = p(e
p(e y | )
100(1-)%-posterior predictive interval (PPI): P (a ye b | y) = 1
Prior predictive distribution

R
p(e
y ) = p(e y | ) p() d = averaged likelihood = integrated/marginal likelihood

3.4.3 Applications
Gaussian case ( 2 known): 95% reference interval of alp
Binomial case: predicting the number of patients with SICH
Poisson case: predicting caries experience

1. The Gaussian case
Diagnostic screening tests 95% reference interval

Example III.6: SAP study 95% reference interval
Serum alkaline phosphatase (alp) was measured on a prospective set of 250

healthy patients by Topal et al (2003)

1. The Gaussian case ( 2 known) Frequentist approach

yi = 100/ alpi (i = 1, . . . , 250) normal distribution
95% ref interval for N(, 2) ( & 2 known): [ 1.96 , + 1.96 ]
Naive approach:
Replace , by y = 7.11, s = 1.4 95% ref interval (alp) = [104.45, 508.95]
Realistic approach: take into account sampling distribution of y

. ye y N(0, 2(1 + 1/n))
p p
. 95% ref interval = [y 1.96 1 + 1/n, y + 1.96 1 + 1/n]
95% ref interval (alp-scale) = [104.33, 510.18]

Example III.6: Histogram alp + 95% reference interval
0.000 0.001 0.002 0.003 0.004 0.005 0.006
95% REFERENCE RANGE
0 200 400 600 800 1000

alp

1. The Gaussian case ( 2 known) Bayesian approach
Posterior PD: ye | y N(, 2 + 2)
Prior PD: ye N(, 2 + 02)

95% PPI (equal tail & HPPI): [ 1.96 + , + 1.96 2 + 2]
2 2
Prior variance 02 large:

y
2 2/n
2

PPD ye|y N y, (1 + 1/n)
95% Bayesian ref interval = frequentist 95% ref interval = [104.3, 510.2]
Same numerical results BUT interpretation different

2. The binomial case
Prior to a clinical trial: reflect on various scenarios

Example III.7: Stroke study Predicting SICH incidence in interim analysis
Before interim analysis but given the pilot data:

Obtain an idea of the number of (future) rt-PA treated patients who will suffer
from SICH in sample of size m = 50
Distribution of ye (given the pilot data)?

2. The binomial case Frequentist approach
Naive approach
. MLE of (incidence SICH) = 8/100 = 0.08 for (fictive) ECASS 2 study
. Predictive distribution: Bin(50, 0.08)
. 95% predictive set: {0, 1, . . . , 7} 94% of the future counts
observed result of 10 SICH patients out of 50 is extreme
Realistic approach: take into account sampling variability of b

. Old problem formulated by Pearson in 1920
y ) = max L(, ye) distribution of ye (Pawitan,
. Profile likelihood of ye: L(e
2001)
. n large: use asymptotic distribution of b

Stroke study: Binomial predictive distribution
0.20
Bin(50,0.08)
0.15
0.10
0.05
0.00
0 2 4 6 8 10 12 14
# future RTPA patients with SICH

2. The binomial case Bayesian approach
Posterior (ECASS 2 prior) = Beta(9,93)-distribution
PPD = Beta-binomial distribution BB(m, ,)

Z 1 1
m ye y)
(me (1 )1
y |y) =
p(e (1 ) d
0 ye B(, )
m B(e y + , m ye + )
=
ye B(, )
BB(m, ,) shows more variability than Bin(m, )

b
94.4% PPS is {0, 1, . . . , 9} 10 SICH patients out of 50 less extreme
Prior predictive distribution = BB(m, 0,0)

Stroke study: Binomial and Beta-binomial predictive distribution
0.20
BB(50,9,93) Bin(50,0.08)
0.15
0.10
0.05
0.00
0 2 4 6 8 10 12 14
# future RTPA patients with SICH

3. The Poisson case Bayesian approach
Poisson likelihood + gamma prior gamma posterior
PPD = negative binomial distribution NB(, )

ye
( + ye) 1
y |y) =
p(e
() ye! +1 +1

Example III.8: Caries study PPD for caries experience
OBSERVED DISTRIBUTION
0.4
0.3
Probability
PPD
0.2
0.1
0.0
0 5 10 15 20
dmftindex

3.5 Exchangeability - 1
Independence:
Qn
p(y1, y2, . . . , yn | ) = i=1 p(yi | )
Independence is defined conditional on
Exchangeable:
is never known but given a prior distribution p():
Z
p(y1, y2, . . . , yn) = p(y1, y2, . . . , yn | )p()d ,
Z Yn
= p(yi | )p()d
i=1
p(y1, y2, . . . , yn) = p(y(1), y(2), . . . , y(n))

y1, y2, . . . , yn are exchangeable

Exchangeability - 2
Exchangeable ; independence
Independence with same marginal distribution exchangeable
Exchangeability is an assumption that depends on context of experiment
Exchangeability is central in prediction (PPD)
Exchangeability was introduced by de Finetti (1937, 1974)

. finite and infinite exchangeability
. Representation theorem: Exchangeable random variables can be seen as
conditionally independent random variables with a given prior
Partial/conditional exchangeability

3.6 A normal approximation to the posterior
Up to now: choice of prior was taken such that posterior & posterior summary
measures are obtained analytically
In general: numerical techniques are needed, but complicated for

multiparameter case
Large sample size: Bayesian analysis simplifies considerably using normal

approximation to posterior

3.6.1 A normal approximation to the likelihood
Example III.9: Kaldor et als case-control study
Case-control study introduced in Example I.4

Results (Chapter 1)
No Chemo 160 11
Chemo 251 138
Total 411 149
0 (1) = probability of having the risk factor in the controls (cases)

n0 (n1) = number of controls (cases)
r0 (r1) = number of controls (cases) with risk factor (chemotherapy)

Example III.9 (continued)
Logarithm of OR =
1/(1 1)
= log
0/(1 0)
OR = e
MLE of
r1 (n0 r0)
= log
r0 (n1 r1)
Approximately N(, 2 )
1 1 1 1
var() 2 = + + +
r0 n0 r0 r1 n1 r1

Example III.9 (continued) - Frequentist approach
= 2.08 and e = 8.0
Using asymptotic normality: P < 0.0001
95% confidence interval for = [1.43, 2.72]
95% confidence interval for OR = [4.2, 15.2]

Example III.9 (continued) - Bayesian approach - 1
Large sample inference for is not necessary in Bayesian context
But large sample result for makes the Bayesian analysis easier
Chapter 2
. Prior N(0, 02) for + normal likelihood of
Normal posterior for : N(, 2)
!
0
= + 2
2 02
!1
2 1 1
= +
2 02

Prior N(0 = log(5), 02 = 10, 0002)
Posterior summary measures frequentist measures

(a)
0.15
0.10
POSTERIOR
0.05
PRIOR
0.0 95% CI
0 5 10 15 20 25
ODDS RATIO

Expert opinion:
. Best guess (median value) for e = 5
. 95% prior credible interval for OR = [1, 25]
Experts put 95% belief in N(1.6, 0.822) for
0.15
PRIOR
Result: Posterior median for = 7.5
0.10
POSTERIOR
95% CI = [4.1, 13.6]
0.05
0.0
0 5 10 15 20 25
ODDS RATIO

3.6.2 Asymptotic properties of the posterior distribution
Normal posterior for a large sample size is justified even when the likelihood is
combined with a non-normal prior
Reflection that likelihood dominates prior for a large sample size

Theorem: Let y represent a sample of n iid observations with joint density
p(y|) L(|y) and p() > 0 a prior density for . Under suitable regularity
conditions, the posterior distribution p(|y) converges to the normal distribution
2 1
2
b ) when n , where b is the MLE of and = 2 d ln L(|y)
N(, b b d 2 | b .
=
Theorem does not play a central role in the Bayesian approach

Example III.10: Caries study Posterior of mean(dmft-index)
. = mean dmft-index
P10
. Likelihood: dmft-index of ten children i yi = 26
. Prior: Gamma(3, 1)
. Posterior: Gamma(29, 11) (solid)
0.8
Posterior
Normal approx (dotted) to Poisson likelihood

BCLT
0.6
. b = y = 2.6
0.4
. b2 = y/n = 0.26
0.2
0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5


3.7 Numerical techniques to determine the posterior
Numerical integration
Sampling from the posterior
Choice of posterior summary measures

3.7.1 Numerical integration
Take f () = t() p(y | )p() or f () = t() p( | y)
Simple integration techniques: equidistant grid + approximate sub integrals by

polynomial
Mid-point rule: constant
Trapezoidal rule: linear
Simpsons rule: quadratic
Rb PM +1
Result: a f ()d m=0 wmf (m)
Gaussian quadrature
Non-adaptive
Adaptive (M = 1 = Laplace approximation)

Example III.11: Caries study Posterior distribution for a lognormal prior
Replace gamma prior (Example III.10) by lognormal prior
Posterior distribution
log()0 2

Pn
n
i=1 yi 1
e 20
, ( > 0)
Posterior moments cannot be evaluated & AUC not known
Mid-point approach

Calculation AUC using mid-point approach
0.10
k x POSTERIOR DENSITY
AUC =0.13
0.08
0.06
0.04
0.02
0.00
1 2 3 4 5


3.7.2 Sampling from the posterior distribution
Monte Carlo integration: usefulness of sampling idea
General purpose sampling algorithms

Monte-Carlo integration
Monte Carlo integration: replace integral by a Monte Carlo sample {e1, . . . , eK }

Approximate p(|y) by sample histogram
Classical Strong Law of Large Numbers:
Z K
1 X e
t() p(|y) d t = t(k ), for K large
K
k=1
Classical Central Limit Theorem: 95% confidence interval

[t 1.96 st/ K, t + 1.96 st/ K]
95% equal tail CI: [2.5%, 97.5%] quantile from sample
95% HPD interval: approach of Tanner (1993)

Example III.12: Stroke study Sampling from the posterior distribution
Posterior for = probability of SICH with rt-PA = Beta(19, 133) (Example II.1)
5, 000 sampled values of from Beta(19, 133)-distribution
Posterior of log(): one extra line in R-program
Sample summary measures true summary measures
95% equal tail CI for : [0.0782, 0.182]
95% equal tail CI for log(): [-2.56, -1.70]
Approximate 95% HPD interval for : [0.0741, 0.179]

2.0
20
(a) (b)
1.5
15
1.0
10
0.5
5
0.0
0
0.05 0.10 0.15 0.20 3.0 2.5 2.0 1.5

log()

General purpose sampling algorithms
Many algorithms are available to sample from standard distributions
Dedicated procedures/general purpose algorithms for non-standard distributions:
. Inverse cumulative distribution function (ICDF) method
. Accept-reject (AR) algorithm
. Importance sampling and the SIR algorithm

Inverse cumulative distribution function (ICDF) method
Random variable x has cdf F (x) F (x) = u U(0, 1)
ICDF method
. Sample u from U(0, 1) x = F 1(u) F (x)
1.0
F1
0.8
u
0.6
cdf
0.4
0.2
0.0
4 2 0 2 4


Accept-reject algorithm 1
Sampling in two steps:
. Stage 1: sample from q() (proposal distribution) e
. Stage 2: reject e sample from p( | y) (target)
Assumption: p(|y) < A q() for all
0.6
A=1.8
. q = envelope distribution
0.5
N(0,1)
0.4
. A = envelope constant
0.3
q
0.2
0.1
0.0 4 2 0 2 4


Accept-reject algorithm 2
Stage 1: e & u are drawn independently from q() & U (0, 1)
Stage 2:
. Accept: when u p(e | y)/A q()

e
. Reject: when u > p(e | y)/A q()

e
Properties AR algorithm:
. Produces a sample from the posterior
. Only needs p(y | ) p()
. Probability of acceptance = 1/A

Adaptive Rejection Sampling algorithms 1
Adaptive Rejection Sampling (ARS) algorithm:
Builds up envelope density in an adaptive manner
Builds up squeezing density in an adaptive manner
Envelope and squeezing density are log of piecewise linear functions with knots at
sampled grid points
Two special cases:

. Tangent method of ARS
. Derivative-free method of ARS

Tangent ARS Derivative-free ARS

(a) (b)
TANGENT DERIVATIVEFREE
log posterior
log posterior
SQUEEZING SQUEEZING
1 2 ~ 3 1 2 ~ 3


Properties ARS algorithms:
Envelope density can be made arbitrarily close to target
5 to 10 grid points determine envelope density
Squeezing density avoids (many) function evaluations
Derivative-free ARS is implemented in WinBUGS
ARMS algorithm: combination with Metropolis algorithm for non log-concave

distributions, implemented in Bayesian SAS procedures

Importance sampling
Interest in E [t() | y]
Rh i h i
t() p(|y) Eq t() p(|y)
R
= t()p( | y)d = q() q()d = q()
Take a sample 1, . . . , K from q()
Estimate E [t() | y] by
K K
1 X t(k )p(k | y) 1 X k k
k
t( )w( )
K q( ) K
k=1 k=1
with importance weights w(k ) = p(k | y)/q(k )

PK k k
k=1 t( )w( )
Normalized weights: b
tI,K = PK k
k=1 w( )

Properties:
Posterior needs to be known only up to a constant
Does not generate a sample from the posterior
With normalized weights, tails of q() must be heavier than of p( | y)

Weighted sampling-resampling (SIR) algorithm
Two stage-sampling
e {e1, . . . , eJ } from q() and compute weights
. Stage 1: Draw
p(ej |y)/q(ej )
wj = PJ (j = 1, . . . , J)
i
p( |y)/q( )
e ei
i=1
Multinomial distribution: D(,

e w) with w = {w1, w2, . . . , wJ }
. Stage 2: take a sample of size K J from D(,

e w)
Posterior needs to be known only up to a constant
Useful in detecting influential observations (Chapter 10)

Example III.13: Caries study Sampling from posterior with lognormal prior
Accept-reject algorithm
Lognormal prior is maximized for log() = 0

Pn
i=1 (yi 1) n
P
Aq() = e Gamma( i yi , n)-distribution
Data from Example III.11
Prior: lognormal distribution with 0 = log(2) & 0 = 0.5
1000 -values sampled: 840 accepted

. Sampled posterior mean (median) = 2.50 (2.48)
. Posterior variance = 0.21
. 95% equal tail CI = [1.66, 3.44]

0.8
0.6
0.4
0.2
0.0
1 2 3 4 5


SIR algorithm (Smith & Gelfand, 1992)
L1 (|y) p1 () L2 (|y) p2 ()
p1( | y) = p1 (y) & p2( | y) = p2 (y)
L2 (|y) p2 ()
p2( | y) L1 (|y) p1 () p1( | y) = v() p1( | y)
Stage 1: Take a (large) sample e1, . . . , eJ from p1(|y)

PJ
Stage 2: Resample K ( J) -values with weights wj = v( )/ i=1 v(ei)
ej
Random sample from p2( | y)
Useful to perform a sensitivity analysis

Here
p1() = Gamma(3, 1) p1( | y) = Gamma(29, 11) (= p1() Poisson lik)
log()0 2

p2() e 20
p2( | y) = ?
Stage 1: Random sample of size J = 10, 000 from p1( | y)
Stage 2:
. Determine weights v(ei) based on p2() (likelihood stays the same)
. Take weighted random sample from this sample of size K = 1, 000
same histogram

3.7.3 Choice of posterior summary measures
In practice: posterior summary measures are computed with sampling techniques
Choice driven by available software:

. Mean, median and SD because provided by WinBUGS
. Mode almost never reported, but useful to compare with frequentist solution
. Equal tail CI (WinBUGS) & HPD (CODA and BOA)

3.8 Bayesian hypothesis testing
Two Bayesian tools for hypothesis testing H0: = 0
True value of is assumed, also in Bayesian context
Based on credible interval:

. Direct use of credible interval
. Contour probability: posterior evidence of H0 with HPD interval
Bayes factor: change of prior odds for H0 due to data

3.8.1 Inference based on credible intervals
First tool: Direct use of credible intervals:
Is 0 contained in 100(1-)% CI? If not, then reject H0 in a Bayesian way!
Popular when many parameters need to be evaluated, e.g. in regression models
Has a frequentist flavor (with pros and cons)

Contour probability
Second tool: Make use of HPD interval:
Contour probability pB : P [p( | y) > p(0 | y)] (1 pB )
pB first suggested by Box & Tiao (1973)
Bayesian counterpart of 2-sided P -value
Bayesian P -value = see posterior predictive checking (Chapter 10)

Example III.14: Cross-over study Use of CIs in Bayesian hypothesis testing
30 patients with systolic hypertension in cross-over study:

Period 1: randomization to A or B
Washout period
Period 2: randomization to B or A
= P(A better than B) & H0 : = 0(= 0.5)
Result: 21 patients better of with A than with B
Testing:
. Frequentist: 2-sided binomial test (P = 0.043)
. Bayesian: U(0,1) prior + Bin(21,30) = Beta(22, 10) posterior (pB = 0.023)

Example III.14: Cross-over study Graphical representation of pB
pB against = 0.5
smallest HPDI with 0.5

0.5
0.0 0.2 0.4 0.6 0.8 1.0


3.8.2 The Bayes factor
Posterior probability for H0
p(y | H0) p(H0)

p(H0 | y) =
p(y | H0) p(H0) + p(y | Ha) p(Ha)
Bayes factor: BF (y)
p(H0 | y) p(y | H0) p(H0)

=
1 p(H0 | y) p(y | Ha) 1 p(H0)
Bayes factor =
factor that transforms prior odds for H0 into posterior odds after observed the data
Central in Bayesian inference + indispensable in model/variable selection

Bayes factor - Jeffreys classification
Jeffreys classification for favoring H0 and against Ha

. Decisive (BF (y) > 100)
. Very strong (32 < BF (y) 100)
. Strong (10 < BF (y) 32)
. Substantial (3.2 < BF (y) 10)
. Not worth more than a bare mention (1 < BF (y) 3.2)

Example III.15: Cross-over study Use of the Bayes factor
Three scenarios for
H0 : = 0.5 versus Ha : = 0.8 (only 0.5 and 0.8 are possible for )
H0 : 0.5 versus Ha : > 0.5
H0 : = 0.5 versus Ha : 6= 0.5

Scenario 1: H0 : = 0.5 versus Ha : = 0.8
p(H0) = p(Ha) = 0.5
Likelihoods under H0 and Ha

30
21 9
. p(y = 21 | H0) = 21 0.5 0.5 = 0.0133
30
21 9
. p(y = 21 | Ha) = 21 0.8 0.2 = 0.0676
BF (21) 0.2 substantial evidence to prefer = 0.8 over = 0.5
With equal prior probabilities: Bayes factor = odds for p(H0 | y)
This scenario: Bayes factor = classical likelihood ratio

Scenario 2: H0 : 0.5 versus Ha : > 0.5
p(H0) = p(Ha) = 0.5
is continuous needed
p(y | H0) = weighted average of p(y | ), weights from p( | H0) U (0, 0.5)
p(y | Ha) = weighted average of p(y | ), weights from p( | Ha) U (0.5, 1)
Averaged likelihoods under H0 and Ha

R 0.5 30 21 30

. p(y = 21 | H0) = 0 21 (1 )9 2 d = 2 21 B(22, 10) 0.01472
R 1 30 21 30

. p(y = 21 | Ha) = 0.5 21 (1 )9 2 d = 2 21 B(22, 10) (1 0.01472)
BF (21) 0.15 substantial evidence to favor > 0.5 over 0.5
Different priors p( | H0) & p( | Ha) give different Bayes factors

Scenario 3: H0 : = 0.5 versus Ha : 6= 0.5
Testing sharp null hypothesis ( = 0.5) = most common hypothesis test
p(H0) = p(Ha) = 0.5 (???)
Needed
p(y | H0) = p(y | 0.5)
p(y | Ha) = weighted average of p(y | ), weights from p( | Ha) U (0, 1)
Likelihood under H0 and averaged likelihood under Ha

30
21 9
. p(y = 21 | H0) = 21 0.5 0.5 = 0.0133
R 1 30 21 9 30

. p(y = 21 | Ha) = 0 21 (1 ) d = 21 B(22, 10)
BF (21) 0.41 Not worth more than a bare mention

Scenario 3: Comparison with frequentist test
0.521 0.59
Classical likelihood ratio test: Z = (21/30)21 (9/30)9
= 0.0847 (P = 0.026)
Classical and pB exaggerate (??) evidence against H0
Frequentist: maximizing Bayesian: averaging

3.8.3 Bayesian versus frequentist hypothesis testing
P -value, Bayes factor and posterior probability
P -value is NOT p(H0 | y)
P-value fallacy: interpreting P -value as p(H0 | y) or p(Ha | y)
In some cases P -value comes close to a Bayesian statement:

Scenario 2: min p(H0 | y) = P (y) (Casella and Berger, 1997)
Scenario 3: Gaussian case: P (y) = 0.05 corresponds to BFmin(y) 0.15
Extreme P -values for most common hypothesis are often not that extreme
when evaluated with Bayes factor
Similar issues with contour probability

Jeffreys-Lindley-Bartletts paradox
Lindleys paradox:
Null hypothesis is rejected in a frequentist analysis with a small

P -value while the Bayes factor favors H0.
Reason:
Marginal likelihood averages over many possible alternative hypotheses
Vaguely specified alternative hypothesis: average over many implausible

hypotheses
Vaguely specified alternative hypothesis Marginal likelihood under alternative is

low

Example III.16: Illustration of Lindleys paradox (Press, 2003)
H0 : = 0 versus Ha : 6= 0 (testing sharp hypothesis)
I.i.d. yi N(, 2) (i = 1, . . . , n) ( known)
P -value = 2 [1 (zobs)] (zobs = observed z-statistic)

n o
1
2
p(H0 | y) = 1/ 1 + 1+n exp zobs/2(1 + 1/n) (p(H0) = 0.5 & p() = N( | 0, 2))
For P = 0.05 (zobs=1.96):

p(H0 | y) increases from 0.33 (n=5) to 0.82 (n=1,000)
Explanation:
Averaging over a large number of unrealistic values under Ha

Testing versus estimation
Frequentist approach: estimation and testing are complementary
Bayesian approach: estimation and testing are NOT complementary

. Testing : Bayesians put positive probabilities on sharp hypotheses
. Estimation: assign a zero probability to = 0
. Estimation: priors often do not have a great impact on the posterior conclusion
. Testing : priors MAY have a great impact on the posterior conclusion

Chapter 4
More than one parameter
Towards practical applications

4.1 Introduction
In this chapter:
Derivation of multivariate (multi-parameter) posterior and its summary measures
Derivation of marginal posterior distributions
Examples
. (multivariate) Gaussian distribution
. Multinomial distribution data
Bayesian linear and generalized linear regression models
Multivariate sampling approach: Method of Composition

4.2 Joint versus marginal posterior inference
Joint posterior inference
Let
y = sample of n independent observations
= (1, 2, . . . , d)T
L( | y)
Multivariate prior: p()
L( | y)p()
Multivariate posterior: p( | y) = R
L( | y)p() d
Posterior mode:
bM
Posterior mean:
HPD region of content (1-)

Marginal posterior inference
Let
. = { 1, 2} Z
. Marginal posterior: p( 1 | y) = p( 1, 2 | y) d 2
Often 1 = one-dimensional
Easy to graphically display marginal posterior
Posterior summary measures based on p( 1 | y) convenient in practice
Marginal posterior mean of 1 = joint posterior mean
Z
Alternatively: p( 1 | y) = p( 1 | 2, y) p( 2 | y) d 2
2 = nuisance parameters p( 1 | y) = get rid of nuisance parameters

In non-Bayesian context done by profile likelihood: pL( 1) = max2 L( 1, 2)

4.3 The normal distribution with and 2 unknown
Acknowledging that and 2 are unknown
Sample y1, . . . , yn of independent observations from N(, 2)
Joint likelihood of (, 2) given y:

" n
#
1 1 X
L(, 2 | y) = exp (yi )2
(2 2)n/2 2 2 i=1
Three priors:
. No prior knowledge is available
. Previous study is available
. Expert knowledge is available

4.3.1 No prior knowledge on and 2 is available
Noninformative joint prior p(, 2) 2 ( and 2 a priori independent)
1
21 2
2
2 2

Posterior distribution p(, | y) n+2
exp (n 1)s + n(y )
2.4
7.4
2.2
7.3
2.0
7.2
1.8
7.1
1.6
2 7.0

1.4 6.9
1.2 6.8

Marginal posterior distributions
Marginal posterior distributions are needed in practice
. p( | y)
. p( 2 | y)
Calculation of marginal posterior distributions involve integration:

R 2 2
R
p( | y) = p(, | y)d = p( | 2, y)p( 2 | y)d 2
Marginal posterior is weighted sum of conditional posteriors with weights =

uncertainty on other parameter(s)

Marginal posterior distributions for the normal case
Conditional posterior for : p( | 2, y) = N(y, 2/n)
Marginal posterior for : p( | y) = tn1(y, s2/n)

y
t(n1)
s/ n
Marginal posterior for 2: p( 2 | y) Inv 2(n 1, s2)

(scaled inverse chi-squared distribution)
(n 1)s2 2
(n 1)
2
= special case of IG(, ) ( = (n 1)/2, = 1/2)

Joint posterior distribution
Joint posterior = multiplication of marginal with conditional posterior
p(, 2 | y) = p( | 2, y) p( 2 | y) = N(y, 2/n) Inv 2(n 1, s2)
Normal-scaled-inverse chi-square distribution = N-Inv-2(y,n,(n 1),s2)
2.4
7.4
2.2
7.3
2.0
7.2
1.8
7.1
1.6
2 7.0

1.4 6.9
1.2 6.8
A posteriori and 2 are dependent

Marginal posterior summary measures for
Posterior mean = mode = median = y
(n1) 2
Posterior variance = n(n2) s
95% equal tail credible and HPD interval =

[y t(0.025; n 1) s/ n, y + t(0.025; n 1) s/ n]

Marginal posterior summary measures for 2
(n1) 2
Posterior mean = (n3) s
(n1) 2
Posterior mode = (n+1) s
(n1)
Posterior median = 2 (0.5,n1)
s2
2(n1)2 4
Posterior variance = 2
(n3) (n5)
s
95% equal tail CI

2 2

(n 1)s (n 1)s
,
2(0.975, n 1) 2(0.025, n 1)
95% HPD interval = computed iteratively

Posterior predictive distribution for normal distribution
and 2 known: distribution of ye = p(e

y | , 2)
and 2 unknown:
Z Z
y | y) =
p(e y | , 2)p(, 2 | y) d d 2
p(e
1
2
= tn1 y, s 1 + n -distribution

Example IV.1: SAP study Noninformative prior
. Example III.6: normal range for alp is too narrow

. Joint posterior distribution (see before)
. Marginal posterior distributions 4
2.0
POSTERIOR DENSITY
POSTERIOR DENSITY
3
1.5
2
1.0
0.5
1
0.0
0
6.2 6.4 6.6 6.8 7.0 7.2 7.4 1.5 2.0 2.5 3.0 3.5 4.0
MU SIGMA^2

Example IV.1 (continued)
For :
.=
bM = M = 7.11
. 2 = 0.0075
. 95% (equal tail and HPD) CI = [6.94, 7.28]
For 2:
. 2 = 1.88,
bM2
= 1.85, 2M = 1.87
. 22 = 0.029
. 95% equal tail CI = [1.58, 2.24], 95% HPD interval = [1.56, 2.22]
PPD = t249(7.11, 1.37)-distribution
95% normal range for alp = [104.1, 513.2] (slightly wider)

4.3.2 An historical study is available
Posterior of historical data prior to the likelihood of the current data
Prior = N-Inv-2(0,0,0,02)-distribution
. 0 = y 0 & 0 = n0
. 0 = n0 1 & 02 = s20
Posterior = N-Inv-2(,, , 2)-distribution

0 0 +ny
.= 0 +n & = 0 + n
. = 0 + n & 2 = 002 + (n 1)s2 + 00+n
n
(y 0)2
Again shrinkage towards mean + posterior variance = weighted average of prior-,

sample variance and distance between prior and sample mean
posterior variance is not necessarily smaller than prior variance!

Marginal posterior distributions & PPD
Marginal posterior distributions:
p( | 2, y) = N( | , 2/)
p( | y) = t ( | , 2/)
p( 2 | y) = Inv 2( 2 | , 2)
PPD:
h i
y | y) = t y, s2 1 + 01+n
p(e

Example IV.2: SAP study Conjugate prior
Retrospective study (Topal et al., 2003)

65 healthy subjects

Mean (SD) for y = alp = 5.25 (1.66)
Conjugate prior = N-Inv-2(5.25, 65, 64, 2.76)
Posterior = N-Inv-2(6.72, 315, 314, 2.61)

Posterior mean midway between prior mean & sample mean
Posterior precision 6= prior + sample precision
Posterior variance < prior variance
Posterior variance > sample variance
Posterior informative variance > NI variance
Prior information did not lower posterior uncertainty, reason: conflict of
likelihood with prior

2.0
POSTERIOR DENSITY
POSTERIOR DENSITY
3
1.5
2
1.0
0.5
1
0.0
0
6.2 6.4 6.6 6.8 7.0 7.2 7.4 1.5 2.0 2.5 3.0 3.5 4.0
MU SIGMA^2

4.3.3 Expert knowledge is available
Expert knowledge available on each parameter separately
Joint prior N(0, 02) Inv 2(0, 02) 6= conjugate
Posterior cannot be derived analytically, but numerical/sampling techniques are

available
. For : p( | 2, y) = N(, 2)
1
02
0 + n2 y 1
2
= 1 and =
02
+ n2 1
02
+ n2
. For 2:
n
Y
p( 2 | y) N( | 0, 02) Inv 2( 2 | 0, 02) N(yi | , 2)
i=1

4.4 Multivariate distributions
Two distributions:
Multivariate normal distribution + related distributions
Multinomial distribution

4.4.1 The multivariate normal and related distributions
Multivariate normal distribution (MVN): N(, ) or Np(, )

For a p-dimensional continuous random vector y:

1 1 T 1
p(y | , ) = exp (y ) (y )
(2)p/2||/2 2
Properties:
. Marginal distributions are normal

. Conditional distributions are normal
. Distributions of linear combinations of y are normal

Related distributions 1
Multivariate Students t-distribution: T (, )

For a p-dimensional continuous random vector y:
(+p)/2
[( + p)/2] 1/2 1 T 1
p(y | , , ) = p/2
|| 1 + (y ) (y )
(/2)(k)
Properties:
. Heavier tails than the MVN distribution

. Posterior in a classical Bayesian regression model (see below)
. Also used as a robust data distribution
. Multivariate extension of location-scale t-distribution with degrees of freedom

Related distributions 2
Wishart distribution: Wishart(, )

For a p p-dimensional random matrix S:

1
p(S) = c||/2|S|(p1)/2 exp tr (1S)
2
with p
Y +1j
c1 p/2 p(p1)/4
=2
j=1
2
Properties:
. Extension of 2()-distribution: S 2/ 2 2(n 1)

. Inverse Wishart distribution IW(D, ):
R IW(D, ) R1 Wishart(D, ) with D a precision matrix

4.4.2 The multinomial distribution
Multinomial distribution: Mult(y,)

For y = (y1, . . . , yk )T vector of frequencies falling into k classes:
k
n! Y y
p(y | ) = j j
y1!y2! . . . yk ! j=1
Pk T
Pk
with n = j=1 yj , = (1, . . . , k ) , j > 0, (j = 1, . . . , k), j=1 j =1
Properties:
. Binomial distribution = special case of the multinomial distribution with k = 2

. Marginal distribution yj = binomial distribution Bin(n,j )
. Conditional distribution
P of yj given y S = {ym : m S, j S} = Mult(y S , S ),
with S = {j / mS m}

Example IV.3: Young adult study Smoking and alcohol drinking
Study examining life style among young adults

Smoking
Alcohol No Yes
No-Mild 180 41
Moderate-Heavy 216 64
Total 396 105
Of interest: association between smoking & alcohol-consumption

22 contingency table = multinomial model M ult(n, )
= {11, 12, 21, 22 = 1 11 12 21}

P
y = {y11, y12, y21, y22} and 1 = i,j ij
n! y11 y12 y11 y22

M ult(n, ) = 11 12 21 22
y11! y12! y21! y22!

Conjugate prior to multinomial distribution = Dirichlet prior Dir()
1 Y ij 1

B() i,j ij
= {11, 12, 21, 22}

Q P
B() = i,j (ij )/ i,j ij
Posterior distribution = Dir( + y)
Note:
Dirichlet distribution = extension of beta distribution to higher dimensions
Marginal distributions of a Dirichlet distribution = beta distribution

Association between smoking and alcohol consumption:
11 22
=
12 21
Needed p( | y), but difficult to derive
Alternatively replace analytical calculations by sampling procedure
. Wij (i, j = 1, 2) distributed independently as Gamma(ij ,1)

P
. T = ij Wij
. Zij = Wij /T have a Dir() distribution

Analysis of contingency table:
Prior distribution: Dir(1, 1, 1, 1)
Posterior distribution: Dir(180+1, 41+1,216+1, 64+1)
Sample of 10, 000 generated values for parameters
95% equal tail CI for : [0.839, 2.014]
Equal to classically obtained estimate

20
30
15
20
10
10
5
5
0
0
0.30 0.35 0.40 0.45 0.04 0.06 0.08 0.10 0.12
11 12
1.2
15
0.8
10
0.4
5
0.0
0
0.35 0.40 0.45 0.50 0.5 1.0 1.5 2.0 2.5 3.0 3.5
12

Example IV.4: A re-analysis of Example III.9
Case-control study (Kaldor et al., 1990) to examine the impact of chemotherapy

on leukaemia in Hodgkins survivors
149 cases (leukaemia) and 411 controls
Question: Does chemotherapy induce excess risk of developing solid tumors,

leukaemia and/or lymphomas?

No Chemo 160 11
Chemo 251 138
Total 411 149

y = {251, 160, 138, 11}
Likelihood = product of binomial likelihoods
L(1, 2 | y) 1251 (1 1)160 2138 (1 2)11

Howard (1998): comparing proportions from 2 independent binomial distributions
Check hypothesis H1 : 2 < 1 versus H2 : 1 < 2
Classical frequentist tests (Fishers Exact test, chi-square test, etc) can be
reproduced by Bayesian tests
A dependent prior p(1, 2) is more natural than product of p(1) and p(2)

Here: examine effect of different independent priors on p(2 1 | y)
Example of a sensitivity analysis
Considered priors (products of beta distributions):

. Uniform prior for 1 and 2
1/2 1/2
. Jeffreys prior: p(1, 2) 1 (1 1)1/2 2 (1 2)1/2
. Haldane prior: p(1, 2) 11 (1 1)1 21 (1 2)1
Posteriors = product of beta distributions
A posteriori 1 and 2 are independent

To obtain p(2 1 | y):

Sample from p(2 | y) and p(1 | y) and take difference of each sampled value
Results: 95% equal tail CIs for 2 1

. Uniform prior: [0.244, 0.372]
. Jeffreys prior: [0.253, 0.378]
. Haldane prior: [0.256, 0.381]

4.5 Frequentist properties of Bayesian inference
Not of prime interest to a Bayesian to know the sampling properties of Bayesian

estimators
But: good frequentist properties of Bayesian estimators adds to their credibility
For instance: interval estimators (correct coverage)
Bayesian approach offers alternative interval estimators may be also useful in

frequentist calculations
. Agresti and Min (2005): best frequentist properties for odds ratio when
Jeffreys prior for the binomial parameters is taken
. Rubin (1984): other examples where the Bayesian 100(1-)% CI gives at least
100(1-)% coverage even when the prior distribution is chosen incorrectly

4.6 The Method of Composition
A method to yield a random sample from a multivariate distribution
Stagewise approach
Based on factorization of joint distribution into a marginal & several conditionals
p(1, . . . , d | y) = p(d | y) p(d1 | d, y) . . . p(1 | d1, . . . , 2, y)
Sampling approach:
. Sample ed from p(d | y)
. Sample e(d1) from p((d1) | ed, y)
. ...
. Sample e1 from p(1 | ed1, . . . , e2, y)

Sampling from N(, 2), both parameters unknown
Sampling approach from normal posterior p(, 2 | y) N(, 2)
Three cases:
. No prior knowledge
. Historical data available
. Expert knowledge available

Case 1: No prior knowledge on and 2
Sample from p(, 2 | y): Sample from p( 2 | y) & Sample from p( | 2, y)
1. Sample from p( 2 | y):
Sample ek from a 2(n 1)-distribution

e2k in (n 1)s2/e
Solve 2k = ek
2. Sample from p( | 2, y):
ek from a N(y,
Sample e2k /n)-distribution
e1, . . . ,
eK = random sample from p( | y) (tn1(y, s2/n)-distribution)

Case 1 (continued)
y | y), 2 approaches:
To sample from the posterior predictive distribution p(e
1
2
1. Sample directly from tn1 y, s 1 + n -distribution
2. Use Method of Composition

e2k from Inv-2( 2 | n 1, s2)
. Sample
ek from N( | y,
. Sample e2k /n)
. Sample yek from N(y |
ek ,
e2k )

Example IV.5: SAP study Sampling the posterior with NI prior
Sampled posterior distributions on next page (K = 1000)
Posterior mean (95% confidence interval)

: 7.11 ([7.106, 7.117])
2: 1.88 ([1.869, 1.890])
95% equal tail CI

: [6.95, 7.27]
2: [1.58, 2.23]

(b)
5
(a)
2.0
4
1.5
3
1.0
2
0.5
1
0.0
0
1.4 1.6 1.8 2.0 2.2 2.4 6.9 7.1 7.3
2

0.30
(d)
2.4
(c)
2.2
0.20
2.0
2
0.10
1.8
1.6
0.00
1.4
6.9 7.0 7.1 7.2 7.3 7.4 4 6 8 10 12

~
y

Case 2: Historical data are available
Same procedure as before!

Case 3: Expert knowledge is available
Problem: p( 2 | y) does not have a known distribution
e2, sampling
For a given e is straightforward

Example IV.6: SAP study Sampling posterior with product of Inform priors

Priors for y = alp:
N(5.25, 2.75/65) & 2 Inv 2(64, 2.75)
Method of Composition:
. Stage I: sample 2
n
Y
p( 2 | y) N( | 0, 02) Inv 2( 2 | 0, 02) N(yi | , 2)
i=1
p( 2 | y) evaluated on a grid mean and variance
Approximating distribution q( 2) Inv 2( 2 | 294.2, 2.12)
Weighted resampling
. Stage II: sample from a normal distribution

(a) (b)
expert
2.5
conjugate
conjugate
4
2.0
3
1.5
2
1.0
0.5
1
0.0
0
1.5 2.0 2.5 3.0 6.5 6.6 6.7 6.8 6.9 7.0 7.1
2


PPD can be obtained by sampling
Sample ye from N(e e2 )

,
Based on sample
. 95% normal range for y: [4.05, 9.67]
. 95% normal range for alp: [106.84, 609.70]

4.7 Bayesian linear regression models
Frequentist multiple linear regression analysis
Non-informative Bayesian multiple linear regression analysis
Multivariate posterior summary measures
Sampling from the posterior
Informative Bayesian multiple linear regression analysis

4.7.1 The frequentist approach to linear regression
Classical regression model: y = X +
. y = a n 1 vector of independent responses

. X = n (d + 1) design matrix
. = (d + 1) 1 vector of regression parameters
. = n 1 vector of random errors N(0, 2 I)
Likelihood:
2 1 1
L(, | y, X) = exp 2 (y X)T (y X)
(2 2)n/2 2
b = (X T X)1Xy
. MLE = LSE of :
. Residual sum of squares: S = (y X)T (y X)
. Mean residual sum of squares: s2 = S/(n d 1)

Example IV.7: Osteoporosis study: a frequentist linear regression analysis
. Cross-sectional study (Boonen et al., 1996)

. 245 healthy elderly women in a geriatric hospital
. Aim: Find determinants for osteoporosis
. Average age women = 75 yrs with a range of 70-90 yrs
. Marker for osteoporosis = tbbmc (in kg) measured for 234 women
. Simple linear regression model: regressing tbbmc on bmi
. Classical frequentist regression analysis:
b0 = 0.813 (0.12)
b1 = 0.0404 (0.0043)
s2 = 0.29, with n d 1 = 232
corr(b0, b1) =-0.99

2.5
2.0
TBBMC (kg)
1.5 1.0
0.5
20 25 30 35 40
BMI(kg m2)

4.7.2 A noninformative Bayesian linear regression model
Bayesian linear regression model = prior information on regression parameters &

residual variance + normal regression likelihood
Noninformative prior for (, 2): p(, 2 | y) 2
Notation: omit design matrix X
Posterior distributions:
h i
2 2 T 1
p(, | y) = N(d+1) | ,
b (X X) Inv 2( 2 | n d 1, s2)
h i
p( | 2, y) = N(d+1) b 2(X T X)1
| ,
p( 2 | y) = Inv h2( 2 | n d 1, s2)i
p( | y) = tnd1 | , b s2(X T X)1

4.7.3 Posterior summary measures for the linear regression model
Posterior summary measures of

(a) regression parameters
(b) parameter of residual variability 2
Univariate posterior summary measures

. The marginal posterior mean (mode, median) of j : MLE (LSE) bj
1/2
. 95% HPD interval for j : bj s(X T X)jj t(0.025, n d 1)
nd1 2
. Marginal posterior mode of 2 is equal to nd+1 s 6= MLE of 2
nd1 2
. Posterior mean of 2: nd3 s
. 95% HPD-interval for 2: algorithm on Inv 2(n d 1, s2)-distribution

Multivariate posterior summary measures
Multivariate posterior summary measures for
Posterior mean (mode) of =

b (MLE=LSE)
100(1-)%-HPD region=
n o
b T (X T X)( )
C() = : ( ) b d s2 F(d + 1, n d 1)
Contour probability for H0 : = 0

Posterior predictive distribution
PPD of ye with x
e:
t-distribution with (n d 1) degrees of freedom with
location parameter: T x
e
h i
2 T T 1
scale parameter: s 1 + x e (X X) x e
ye T x
tnd1
e
Given y : q
e T (X T X)1x
s 1+x e
How to sample?
. Directly from t-distribution
. Method of Composition

4.7.4 Sampling from the posterior distribution
Method of Composition
p( | y) = multivariate t-distribution: how to sample from it?
p( | 2, y) = multivariate normal distribution
Sample in two steps

Example IV.8: Osteoporosis study Sampling with Method of Composition
e2 from p( 2 | y) = Inv 2( 2 | n d 1, s2)

Sample
h i
Sample from e2, y) = N(d+1) | ,
e from p( | b e2(X T X)1
Sampled mean regression vector = (0.816, 0.0403)
95% equal tail CIs = 0: [0.594, 1.040] & 1: [0.0317, 0.0486]
Contour probability for H0 : = 0 = < 0.001
Marginal posterior of (0, 1) has a ridge (r(0, 1) = 0.99)

Distribution of a future observation at bmi=30
2
Sample future observation ye from N(e
30,
e30 ):
T
.
e30 = (1, 30)
e
T 1
2 2
T

e30 =
. e 1 + (1, 30)(X X) (1, 30)
Sampled mean and standard deviation = 2.033 and 0.282

100
(a) (b)
80
3
60
2
40
1
20
0
0
0.6 0.8 1.0 1.2 0.025 0.035 0.045
0 1
1.5
(c) (d)
0.045
1.0
1
0.035
0.5
0.025
0.0
0.6 0.8 1.0 1.2 1.5 2.0 2.5 3.0

0 ~
y

4.8 Bayesian generalized linear models
Generalized Linear Model (GLIM): extension of the linear regression model to a wide
class of regression models
Distributional part
Link function
Variance function
Bayesian Generalized Linear Model (BGLIM): GLIM + priors on parameters

Components of GLIM
Distributional part:

y b()
p(y | ; ) = exp + c(y; ) , with a(), b(), c() known functions
a()
Often a() = /w, with w a prior weight. For known and w = 1:
d b()
. E(y) = = d
d2 b()
. V ar(y) = a() V () with V () = d2
Link function: g() = = xT , g = monotone (differentiable) function

When = link function = canonical, h = g 1
Variance function: = extra dispersion or scale parameter

V ar(y) = a() V () can depend on covariates via

Special cases of a GLIM
Independent yi (i = 1, . . . , n): p(yi | i; ) (E(yi) = i & g(i) = xTi )
Distributional part of GLIM = example of one-parameter exponential family in

canonical parameter (but different notation than before)
Examples of a GLIM:
. Normal linear regression model with a normal distribution yi N(i, 2),
identity link (g(i) = i), = 1 and V (i) = 2 assumed known
. Poisson regression model with the Poisson distribution yi Poisson(i), log
link (g(i) = log(i)), = 1 and V (i) = i
. Logistic regression model with the Bernoulli (or Binomial) distribution
yi Bern(i), logistic link (g(i) = logit(i)), = 1 and V (i) = i(1 i)

4.8.1 More complex regression models
Considered multiparameter models are limited

. Gamma/Weibull distribution for alp?
. Censored/truncated data?
. Logistic/Cox regression?
Postpone to MCMC techniques

Chapter 5
Markov chain Monte Carlo sampling
Finally the real work can start

5.1 Introduction
. Solving the posterior distribution analytically is often not feasible due to the
difficulty in determining the integration constant
. Computing the integral using numerical integration methods is a practical
alternative if only a few parameters are involved
New computational approach is needed
. Sampling is the way to go!

. With Markov chain Monte Carlo (MCMC) methods:
1. Gibbs sampler
2. Metropolis-(Hastings) algorithm
MCMC approaches have revolutionalized Bayesian methods!

5.2 The Gibbs sampler
Gibbs Sampler: introduced by Geman and Geman (1984) in the context of

image-processing for the estimation of the parameters of the Gibbs distribution
Gelfand and Smith (1990) introduced Gibbs sampling to tackle complex

estimation problems in a Bayesian manner

5.2.1 The bivariate Gibbs sampler 1
Method of Composition:
p(1, 2 | y) is completely determined by:

. marginal p(2 | y)
. conditional p(1 | 2, y)
Split-up yields a simple way to sample from joint distribution

The bivariate Gibbs sampler 2
Gibbs sampling:
p(1, 2 | y) is completely determined by:

Property yields another simple way to sample from joint distribution:

. Take starting values 10 and 20 (only 1 is needed)
. Given 1k and 2k at iteration k, generate the (k + 1)-th value according to
iterative scheme:
(k+1)
1. Sample 1 from p(1 | 2k , y)
(k+1) (k+1)
2. Sample 2 from p(2 | 1 , y)

The bivariate Gibbs sampler 3
Result of Gibbs sampling:
Chain of vectors: k = (1k , 2k )T , k = 1, 2, . . .

Consists of dependent elements
Markov property: p( (k+1) | k , (k1), . . . , y) = p( (k+1) | k , y)
Chain depends on starting value + initial portion/burn-in part must be discarded
Under mild conditions: sample from the posterior distribution = target distribution
From k0 on: summary measures calculated from the chain consistently estimate
the true posterior measures
Gibbs sampler is called a Markov chain Monte Carlo method

Example VI.1: SAP study Gibbs sampling the posterior with NI priors 1
Example IV.5: sampling from posterior distribution of the normal likelihood based
on 250 alp measurements of healthy patients with NI prior for both parameters
Now using Gibbs sampler
Determine two conditional distributions:

1. p( | 2, y): N( | y, 2/n)
1
Pn
2. p( | , y): Inv ( | n, s) with s = n i=1(yi )2
2 2 2 2 2
Iterative procedure: At iteration (k + 1)

1. Sample (k+1) from N(y, ( 2)k /n)
2. Sample ( 2)(k+1) from Inv 2(n, s2(k+1) )

2.6
2.6
(a) (b)
2.4
2.4
2.2
2.2
2
2
2.0
2.0
1.8
1.8
1.6
1.6
1.4
1.4
6.6 6.8 7.0 7.2 7.4 6.6 6.8 7.0 7.2 7.4

Zigzag pattern in the (, 2)-plane

1 complete step = 2 substeps (blue=genuine element)
Burn-in = 500, total chain = 1,500

2.5
(a) (b)
4
2.0
3
1.5
2
1.0
0.5
1
0.0
0
6.8 6.9 7.0 7.1 7.2 7.3 7.4 1.4 1.6 1.8 2.0 2.2 2.4
2
Solid lines = true posterior distributions

Example VI.2: Sampling from a discrete continuous distribution 1
n x+1

Joint distribution: f (x, y) x y (1 y)(nx+1)
x a discrete random variable taking values in {0, 1, . . . , n}
y a random variable on the unit interval
, > 0 parameters
Question: f (x)? Use Gibbs sampling

1. f (x | y): Bin(n, y)
2. f (y | x): Beta(x + , n x + )
Iterative procedure: At iteration (k + 1):

1. Sample x(k+1) from Bin(n, y k )
2. Sample y (k+1) from Beta(x(k+1) + , n x(k+1) + )

Example VI.2: Sampling from a discrete continuous distribution 2
0.08
0.06
Density
0.04 0.02
0.00
0 5 10 15 20 25 30
x
Solid line = true posterior distribution


Example VI.3: SAP study Gibbs sampling the posterior with I priors 1
Example VI.1: now with independent informative priors (semi-conjugate prior)

N(0, 02)
2 Inv 2(0, 02)
Posterior:
1 212 (0)2
p(, 2 | y) e 0
0
2 (0 /2+1) 0 02 /2 2
( ) e
n
1 Y 12 (yi)2
n
e 2
i=1
n 1 2
12 (yi )2 22 (0 ) ( n+0
+1) 0 02 /2 2
Y
2
e 2 e 0 ( ) 2 e
i=1

Example VI.3: SAP study Gibbs sampling the posterior with I priors 2

1 2
2
Qn 12 (yi )2 22 (0 ) k 2 k

1. p( | , y): i=1 e
2 e
0 (N , ( ) )
Pn 2 2

2 2 i=1 (yi ) +0 0
2. p( | , y): Inv 0 + n, 0 +n
Iterative procedure: At iteration (k + 1)

(k+1) k 2 k

1. Sample from N , ( )
Pn 2 + 2

(y )
2. Sample ( 2)(k+1) from Inv 2 0 + n, i=1 i0+n 0 0

Example VI.3: SAP study Trace plots
7.0
6.5

6.0
(a)
5.5
0 500 1000 1500

Iteration
3.0
(b)
2.2 2.6
2
1.8
0 500 1000 1500

Iteration

5.2.2 The general Gibbs sampler 1
Starting position 0 = (10, . . . , d0)T

Multivariate version of the Gibbs sampler:
Iteration (k + 1):
(k+1)
1. Sample 1 from p(1 | 2k , . . . , (d1)
k
, dk , y)
(k+1) (k+1)
2. Sample 2 from p(2 | 1 , 3k , . . . , dk , y)
..
(k+1) (k+1) (k+1)
d. Sample d from p(d | 1 , . . . , (d1) , y)

5.2.3 The general Gibbs sampler 2
Full conditional distributions: p(j | 2k , . . . , (j1)

k k
, (j+1) k
, . . . , (d1) , dk , y)
Also called: full conditionals
Under mild regularity conditions:

k , (k+1), . . . ultimately are observations from the posterior distribution

Example VI.4: British coal mining disasters data 1
. British coal-mining disasters data set: # severe accidents in British coal mines
from 1851 to 1962
. Decrease in frequency of disasters from year 40 (+ 1850) onwards?
6
5
4
# Disasters
3
2
1
0
0 20 40 60 80 100
1850+year

Statistical model:
Likelihood: Poisson process with a change point at k

. yi Poisson() for i = 1, . . . , k
. yi Poisson() for i = k + 1, . . . , n (n=112)
Priors
. : Gamma(a1, b1), (a1, b1 parameters)
. : Gamma(a2, b2), (a2, b2 parameters)
. k: p(k) = 1/n
. b1: Gamma(c1, d1), (c1, d1 parameters)

. b2: Gamma(c2, d2), (c2, d2 parameters)

Full conditionals:
k
X
p( | y, , b1, b2, k) = Gamma(a1 + yi, k + b1)
i=1
Xnk
p( | y, , b1, b2, k) = Gamma(a2 + y i , n k + b2 )
i=k+1
p(b1 | y, , , b2, k) = Gamma(a1 + c1, + d1)
p(b2 | y, , , b1, k) = Gamma(a2 + c2, + d2)
(y | k, , )
p(k | y, , , b1, b2) = Pn
j=1 (y | j, , )
Pki=1 yi

with (y | k, , ) = exp [k( )]

a1 = a2 = 0.5, c1 = c2 = 0, d1 = d2 = 1

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.20
0.15
0.10
0.05
0.00
2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 35 40 45
k
a1 = a2 = 0.5, c1 = c2 = 0, d1 = d2 = 1
Posterior mode of k: 1891
Posterior mean for /= 3.42 with 95% CI = [2.48, 4.59]

Example VI.5: Osteoporosis study Using the Gibbs sampler 1
Bayesian linear regression model with NI priors:
. Regression model: tbbmci = 0 + 1bmii + i (i = 1, . . . , n = 234)

. Priors: p(0, 1, 2) 2
. Notation: y = (tbbmc1, . . . , tbbmc234)T , x = (bmi1, . . . , bmi234)T
Full conditionals: p( 2 | 0, 1, y) = Inv 2(n, s2 )

p(0 | 2, 1, y) = N(r1 , 2/n)
p(1 | 2, 0, y) = N(r0 , 2/xT x)
with
s2= n1(yi 0 1 xi)2
P
r1 = n1
P
(yi 1 xi)
r0 = (yi 0) xi/xT x
P

Example VI.5: Osteoporosis study Using the Gibbs sampler 2
Parameter Method of Composition

2.5% 25% 50% 75% 97.5% Mean SD
0 0.57 0.74 0.81 0.89 1.05 0.81 0.12
1 0.032 0.038 0.040 0.043 0.049 0.040 0.004
2 0.069 0.078 0.083 0.088 0.100 0.083 0.008
Gibbs sampler
2.5% 25% 50% 75% 97.5% Mean SD
0 0.67 0.77 0.84 0.91 1.10 0.77 0.11
1 0.030 0.036 0.040 0.042 0.046 0.039 0.0041
2 0.069 0.077 0.083 0.088 0.099 0.083 0.0077
Method of Composition = 1,000 independently sampled values

Gibbs sampler: burn-in = 500, total chain = 1,500

Example VI.5: Osteoporosis study Index plot from Method of Composition
(a)
0.045
1
0.030
0 200 400 600 800 1000

Index
0.11
(b)
0.09
2
0.07
0 200 400 600 800 1000

Index

Example VI.5: Osteoporosis study Trace plot from Gibbs sampler
(a)
0.045
1
0.030
0 500 1000 1500

Iteration
(b)
0.08 0.10
2
0.06
0 500 1000 1500

Iteration

Example VI.5: Osteoporosis study Trace plot
Comparison of index plot with trace plot shows:
2: index plot and trace plot similar (almost) independent sampling
1: trace plot shows slow mixing quite dependent sampling
Method of Composition and Gibbs sampling: similar posterior measures of 2
Method of Composition and Gibbs sampling: less similar posterior measures of

1

Example VI.5: Osteoporosis study Autocorrelation
Autocorrelation:
(k1)
. Autocorrelation of lag 1: correlation of 1k with 1 (k=1, . . .)
(k2)
. Autocorrelation of lag 2: correlation of 1k with 1 (k=1, . . .)
...
(km)
. Autocorrelation of lag m: correlation of 1k with 1 (k=1, . . .)
High autocorrelation:
burn-in part is larger takes longer to forget initial positions

remaining part needs to be longer to obtain stable posterior measures

5.2.4 Remarks
Full conditionals determine joint distribution
Generate joint distribution from full conditionals
Transition kernel

Remarks Full conditionals determine joint
Full conditionals determine the joint distribution: see Besag (1974) and
Hammersley and Clifford (1971)
Proofs in Robert and Casella (2004) for bivariate case (Theorem 9.3) and for
general case (Theorem 10.5)
Proof bivariate case:

1. p(1, 2) = p(2 | 1)p1(1) = p(1 | 2)p2(2)
R p(2 |1 ) R p2 (2 ) 1
2. p(1 |2 ) d2 = p1 (1 ) d2 = p1 (1 )
R
p(1, 2) = p(2 | 1)/ [p(2 | 1)/p(1 | 2)] d2

Remarks Generate joint from full conditionals
That the joint distribution exists is not enough to determine the joint
Bivariate case (Casella and George, 1992): compute p(1, 2) from p(1 | 2) &
p(2 | 1)
R
1. p1(1) = p(1 | 2)p2(2) d2 and similar for 2
Z Z
2. p1(1) = p(1 | 2) p(2 | 1)p1(1) d1 d2
Z Z
= p(1 | 2)p(2 | 1) d2 p1(1) d1
Z
= K1(1, 1) p1(1) d1
R
with K1(1, 1) = p(1 | 2)p(2 | 1) d2

If the conditionals are known, then p1(1) can be solved by finding the (fixed
point) solution of an integral equation
Gibbs sampler = stochastic version of this iterative algorithm

Remarks Transition kernel
Transition kernel/function: engine that generates a move from k to (k+1)
Gibbs sampler:
K(, ) = p(1 | 2, . . . , d) p(2 | 1, 2, . . . , d) p(d | 1, . . . , (d1))
R
K1(1, 1) = p(1 | 2)p(2 | 1) d2:
transition kernel to express that a move from 1 to 1
can be made via all possible values for 2

5.2.5 Review of Gibbs sampling approaches
Sampling the full conditionals is done via different algorithms depending on:
. Shape of full conditional (classical versus general purpose algorithm)
. Preference of software developer:
SASr procedures GENMOD, LIFEREG and PHREG: ARMS algorithm
WinBUGS: variety of samplers
Several versions of the basic Gibbs sampler:

. Deterministic- or systematic scan Gibbs sampler: d dims visited in fixed order
. Random-scan Gibbs sampler: d dims visited in random order
. Reversible Gibbs sampler: d dims visited in order and reversed order
. Block Gibbs sampler: d dims split up into m blocks of parameters and Gibbs
sampler applied to blocks

Review of Gibbs sampling approaches The block Gibbs sampler
Block Gibbs sampler:
Normal linear regression

. p( 2 | 0, 1, y)
. p(0, 1 | 2, y)
May speed up considerably convergence
WinBUGS: blocking option on
SASr procedure MCMC: allows the user to specify the blocks

5.2.6 The Slice sampler* 1
Slice sampler: Sampling from a density f (x) defined on [a, b] to samplubg

from bivariate uniform density g(x, y) on region A: { (x, y) : 0 < y < f (x)}
y = auxiliary variable
Three cases:
. [a, b] finite interval
Sample from [a, b] [0, m = max f (x)] and reject if outside region A
. General unimodal case
. General multimodal case

The Slice sampler* 2
General unimodal case = example of bivariate Gibbs algorithm
y | x U(0, f (x))
x | y U(miny , maxy )
miny (maxy ) minimal (maximal x-value) of solution y = f (x)
Stochastic intervals S(y) = [miny , maxy ] = slices
Implemented in WinBUGS when support is finite (first case)
General multimodal case = extension (slices are not intervals anymore)

Example VI.6: Slice sampling applied to the normal density
Sample from a standard normal density:

2

. y | x U 0, ex /2
p p p p
. x | y U( 2 log(y), 2 log(y)) (S(y) = [ 2 log(y), 2 log(y)] )
0.4
0.0 0.1 0.2 0.3 0.4 0.5

(a) (b)
0.3
Density
S(y)
0.2
y
0.1
0.0
4 2 0 2 4 3 1 0 1 2 3
x x

5.3 The Metropolis(-Hastings) algorithm
Metropolis-Hastings (MH) algorithm = general Markov chain Monte Carlo

technique to sample from the posterior distribution but does not require full
conditionals
Special case: Metropolis algorithm proposed by Metropolis in 1953
General case: Metropolis-Hastings algorithm proposed by Hastings in 1970
Became popular only after introduction of Gelfand & Smiths paper (1990)
Further generalization: Reversible Jump MCMC algorithm by Green (1995)

5.3.1 The Metropolis algorithm 1
Sketch of algorithm:
New positions are proposed by a proposal density q
Proposed positions will be:

. Accepted:
Proposed location has higher posterior probability: with probability 1
Otherwise: with probability proportional to posterior probability
. Rejected:
Otherwise
Algorithm satisfies again Markov property MCMC algorithm
Similarity with AR algorithm

The Metropolis(-Hastings) algorithm 2
Metropolis algorithm:
Chain is at k Metropolis algorithm samples value (k+1) as follows:
1. Sample a candidate e | ), with

e from the symmetric proposal density q(
= k
2. The next value (k+1) will be equal to:
e with probability ( k , )
e (accept proposal),
k otherwise (reject proposal),
with !
e | y)
p(
k
( , )
e = min r= k
,1
p( | y)
Function ( k , )
e = probability of a move

Example VI.7: SAP study Metropolis algorithm for NI prior case 1
Settings as in Example VI.1, now apply Metropolis algorithm:
. Proposal density: N( k , ) with k = (k , ( 2)k )T and = diag(0.03, 0.03)
2.6
2.6
(a) (b)
2.4
2.4
2.2
2.2
2
2
2.0
2.0
1.8
1.8
1.6
1.6
1.4
1.4
6.6 6.8 7.0 7.2 7.4 6.6 6.8 7.0 7.2 7.4

Jumps to any location in the (, 2)-plane


Marginal distributions:
3.0
6 (a) (b)
2.5
5
2.0
4
1.5
3
1.0
2
0.5
1
0.0
0
6.9 7.0 7.1 7.2 7.3 1.6 1.8 2.0 2.2 2.4
2

Acceptance rate = 40%


Trace plots:
7.3
(a)
7.1

6.9
600 800 1000 1200 1400

Iteration
2.4
(b)
2.0
2
1.6
600 800 1000 1200 1400

Iteration
Accepted moves = blue color, rejected moves = red color

. Proposal density: N( k , ) with k = (k , ( 2)k )T and = diag(0.001, 0.001)
2.6
(a) (b)
3.0
2.4
2.2
2.0
2
2.0
1.8
1.0
1.6
1.4
0.0
6.6 6.8 7.0 7.2 7.4 1.5 1.7 1.9 2.1
2

Acceptance rate = 84%

Poor approximation of true distribution

5.3.2 The Metropolis-Hastings algorithm 1
Metropolis-Hastings algorithm:
Chain is at k Metropolis-Hastings algorithm samples value (k+1) as follows:
1. Sample a candidate e | ), with

e from the (asymmetric) proposal density q(
= k
2. The next value (k+1) will be equal to:
e with probability ( k , )
e (accept proposal),
k otherwise (reject proposal),
with !
k
e | y) q( | )
k p( e
( , )
e = min r =
k k
,1
p( | y) q( | )
e

The Metropolis-Hastings algorithm 2
Reversibility condition: Probability of move from to

e = probability of move
from
e to
Reversible chain: chain satisfying reversibility condition
e | k ) q()
Example asymmetric proposal density: q( e (Independent MH
algorithm)
WinBUGS makes use of univariate MH algorithm to sample from some

non-standard full conditionals

Example VI.8: Sampling a t-distribution using Independent MH algorithm
Target distribution : t3(3, 22)-distribution
(a) Independent MH algorithm with proposal density N(3,42)

(b) Independent MH algorithm with proposal density N(3,22)
0.30
0.30
(a) (b)
0.20
0.20
0.10
0.10
0.00
0.00
5 0 5 10 5 0 5 10
t t

5.3.3 Remarks* 1
The Gibbs sampler is a special case of the Metropolis-Hastings algorithm, but

Gibbs sampler is still treated differently
The transition kernel of the MH-algorithm
The reversibility condition
Difference with AR algorithm

Remarks* Gibbs sampler = special case of MH algorithm
Define d transition functions

e | k ) = p(ej | k , y) if
qjG( ej = k
j j
= 0 otherwise
j = without the jth component.
Only possible jumps are to parameter vectors e that match k on all components
other than the jth. Then ratio r in the jth substep:
e | y)q G( k | )
p( e e | y)p( k | ej , y)
p(
j j
r = =
e | k ) p( k | y)p(ej | k , y)
p( k | y)qjG( j
e | y)/p(ej | k , y)
p( j
= k k e
=1
p( | y)/p( j | j , y)
each jump is accepted

Remarks* Transition kernel
Transition kernel = probability to move from k to (k+1) with , .
Jump in 2 components:
. First component (move): K(, ) = (, )q( | )
move to = e suggested by the proposal density q( | ) and accepted with
probability (, )
R
. Second component (stay): r() = 1 (, )q( | )d
IRd
probability that no move is made, i.e. =
Probability that B, with B :

Z
p(, B) = K(, ) d + r()I( B)
B

Remarks* Reversibility condition
Reversibility condition:
= Probability to move from set A to set B = probability to move from set B to set
A (any sets A and B in )
Z Z
p(, B) d = p(, A) d
A B
. Condition satisfied when detailed balance condition is satisfied:

Z Z Z Z
K(, ) dd = K(, ) dd all A and B
A B B A
. Detailed balance condition optimally satisfied for MH acceptance probability

(Hastings, 1970)

Remarks* Comparison with AR algorithm
AR algorithm:
. Makes use of instrumental distribution to propose values

. Proposed values are accepted or rejected
. Generates independent samples
. No trace of rejected values
MH algorithm:
. Makes use of instrumental distribution to propose values

. Proposed values are accepted or rejected
. Generates dependent samples
. Trace of rejected values

5.3.4 Review of Metropolis(-Hastings) approaches
The Random-Walk Metropolis(-Hastings) algorithm
The Independent Metropolis-Hastings algorithm
The Block Metropolis-Hastings algorithm
The Reversible Jump MCMC (RJMCMC) algorithm

The Random-Walk Metropolis(-Hastings) algorithm
Proposal density: q(
e | ) = q(e ), e.g. q(
e ) q(|
e |) proposal
density is symmetric and gives the Metropolis algorithm
. Multivariate normal density: WinBUGS & SASr procedures
. Multivariate t-distribution: SASr PROC MCMC for long tailed posteriors
Acceptance rate: 45% for d = 1 and 23.4% for d > 1
Tuning the proposal density:

. WinBUGS (one-dimensional MH algorithm): in first 4000 iterations to produce
an acceptance rate between 20% and 40%
. SASr procedure MCMC: in several loops

The Independent Metropolis-Hastings algorithm
Proposal density: does not depend on the position in the chain, e.g.
e | ) = Nd(
q( e | , )
One of the possible samplers of the SASr procedure MCMC
Similar to AR algorithm but accepts e > p( k | y)/q( k )

e | y)/q()
e when p(
High acceptance rate is desirable when proposal density q() is close to the
posterior density
If p( | y) A q() for all , then the Markov chain generated by the Independent
MH algorithm enjoys excellent convergence properties (Theorem 7.8) and that the
expected acceptance probability exceeds that of the AR algorithm (Lemma 7.9)

The Block Metropolis-Hastings algorithm
MH algorithm within Gibbs sampling: Metropolis-within-Gibbs
SASr procedure MCMC: blocks specified by the user
WinBUGS: regression coefficients in one block (blocking option switched on) and
variance parameters in other block

The Reversible Jump MCMC (RJMCMC) algorithm
Special case of the MH algorithm
Jumps within space and between spaces
Important application: Bayesian variable selection

5.4 Justification of the MCMC approaches*
. Homogeneous Markov chain

. Reversible Markov chain
. Ergodicity: (irrespective of the starting position)
Irreducibility: the chain can reach each possible outcome
Aperiodicity: there is no cyclic behavior in the chain
Positive recurrence: the chain visits every possible outcome an infinite amount
of times and the expected time to return to a particular outcome is finite.
ergodicity implies that the chain will explore the posterior distribution exhaustively
. Autocovariance and autocorrelation
. Extensions of classical theorems to ergodic Markov chains

Justification of the MCMC approaches* Autocorrelation
Elements of a Markov chain are conditionally independent given the

immediate past, but they are dependent unconditionally
The autocovariance of lag m (m 0) of the Markov chain (tk )k (t()k )k :

m = cov(tk , tk+m)
The variance of (tk )k : 0 = autocovariance for m = 0
The autocorrelation of lag m: m = m/0

Justification of the MCMC approaches* Ergodic theorems
Theorem: When (k )k is an ergodic Markov chain with stationary distribution ,

then the limiting distribution is also .
Theorem: (Markov Chain Law of Large Numbers) For an ergodic Markov chain
with a finite expected value for t(), tk converges to the true mean.
Theorem: (Markov Chain Central Limit Theorem) For a uniformly (or

geometrically) ergodic Markov chain with t2() (or t2+(), for some > 0 in the
geometric case) integrable with respect to then
tk E [t()]
k converges in distribution to N(0, 1) as k ,

with

!
X
2
= 0 1 + 2 m .
m=1

5.4.1 Properties of the MH algorithm* 1
The MH algorithm creates a reversible Markov chain, i.e. a Markov chain that
satisfies the detailed balance condition.
Proof discrete case:
= discrete random variable S = {x1, x2, . . . , xr }
j = p( = xj )
Q = (qij )ij : matrix that describes the move from xi to xj with probability qij

j qji
Probability that a move from xi is made to (6=) xj = ij = min 1, i qij
Probability to move from xi to xj : pij = ij qij

Properties of the MH algorithm* 2
The detailed balance condition: i pij = j pji is satisfied because

j qji
i pij = i ij qij = i min 1, qij
i qij

i qij
= min (i qij , j qji) = min 1, j qji
j qji
= j pji
is stationary distribution
MH algorithm creates a Markov chain where the target distribution is also the
stationary distribution
+ Extra verifications show that LLN and CLT for ergodic chains can be applied

Properties of the Gibbs sampler*
Verifications show that LLN and CLT for ergodic chains can be applied

5.5 Choice of the sampler
Choice of the sampler depends on a variety of considerations

Example VI.9: Caries study MCMC approaches for logistic regression
Subset of n = 500 children of the Signal-Tandmobielr study at 1st examination:
. Research questions:
Have girls a different risk for developing caries experience (CE ) than boys
(gender ) in the first year of primary school?
Is there an east-west gradient (x-coordinate) in CE?
. Bayesian model: logistic regression + N(0, 1002) priors for regression coefficients
. No standard full conditionals
. Three algorithms:
Self-written R program: evaluate full conditionals on a grid + ICDF-method
WinBUGS program: multivariate MH algorithm (blocking mode on)
SASr procedure MCMC: Random-Walk MH algorithm

Program Parameter Mode Mean SD Median MCSE
Intercept -0.5900 0.2800
MLE gender -0.0379 0.1810
x-coord 0.0052 0.0017
Intercept -0.5880 0.2840 -0.5860 0.0104
R gender -0.0516 0.1850 -0.0578 0.0071
x-coord 0.0052 0.0017 0.0052 6.621E-5
Intercept -0.5800 0.2810 -0.5730 0.0094
WinBUGS gender -0.0379 0.1770 -0.0324 0.0060
x-coord 0.0052 0.0018 0.0053 5.901E-5
Intercept -0.6530 0.2600 -0.6450 0.0317
SASr gender -0.0319 0.1950 -0.0443 0.0208
x-coord 0.0055 0.0016 0.0055 0.00016

Conclusions:
Posterior means/medians of the three samplers are close (to the MLE)
Precision with which the posterior mean was determined (high precision = low
MCSE) differs considerably
The clinical conclusion was the same
Samplers may have quite a different efficiency

5.6 The Reversible Jump MCMC algorithm*
Reversible Jump MCMC (RJMCMC): extension of the standard MH-algorithm

(Green, 1995) to allow sampling from target distributions on spaces of varying
dimension = trans-dimensional case
Applications:
Mixtures with an unknown # of components and hidden Markov models
Change-point problems with an unknown # of change-points/locations
Model and variable selection problems
Analysis of quantitative trait locus (QTL) data
Theory is complex:
Idea: create 1-to-1 function between the spaces of different dimensions
Construct a MH-algorithm that satisfies detailed-balance condition

Example VI.10: Caries study choosing between Poisson & NB with RJMCM
We now look at dmft-index of all children in first year
Poisson overdispersion: mean dmft = 2.24 and variance = 7.93

Candidate model for fitting overdispersion: Negative binomial distribution
Choose between Poisson and Negative Binomial with RJMCMC
yi
Qn
Poisson likelihood: L( | y) = i=1 yi ! exp(n)
1/ yi
Qn (1/+yi ) 1
Negative binomial model: L(, | y) = i=1 (1/) yi ! 1+ 1/+
= mean for both distributions

(1 + ) = variance of NB
measures overdispersion ( = 0: Poisson model)

Example VI.10 Choices of moves
Two models:
. Poisson: 1 = (1, 1) with 1

. Negative binomial: 2 = (2, 2) with 2 = (21, 22) (, )
. Trans-dimensional moves are between 1 = (1, 1) and 2 = (2, 2)
+ Necessary priors
There are four types of moves:
(1) Poisson to Poisson: classical

(2) Poisson to negative binomial: trans-dimensional
(3) Negative binomial to Poisson: trans-dimensional
(4) Negative binomial to negative binomial: classical

Example V1.10 Mixing behavior
Four settings
. Preference was measured by % of times the Markov chain was in model 2
(negative binomial): between 56% and 74%
. A suggested move from model 1 to model 2 was always accepted
. Percentage of trans-dimensional moves: between 32% and 55%

Example V1.10 Trace and density plots
Poisson Negative Binomial Negative Binomial
10
12
8
8
10
6
6
4
4
2
2
0
0
0
0 500 1500 2500 3500 0 1000 2000 3000 4000 0 1000 2000 3000 4000
Iteration Iteration Iteration
Poisson Negative Binomial Negative Binomial
0.4
0.30
0.3
0.3
Density
Density
Density
0.20
0.2
0.2
0.10
0.1
0.1
0.00
0.0
0.0
0 2 4 6 8 10 0 2 4 6 8 10 12 14 0 2 4 6 8 10


Bayesian Course Main

Uploaded by

Copyright:

Available Formats

You might also like

Bayesian Course Main

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Course Main

Uploaded by

Copyright:

Available Formats

Concepts in Bayesian Inference

I Basic concepts in Bayesian methods 1

1 Modes of statistical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 The classical statistical approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.2 The P -value as a measure of evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.3 The confidence interval as a measure of evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.1.4 An historical note on the frequentist paradigms* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.2 Statistical inference based on the likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.2.1 The likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.2.2 The likelihood principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Concepts in Bayesian Inference i

1.3.2 Bayes theorem Discrete version for simple events - 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2 Bayes theorem: computing the posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

2.2 Bayes theorem The binary version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.3 Probability in a Bayesian context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

2.4 Bayes theorem The categorical version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

2.5 Bayes theorem The continuous version - 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

2.6 The binomial case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

2.7 The Gaussian case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

2.8 The Poisson case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

2.9 The prior and posterior of h() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

2.10 Bayesian versus likelihood approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Concepts in Bayesian Inference ii

2.12 The different modes of the Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

2.13 An historical note on the Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

3 Introduction to Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

3.2 Summarizing the posterior with probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

3.3 Posterior summary measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

3.3.1 Measures of location & variability - Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

3.3.2 Posterior interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

3.4 Predictive distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

3.4.1 Frequentist approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

3.4.2 Bayesian approach: posterior predictive distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

3.4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

3.5 Exchangeability - 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

3.6 A normal approximation to the posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Concepts in Bayesian Inference iii

3.6.2 Asymptotic properties of the posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

3.7 Numerical techniques to determine the posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

3.7.1 Numerical integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

3.7.2 Sampling from the posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

3.7.3 Choice of posterior summary measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

3.8 Bayesian hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

3.8.1 Inference based on credible intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

3.8.2 The Bayes factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

3.8.3 Bayesian versus frequentist hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

4 More than one parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

4.2 Joint versus marginal posterior inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

4.3 The normal distribution with and 2 unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

4.3.1 No prior knowledge on and 2 is available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

Concepts in Bayesian Inference iv

4.3.3 Expert knowledge is available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

4.4 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

4.4.1 The multivariate normal and related distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

4.4.2 The multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

4.5 Frequentist properties of Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

4.6 The Method of Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

4.7 Bayesian linear regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

4.7.1 The frequentist approach to linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257