Bayesian Course Main

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 350

Concepts in Bayesian Inference

Christel Faes
Interuniversity Institute for Biostatistics and statistical Bioinformatics
Hasselt University
christel.faes@uhasselt.be

Emmanuel Lesaffre
Department of Biostatistics, Erasmus Medical Center
Interuniversity Institute for Biostatistics and statistical Bioinformatics
Catholic University Leuven & University Hasselt
Contents

I Basic concepts in Bayesian methods 1

1 Modes of statistical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2


1.1 The frequentist approach: a critical reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 The classical statistical approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.2 The P -value as a measure of evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.3 The confidence interval as a measure of evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.1.4 An historical note on the frequentist paradigms* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.2 Statistical inference based on the likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.2.1 The likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.2.2 The likelihood principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Concepts in Bayesian Inference i


1.3 The Bayesian approach: some basic ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

1.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

1.3.2 Bayes theorem Discrete version for simple events - 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

1.4 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2 Bayes theorem: computing the posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

2.2 Bayes theorem The binary version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.3 Probability in a Bayesian context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

2.4 Bayes theorem The categorical version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

2.5 Bayes theorem The continuous version - 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

2.6 The binomial case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

2.7 The Gaussian case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

2.8 The Poisson case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

2.9 The prior and posterior of h() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

2.10 Bayesian versus likelihood approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Concepts in Bayesian Inference ii


2.11 Bayesian versus frequentist approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

2.12 The different modes of the Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

2.13 An historical note on the Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

3 Introduction to Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

3.2 Summarizing the posterior with probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

3.3 Posterior summary measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

3.3.1 Measures of location & variability - Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

3.3.2 Posterior interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

3.4 Predictive distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

3.4.1 Frequentist approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

3.4.2 Bayesian approach: posterior predictive distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

3.4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

3.5 Exchangeability - 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

3.6 A normal approximation to the posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Concepts in Bayesian Inference iii


3.6.1 A normal approximation to the likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

3.6.2 Asymptotic properties of the posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

3.7 Numerical techniques to determine the posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

3.7.1 Numerical integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

3.7.2 Sampling from the posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

3.7.3 Choice of posterior summary measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

3.8 Bayesian hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

3.8.1 Inference based on credible intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

3.8.2 The Bayes factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

3.8.3 Bayesian versus frequentist hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

4 More than one parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

4.2 Joint versus marginal posterior inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

4.3 The normal distribution with and 2 unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

4.3.1 No prior knowledge on and 2 is available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

Concepts in Bayesian Inference iv


4.3.2 An historical study is available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

4.3.3 Expert knowledge is available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

4.4 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

4.4.1 The multivariate normal and related distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

4.4.2 The multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

4.5 Frequentist properties of Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

4.6 The Method of Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

4.7 Bayesian linear regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

4.7.1 The frequentist approach to linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

4.7.2 A noninformative Bayesian linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

4.7.3 Posterior summary measures for the linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

4.7.4 Sampling from the posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

4.8 Bayesian generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

4.8.1 More complex regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

5 Markov chain Monte Carlo sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

Concepts in Bayesian Inference v


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

5.2 The Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

5.2.1 The bivariate Gibbs sampler 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

5.2.2 The general Gibbs sampler 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

5.2.3 The general Gibbs sampler 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

5.2.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

5.2.5 Review of Gibbs sampling approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

5.2.6 The Slice sampler* 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

5.3 The Metropolis(-Hastings) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

5.3.1 The Metropolis algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

5.3.2 The Metropolis-Hastings algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

5.3.3 Remarks* 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

5.3.4 Review of Metropolis(-Hastings) approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

5.4 Justification of the MCMC approaches* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328

5.4.1 Properties of the MH algorithm* 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331

Concepts in Bayesian Inference vi


5.5 Choice of the sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334

5.6 The Reversible Jump MCMC algorithm* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338

Concepts in Bayesian Inference vii


Part I

Basic concepts in Bayesian methods

Concepts in Bayesian Inference 1


Chapter 1
Modes of statistical inference

Central to statistics = statistical inference

Three approaches:
. Frequentist approach
. (Likelihood approach)
. Bayesian approach

Concepts in Bayesian Inference 2


1.1 The frequentist approach: a critical reflection

FIRST: Review of classical approach on statistical inference

Concepts in Bayesian Inference 3


1.1.1 The classical statistical approach

Classical approach:

Mix of two approaches (Fisher & Neyman and Pearson)

Here: based on P -value, significance level, power and confidence interval

Example: RCT

Concepts in Bayesian Inference 4


Example I.1: Toenail RCT

Randomized, double blind, parallel group, multi-center study (Debacker et al.,


1996)
Two treatments (A : Lamisil and B : Itraconazol) on 2 189 patients
12 weeks of treatment and 48 weeks of follow up (FU)
Significance level = 0.05
Sample size to ensure that 0.20
Primary endpoint = negative mycology (negative microscopy & negative culture)
Here unaffected nail length at week 48 on big toenail
163 patients treated with A and 171 treated with B

Concepts in Bayesian Inference 5


Example I.1 (continued)

A: 1 & B: 2

H0: = 1 2 = 0

Completion of study:
b = 1.38 with tobs = 2.19 in 0.05 rejection region

Neyman-Pearson: reject that A and B are equally effective

Fisher: 2-sided P = 0.030 strong evidence against H0

Wrong statement: Result is significant at 2-sided of 0.030

Concepts in Bayesian Inference 6


1.1.2 The P -value as a measure of evidence

Use and misuse of P -value:

The P -value is not the probability that H0 is (not) true

The P -value depends on fictive data (Example I.2)

The P -value depends on the sample space (Examples I.3 and I.4)

The P -value is not an absolute measure

The P -value does not take all evidence into account (Example I.5)

Concepts in Bayesian Inference 7


The P -value is not the probability that H0 is (not) true

Often P -value is interpreted in a wrong manner

P -value = probability that observed or a more extreme result occurs

P -value = surprise index

P -value 6= p(H0 | y)

p(H0 | y) = Bayesian probability

Concepts in Bayesian Inference 8


The P -value depends on fictive data

P -value = probability that observed or a more extreme result occurs

P -value is based not only on the observed result


but also on fictive (never observed) data

Probability has a long-run frequency definition

Example I.2

Concepts in Bayesian Inference 9


Example I.2: Graphical representation of P -value

P -value of RCT (Example I.1)

0.4
unobserved y values
0.3

observed t value
Density
0.2

0.015 0.015
0.1

area area
0.0

4 2 0 2 4
t

Concepts in Bayesian Inference 10


The P -value depends on the sample space

P -value = probability that observed or a more extreme result occurs

Calculation of P -value depends on all possible samples (under H0)

The possible samples are similar in some characteristics to the observed sample
(e.g. same sample size)

Examples I.3 and I.4

Concepts in Bayesian Inference 11


Example I.3: Accounting for interim analyses in a RCT

2 identical RCTs except for the number of analyses:

RCT 1: 4 interim analyses + final analysis


. Correction for multiple testing
. Group sequential trial: Pococks rule
. Global = 0.05, nominal significance level=0.016

RCT 2: 1 final analysis


. Global = 0.05, nominal significance level=0.05

If P = 0.02 then for RCT 1: NO significance, for RCT 2: Significance

Concepts in Bayesian Inference 12


Example I.4: Kaldor et als case-control study

Case-control study (Kaldor et al., 1990) to examine the impact of chemotherapy


on leukaemia in Hodgkins survivors

149 cases (leukaemia) and 411 controls

Question: Does chemotherapy induce excess risk of developing solid tumors,


leukaemia and/or lymphomas?

Treatment Controls Cases


No Chemo 160 11
Chemo 251 138
Total 411 149

Concepts in Bayesian Inference 13


Example I.4 (continued)

Pearson 2(1)-test: P =7.8959x1013

Fishers Exact test: P = 1.487x1014

Odds ratio = 7.9971 with a 95% confidence interval = [4.19,15.25]

Reason for difference: 2 sample spaces are different

Concepts in Bayesian Inference 14


The P -value is not an absolute measure

Small P -value does not necessarily indicate large difference between treatments,
strong association, etc.

Interpretation of a small P -value in a small/large study

When does H0 occur in practice?

Concepts in Bayesian Inference 15


The P -value does not take all evidence into account

Studies are analyzed in isolation, no reference to historical data

Why not incorporating past information in current study?

Concepts in Bayesian Inference 16


Example I.5: Merseyside registry results

Subsequent registry study in UK

Preliminary results of the Merseyside registry: P = 0.67

Conclusion: no excess effect of chemotherapy (?)

Treatment Controls Cases


No Chemo 3 0
Chemo 3 2
Total 6 2

Concepts in Bayesian Inference 17


1.1.3 The confidence interval as a measure of evidence

95% confidence interval: expression of uncertainty on parameter of interest

Technical definition: in 95 out of 100 studies true parameter is enclosed

In each study confidence interval includes/does not include true value

Practical interpretation has a Bayesian nature

Concepts in Bayesian Inference 18


Example I.6: 95% confidence interval toenail RCT

95% confidence interval for = [0.14, 2.62]

Interpretation: most likely (with 0.95 probability) lies between 0.14 and 2.62

= a Bayesian interpretation

Concepts in Bayesian Inference 19


1.1.4 An historical note on the frequentist paradigms*

1. Fishers approach

Views of Fisher presented in:

Statistical Methods for Research Workers (1925)

The Design of Experiments (1935)

Concepts in Bayesian Inference 20


1. Fishers approach

Inductive approach

Introduction of:
. Null-hypothesis (H0)
. Significance test
. P -value = evidence against H0
. Significance level
. NO alternative hypothesis
. NO power

Concepts in Bayesian Inference 21


2. Neyman & Pearsons approach

Views of Neyman & Pearson presented in:

Papers in 1928 & 1933

Deductive approach

Introduction of:
. Alternative hypothesis (HA)
. Type I error
. Type II error & power
. Hypothesis test

Concepts in Bayesian Inference 22


3. Fishers versus Neyman & Pearsons approach

Fisher and Neyman & Pearson in a never-ending debate

In practice two approaches are mixed

Other measures (Bayes factor) have been proposed as measure for or against an
hypothesis

Bayesian inference and P -value are close in one-sided hypothesis test

Dont shoot at the play, but at the pianist

Concepts in Bayesian Inference 23


1.2 Statistical inference based on the likelihood function

Inference purely on likelihood function has not been developed to a full-blown


statistical approach

Considered here as a pre-cursor to Bayesian approach

Concepts in Bayesian Inference 24


1.2.1 The likelihood function

Likelihood was introduced by Fisher in 1922

Likelihood function = plausibility of the observed data as a function of the


parameters of the stochastic model

Inference based on likelihood function is QUITE different from inference based on


P -value

Concepts in Bayesian Inference 25


Example I.7: A surgery experiment

New but rather complicated surgical technique

Surgeon operates n = 12 patients with s = 9 successes

Notation:
Result on ith operation: success yi = 1, failure yi = 0
Total experiment: n operations with s successes
Sample {y1, . . . , yn} y
Probability of success = p(yi) =

Binomial distribution:
Expresses probability of s successes out of n experiments.

Concepts in Bayesian Inference 26


Binomial distribution


n
n X
f (s) = s (1 )ns with s = yi
s i=1

fixed & function of s:


n
P
f (s) is discrete distribution with f (s) = 1
s=0

s fixed & function of :


binomial likelihood function L(|s)

Concepts in Bayesian Inference 27


Binomial distribution

0
(a) (b)

10
0.20

Loglikelihood
Likelihood

30
0.10

50
0.00

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Maximum likelihood estimate (MLE) b maximizes L(|s)


Maximizing L(|s) equivalent to maximizing log[L(|s)] `(|s)

Concepts in Bayesian Inference 28


Example I.7 Determining MLE

To determine MLE first derivative of likelihood function is needed:

`(|s) = c + s ln + (n s) ln(1 )

d
d `(|s) = s (ns)
(1) = 0 = s/n
b

For s = 9 and n = 12 b = 0.75

Concepts in Bayesian Inference 29


1.2.2 The likelihood principles

Two likelihood principles (LP):

LP 1: All evidence, which is obtained from an experiment, about an unknown


quantity , is contained in the likelihood function of for the given data
. Standardized likelihood
. Interval of evidence

LP 2: Two likelihood functions for contain the same information about if they
are proportional to each other.

Concepts in Bayesian Inference 30


Likelihood principle 1

LP 1: All evidence, which is obtained from an experiment, about an unknown quantity


, is contained in the likelihood function of for the given data

Concepts in Bayesian Inference 31


Example I.7 (continued)

Maximal evidence for = 0.75

Likelihood ratio L(0.5|s)/L(0.75|s) = relative evidence for 2 hypotheses = 0.5


& = 0.75 (0.21 ??)

Standardized likelihood: LS (|s) L(|s)/L(|s)


b

LS (0.5|s) = 0.21 = test for hypothesis H0 without involving fictive data

Interval of ( 1/2 maximal) evidence

Concepts in Bayesian Inference 32


Inference on the likelihood function

0.25
0.20

Binomial likelihood
Likelihood
0.15

95% CI
0.10
0.05

interval of >= 1/2 evidence

MLE
0.00

0.0 0.2 0.4 0.6 0.8 1.0


Concepts in Bayesian Inference 33


Likelihood principle 2

LP 2: Two likelihood functions for contain the same information about


if they are proportional to each other

LP 2 = Relative likelihood principle

When likelihood is proportional under two experimental conditions, then


information about must be the same!

Concepts in Bayesian Inference 34


Example I.8: Another surgery experiment

. Surgeon 1: (Example I.7) Operate 12 patients, observe 9 successes (and 3 failures)


. Surgeon 2: Operate n patients until 3 failures are observed

n
P
. Surgeon 1: s = yi has a binomial distribution
i=1
n
 s
binomial likelihood L1(|s) = s (1 )(ns)
n
P
. Surgeon 2: s = yi has a negative binomial (Pascal) distribution
i=1
s+k1
 s k
negative binomial likelihood L2(|s) = s (1 )

Concepts in Bayesian Inference 35


Example I.8 Likelihood inference

0.25
0.20
Binomial likelihood
Likelihood
0.15
0.10

Negative binomial likelihood


0.05

interval of >= 1/2 evidence

MLE
0.00

0.0 0.2 0.4 0.6 0.8 1.0


LP 2: 2 experiments give us the same information about

Concepts in Bayesian Inference 36


Example I.8 Frequentist inference

H0: = 0.5 & HA: > 0.5

Surgeon 1: Calculation P -value = 0.0730



12
X 12
p [s 9 | = 0.5] = 0.5s (1 0.5)12s
s=9 s
Surgeon 2: Calculation P -value = 0.0337


X 2+s
p [s 9 | = 0.5] = 0.5s (1 0.5)3
s=9 s

Frequentist conclusion 6= Likelihood conclusion

Concepts in Bayesian Inference 37


Example I.8 - Conclusion

Design aspects (stopping rule) are important in frequentist context

Concepts in Bayesian Inference 38


1.3 The Bayesian approach: some basic ideas

Bayesian methodology = topic of the course

Statistical inference through different type of glasses

Concepts in Bayesian Inference 39


1.3.1 Introduction

Examples I.7 and I.8: combination of information from a similar historical surgical
technique could be used in the evaluation of current technique = Bayesian exercise

Planning phase III study:


. Comparison new old treatment for treating breast cancer
. Background information is incorporated when writing the protocol
. Background information is not incorporated in the statistical analysis
. Suppose small-scaled study with unexpectedly positive result (P < 0.01)
Reaction???

Concepts in Bayesian Inference 40


Introduction-2

New mouthwash:
. Daily use of the new mouthwash before tooth brushing reduces plaque?
. Results: new mouthwash reduced 25% of plaque with a 95% CI = [10%, 40%]
. Previous trials: overall reduction in plaque in-between 5% and 15%
. Experts: plaque reduction will probably not exceed 30%
. What to conclude then?

Concepts in Bayesian Inference 41


Introduction-3

Central idea of Bayesian approach:

Combine likelihood (data) with Your prior


knowledge (prior probability) to update information
on the parameter to result in a revised probability
associated with the parameter (posterior probability)

Concepts in Bayesian Inference 42


Example I.9: Examples of Bayesian reasoning in daily life

Tourist example: Prior view on Belgians + visit to Belgium (data) posterior


view on Belgians

Marketing example: Launch of new energy drink on the market

Medical example: Patients treated for CVA with thrombolytic agent suffer from
SBAs. Historical studies (20% - prior), pilot study (10% - data) posterior

Concepts in Bayesian Inference 43


1.3.2 Bayes theorem Discrete version for simple events - 1

A (diseased) & B (positive diagnostic test)


AC (not diseased) & B C (negative diagnostic test)

p(A, B) = p(A) p(B | A) = p(B) p(A | B)

Bayes theorem = Theorem on Inverse Probability

p (A | B ) p (B)
p (B | A) =
p (A)

Concepts in Bayesian Inference 44


Bayes theorem Discrete version for simple events - 2

Bayes Theorem - version II:

p (A | B ) p (B)
p (B | A) =
p (A | B ) p (B) + p (A | B C ) p (B C )

Concepts in Bayesian Inference 45


Example I.10: Sensitivity, specificity, prevalence and predictive values

B = diseased, A = positive diagnostic test

Characteristics of diagnostic test:


. Sensitivity (Se) = p (A | B )

C C

. Specificity (Sp) = p A B
. Positive predictive value (pred+) = p (B | A)

C C

. Negative predictive value (pred-) = p B A
. Prevalence (prev) = p(B)

pred+ calculated from Se, Sp and prev using Bayes Theorem

Concepts in Bayesian Inference 46


Example I.10 (continued)

Folin-Wu blood test: screening test for diabetes (Boston City Hospital)

Test Diabetic Non-diabetic Total


+ 56 49 105
- 14 461 475
Total 70 510 580

Se = 56/70 = 0.80
Sp = 461/510 = 0.90
prev = 70/580 = 0.12

Concepts in Bayesian Inference 47


Example I.10 (continued)

Bayes Theorem:
+ + p (T + | D+ ) p (D+)
p D T =
p (T + | D+ ) p (D+) + p (T + | D ) p (D)

In terms of Se, Sp and prev:

Se prev
pred+ =
Se prev + (1 Sp) (1 prev)

Concepts in Bayesian Inference 48


Example I.10 (continued)

For p(B)=0.03 pred+ = 0.20 & pred- = 0.99

For p(B)=0.30 pred+ = ?? & pred- = ??

Individual prediction: combine prior knowledge (prevalence of diabetes in


population) with result of Folin-Wu blood test on patient to arrive at revised
opinion for patient

Folin-Wu blood test: prior (prevalence) = 0.10 & positive test posterior = 0.47

Concepts in Bayesian Inference 49


Example I.11: The Bayesian interpretation of published research findings

Ioannidis (2005) explains why many medical research findings appear to be false

. Let = P(Type I error) & = P(Type II error)


. Purpose = find true relationships among G possible risk indicators and disease
& there is only 1 true relationship
. Prior probability that research finding is positive = 1/G
R = 1/(G 1) = prior odds
. Average number of truly positive associations in c relationships examined in an
independent manner =
c P (+|truly+) prior(truly+) = c(1 )R/(R + 1) (sensitivity prevalence)
. Average number of false positive findings =
c P (+|truly) prior(truly) = c/(R + 1) ((1-specificity)(1-prevalence))

Concepts in Bayesian Inference 50


Example I.11 (continued)

(1 )R
pred+ =
(1 )R +

If (1 )R >
Posterior probability of finding a true relationship > 0.5
Power to find a positive result must be > 0.05/R to find with high likelihood a
truly positive result, which is impossible for G large
Other (interesting!) conclusions, see Ioannidis (2005)

Concepts in Bayesian Inference 51


1.4 Outlook

Bayes theorem will be further developed in the next chapter such that it
becomes useful in statistical practice

Reanalyze examples such as those seen in this chapter

Valid question: What can a Bayesian analysis do more than a classical


frequentist analysis?

Six additional chapters are needed to develop useful Bayesian tools

But it is worth the effort!

Concepts in Bayesian Inference 52


Example I.12: Toenail RCT A Bayesian analysis

Re-analysis of toenail data using WinBUGS (most popular Bayesian software)

Figure 1.4 (a): AUC on pos x-axis represents = our posterior belief that is
positive (= 0.98)

Figure 1.4 (b): AUC on AUC for the interval [1, [ = our belief that 1/2 > 1

Figure 1.4 (c): incorporation of skeptical prior that is positive (a priori around
-0.5, with some uncertainty) (= 0.54)

Figure 1.4 (d): incorporation of information on the variance parameters (22/12


varies around 2)

Concepts in Bayesian Inference 53


Example I.12 (continued)

(a) (b)
delta sample: 5000 rat sample: 5000
0.6 6.0
0.4 4.0
0.2 2.0
0.0 0.0
-2.0 0.0 2.0 4.0 0.75 1.0 1.25 1.5

(c) (d)
delta sample: 5000 ratsig sample: 5000
0.8 6.0
0.6 4.0
0.4
0.2 2.0
0.0 0.0
-2.0 0.0 2.0 0.8 1.0 1.2 1.4

Concepts in Bayesian Inference 54


Wy Bayesian?

Philosophical differences aside, there are practical reasons that drive the recent
popularity of Bayesian analysis:
Simplicity in thinking about problems and answering questions
Flexibility in making inference on a wide range of models (data augmentation,
hierarchical models)
Incorporation of prior information
Development of efficient inference and sampling tools
Fast computers

Concepts in Bayesian Inference 55


Historical notes on Thomas Bayes

Thomas Bayes: was probably born


on 1701 and died on 7/4/1761.
He was a minister and a reverent with
mathematical interests.
None of his work on mathematics was
published during his life.
One of 2 posthumous works, published in 1764:
An Essay toward a Problem in the Doctrine of Chances
is the basis for Bayesian Inference.
Laplace (1749-1827) reinvented Bayes theorem, because he did not know Bayes
work.

Concepts in Bayesian Inference 56


Historical notes on the Bayesian computational procedures

Before 80s: Conjugate analysis


Early 80s: Numerical integration &
Monte Carlo integration
Mid 80s: Laplace approximations
Late 80s: Gelfand Smith use Gibbs sampler for simple Normal models
1989: The development of BUGS started, followed by the development of
WINBUGS

Concepts in Bayesian Inference 57


Has one accepted the Bayesian methodology?

Statistical world:
. Bayesian statistics has not been (widely) accepted for a long time.
. Frequentist world versus Bayesian (Likelihood) world
. From 1990: change in attitude of statistical community.

Each viewpoint has its merits.


Use Bayesian methods when appropriate.

Medical world:
. The medical world is more conservative. Bayesian methods are better accepted
in exploratory/epidemiological studies than in clinical trials. In clinical trials,
there seems to be a role for Bayesian methods except for in phase III studies.

Concepts in Bayesian Inference 58


Bayesian software

Statistical packages like S+, R, or GAUSS can be used.


For elementary (educational) calculations, one could use 1stBayes developed by
Tony OHagan (Sheffield). It can be downloaded and installed from the Website
http://tonyohagan.co.uk/1b/
More advance Bayesian analyses can be done using WINBUGS, free downloadable
from http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml
To find other Bayesian software for simple or complex problems a good starting
point is to consult the Webpage of the ISBA Society: http://www.bayesian.org/
SAS using the PROC GENMOD, LIFEREG, PHREG and MCMC procedures

Concepts in Bayesian Inference 59


True story ...

A Bayesian and a frequentist were sentenced to death. When the judge asked what
their final wishes were, the Bayesian replied that he wished to teach the frequentist the
ultimate lesson. The judge granted his request and then repeated the question to the
frequentist. He replied that he wished to get the lesson again and again and again . . .

Concepts in Bayesian Inference 60


Chapter 2
Bayes theorem: computing the posterior distribution

General expression of Bayes theorem is derived

Concepts in Bayesian Inference 61


2.1 Introduction

In this chapter:

Bayes theorem for binary, categorical and continuous case

Derivation of posterior distribution: for binomial, normal and Poisson

A variety of examples

Concepts in Bayesian Inference 62


2.2 Bayes theorem The binary version

D+ = 1 and D = 0
T + y = 1 and T y = 0

p(y = 1 | = 1) p( = 1)
p( = 1 | y = 1) =
p(y = 1 | = 1) p( = 1) + p(y = 1 | = 0) p( = 0)

p( = 1), p( = 0) prior probabilities


p(y = 1 | = 1) likelihood
p( = 1 | y = 1) posterior probability
Now parameter has also a probability

Concepts in Bayesian Inference 63


Bayes theorem

Shorthand notation

p(y | )p()
p( | y) =
p(y)

where can stand for = 0 or = 1.

Concepts in Bayesian Inference 64


2.3 Probability in a Bayesian context

Bayesian probability = expression of Our/Your uncertainty of the parameter value

Coin tossing: truth is there, but unknown to us

Diabetes: from population to individual patient

Probability can have two meanings: limiting proportion (objective) or personal belief
(subjective)

Concepts in Bayesian Inference 65


Other examples of Bayesian probabilities

Subjective probability varies with individual, in time, etc.

Tour de France

Global warming

...

Concepts in Bayesian Inference 66


Subjective probability rules

Let A1, A2, . . . , AK mutually exclusive events with total event S


Subjective probability p should be coherent:

Ak : p(Ak ) 0 (k=1, . . ., K)

p(S) = 1

p(AC ) = 1 p(A)

With B1, B2, . . . , BL another set of mutually exclusive events:


p(Ai, Bj )
p(Ai | Bj ) =
p(Bj )

Concepts in Bayesian Inference 67


2.4 Bayes theorem The categorical version

Subject can belong to K > 2 classes: 1, 2, . . . , K

y takes L different values: y1, . . . , yL or continuous

Bayes theorem for categorical parameter:

p (y | k ) p (k )
p (k | y) = K
P
p (y | k ) p (k )
k=1

Concepts in Bayesian Inference 68


2.5 Bayes theorem The continuous version - 1

1-dimensional continuous parameter


i.i.d. sample y = y1, . . . , yn
Qn
Joint distribution of sample = p(y|) = i=1 p(yi |) = likelihood L(|y)
Prior density function p()
Split up: p(y, ) = p(y|)p() = p(|y)p(y)
Bayes Theorem for continuous parameters:

L(|y)p() L(|y)p()
p(|y) = = R
p(y) L(|y)p()d

Concepts in Bayesian Inference 69


Shorter: p(|y) L(|y)p()
R
L(|y)p()d = averaged likelihood

Averaged likelihood posterior distribution involves integration

Important: Also in Bayesian context, there is a true parameter 0

Concepts in Bayesian Inference 70


2.6 The binomial case

Example II.1: Stroke study Monitoring safety

. Rt-PA: thrombolytic for ischemic stroke


. Historical studies ECASS 1 and ECASS 2: complication SICH
. ECASS 3 study: patients with ischemic stroke (Tx between 3 & 4.5 hours)
. DSMB: monitor SICH in ECASS 3
. Fictive situation:
First interim analysis ECASS 3: 50 rt-PA patients with 10 SICHs
Historical data ECASS 2: 100 rt-PA patients with 8 ISCHs
. Estimate risk for SICH in ECASS 3 construct stopping rule

Concepts in Bayesian Inference 71


Comparison of 3 approaches

Frequentist

Likelihood

Bayesian - different prior distributions

Exemplify mechanics of calculating the posterior distribution using Bayes theorem

Concepts in Bayesian Inference 72


Notation

SICH incidence:

i.i.d. Bernoulli random variables y1, . . . , yn

SICH: yi = 1, otherwise yi = 0
Pn n

y= 1 yi has Bin(n, ): p(y|) = y y (1 )(ny)

Concepts in Bayesian Inference 73


Frequentist approach

MLE b = y/n = 10/50 = 0.20

Test hypothesis = 0.09 (value of ECASS 2 study)

Classical 95% confidence interval = [0.089, 0.31]

Concepts in Bayesian Inference 74


Likelihood inference

MLE b = 0.20

No hypothesis test is performed

Confidence interval 0.95 interval of evidence = [0.09, 0.36]

Concepts in Bayesian Inference 75


Bayesian approach: prior obtained from ECASS 2 study

1. Specifying the (ECASS 2) prior distribution

2. Constructing the posterior distribution

3. Characteristics of the posterior distribution

4. Equivalence of prior information and extra data

Concepts in Bayesian Inference 76


1. Specifying the (ECASS 2) prior distribution - 1

n0
y0 (1 )(n0y0) (y0 = 8 & n0 = 100)

ECASS 2 likelihood: L(|y0) = y0

ECASS 2 likelihood expresses prior belief on (a)


but is not (yet) prior distribution

0.15
As a function of
L(|y0) 6= density (AUC 6= 1)

0.10
LIKELIHOOD

How to standardize?

0.05
Numerically or analytically?

0.0
0.0 0.05 0.10 0.15 0.20 0.25 0.30

PROPORTION SICH

Concepts in Bayesian Inference 77


1. Specifying the (ECASS 2) prior distribution - 2

Kernel of binomial likelihood y0 (1 )(n0y0) beta density Beta(, ):

p() = 1
B(0 ,0 ) 01(1 )01
(b)

with B(, ) = ()()

15
(+)
() gamma function proportional to
LIKELIHOOD

10
0(9) y0 + 1
0(100 8 + 1) n0 y0 + 1

5
LIKELIHOOD

0
0.0 0.05 0.10 0.15 0.20 0.25 0.30

PROPORTION SICH

Concepts in Bayesian Inference 78


2. Constructing the posterior distribution - 1

Bayes theorem needs:


. Prior p() (ECASS 2 study)
. Likelihood L(|y) (ECASS 3 interim analysis), y=10 & n=50
R
. Averaged likelihood L(|y)p()d

Numerator of Bayes theorem


 
n 1
L(|y) p() = 0+y1(1 )0+n+y1
y B(0, 0)

Averaged likelihood
 
n B(0 + y, 0 + n y)
p(y) =
y B(0, 0)

Concepts in Bayesian Inference 79


2. Constructing the posterior distribution - 2

Posterior distribution Beta(, )

1
p(|y) = 1(1 )1
B(, )

with
= 0 + y
= 0 + n y

Concepts in Bayesian Inference 80


2. Prior, likelihood & posterior

15
POSTERIOR
10

PRIOR proportional to
LIKELIHOOD
5
0

0.0 0.1 0.2 0.3 0.4


Concepts in Bayesian Inference 81


3. Characteristics of the posterior distribution

Posterior = compromise between prior & likelihood

n0
Posterior mode: = n0 +n 0 + n0n+n b (analogous result for mean)

Shrinkage: 0 b when (y0/n0 y/n)

Here: posterior more peaked than prior & likelihood (not in general)

Likelihood dominates the prior for large sample sizes

Posterior = beta distribution = prior (conjugacy)

Posterior estimate = MLE of combined ECASS 2 data & interim data ECASS 3

Concepts in Bayesian Inference 82


4. Equivalence of prior information and extra data

Beta(0, 0) prior
binomial experiment with (0 1) successes in (0 + 0 2) experiments

Prior
extra data to observed data set: (0 1) successes and (0 1) failures

Concepts in Bayesian Inference 83


Bayesian approach: using a subjective prior

Suppose DSMB neurologists believe that SICH incidence is probably more than
5% but most likely not more than 20%

If prior belief = ECASS 2 prior density posterior inference is the same

The neurologists could also combine their qualitative prior belief with ECASS 2
data to construct a prior distribution adjust ECASS 2 prior

Concepts in Bayesian Inference 84


Example subjective prior

ECASS 2

15
POSTERIOR

SUBJECTIVE
POSTERIOR
10

ECASS 2 SUBJECTIVE
PRIOR PRIOR
5
0

0.0 0.1 0.2 0.3 0.4


Concepts in Bayesian Inference 85


Bayesian approach: no prior information is available

Suppose no prior information is available

Need: a prior distribution that expresses ignorance


= noninformative (NI) prior

8
For stroke study: NI prior = proportional to
POSTERIOR LIKELIHOOD
p() = I[0,1] = flat prior on [0,1]

6
Uniform prior on [0,1] = Beta(1,1)

4
2
FLAT PRIOR

0 0.0 0.1 0.2 0.3 0.4


Concepts in Bayesian Inference 86


2.7 The Gaussian case

Example II.2: Dietary study Monitoring dietary behavior in Belgium

IBBENS study: dietary survey in Belgium

Of interest: intake of cholesterol

Monitoring dietary behavior in Belgium: IBBENS-2 study

Concepts in Bayesian Inference 87


Bayesian approach: prior obtained from the IBBENS study

1. Specifying the (IBBENS) prior distribution

2. Constructing the posterior distribution

3. Characteristics of the posterior distribution

4. Equivalence of prior information and extra data

Concepts in Bayesian Inference 88


1. Specifying the (IBBENS) prior distribution

Histogram of the dietary cholesterol of 563 bank employees normal

y N(, 2) when
1  2 2

f (y) = exp (y ) /2
2

Sample y1, . . . , yn likelihood


" n
# "  2 #
1 X
2 1 y
L(|y) exp 2 (yi ) exp L(|y)
2 i=1
2 / n

Concepts in Bayesian Inference 89


Histogram and likelihood IBBENS study

0.004

0.08
(a) (b)

0.06
0.002

0.04
0.02
MLE= 328
0.000

0.00
0 200 400 600 800 310 320 330 340
cholesterol (mg/day)

Concepts in Bayesian Inference 90


Denote sample n0 IBBENS data: y 0 {y0,1, . . . , y0,n0 } with mean y 0

Likelihood N(0, 02)


0 y 0 = 328

0 = / n0 = 120.3/ 563 = 5.072

IBBENS prior distribution


"  2 #
1 1 0
p() = exp
20 2 0

with 0 y 0

Concepts in Bayesian Inference 91


2. Constructing the posterior distribution

IBBENS-2 study:
sample y with n=50
y = 318 mg/day & s = 119.5 mg/day
95% confidence interval = [284.3, 351.9] mg/day wide

Combine IBBENS prior distribution IBBENS-2 normal likelihood:


IBBENS-2 likelihood: L(|y)
IBBENS prior density: N(0, 02)

Posterior distribution p()L(|y):


( " 2  2#)
1 0 y
p(|y) p(|y) exp +
2 0 / n

Concepts in Bayesian Inference 92


. Integration constant to obtain density?
. Recognize standard distribution: exponent (quadratic function of )
. Posterior distribution:

p(|y) = N(, 2),


with
1
02
0 + n2 y 1
2
= 1 and =
02
+ n2 1
02
+ n2

. Here: = 327.2 and = 4.79.

Concepts in Bayesian Inference 93


IBBENS-2 posterior distribution

0.08
IBBENS2 POSTERIOR IBBENS PRIOR

0.06
0.04
0.02

IBBENS2 LIKELIHOOD
0.00

250 300 350 400


Concepts in Bayesian Inference 94


3. Characteristics of the posterior distribution

. Posterior distribution: compromise between prior and likelihood


. Posterior mean: weighted average of prior and the sample mean
w0 w1
= 0 + y
w0 + w1 w0 + w1
with weights
1 1
w0 = 2 & w1 = 2
0 /n

. The posterior precision = 1/posterior variance:


1
2 = w 0 + w1

with w0 = 1/02 = prior precision and w1 = 1/( 2/n) = sample precision

Concepts in Bayesian Inference 95


Posterior is always more peaked than prior and likelihood

When n or 0 : : p(|y) = N(y, 2/n)

When sample size increases the likelihood dominates the prior

Posterior = normal = prior conjugacy

Concepts in Bayesian Inference 96


4. Equivalence of prior information and extra data

Prior variance 02 = 2 2 = 2/(n + 1)

Prior information = adding one extra observation to the sample

General: 02 = 2/n0, with n0 general


n0 n
= 0 + y
n0 + n n0 + n

and
2 2
=
n0 + n

Concepts in Bayesian Inference 97


Bayesian approach: using a subjective prior

Discounted IBBENS prior: increase IBBENS prior variance from 25 to 100

Discounted IBBENS prior + shift: increase from 0 = 328 to 0 = 340

Concepts in Bayesian Inference 98


Discounted priors
0.05

0.05
POSTERIOR
(a) (b) POSTERIOR
0.04

0.04
PRIOR PRIOR
0.03

0.03
LIKELIHOOD LIKELIHOOD
0.02

0.02
0.01

0.01
0.00

0.00
250 300 350 400 250 300 350 400

Concepts in Bayesian Inference 99


Bayesian approach: no prior information is available

Non-informative prior: 02

Posterior: N(y, 2/n)

Concepts in Bayesian Inference 100


Non-informative prior

0.025
0.020
0.015 POSTERIOR

LIKELIHOOD
0.010
0.005

PRIOR
0.000

250 300 350 400


Concepts in Bayesian Inference 101


2.8 The Poisson case

Take y {y1, . . . , yn} independent counts Poisson distribution

Poisson() y e
p(y| ) =
y!

Mean and variance =

Poisson likelihood:
n
Y n
Y
L(| y) p(yi| ) = (yi /yi!) en
i=1 i=1

Concepts in Bayesian Inference 102


Example II.6: Describing caries experience in Flanders

The Signal-Tandmobielr (STM) study:

Longitudinal oral health study in Flanders

0.4
Annual examinations from 1996 to 2001

0.3
PROPORTION
4468 children (7% of children born in 1989)

0.2
Caries experience measured by dmft-index

0.1
(min=0, max=20)

0.0
0 5 10 15

dmft-index

Concepts in Bayesian Inference 103


Frequentist and likelihood calculations

MLE of : b = y = 2.42

Likelihood-based 95% confidence interval for : [2.1984,2.2875]

Concepts in Bayesian Inference 104


Bayesian approach: prior distribution based on historical data

1. Specifying the prior distribution

2. Constructing the posterior distribution

3. Characteristics of the posterior distribution

4. Equivalence of prior information and extra data

Concepts in Bayesian Inference 105


1. Specifying the prior distribution

Information from literature:


Average dmft-index 4.1 (Liege, 1983) & 1.39 (Gent, 1994)
Oral hygiene has improved considerably in Flanders
Average dmft-index bounded above by 10

Candidate for prior: Gamma(0, 0)

00 01 0
p() = e
(0)

0 = shape parameter & 0 = inverse of scale parameter


E(0) = 0/0 & var() = 0/02

STM study: 0 = 3 & 0 = 1

Concepts in Bayesian Inference 106


Gamma prior for STM study

0.25
0.20
0.15
0.10

Gamma(3,1)
0.05
0.00

0 5 10 15 20

Concepts in Bayesian Inference 107


2. Constructing the posterior distribution

Posterior
n
n
Y
yi 00 01 0
p(| y) e ( /yi!) e
i=1
( 0 )
P
( yi +0 )1 (n+0 )
e
P
Recognize kernel of a Gamma( yi + 0, n + 0) distribution

1
p(| y) p(| y) = e
()
P
with = yi + 0= 9758 + 3 = 9761 and = n + 0= 4351 + 1 = 4352

STM study: effect of prior is minimal

Concepts in Bayesian Inference 108


3. Characteristics of the posterior distribution

Posterior is a compromise between prior and likelihood

Posterior mode and mean demonstrate shrinkage

For STM study posterior more peaked than prior likelihood, but not in general

Prior is dominated by likelihood for a large sample size

Posterior = gamma = prior conjugacy

Concepts in Bayesian Inference 109


4. Equivalence of prior information and extra data

Prior = equivalent to experiment of size 0 with counts summing up to 0 1

STM study: prior corresponds to an experiment of size 1 with count equal to 2

Concepts in Bayesian Inference 110


Bayesian approach: no prior information is available

Gamma with 0 1 and 0 0 = non-informative prior

Concepts in Bayesian Inference 111


2.9 The prior and posterior of h()

h monotone transformation of h() = new parameter

Transformation rule: distribution of


 1
1
d
p(h ())
d

with p() = p(h1()) prior or posterior of

Concepts in Bayesian Inference 112


Example II.4: Stroke study Posterior distribution of log()

Probability of success is often modeled on the log-scale

Posterior distribution of = log()


1
p(|y) = exp (1 exp )1
B(, )

with = 19 and = 133.

Concepts in Bayesian Inference 113


2.10 Bayesian versus likelihood approach

Bayesian approach satisfies 1st likelihood principle in that inference does not
depend on never observed results

Bayesian approach satisfies 2nd likelihood principle:


Z
p2( | y) = L2( | y)p()/ L2( | y)p()d
Z
= c L1( | y)p()/ c L1( | y)p()d
= p1( | y)

In Bayesian approach parameter is stochastic


different effect of transformation h() in Bayesian and likelihood approach

Concepts in Bayesian Inference 114


2.11 Bayesian versus frequentist approach

Frequentist approach:
. fixed and data are stochastic
. Many tests are based on asymptotic arguments
. Maximization is key tool
. Does depend on stopping rules

Bayesian approach:
. Condition on observed data (data fixed), uncertainty about ( stochastic)
. No asymptotic arguments are needed, all inference depends on posterior
. Integration is key tool
. Does not depend on stopping rules

Concepts in Bayesian Inference 115


Frequentist and Bayesian approach can give the same numerical output (with
different interpretation), but sometimes also very different inference

Frequentist ideas in Bayesian approaches (MCMC)

Concepts in Bayesian Inference 116


2.12 The different modes of the Bayesian approach

Subjectivity objectivity

Subjective Bayesian objective Bayesian

46656 varieties of Bayesians

Pragmatic Bayesian = Bayesian ??

Concepts in Bayesian Inference 117


2.13 An historical note on the Bayesian approach

born on 1701
. Thomas Bayesand died on
was probably born7-4-1761.
in 1701
and died in 1761

verent. with
He wasmathematical
a Presbyterian minister, studied logic
interests.
and theology at Edinburgh University,
and had strong mathematical interests
matics. was published during his life.
Bayes theorem was submitted posthumously by
his friend Richard Price in 1763
and was entitled
An Essay toward a Problem in the Doctrine of Chances

Concepts in Bayesian Inference 118


. Up to 1950 Bayes theorem was called
Theorem of Inverse Probability
. Fundament of Bayesian theory was developed by
Pierre-Simon Laplace (1749-827)
. Laplace first assumed indifference prior,
later he relaxed this assumption
. Much opposition:
e.g. Poisson, Fisher, Neyman and Pearson, etc

Concepts in Bayesian Inference 119


. Fisher strong opponent to Bayesian theory
. Because of his influence
dramatic negative effect
. was opposed to use of flat prior and
that conclusions change when putting flat prior
on h() rather than on
. Some connection between
Fisher and (inductive) Bayesian approach,
but much difference with N&P approach

Concepts in Bayesian Inference 120


Proponents of the Bayesian approach:

de Finetti: exchangeability

Jeffreys: noninformative prior, Bayes factor

Savage: theory of subjective and personal probability and statistics

Lindley: Gaussian hierarchical models

Geman & Geman: Gibbs sampling

Gelfand & Smith: introduction of Gibbs sampling into statistics

Spiegelhalter: (Win)BUGS

Concepts in Bayesian Inference 121


Recommendation

The theory that would not die. How Bayes rule cracked the enigma
code, hunted down Russian submarines & emerged triumphant from
two centuries of controversy

Mc Grayne (2011)

Concepts in Bayesian Inference 122


Chapter 3
Introduction to Bayesian inference

We can start working NOW!

Concepts in Bayesian Inference 123


3.1 Introduction

In this chapter:

Exploration of the posterior distribution:


. Summary statistics for location and variability
. Interval estimation
. Predictive distribution

Normal approximation of posterior

Simple sampling procedures

Bayesian hypothesis tests

Concepts in Bayesian Inference 124


3.2 Summarizing the posterior with probabilities

Direct exploration of the posterior: P (a < < b|y) for different a and b

Example III.1: Stroke study SICH incidence

= probability of SICH due to rt-PA at first ECASS-3 interim analysis


p(|y) = Beta(19, 133)-distribution

P(a < < b|y):


. a = 0.2, b = 1.0: P(0.2 < < 1|y)= 0.0062
. a = 0.0, b = 0.08: P(0 < < 0.08|y)= 0.033

Concepts in Bayesian Inference 125


3.3 Posterior summary measures

Posterior mode

Posterior mean

Posterior median

Posterior variance and standard deviation

Credible intervals: HPD and equal-tail

Concepts in Bayesian Inference 126


3.3.1 Measures of location & variability - Mode

Posterior mode bM :

bM = arg max p(|y)

Properties:

. Posterior mode only involves maximization


. For p() c, posterior mode = MLE
6 h(bM )
. = h() with h monotone transformation: bM =

Concepts in Bayesian Inference 127


Measures of location & variability - Mean

Posterior mean :

R
= p(|y)d

Properties:
R
. minimizes b 2 p(|y)d over all estimators b
( )
. Posterior mean involves twice integration
. = h() with h monotone transformation: 6= h()

Concepts in Bayesian Inference 128


Measures of location & variability - Median

Posterior median M :

R
0.5 = M p(|y)d

Properties:
R
. M minimizes a | |
b p(|y)d with a > 0 over all estimators b

. For a symmetric posterior: posterior median = posterior mean = posterior mode


. Posterior median involves one integration and solving an integral equation
. = h() with h monotone transformation: M = h(M )

Concepts in Bayesian Inference 129


Measures of location & variability - Variance/SD

Posterior variance 2:

2
R
= ( )2 p(|y)d

Posterior SD:

Calculations involve three integrals

Concepts in Bayesian Inference 130


Example III.2: Stroke study Posterior summary measures

Posterior at 1st interim ECASS 3 analysis: Beta(, ) ( = 19 & = 133)

Posterior mode: maximize ( 1) ln() + ( 1) ln(1 ) wrt


bM = ( 1)/(( + 2)) = 18/150 = 0.12

1
R1
Posterior mean: integrate B(,) 0 1(1 )1 d
= B( + 1, )/B(, ) = /( + ) = 19/152 = 0.125

1
R1
Posterior median: solve 0.5 = B(,) M 1(1 )1 d for
M = = 0.122 (R-function qbeta)

12 1
R1 1
Posterior variance: calculate also 0
B(,)
(1 ) d
 
2 2
= / ( + ) ( + + 1) = 0.02672

Concepts in Bayesian Inference 131


Example III.3: Dietary study Posterior summary measures

Posterior for (based on IBBENS prior):


bM M
Gaussian with

Posterior mode=mean=median:
bM = 327.2 mg/dl

Posterior variance & SD: 2 = 22.99 & = 4.79 mg/dl

Concepts in Bayesian Inference 132


3.3.2 Posterior interval estimation

Definition credible/credibility interval:

[a, b] = 100(1 )% credible interval for if

P (a b | y) = 1

In terms of cdf F ():

P (a b | y) = 1 = F (b) F (a)

Concepts in Bayesian Inference 133


Two types of credible interval

Definition of CI does not uniquely define credible interval

Two special cases:


. 100(1 )% equal tail credible interval [a, b]:
P ( a| y) F (a) = /2
P ( b| y) 1 F (b) = /2

. 100(1 )% highest posterior density (HPD) interval [a, b]


A 100(1 )% credible interval
For all 1 [a, b] and for all 2
/ [a, b] : p (1 | y) p (2 | y)

Concepts in Bayesian Inference 134


Properties of HPD interval

HPD interval = an interval of evidence based on posterior

HPD interval explicitly needs a density

HDI = highest density interval (software FirstBayes)

100(1-)% HPD interval = shortest interval with size (1 ) (Press, 2003)

Image of an HPD interval under a monotone transformation is not HPD anymore

Symmetric posterior: equal tail = HPD interval

Calculation of HPD interval: iterative procedure

Concepts in Bayesian Inference 135


95% confidence interval 95% credible interval

Given data set y: 95% credible interval contains the 95% most plausible
parameter values a posteriori

Given data set y: 95% confidence interval either contains or does not contain the
true value. Adjective 95% gets its meaning only in the long run

Interpretation of the classical confidence interval = Bayesian flavor

Concepts in Bayesian Inference 136


Example III.4: Dietary study Interval estimation of dietary intake

Posterior = N(, 2)

Obvious choice for a 95% CI is [ 1.96 , + 1.96 ]

Equal 95% tail CI = 95% HPD interval

Results IBBENS-2 study:


. IBBENS prior distribution 95% CI = [317.8, 336.6] mg/dl
. N(328; 10,000) prior 95% CI = [285.6, 351.0] mg/dl
. Classical (frequentist) 95% confidence interval = [284.9, 351.1] mg/dl

Concepts in Bayesian Inference 137


Example III.5: Stroke study Interval estimation of probability of SICH

Posterior = Beta(19, 133)-distribution

95% equal tail CI (R function qbeta) = [0.077, 0.18] (see figure)

95% HPD interval = [0.075, 0.18] (see figure)

Computations HPD interval:


. 95% HPD interval: interval [a, a+h] with
F (a + h) F (a) = 0.95 (F cdf)
f (a + h) = f (a) (f pdf)
. Optimization program is needed (R-function optimize), to find a and h

Concepts in Bayesian Inference 138


Stroke study: equal tail credible interval

BETA(19,133)

15
10

0.025 0.025
5

95% EQUAL TAIL CI


0

0.05 0.10 0.15 0.20


Concepts in Bayesian Inference 139


Stroke study: HPD of transformed parameter

HPD interval is not-invariant to (monotone) transformations:

2.0
(a) BETA(19,133) (b)
15

1.5
10

1.0
5

0.5
95% HPD interval log(95% original HPDI)
0.0
0

0.05 0.10 0.15 0.20 2.8 2.4 2.0 1.6


Concepts in Bayesian Inference 140


3.4 Predictive distributions

Predictive distribution = distribution of a future observation ye after having


observed the sample {y1, . . . , yn}

Assumption: ye is independent of y given and is drawn from the same


distribution as {y1, . . . , yn} (e
y p(y | ))

We look at (a) frequentist and (b) Bayesian approach

Three cases: (a) Gaussian, (b) binomial and (c) Poisson

Concepts in Bayesian Inference 141


3.4.1 Frequentist approach

Naive approach
. Estimate b
y | )
. Predictive distribution of ye: p(e b

. 100(1-)%-predictive interval (PI): interval containing 100(1-)% of the


future observations
. When based on b 95% PI based on p(ey | )
b too short

Realistic approach: allow for the sampling distribution of b

Concepts in Bayesian Inference 142


3.4.2 Bayesian approach: posterior predictive distribution

Central idea: Take the posterior uncertainty into account

Three cases:
. All mass (AUC 1) of p( | y) at bM distribution of ye: p(e y | bM )
1 K
PK
y | k )p(k | y)
. All mass at , . . . , distribution of ye: k=1 p(e
. General case: posterior predictive distribution (PPD)

distribution of ye
R
y | y) =
p(e y | )p( | y) d
p(e

Concepts in Bayesian Inference 143


PPD

PPD = marginal distribution of future observation taking posterior uncertainty


into account
Z Z
y | y) = p(e
p(e y , | y) d = p(e y | , y)p( | y) d

Use hierarchical independence


y | , y) = p(e
p(e y | )

100(1-)%-posterior predictive interval (PPI): P (a ye b | y) = 1

Prior predictive distribution


R
p(e
y ) = p(e y | ) p() d = averaged likelihood = integrated/marginal likelihood

Concepts in Bayesian Inference 144


3.4.3 Applications

Gaussian case ( 2 known): 95% reference interval of alp

Binomial case: predicting the number of patients with SICH

Poisson case: predicting caries experience

Concepts in Bayesian Inference 145


1. The Gaussian case

Diagnostic screening tests 95% reference interval

Concepts in Bayesian Inference 146


Example III.6: SAP study 95% reference interval

Serum alkaline phosphatase (alp) was measured on a prospective set of 250


healthy patients by Topal et al (2003)

Concepts in Bayesian Inference 147


1. The Gaussian case ( 2 known) Frequentist approach


yi = 100/ alpi (i = 1, . . . , 250) normal distribution

95% ref interval for N(, 2) ( & 2 known): [ 1.96 , + 1.96 ]

Naive approach:
Replace , by y = 7.11, s = 1.4 95% ref interval (alp) = [104.45, 508.95]

Realistic approach: take into account sampling distribution of y


. ye y N(0, 2(1 + 1/n))
p p
. 95% ref interval = [y 1.96 1 + 1/n, y + 1.96 1 + 1/n]
95% ref interval (alp-scale) = [104.33, 510.18]

Concepts in Bayesian Inference 148


Example III.6: Histogram alp + 95% reference interval

0.000 0.001 0.002 0.003 0.004 0.005 0.006

95% REFERENCE RANGE

0 200 400 600 800 1000


alp

Concepts in Bayesian Inference 149


1. The Gaussian case ( 2 known) Bayesian approach

Posterior PD: ye | y N(, 2 + 2)

Prior PD: ye N(, 2 + 02)



95% PPI (equal tail & HPPI): [ 1.96 + , + 1.96 2 + 2]
2 2

Prior variance 02 large:


y
2 2/n
 2

PPD ye|y N y, (1 + 1/n)
95% Bayesian ref interval = frequentist 95% ref interval = [104.3, 510.2]

Same numerical results BUT interpretation different

Concepts in Bayesian Inference 150


2. The binomial case

Prior to a clinical trial: reflect on various scenarios

Concepts in Bayesian Inference 151


Example III.7: Stroke study Predicting SICH incidence in interim analysis

Before interim analysis but given the pilot data:


Obtain an idea of the number of (future) rt-PA treated patients who will suffer
from SICH in sample of size m = 50

Distribution of ye (given the pilot data)?

Concepts in Bayesian Inference 152


2. The binomial case Frequentist approach

Naive approach
. MLE of (incidence SICH) = 8/100 = 0.08 for (fictive) ECASS 2 study
. Predictive distribution: Bin(50, 0.08)
. 95% predictive set: {0, 1, . . . , 7} 94% of the future counts
observed result of 10 SICH patients out of 50 is extreme

Realistic approach: take into account sampling variability of b


. Old problem formulated by Pearson in 1920
y ) = max L(, ye) distribution of ye (Pawitan,
. Profile likelihood of ye: L(e
2001)
. n large: use asymptotic distribution of b

Concepts in Bayesian Inference 153


Stroke study: Binomial predictive distribution

0.20
Bin(50,0.08)
0.15
0.10
0.05
0.00

0 2 4 6 8 10 12 14
# future RTPA patients with SICH

Concepts in Bayesian Inference 154


2. The binomial case Bayesian approach

Posterior (ECASS 2 prior) = Beta(9,93)-distribution

PPD = Beta-binomial distribution BB(m, ,)


Z 1  1
m ye y)
(me (1 )1
y |y) =
p(e (1 ) d
0  ye B(, )
m B(e y + , m ye + )
=
ye B(, )

BB(m, ,) shows more variability than Bin(m, )


b

94.4% PPS is {0, 1, . . . , 9} 10 SICH patients out of 50 less extreme

Prior predictive distribution = BB(m, 0,0)

Concepts in Bayesian Inference 155


Stroke study: Binomial and Beta-binomial predictive distribution

0.20
BB(50,9,93) Bin(50,0.08)
0.15
0.10
0.05
0.00

0 2 4 6 8 10 12 14
# future RTPA patients with SICH

Concepts in Bayesian Inference 156


3. The Poisson case Bayesian approach

Poisson likelihood + gamma prior gamma posterior

PPD = negative binomial distribution NB(, )


   ye
( + ye) 1
y |y) =
p(e
() ye! +1 +1

Concepts in Bayesian Inference 157


Example III.8: Caries study PPD for caries experience

OBSERVED DISTRIBUTION

0.4
0.3
Probability

PPD
0.2
0.1
0.0

0 5 10 15 20
dmftindex

Concepts in Bayesian Inference 158


3.5 Exchangeability - 1

Independence:
Qn
p(y1, y2, . . . , yn | ) = i=1 p(yi | )
Independence is defined conditional on

Exchangeable:
is never known but given a prior distribution p():
Z
p(y1, y2, . . . , yn) = p(y1, y2, . . . , yn | )p()d ,
Z Yn
= p(yi | )p()d
i=1

p(y1, y2, . . . , yn) = p(y(1), y(2), . . . , y(n))


y1, y2, . . . , yn are exchangeable

Concepts in Bayesian Inference 159


Exchangeability - 2

Exchangeable ; independence

Independence with same marginal distribution exchangeable

Exchangeability is an assumption that depends on context of experiment

Exchangeability is central in prediction (PPD)

Exchangeability was introduced by de Finetti (1937, 1974)


. finite and infinite exchangeability
. Representation theorem: Exchangeable random variables can be seen as
conditionally independent random variables with a given prior

Partial/conditional exchangeability

Concepts in Bayesian Inference 160


3.6 A normal approximation to the posterior

Up to now: choice of prior was taken such that posterior & posterior summary
measures are obtained analytically

In general: numerical techniques are needed, but complicated for


multiparameter case

Large sample size: Bayesian analysis simplifies considerably using normal


approximation to posterior

Concepts in Bayesian Inference 161


3.6.1 A normal approximation to the likelihood

Example III.9: Kaldor et als case-control study

Case-control study introduced in Example I.4


Results (Chapter 1)
Treatment Controls Cases
No Chemo 160 11
Chemo 251 138
Total 411 149

0 (1) = probability of having the risk factor in the controls (cases)


n0 (n1) = number of controls (cases)
r0 (r1) = number of controls (cases) with risk factor (chemotherapy)

Concepts in Bayesian Inference 162


Example III.9 (continued)

Logarithm of OR =  
1/(1 1)
= log
0/(1 0)

OR = e

MLE of  
r1 (n0 r0)
= log
r0 (n1 r1)

Approximately N(, 2 )

1 1 1 1
var() 2 = + + +
r0 n0 r0 r1 n1 r1

Concepts in Bayesian Inference 163


Example III.9 (continued) - Frequentist approach

= 2.08 and e = 8.0

Using asymptotic normality: P < 0.0001

95% confidence interval for = [1.43, 2.72]

95% confidence interval for OR = [4.2, 15.2]

Concepts in Bayesian Inference 164


Example III.9 (continued) - Bayesian approach - 1

Large sample inference for is not necessary in Bayesian context

But large sample result for makes the Bayesian analysis easier

Chapter 2
. Prior N(0, 02) for + normal likelihood of
Normal posterior for : N(, 2)
!
0
= + 2
2 02
!1
2 1 1
= +
2 02

Concepts in Bayesian Inference 165


Example III.9 (continued) - Bayesian approach - 2

Prior N(0 = log(5), 02 = 10, 0002)

Posterior summary measures frequentist measures


(a)

0.15
0.10
POSTERIOR

0.05
PRIOR

0.0 95% CI

0 5 10 15 20 25

ODDS RATIO

Concepts in Bayesian Inference 166


Example III.9 (continued) - Bayesian approach - 3

Expert opinion:
. Best guess (median value) for e = 5
. 95% prior credible interval for OR = [1, 25]
Experts put 95% belief in N(1.6, 0.822) for

0.15
PRIOR

Result: Posterior median for = 7.5

0.10
POSTERIOR

95% CI = [4.1, 13.6]

0.05
0.0
0 5 10 15 20 25

ODDS RATIO

Concepts in Bayesian Inference 167


3.6.2 Asymptotic properties of the posterior distribution

Normal posterior for a large sample size is justified even when the likelihood is
combined with a non-normal prior

Reflection that likelihood dominates prior for a large sample size


Theorem: Let y represent a sample of n iid observations with joint density
p(y|) L(|y) and p() > 0 a prior density for . Under suitable regularity
conditions, the posterior distribution p(|y) converges to the normal distribution
 2 1
2
b ) when n , where b is the MLE of and = 2 d ln L(|y)
N(, b b d 2 | b .
=

Theorem does not play a central role in the Bayesian approach

Concepts in Bayesian Inference 168


Example III.10: Caries study Posterior of mean(dmft-index)

. = mean dmft-index
P10
. Likelihood: dmft-index of ten children i yi = 26
. Prior: Gamma(3, 1)
. Posterior: Gamma(29, 11) (solid)

0.8
Posterior

Normal approx (dotted) to Poisson likelihood


BCLT

0.6
. b = y = 2.6

0.4
. b2 = y/n = 0.26

0.2
0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

Concepts in Bayesian Inference 169


3.7 Numerical techniques to determine the posterior

Numerical integration

Sampling from the posterior

Choice of posterior summary measures

Concepts in Bayesian Inference 170


3.7.1 Numerical integration

Take f () = t() p(y | )p() or f () = t() p( | y)

Simple integration techniques: equidistant grid + approximate sub integrals by


polynomial
Mid-point rule: constant
Trapezoidal rule: linear
Simpsons rule: quadratic
Rb PM +1
Result: a f ()d m=0 wmf (m)

Gaussian quadrature
Non-adaptive
Adaptive (M = 1 = Laplace approximation)

Concepts in Bayesian Inference 171


Example III.11: Caries study Posterior distribution for a lognormal prior

Replace gamma prior (Example III.10) by lognormal prior

Posterior distribution

log()0 2
 
Pn
n
i=1 yi 1
e 20
, ( > 0)

Posterior moments cannot be evaluated & AUC not known

Mid-point approach

Concepts in Bayesian Inference 172


Example III.11 (continued)

Calculation AUC using mid-point approach

0.10
k x POSTERIOR DENSITY

AUC =0.13
0.08
0.06
0.04
0.02
0.00

1 2 3 4 5

Concepts in Bayesian Inference 173


3.7.2 Sampling from the posterior distribution

Monte Carlo integration: usefulness of sampling idea

General purpose sampling algorithms

Concepts in Bayesian Inference 174


Monte-Carlo integration

Monte Carlo integration: replace integral by a Monte Carlo sample {e1, . . . , eK }


Approximate p(|y) by sample histogram
Classical Strong Law of Large Numbers:
Z K
1 X e
t() p(|y) d t = t(k ), for K large
K
k=1

Classical Central Limit Theorem: 95% confidence interval



[t 1.96 st/ K, t + 1.96 st/ K]
95% equal tail CI: [2.5%, 97.5%] quantile from sample
95% HPD interval: approach of Tanner (1993)

Concepts in Bayesian Inference 175


Example III.12: Stroke study Sampling from the posterior distribution

Posterior for = probability of SICH with rt-PA = Beta(19, 133) (Example II.1)

5, 000 sampled values of from Beta(19, 133)-distribution

Posterior of log(): one extra line in R-program

Sample summary measures true summary measures

95% equal tail CI for : [0.0782, 0.182]

95% equal tail CI for log(): [-2.56, -1.70]

Approximate 95% HPD interval for : [0.0741, 0.179]

Concepts in Bayesian Inference 176


Example III.12 (continued)

2.0
20
(a) (b)

1.5
15

1.0
10

0.5
5

0.0
0

0.05 0.10 0.15 0.20 3.0 2.5 2.0 1.5


log()

Concepts in Bayesian Inference 177


General purpose sampling algorithms

Many algorithms are available to sample from standard distributions

Dedicated procedures/general purpose algorithms for non-standard distributions:

. Inverse cumulative distribution function (ICDF) method

. Accept-reject (AR) algorithm

. Importance sampling and the SIR algorithm

Concepts in Bayesian Inference 178


Inverse cumulative distribution function (ICDF) method

Random variable x has cdf F (x) F (x) = u U(0, 1)

ICDF method
. Sample u from U(0, 1) x = F 1(u) F (x)

1.0
F1

0.8
u

0.6
cdf
0.4
0.2
0.0
4 2 0 2 4

Concepts in Bayesian Inference 179


Accept-reject algorithm 1

Sampling in two steps:

. Stage 1: sample from q() (proposal distribution) e

. Stage 2: reject e sample from p( | y) (target)

Assumption: p(|y) < A q() for all

0.6
A=1.8

. q = envelope distribution

0.5
N(0,1)

0.4
. A = envelope constant

0.3
q

0.2
0.1
0.0 4 2 0 2 4

Concepts in Bayesian Inference 180


Accept-reject algorithm 2

Stage 1: e & u are drawn independently from q() & U (0, 1)

Stage 2:

. Accept: when u p(e | y)/A q()


e

. Reject: when u > p(e | y)/A q()


e

Properties AR algorithm:
. Produces a sample from the posterior
. Only needs p(y | ) p()
. Probability of acceptance = 1/A

Concepts in Bayesian Inference 181


Adaptive Rejection Sampling algorithms 1

Adaptive Rejection Sampling (ARS) algorithm:

Builds up envelope density in an adaptive manner

Builds up squeezing density in an adaptive manner

Envelope and squeezing density are log of piecewise linear functions with knots at
sampled grid points

Two special cases:


. Tangent method of ARS
. Derivative-free method of ARS

Concepts in Bayesian Inference 182


Adaptive Rejection Sampling algorithms 2

Tangent ARS Derivative-free ARS


(a) (b)

TANGENT DERIVATIVEFREE
log posterior

log posterior
SQUEEZING SQUEEZING

1 2 ~ 3 1 2 ~ 3

Concepts in Bayesian Inference 183


Adaptive Rejection Sampling algorithms 3

Properties ARS algorithms:

Envelope density can be made arbitrarily close to target

5 to 10 grid points determine envelope density

Squeezing density avoids (many) function evaluations

Derivative-free ARS is implemented in WinBUGS

ARMS algorithm: combination with Metropolis algorithm for non log-concave


distributions, implemented in Bayesian SAS procedures

Concepts in Bayesian Inference 184


Importance sampling

Interest in E [t() | y]
Rh i h i
t() p(|y) Eq t() p(|y)
R
= t()p( | y)d = q() q()d = q()

Take a sample 1, . . . , K from q()

Estimate E [t() | y] by
K K
1 X t(k )p(k | y) 1 X k k
k
t( )w( )
K q( ) K
k=1 k=1

with importance weights w(k ) = p(k | y)/q(k )


PK k k
k=1 t( )w( )
Normalized weights: b
tI,K = PK k
k=1 w( )

Concepts in Bayesian Inference 185


Properties:

Posterior needs to be known only up to a constant

Does not generate a sample from the posterior

With normalized weights, tails of q() must be heavier than of p( | y)

Concepts in Bayesian Inference 186


Weighted sampling-resampling (SIR) algorithm

Two stage-sampling
e {e1, . . . , eJ } from q() and compute weights
. Stage 1: Draw
p(ej |y)/q(ej )
wj = PJ (j = 1, . . . , J)
i
p( |y)/q( )
e ei
i=1

Multinomial distribution: D(,


e w) with w = {w1, w2, . . . , wJ }

. Stage 2: take a sample of size K  J from D(,


e w)

Posterior needs to be known only up to a constant

Useful in detecting influential observations (Chapter 10)

Concepts in Bayesian Inference 187


Example III.13: Caries study Sampling from posterior with lognormal prior

Accept-reject algorithm

Lognormal prior is maximized for log() = 0


Pn
i=1 (yi 1) n
P
Aq() = e Gamma( i yi , n)-distribution

Data from Example III.11

Prior: lognormal distribution with 0 = log(2) & 0 = 0.5

1000 -values sampled: 840 accepted


. Sampled posterior mean (median) = 2.50 (2.48)
. Posterior variance = 0.21
. 95% equal tail CI = [1.66, 3.44]

Concepts in Bayesian Inference 188


Example III.13 (continued)

0.8
0.6
0.4
0.2
0.0

1 2 3 4 5

Concepts in Bayesian Inference 189


Example III.13 (continued)

SIR algorithm (Smith & Gelfand, 1992)

L1 (|y) p1 () L2 (|y) p2 ()
p1( | y) = p1 (y) & p2( | y) = p2 (y)

L2 (|y) p2 ()
p2( | y) L1 (|y) p1 () p1( | y) = v() p1( | y)

Stage 1: Take a (large) sample e1, . . . , eJ from p1(|y)


PJ
Stage 2: Resample K ( J) -values with weights wj = v( )/ i=1 v(ei)
ej

Random sample from p2( | y)

Useful to perform a sensitivity analysis

Concepts in Bayesian Inference 190


Example III.13 (continued)

Here

p1() = Gamma(3, 1) p1( | y) = Gamma(29, 11) (= p1() Poisson lik)

log()0 2
 

p2() e 20
p2( | y) = ?

Stage 1: Random sample of size J = 10, 000 from p1( | y)

Stage 2:
. Determine weights v(ei) based on p2() (likelihood stays the same)
. Take weighted random sample from this sample of size K = 1, 000

same histogram

Concepts in Bayesian Inference 191


3.7.3 Choice of posterior summary measures

In practice: posterior summary measures are computed with sampling techniques

Choice driven by available software:


. Mean, median and SD because provided by WinBUGS
. Mode almost never reported, but useful to compare with frequentist solution
. Equal tail CI (WinBUGS) & HPD (CODA and BOA)

Concepts in Bayesian Inference 192


3.8 Bayesian hypothesis testing

Two Bayesian tools for hypothesis testing H0: = 0

True value of is assumed, also in Bayesian context

Based on credible interval:


. Direct use of credible interval
. Contour probability: posterior evidence of H0 with HPD interval

Bayes factor: change of prior odds for H0 due to data

Concepts in Bayesian Inference 193


3.8.1 Inference based on credible intervals

First tool: Direct use of credible intervals:

Is 0 contained in 100(1-)% CI? If not, then reject H0 in a Bayesian way!

Popular when many parameters need to be evaluated, e.g. in regression models

Has a frequentist flavor (with pros and cons)

Concepts in Bayesian Inference 194


Contour probability

Second tool: Make use of HPD interval:

Contour probability pB : P [p( | y) > p(0 | y)] (1 pB )

pB first suggested by Box & Tiao (1973)

Bayesian counterpart of 2-sided P -value

Bayesian P -value = see posterior predictive checking (Chapter 10)

Concepts in Bayesian Inference 195


Example III.14: Cross-over study Use of CIs in Bayesian hypothesis testing

30 patients with systolic hypertension in cross-over study:


Period 1: randomization to A or B
Washout period
Period 2: randomization to B or A

= P(A better than B) & H0 : = 0(= 0.5)

Result: 21 patients better of with A than with B

Testing:
. Frequentist: 2-sided binomial test (P = 0.043)
. Bayesian: U(0,1) prior + Bin(21,30) = Beta(22, 10) posterior (pB = 0.023)

Concepts in Bayesian Inference 196


Example III.14: Cross-over study Graphical representation of pB

pB against = 0.5

smallest HPDI with 0.5


0.5
0.0 0.2 0.4 0.6 0.8 1.0

Concepts in Bayesian Inference 197


3.8.2 The Bayes factor

Posterior probability for H0

p(y | H0) p(H0)


p(H0 | y) =
p(y | H0) p(H0) + p(y | Ha) p(Ha)

Bayes factor: BF (y)

p(H0 | y) p(y | H0) p(H0)


=
1 p(H0 | y) p(y | Ha) 1 p(H0)

Bayes factor =
factor that transforms prior odds for H0 into posterior odds after observed the data

Central in Bayesian inference + indispensable in model/variable selection

Concepts in Bayesian Inference 198


Bayes factor - Jeffreys classification

Jeffreys classification for favoring H0 and against Ha


. Decisive (BF (y) > 100)
. Very strong (32 < BF (y) 100)
. Strong (10 < BF (y) 32)
. Substantial (3.2 < BF (y) 10)
. Not worth more than a bare mention (1 < BF (y) 3.2)

Concepts in Bayesian Inference 199


Example III.15: Cross-over study Use of the Bayes factor

Three scenarios for

H0 : = 0.5 versus Ha : = 0.8 (only 0.5 and 0.8 are possible for )

H0 : 0.5 versus Ha : > 0.5

H0 : = 0.5 versus Ha : 6= 0.5

Concepts in Bayesian Inference 200


Scenario 1: H0 : = 0.5 versus Ha : = 0.8

p(H0) = p(Ha) = 0.5

Likelihoods under H0 and Ha


30
 21 9
. p(y = 21 | H0) = 21 0.5 0.5 = 0.0133
30
 21 9
. p(y = 21 | Ha) = 21 0.8 0.2 = 0.0676

BF (21) 0.2 substantial evidence to prefer = 0.8 over = 0.5

With equal prior probabilities: Bayes factor = odds for p(H0 | y)

This scenario: Bayes factor = classical likelihood ratio

Concepts in Bayesian Inference 201


Scenario 2: H0 : 0.5 versus Ha : > 0.5

p(H0) = p(Ha) = 0.5

is continuous needed
p(y | H0) = weighted average of p(y | ), weights from p( | H0) U (0, 0.5)
p(y | Ha) = weighted average of p(y | ), weights from p( | Ha) U (0.5, 1)

Averaged likelihoods under H0 and Ha


R 0.5 30 21 30

. p(y = 21 | H0) = 0 21 (1 )9 2 d = 2 21 B(22, 10) 0.01472
R 1 30 21 30

. p(y = 21 | Ha) = 0.5 21 (1 )9 2 d = 2 21 B(22, 10) (1 0.01472)

BF (21) 0.15 substantial evidence to favor > 0.5 over 0.5

Different priors p( | H0) & p( | Ha) give different Bayes factors

Concepts in Bayesian Inference 202


Scenario 3: H0 : = 0.5 versus Ha : 6= 0.5

Testing sharp null hypothesis ( = 0.5) = most common hypothesis test

p(H0) = p(Ha) = 0.5 (???)

Needed
p(y | H0) = p(y | 0.5)
p(y | Ha) = weighted average of p(y | ), weights from p( | Ha) U (0, 1)

Likelihood under H0 and averaged likelihood under Ha


30
 21 9
. p(y = 21 | H0) = 21 0.5 0.5 = 0.0133
R 1 30 21 9 30

. p(y = 21 | Ha) = 0 21 (1 ) d = 21 B(22, 10)

BF (21) 0.41 Not worth more than a bare mention

Concepts in Bayesian Inference 203


Scenario 3: Comparison with frequentist test

0.521 0.59
Classical likelihood ratio test: Z = (21/30)21 (9/30)9
= 0.0847 (P = 0.026)

Classical and pB exaggerate (??) evidence against H0

Frequentist: maximizing Bayesian: averaging

Concepts in Bayesian Inference 204


3.8.3 Bayesian versus frequentist hypothesis testing

P -value, Bayes factor and posterior probability

P -value is NOT p(H0 | y)

P-value fallacy: interpreting P -value as p(H0 | y) or p(Ha | y)

In some cases P -value comes close to a Bayesian statement:


Scenario 2: min p(H0 | y) = P (y) (Casella and Berger, 1997)
Scenario 3: Gaussian case: P (y) = 0.05 corresponds to BFmin(y) 0.15
Extreme P -values for most common hypothesis are often not that extreme
when evaluated with Bayes factor

Similar issues with contour probability

Concepts in Bayesian Inference 205


Jeffreys-Lindley-Bartletts paradox

Lindleys paradox:

Null hypothesis is rejected in a frequentist analysis with a small


P -value while the Bayes factor favors H0.

Reason:

Marginal likelihood averages over many possible alternative hypotheses

Vaguely specified alternative hypothesis: average over many implausible


hypotheses

Vaguely specified alternative hypothesis Marginal likelihood under alternative is


low

Concepts in Bayesian Inference 206


Example III.16: Illustration of Lindleys paradox (Press, 2003)

H0 : = 0 versus Ha : 6= 0 (testing sharp hypothesis)

I.i.d. yi N(, 2) (i = 1, . . . , n) ( known)

P -value = 2 [1 (zobs)] (zobs = observed z-statistic)


n o
1
 2
p(H0 | y) = 1/ 1 + 1+n exp zobs/2(1 + 1/n) (p(H0) = 0.5 & p() = N( | 0, 2))

For P = 0.05 (zobs=1.96):


p(H0 | y) increases from 0.33 (n=5) to 0.82 (n=1,000)

Explanation:
Averaging over a large number of unrealistic values under Ha

Concepts in Bayesian Inference 207


Testing versus estimation

Frequentist approach: estimation and testing are complementary

Bayesian approach: estimation and testing are NOT complementary


. Testing : Bayesians put positive probabilities on sharp hypotheses
. Estimation: assign a zero probability to = 0

. Estimation: priors often do not have a great impact on the posterior conclusion
. Testing : priors MAY have a great impact on the posterior conclusion

Concepts in Bayesian Inference 208


Chapter 4
More than one parameter

Towards practical applications

Concepts in Bayesian Inference 209


4.1 Introduction

In this chapter:

Derivation of multivariate (multi-parameter) posterior and its summary measures

Derivation of marginal posterior distributions

Examples
. (multivariate) Gaussian distribution
. Multinomial distribution data

Bayesian linear and generalized linear regression models

Multivariate sampling approach: Method of Composition

Concepts in Bayesian Inference 210


4.2 Joint versus marginal posterior inference

Joint posterior inference

Let
y = sample of n independent observations
= (1, 2, . . . , d)T
L( | y)
Multivariate prior: p()
L( | y)p()
Multivariate posterior: p( | y) = R
L( | y)p() d
Posterior mode:
bM
Posterior mean:
HPD region of content (1-)

Concepts in Bayesian Inference 211


Marginal posterior inference

Let
. = { 1, 2} Z
. Marginal posterior: p( 1 | y) = p( 1, 2 | y) d 2

Often 1 = one-dimensional
Easy to graphically display marginal posterior
Posterior summary measures based on p( 1 | y) convenient in practice
Marginal posterior mean of 1 = joint posterior mean
Z
Alternatively: p( 1 | y) = p( 1 | 2, y) p( 2 | y) d 2

2 = nuisance parameters p( 1 | y) = get rid of nuisance parameters


In non-Bayesian context done by profile likelihood: pL( 1) = max2 L( 1, 2)

Concepts in Bayesian Inference 212


4.3 The normal distribution with and 2 unknown

Acknowledging that and 2 are unknown

Sample y1, . . . , yn of independent observations from N(, 2)

Joint likelihood of (, 2) given y:


" n
#
1 1 X
L(, 2 | y) = exp (yi )2
(2 2)n/2 2 2 i=1

Three priors:
. No prior knowledge is available
. Previous study is available
. Expert knowledge is available

Concepts in Bayesian Inference 213


4.3.1 No prior knowledge on and 2 is available

Noninformative joint prior p(, 2) 2 ( and 2 a priori independent)

1
21 2
2
  2 2

Posterior distribution p(, | y) n+2
exp (n 1)s + n(y )

2.4
7.4
2.2
7.3
2.0
7.2
1.8
7.1
1.6
2 7.0

1.4 6.9
1.2 6.8

Concepts in Bayesian Inference 214


Marginal posterior distributions

Marginal posterior distributions are needed in practice

. p( | y)
. p( 2 | y)

Calculation of marginal posterior distributions involve integration:


R 2 2
R
p( | y) = p(, | y)d = p( | 2, y)p( 2 | y)d 2

Marginal posterior is weighted sum of conditional posteriors with weights =


uncertainty on other parameter(s)

Concepts in Bayesian Inference 215


Marginal posterior distributions for the normal case

Conditional posterior for : p( | 2, y) = N(y, 2/n)

Marginal posterior for : p( | y) = tn1(y, s2/n)


y
t(n1)
s/ n

Marginal posterior for 2: p( 2 | y) Inv 2(n 1, s2)


(scaled inverse chi-squared distribution)

(n 1)s2 2
(n 1)
2

= special case of IG(, ) ( = (n 1)/2, = 1/2)

Concepts in Bayesian Inference 216


Joint posterior distribution

Joint posterior = multiplication of marginal with conditional posterior

p(, 2 | y) = p( | 2, y) p( 2 | y) = N(y, 2/n) Inv 2(n 1, s2)

Normal-scaled-inverse chi-square distribution = N-Inv-2(y,n,(n 1),s2)

2.4
7.4
2.2
7.3
2.0
7.2
1.8
7.1
1.6
2 7.0

1.4 6.9
1.2 6.8

A posteriori and 2 are dependent

Concepts in Bayesian Inference 217


Marginal posterior summary measures for

Posterior mean = mode = median = y

(n1) 2
Posterior variance = n(n2) s

95% equal tail credible and HPD interval =



[y t(0.025; n 1) s/ n, y + t(0.025; n 1) s/ n]

Concepts in Bayesian Inference 218


Marginal posterior summary measures for 2

(n1) 2
Posterior mean = (n3) s

(n1) 2
Posterior mode = (n+1) s

(n1)
Posterior median = 2 (0.5,n1)
s2

2(n1)2 4
Posterior variance = 2
(n3) (n5)
s

95% equal tail CI


2 2
 
(n 1)s (n 1)s
,
2(0.975, n 1) 2(0.025, n 1)

95% HPD interval = computed iteratively

Concepts in Bayesian Inference 219


Posterior predictive distribution for normal distribution

and 2 known: distribution of ye = p(e


y | , 2)

and 2 unknown:
Z Z
y | y) =
p(e y | , 2)p(, 2 | y) d d 2
p(e

1
 2 
= tn1 y, s 1 + n -distribution

Concepts in Bayesian Inference 220


Example IV.1: SAP study Noninformative prior

. Example III.6: normal range for alp is too narrow


. Joint posterior distribution (see before)
. Marginal posterior distributions 4

2.0
POSTERIOR DENSITY

POSTERIOR DENSITY
3

1.5
2

1.0
0.5
1

0.0
0

6.2 6.4 6.6 6.8 7.0 7.2 7.4 1.5 2.0 2.5 3.0 3.5 4.0

MU SIGMA^2

Concepts in Bayesian Inference 221


Example IV.1 (continued)

For :
.=
bM = M = 7.11
. 2 = 0.0075
. 95% (equal tail and HPD) CI = [6.94, 7.28]

For 2:
. 2 = 1.88,
bM2
= 1.85, 2M = 1.87
. 22 = 0.029
. 95% equal tail CI = [1.58, 2.24], 95% HPD interval = [1.56, 2.22]

PPD = t249(7.11, 1.37)-distribution

95% normal range for alp = [104.1, 513.2] (slightly wider)

Concepts in Bayesian Inference 222


4.3.2 An historical study is available

Posterior of historical data prior to the likelihood of the current data

Prior = N-Inv-2(0,0,0,02)-distribution
. 0 = y 0 & 0 = n0
. 0 = n0 1 & 02 = s20

Posterior = N-Inv-2(,, , 2)-distribution


0 0 +ny
.= 0 +n & = 0 + n
. = 0 + n & 2 = 002 + (n 1)s2 + 00+n
n
(y 0)2

Again shrinkage towards mean + posterior variance = weighted average of prior-,


sample variance and distance between prior and sample mean
posterior variance is not necessarily smaller than prior variance!

Concepts in Bayesian Inference 223


Marginal posterior distributions & PPD

Marginal posterior distributions:

p( | 2, y) = N( | , 2/)
p( | y) = t ( | , 2/)
p( 2 | y) = Inv 2( 2 | , 2)

PPD:
h  i
y | y) = t y, s2 1 + 01+n
p(e

Concepts in Bayesian Inference 224


Example IV.2: SAP study Conjugate prior

Retrospective study (Topal et al., 2003)


65 healthy subjects

Mean (SD) for y = alp = 5.25 (1.66)
Conjugate prior = N-Inv-2(5.25, 65, 64, 2.76)

Posterior = N-Inv-2(6.72, 315, 314, 2.61)


Posterior mean midway between prior mean & sample mean
Posterior precision 6= prior + sample precision
Posterior variance < prior variance
Posterior variance > sample variance
Posterior informative variance > NI variance
Prior information did not lower posterior uncertainty, reason: conflict of
likelihood with prior

Concepts in Bayesian Inference 225


Example IV.2 (continued)

2.0
POSTERIOR DENSITY

POSTERIOR DENSITY
3

1.5
2

1.0
0.5
1

0.0
0

6.2 6.4 6.6 6.8 7.0 7.2 7.4 1.5 2.0 2.5 3.0 3.5 4.0

MU SIGMA^2

Concepts in Bayesian Inference 226


4.3.3 Expert knowledge is available

Expert knowledge available on each parameter separately

Joint prior N(0, 02) Inv 2(0, 02) 6= conjugate

Posterior cannot be derived analytically, but numerical/sampling techniques are


available
. For : p( | 2, y) = N(, 2)
1
02
0 + n2 y 1
2
= 1 and =
02
+ n2 1
02
+ n2

. For 2:
n
Y
p( 2 | y) N( | 0, 02) Inv 2( 2 | 0, 02) N(yi | , 2)
i=1

Concepts in Bayesian Inference 227


4.4 Multivariate distributions

Two distributions:

Multivariate normal distribution + related distributions

Multinomial distribution

Concepts in Bayesian Inference 228


4.4.1 The multivariate normal and related distributions

Multivariate normal distribution (MVN): N(, ) or Np(, )


For a p-dimensional continuous random vector y:
 
1 1 T 1
p(y | , ) = exp (y ) (y )
(2)p/2||/2 2
Properties:

. Marginal distributions are normal


. Conditional distributions are normal
. Distributions of linear combinations of y are normal

Concepts in Bayesian Inference 229


Related distributions 1

Multivariate Students t-distribution: T (, )


For a p-dimensional continuous random vector y:

 (+p)/2
[( + p)/2] 1/2 1 T 1
p(y | , , ) = p/2
|| 1 + (y ) (y )
(/2)(k)

Properties:

. Heavier tails than the MVN distribution


. Posterior in a classical Bayesian regression model (see below)
. Also used as a robust data distribution
. Multivariate extension of location-scale t-distribution with degrees of freedom

Concepts in Bayesian Inference 230


Related distributions 2

Wishart distribution: Wishart(, )


For a p p-dimensional random matrix S:
 
1
p(S) = c||/2|S|(p1)/2 exp tr (1S)
2
with p  
Y +1j
c1 p/2 p(p1)/4
=2
j=1
2
Properties:

. Extension of 2()-distribution: S 2/ 2 2(n 1)


. Inverse Wishart distribution IW(D, ):
R IW(D, ) R1 Wishart(D, ) with D a precision matrix

Concepts in Bayesian Inference 231


4.4.2 The multinomial distribution

Multinomial distribution: Mult(y,)


For y = (y1, . . . , yk )T vector of frequencies falling into k classes:
k
n! Y y
p(y | ) = j j
y1!y2! . . . yk ! j=1
Pk T
Pk
with n = j=1 yj , = (1, . . . , k ) , j > 0, (j = 1, . . . , k), j=1 j =1
Properties:

. Binomial distribution = special case of the multinomial distribution with k = 2


. Marginal distribution yj = binomial distribution Bin(n,j )
. Conditional distribution
P of yj given y S = {ym : m S, j S} = Mult(y S , S ),
with S = {j / mS m}

Concepts in Bayesian Inference 232


Example IV.3: Young adult study Smoking and alcohol drinking

Study examining life style among young adults


Smoking
Alcohol No Yes
No-Mild 180 41
Moderate-Heavy 216 64
Total 396 105

Of interest: association between smoking & alcohol-consumption

Concepts in Bayesian Inference 233


Example IV.3 (continued)

22 contingency table = multinomial model M ult(n, )

= {11, 12, 21, 22 = 1 11 12 21}


P
y = {y11, y12, y21, y22} and 1 = i,j ij

n! y11 y12 y11 y22


M ult(n, ) = 11 12 21 22
y11! y12! y21! y22!

Concepts in Bayesian Inference 234


Example IV.3 (continued)

Conjugate prior to multinomial distribution = Dirichlet prior Dir()

1 Y ij 1

B() i,j ij

= {11, 12, 21, 22}


Q P 
B() = i,j (ij )/ i,j ij

Posterior distribution = Dir( + y)

Note:
Dirichlet distribution = extension of beta distribution to higher dimensions
Marginal distributions of a Dirichlet distribution = beta distribution

Concepts in Bayesian Inference 235


Example IV.3 (continued)

Association between smoking and alcohol consumption:

11 22
=
12 21

Needed p( | y), but difficult to derive

Alternatively replace analytical calculations by sampling procedure

. Wij (i, j = 1, 2) distributed independently as Gamma(ij ,1)


P
. T = ij Wij
. Zij = Wij /T have a Dir() distribution

Concepts in Bayesian Inference 236


Example IV.3 (continued)

Analysis of contingency table:

Prior distribution: Dir(1, 1, 1, 1)

Posterior distribution: Dir(180+1, 41+1,216+1, 64+1)

Sample of 10, 000 generated values for parameters

95% equal tail CI for : [0.839, 2.014]

Equal to classically obtained estimate

Concepts in Bayesian Inference 237


Example IV.3 (continued)

20

30
15

20
10

10
5

5
0

0
0.30 0.35 0.40 0.45 0.04 0.06 0.08 0.10 0.12
11 12

1.2
15

0.8
10

0.4
5

0.0
0

0.35 0.40 0.45 0.50 0.5 1.0 1.5 2.0 2.5 3.0 3.5
12

Concepts in Bayesian Inference 238


Example IV.4: A re-analysis of Example III.9

Case-control study (Kaldor et al., 1990) to examine the impact of chemotherapy


on leukaemia in Hodgkins survivors

149 cases (leukaemia) and 411 controls

Question: Does chemotherapy induce excess risk of developing solid tumors,


leukaemia and/or lymphomas?

Treatment Controls Cases


No Chemo 160 11
Chemo 251 138
Total 411 149

Concepts in Bayesian Inference 239


Example IV.4 (continued)

y = {251, 160, 138, 11}

Likelihood = product of binomial likelihoods

L(1, 2 | y) 1251 (1 1)160 2138 (1 2)11

Concepts in Bayesian Inference 240


Example IV.4 (continued)

Howard (1998): comparing proportions from 2 independent binomial distributions

Check hypothesis H1 : 2 < 1 versus H2 : 1 < 2

Classical frequentist tests (Fishers Exact test, chi-square test, etc) can be
reproduced by Bayesian tests

A dependent prior p(1, 2) is more natural than product of p(1) and p(2)

Concepts in Bayesian Inference 241


Example IV.4 (continued)

Here: examine effect of different independent priors on p(2 1 | y)

Example of a sensitivity analysis

Considered priors (products of beta distributions):


. Uniform prior for 1 and 2
1/2 1/2
. Jeffreys prior: p(1, 2) 1 (1 1)1/2 2 (1 2)1/2
. Haldane prior: p(1, 2) 11 (1 1)1 21 (1 2)1

Posteriors = product of beta distributions

A posteriori 1 and 2 are independent

Concepts in Bayesian Inference 242


Example IV.4 (continued)

To obtain p(2 1 | y):


Sample from p(2 | y) and p(1 | y) and take difference of each sampled value

Results: 95% equal tail CIs for 2 1


. Uniform prior: [0.244, 0.372]
. Jeffreys prior: [0.253, 0.378]
. Haldane prior: [0.256, 0.381]

Concepts in Bayesian Inference 243


4.5 Frequentist properties of Bayesian inference

Not of prime interest to a Bayesian to know the sampling properties of Bayesian


estimators

But: good frequentist properties of Bayesian estimators adds to their credibility

For instance: interval estimators (correct coverage)

Bayesian approach offers alternative interval estimators may be also useful in


frequentist calculations
. Agresti and Min (2005): best frequentist properties for odds ratio when
Jeffreys prior for the binomial parameters is taken
. Rubin (1984): other examples where the Bayesian 100(1-)% CI gives at least
100(1-)% coverage even when the prior distribution is chosen incorrectly

Concepts in Bayesian Inference 244


4.6 The Method of Composition

A method to yield a random sample from a multivariate distribution

Stagewise approach

Based on factorization of joint distribution into a marginal & several conditionals

p(1, . . . , d | y) = p(d | y) p(d1 | d, y) . . . p(1 | d1, . . . , 2, y)

Sampling approach:
. Sample ed from p(d | y)
. Sample e(d1) from p((d1) | ed, y)
. ...
. Sample e1 from p(1 | ed1, . . . , e2, y)

Concepts in Bayesian Inference 245


Sampling from N(, 2), both parameters unknown

Sampling approach from normal posterior p(, 2 | y) N(, 2)

Three cases:
. No prior knowledge
. Historical data available
. Expert knowledge available

Concepts in Bayesian Inference 246


Case 1: No prior knowledge on and 2

Sample from p(, 2 | y): Sample from p( 2 | y) & Sample from p( | 2, y)

1. Sample from p( 2 | y):

Sample ek from a 2(n 1)-distribution


e2k in (n 1)s2/e
Solve 2k = ek

2. Sample from p( | 2, y):

ek from a N(y,
Sample e2k /n)-distribution
e1, . . . ,
eK = random sample from p( | y) (tn1(y, s2/n)-distribution)

Concepts in Bayesian Inference 247


Case 1 (continued)

y | y), 2 approaches:
To sample from the posterior predictive distribution p(e

1
 2 
1. Sample directly from tn1 y, s 1 + n -distribution

2. Use Method of Composition


e2k from Inv-2( 2 | n 1, s2)
. Sample
ek from N( | y,
. Sample e2k /n)
. Sample yek from N(y |
ek ,
e2k )

Concepts in Bayesian Inference 248


Example IV.5: SAP study Sampling the posterior with NI prior

Sampled posterior distributions on next page (K = 1000)

Posterior mean (95% confidence interval)


: 7.11 ([7.106, 7.117])
2: 1.88 ([1.869, 1.890])

95% equal tail CI


: [6.95, 7.27]
2: [1.58, 2.23]

Concepts in Bayesian Inference 249


Example IV.5 (continued)

(b)

5
(a)

2.0

4
1.5

3
1.0

2
0.5

1
0.0

0
1.4 1.6 1.8 2.0 2.2 2.4 6.9 7.1 7.3
2

0.30
(d)
2.4

(c)
2.2

0.20
2.0
2

0.10
1.8
1.6

0.00
1.4

6.9 7.0 7.1 7.2 7.3 7.4 4 6 8 10 12


~
y

Concepts in Bayesian Inference 250


Case 2: Historical data are available

Same procedure as before!

Concepts in Bayesian Inference 251


Case 3: Expert knowledge is available

Problem: p( 2 | y) does not have a known distribution

e2, sampling
For a given e is straightforward

Concepts in Bayesian Inference 252


Example IV.6: SAP study Sampling posterior with product of Inform priors


Priors for y = alp:
N(5.25, 2.75/65) & 2 Inv 2(64, 2.75)

Method of Composition:
. Stage I: sample 2
n
Y
p( 2 | y) N( | 0, 02) Inv 2( 2 | 0, 02) N(yi | , 2)
i=1
p( 2 | y) evaluated on a grid mean and variance
Approximating distribution q( 2) Inv 2( 2 | 294.2, 2.12)
Weighted resampling
. Stage II: sample from a normal distribution

Concepts in Bayesian Inference 253


Example IV.6 (continued)

(a) (b)

expert
2.5

conjugate
conjugate

4
2.0

3
1.5

2
1.0
0.5

1
0.0

0
1.5 2.0 2.5 3.0 6.5 6.6 6.7 6.8 6.9 7.0 7.1
2

Concepts in Bayesian Inference 254


Example IV.6 (continued)

PPD can be obtained by sampling

Sample ye from N(e e2 )


,

Based on sample
. 95% normal range for y: [4.05, 9.67]
. 95% normal range for alp: [106.84, 609.70]

Concepts in Bayesian Inference 255


4.7 Bayesian linear regression models

Frequentist multiple linear regression analysis

Non-informative Bayesian multiple linear regression analysis

Multivariate posterior summary measures

Sampling from the posterior

Informative Bayesian multiple linear regression analysis

Concepts in Bayesian Inference 256


4.7.1 The frequentist approach to linear regression

Classical regression model: y = X +

. y = a n 1 vector of independent responses


. X = n (d + 1) design matrix
. = (d + 1) 1 vector of regression parameters
. = n 1 vector of random errors N(0, 2 I)

Likelihood:  
2 1 1
L(, | y, X) = exp 2 (y X)T (y X)
(2 2)n/2 2
b = (X T X)1Xy
. MLE = LSE of :
. Residual sum of squares: S = (y X)T (y X)
. Mean residual sum of squares: s2 = S/(n d 1)

Concepts in Bayesian Inference 257


Example IV.7: Osteoporosis study: a frequentist linear regression analysis

. Cross-sectional study (Boonen et al., 1996)


. 245 healthy elderly women in a geriatric hospital
. Aim: Find determinants for osteoporosis
. Average age women = 75 yrs with a range of 70-90 yrs
. Marker for osteoporosis = tbbmc (in kg) measured for 234 women
. Simple linear regression model: regressing tbbmc on bmi
. Classical frequentist regression analysis:
b0 = 0.813 (0.12)
b1 = 0.0404 (0.0043)
s2 = 0.29, with n d 1 = 232
corr(b0, b1) =-0.99

Concepts in Bayesian Inference 258


Example IV.7 (continued)

2.5
2.0
TBBMC (kg)
1.5 1.0
0.5

20 25 30 35 40
BMI(kg m2)

Concepts in Bayesian Inference 259


4.7.2 A noninformative Bayesian linear regression model

Bayesian linear regression model = prior information on regression parameters &


residual variance + normal regression likelihood

Noninformative prior for (, 2): p(, 2 | y) 2

Notation: omit design matrix X

Posterior distributions:
h i
2 2 T 1
p(, | y) = N(d+1) | ,
b (X X) Inv 2( 2 | n d 1, s2)
h i
p( | 2, y) = N(d+1) b 2(X T X)1
| ,
p( 2 | y) = Inv h2( 2 | n d 1, s2)i
p( | y) = tnd1 | , b s2(X T X)1

Concepts in Bayesian Inference 260


4.7.3 Posterior summary measures for the linear regression model

Posterior summary measures of


(a) regression parameters
(b) parameter of residual variability 2

Univariate posterior summary measures


. The marginal posterior mean (mode, median) of j : MLE (LSE) bj
1/2
. 95% HPD interval for j : bj s(X T X)jj t(0.025, n d 1)
nd1 2
. Marginal posterior mode of 2 is equal to nd+1 s 6= MLE of 2
nd1 2
. Posterior mean of 2: nd3 s

. 95% HPD-interval for 2: algorithm on Inv 2(n d 1, s2)-distribution

Concepts in Bayesian Inference 261


Multivariate posterior summary measures

Multivariate posterior summary measures for

Posterior mean (mode) of =


b (MLE=LSE)

100(1-)%-HPD region=
n o
b T (X T X)( )
C() = : ( ) b d s2 F(d + 1, n d 1)

Contour probability for H0 : = 0

Concepts in Bayesian Inference 262


Posterior predictive distribution

PPD of ye with x
e:
t-distribution with (n d 1) degrees of freedom with
location parameter: T x
e
h i
2 T T 1
scale parameter: s 1 + x e (X X) x e

ye T x
tnd1
e
Given y : q
e T (X T X)1x
s 1+x e

How to sample?
. Directly from t-distribution
. Method of Composition

Concepts in Bayesian Inference 263


4.7.4 Sampling from the posterior distribution

Method of Composition

p( | y) = multivariate t-distribution: how to sample from it?

p( | 2, y) = multivariate normal distribution

Sample in two steps

Concepts in Bayesian Inference 264


Example IV.8: Osteoporosis study Sampling with Method of Composition

e2 from p( 2 | y) = Inv 2( 2 | n d 1, s2)


Sample
h i
Sample from e2, y) = N(d+1) | ,
e from p( | b e2(X T X)1

Sampled mean regression vector = (0.816, 0.0403)

95% equal tail CIs = 0: [0.594, 1.040] & 1: [0.0317, 0.0486]

Contour probability for H0 : = 0 = < 0.001

Marginal posterior of (0, 1) has a ridge (r(0, 1) = 0.99)

Concepts in Bayesian Inference 265


Example IV.8 (continued)

Distribution of a future observation at bmi=30

2
Sample future observation ye from N(e
30,
e30 ):
T
.
e30 = (1, 30)
e
T 1
2 2
 T

e30 =
. e 1 + (1, 30)(X X) (1, 30)

Sampled mean and standard deviation = 2.033 and 0.282

Concepts in Bayesian Inference 266


Example IV.8 (continued)

100
(a) (b)

80
3

60
2

40
1

20
0

0
0.6 0.8 1.0 1.2 0.025 0.035 0.045
0 1

1.5
(c) (d)
0.045

1.0
1
0.035

0.5
0.025

0.0

0.6 0.8 1.0 1.2 1.5 2.0 2.5 3.0


0 ~
y

Concepts in Bayesian Inference 267


4.8 Bayesian generalized linear models

Generalized Linear Model (GLIM): extension of the linear regression model to a wide
class of regression models

Distributional part

Link function

Variance function

Bayesian Generalized Linear Model (BGLIM): GLIM + priors on parameters

Concepts in Bayesian Inference 268


Components of GLIM

Distributional part:
 
y b()
p(y | ; ) = exp + c(y; ) , with a(), b(), c() known functions
a()
Often a() = /w, with w a prior weight. For known and w = 1:
d b()
. E(y) = = d
d2 b()
. V ar(y) = a() V () with V () = d2

Link function: g() = = xT , g = monotone (differentiable) function


When = link function = canonical, h = g 1

Variance function: = extra dispersion or scale parameter


V ar(y) = a() V () can depend on covariates via

Concepts in Bayesian Inference 269


Special cases of a GLIM

Independent yi (i = 1, . . . , n): p(yi | i; ) (E(yi) = i & g(i) = xTi )

Distributional part of GLIM = example of one-parameter exponential family in


canonical parameter (but different notation than before)

Examples of a GLIM:
. Normal linear regression model with a normal distribution yi N(i, 2),
identity link (g(i) = i), = 1 and V (i) = 2 assumed known
. Poisson regression model with the Poisson distribution yi Poisson(i), log
link (g(i) = log(i)), = 1 and V (i) = i
. Logistic regression model with the Bernoulli (or Binomial) distribution
yi Bern(i), logistic link (g(i) = logit(i)), = 1 and V (i) = i(1 i)

Concepts in Bayesian Inference 270


4.8.1 More complex regression models

Considered multiparameter models are limited


. Gamma/Weibull distribution for alp?
. Censored/truncated data?
. Logistic/Cox regression?

Postpone to MCMC techniques

Concepts in Bayesian Inference 271


Chapter 5
Markov chain Monte Carlo sampling

Finally the real work can start

Concepts in Bayesian Inference 272


5.1 Introduction

. Solving the posterior distribution analytically is often not feasible due to the
difficulty in determining the integration constant
. Computing the integral using numerical integration methods is a practical
alternative if only a few parameters are involved
New computational approach is needed

. Sampling is the way to go!


. With Markov chain Monte Carlo (MCMC) methods:
1. Gibbs sampler
2. Metropolis-(Hastings) algorithm

MCMC approaches have revolutionalized Bayesian methods!

Concepts in Bayesian Inference 273


5.2 The Gibbs sampler

Gibbs Sampler: introduced by Geman and Geman (1984) in the context of


image-processing for the estimation of the parameters of the Gibbs distribution

Gelfand and Smith (1990) introduced Gibbs sampling to tackle complex


estimation problems in a Bayesian manner

Concepts in Bayesian Inference 274


5.2.1 The bivariate Gibbs sampler 1

Method of Composition:

p(1, 2 | y) is completely determined by:


. marginal p(2 | y)
. conditional p(1 | 2, y)

Split-up yields a simple way to sample from joint distribution

Concepts in Bayesian Inference 275


The bivariate Gibbs sampler 2

Gibbs sampling:

p(1, 2 | y) is completely determined by:


. conditional p(2 | 1, y)
. conditional p(1 | 2, y)

Property yields another simple way to sample from joint distribution:


. Take starting values 10 and 20 (only 1 is needed)
. Given 1k and 2k at iteration k, generate the (k + 1)-th value according to
iterative scheme:
(k+1)
1. Sample 1 from p(1 | 2k , y)
(k+1) (k+1)
2. Sample 2 from p(2 | 1 , y)

Concepts in Bayesian Inference 276


The bivariate Gibbs sampler 3

Result of Gibbs sampling:

Chain of vectors: k = (1k , 2k )T , k = 1, 2, . . .


Consists of dependent elements
Markov property: p( (k+1) | k , (k1), . . . , y) = p( (k+1) | k , y)

Chain depends on starting value + initial portion/burn-in part must be discarded

Under mild conditions: sample from the posterior distribution = target distribution

From k0 on: summary measures calculated from the chain consistently estimate
the true posterior measures

Gibbs sampler is called a Markov chain Monte Carlo method

Concepts in Bayesian Inference 277


Example VI.1: SAP study Gibbs sampling the posterior with NI priors 1

Example IV.5: sampling from posterior distribution of the normal likelihood based
on 250 alp measurements of healthy patients with NI prior for both parameters

Now using Gibbs sampler

Determine two conditional distributions:


1. p( | 2, y): N( | y, 2/n)
1
Pn
2. p( | , y): Inv ( | n, s) with s = n i=1(yi )2
2 2 2 2 2

Iterative procedure: At iteration (k + 1)


1. Sample (k+1) from N(y, ( 2)k /n)
2. Sample ( 2)(k+1) from Inv 2(n, s2(k+1) )

Concepts in Bayesian Inference 278


Example VI.1: SAP study Gibbs sampling the posterior with NI priors 2

2.6

2.6
(a) (b)

2.4

2.4
2.2

2.2
2

2
2.0

2.0
1.8

1.8
1.6

1.6
1.4

1.4
6.6 6.8 7.0 7.2 7.4 6.6 6.8 7.0 7.2 7.4

Zigzag pattern in the (, 2)-plane


1 complete step = 2 substeps (blue=genuine element)
Burn-in = 500, total chain = 1,500

Concepts in Bayesian Inference 279


Example VI.1: SAP study Gibbs sampling the posterior with NI priors 3

2.5
(a) (b)
4

2.0
3

1.5
2

1.0
0.5
1

0.0
0

6.8 6.9 7.0 7.1 7.2 7.3 7.4 1.4 1.6 1.8 2.0 2.2 2.4
2

Solid lines = true posterior distributions

Concepts in Bayesian Inference 280


Example VI.2: Sampling from a discrete continuous distribution 1
n x+1

Joint distribution: f (x, y) x y (1 y)(nx+1)
x a discrete random variable taking values in {0, 1, . . . , n}
y a random variable on the unit interval
, > 0 parameters

Question: f (x)? Use Gibbs sampling

Determine two conditional distributions:


1. f (x | y): Bin(n, y)
2. f (y | x): Beta(x + , n x + )

Iterative procedure: At iteration (k + 1):


1. Sample x(k+1) from Bin(n, y k )
2. Sample y (k+1) from Beta(x(k+1) + , n x(k+1) + )

Concepts in Bayesian Inference 281


Example VI.2: Sampling from a discrete continuous distribution 2

0.08
0.06
Density
0.04 0.02
0.00

0 5 10 15 20 25 30
x

Solid line = true posterior distribution


Burn-in = 500, total chain = 1,500

Concepts in Bayesian Inference 282


Example VI.3: SAP study Gibbs sampling the posterior with I priors 1

Example VI.1: now with independent informative priors (semi-conjugate prior)


N(0, 02)
2 Inv 2(0, 02)

Posterior:
1 212 (0)2
p(, 2 | y) e 0
0
2 (0 /2+1) 0 02 /2 2
( ) e
n
1 Y 12 (yi)2
n
e 2
i=1
n 1 2
12 (yi )2 22 (0 ) ( n+0
+1) 0 02 /2 2
Y
2
e 2 e 0 ( ) 2 e
i=1

Concepts in Bayesian Inference 283


Example VI.3: SAP study Gibbs sampling the posterior with I priors 2

Determine two conditional distributions:


1 2
2
Qn 12 (yi )2 22 (0 ) k 2 k

1. p( | , y): i=1 e
2 e
0 (N , ( ) )
 Pn 2 2

2 2 i=1 (yi ) +0 0
2. p( | , y): Inv 0 + n, 0 +n

Iterative procedure: At iteration (k + 1)


(k+1) k 2 k

1. Sample from N , ( )
 Pn 2 + 2

(y )
2. Sample ( 2)(k+1) from Inv 2 0 + n, i=1 i0+n 0 0

Concepts in Bayesian Inference 284


Example VI.3: SAP study Trace plots

7.0
6.5

6.0

(a)
5.5

0 500 1000 1500


Iteration
3.0

(b)
2.2 2.6
2
1.8

0 500 1000 1500


Iteration

Concepts in Bayesian Inference 285


5.2.2 The general Gibbs sampler 1

Starting position 0 = (10, . . . , d0)T


Multivariate version of the Gibbs sampler:
Iteration (k + 1):

(k+1)
1. Sample 1 from p(1 | 2k , . . . , (d1)
k
, dk , y)
(k+1) (k+1)
2. Sample 2 from p(2 | 1 , 3k , . . . , dk , y)
..
(k+1) (k+1) (k+1)
d. Sample d from p(d | 1 , . . . , (d1) , y)

Concepts in Bayesian Inference 286


5.2.3 The general Gibbs sampler 2

Full conditional distributions: p(j | 2k , . . . , (j1)


k k
, (j+1) k
, . . . , (d1) , dk , y)

Also called: full conditionals

Under mild regularity conditions:


k , (k+1), . . . ultimately are observations from the posterior distribution

Concepts in Bayesian Inference 287


Example VI.4: British coal mining disasters data 1

. British coal-mining disasters data set: # severe accidents in British coal mines
from 1851 to 1962
. Decrease in frequency of disasters from year 40 (+ 1850) onwards?
6
5
4
# Disasters
3
2
1
0

0 20 40 60 80 100
1850+year

Concepts in Bayesian Inference 288


Example VI.4: British coal mining disasters data 2

Statistical model:

Likelihood: Poisson process with a change point at k


. yi Poisson() for i = 1, . . . , k
. yi Poisson() for i = k + 1, . . . , n (n=112)

Priors
. : Gamma(a1, b1), (a1, b1 parameters)
. : Gamma(a2, b2), (a2, b2 parameters)
. k: p(k) = 1/n

. b1: Gamma(c1, d1), (c1, d1 parameters)


. b2: Gamma(c2, d2), (c2, d2 parameters)

Concepts in Bayesian Inference 289


Example VI.4: British coal mining disasters data 3

Full conditionals:
k
X
p( | y, , b1, b2, k) = Gamma(a1 + yi, k + b1)
i=1
Xnk
p( | y, , b1, b2, k) = Gamma(a2 + y i , n k + b2 )
i=k+1
p(b1 | y, , , b2, k) = Gamma(a1 + c1, + d1)
p(b2 | y, , , b1, k) = Gamma(a2 + c2, + d2)
(y | k, , )
p(k | y, , , b1, b2) = Pn
j=1 (y | j, , )
 Pki=1 yi

with (y | k, , ) = exp [k( )]

a1 = a2 = 0.5, c1 = c2 = 0, d1 = d2 = 1

Concepts in Bayesian Inference 290


Example VI.4: British coal mining disasters data 4

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.20
0.15
0.10
0.05
0.00
2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 35 40 45
k

a1 = a2 = 0.5, c1 = c2 = 0, d1 = d2 = 1
Posterior mode of k: 1891
Posterior mean for /= 3.42 with 95% CI = [2.48, 4.59]

Concepts in Bayesian Inference 291


Example VI.5: Osteoporosis study Using the Gibbs sampler 1

Bayesian linear regression model with NI priors:

. Regression model: tbbmci = 0 + 1bmii + i (i = 1, . . . , n = 234)


. Priors: p(0, 1, 2) 2
. Notation: y = (tbbmc1, . . . , tbbmc234)T , x = (bmi1, . . . , bmi234)T

Full conditionals: p( 2 | 0, 1, y) = Inv 2(n, s2 )


p(0 | 2, 1, y) = N(r1 , 2/n)
p(1 | 2, 0, y) = N(r0 , 2/xT x)
with
s2= n1(yi 0 1 xi)2
P

r1 = n1
P
(yi 1 xi)
r0 = (yi 0) xi/xT x
P

Concepts in Bayesian Inference 292


Example VI.5: Osteoporosis study Using the Gibbs sampler 2

Parameter Method of Composition


2.5% 25% 50% 75% 97.5% Mean SD
0 0.57 0.74 0.81 0.89 1.05 0.81 0.12
1 0.032 0.038 0.040 0.043 0.049 0.040 0.004
2 0.069 0.078 0.083 0.088 0.100 0.083 0.008
Gibbs sampler
2.5% 25% 50% 75% 97.5% Mean SD
0 0.67 0.77 0.84 0.91 1.10 0.77 0.11
1 0.030 0.036 0.040 0.042 0.046 0.039 0.0041
2 0.069 0.077 0.083 0.088 0.099 0.083 0.0077

Method of Composition = 1,000 independently sampled values


Gibbs sampler: burn-in = 500, total chain = 1,500

Concepts in Bayesian Inference 293


Example VI.5: Osteoporosis study Index plot from Method of Composition

(a)
0.045
1
0.030

0 200 400 600 800 1000


Index
0.11

(b)
0.09
2
0.07

0 200 400 600 800 1000


Index

Concepts in Bayesian Inference 294


Example VI.5: Osteoporosis study Trace plot from Gibbs sampler

(a)
0.045
1
0.030

0 500 1000 1500


Iteration

(b)
0.08 0.10
2
0.06

0 500 1000 1500


Iteration

Concepts in Bayesian Inference 295


Example VI.5: Osteoporosis study Trace plot

Comparison of index plot with trace plot shows:

2: index plot and trace plot similar (almost) independent sampling

1: trace plot shows slow mixing quite dependent sampling

Method of Composition and Gibbs sampling: similar posterior measures of 2

Method of Composition and Gibbs sampling: less similar posterior measures of


1

Concepts in Bayesian Inference 296


Example VI.5: Osteoporosis study Autocorrelation

Autocorrelation:
(k1)
. Autocorrelation of lag 1: correlation of 1k with 1 (k=1, . . .)
(k2)
. Autocorrelation of lag 2: correlation of 1k with 1 (k=1, . . .)
...
(km)
. Autocorrelation of lag m: correlation of 1k with 1 (k=1, . . .)

High autocorrelation:

burn-in part is larger takes longer to forget initial positions


remaining part needs to be longer to obtain stable posterior measures

Concepts in Bayesian Inference 297


5.2.4 Remarks

Full conditionals determine joint distribution

Generate joint distribution from full conditionals

Transition kernel

Concepts in Bayesian Inference 298


Remarks Full conditionals determine joint

Full conditionals determine the joint distribution: see Besag (1974) and
Hammersley and Clifford (1971)

Proofs in Robert and Casella (2004) for bivariate case (Theorem 9.3) and for
general case (Theorem 10.5)

Proof bivariate case:


1. p(1, 2) = p(2 | 1)p1(1) = p(1 | 2)p2(2)
R p(2 |1 ) R p2 (2 ) 1
2. p(1 |2 ) d2 = p1 (1 ) d2 = p1 (1 )

R
p(1, 2) = p(2 | 1)/ [p(2 | 1)/p(1 | 2)] d2

Concepts in Bayesian Inference 299


Remarks Generate joint from full conditionals

That the joint distribution exists is not enough to determine the joint

Bivariate case (Casella and George, 1992): compute p(1, 2) from p(1 | 2) &
p(2 | 1)
R
1. p1(1) = p(1 | 2)p2(2) d2 and similar for 2
Z Z 
2. p1(1) = p(1 | 2) p(2 | 1)p1(1) d1 d2
Z Z 
= p(1 | 2)p(2 | 1) d2 p1(1) d1
Z
= K1(1, 1) p1(1) d1
R
with K1(1, 1) = p(1 | 2)p(2 | 1) d2

Concepts in Bayesian Inference 300


If the conditionals are known, then p1(1) can be solved by finding the (fixed
point) solution of an integral equation

Gibbs sampler = stochastic version of this iterative algorithm

Concepts in Bayesian Inference 301


Remarks Transition kernel

Transition kernel/function: engine that generates a move from k to (k+1)

Gibbs sampler:
K(, ) = p(1 | 2, . . . , d) p(2 | 1, 2, . . . , d) p(d | 1, . . . , (d1))
R
K1(1, 1) = p(1 | 2)p(2 | 1) d2:
transition kernel to express that a move from 1 to 1
can be made via all possible values for 2

Concepts in Bayesian Inference 302


5.2.5 Review of Gibbs sampling approaches

Sampling the full conditionals is done via different algorithms depending on:
. Shape of full conditional (classical versus general purpose algorithm)
. Preference of software developer:
SASr procedures GENMOD, LIFEREG and PHREG: ARMS algorithm
WinBUGS: variety of samplers

Several versions of the basic Gibbs sampler:


. Deterministic- or systematic scan Gibbs sampler: d dims visited in fixed order
. Random-scan Gibbs sampler: d dims visited in random order
. Reversible Gibbs sampler: d dims visited in order and reversed order
. Block Gibbs sampler: d dims split up into m blocks of parameters and Gibbs
sampler applied to blocks

Concepts in Bayesian Inference 303


Review of Gibbs sampling approaches The block Gibbs sampler

Block Gibbs sampler:

Normal linear regression


. p( 2 | 0, 1, y)
. p(0, 1 | 2, y)

May speed up considerably convergence

WinBUGS: blocking option on

SASr procedure MCMC: allows the user to specify the blocks

Concepts in Bayesian Inference 304


5.2.6 The Slice sampler* 1

Slice sampler: Sampling from a density f (x) defined on [a, b] to samplubg


from bivariate uniform density g(x, y) on region A: { (x, y) : 0 < y < f (x)}

y = auxiliary variable

Three cases:
. [a, b] finite interval
Sample from [a, b] [0, m = max f (x)] and reject if outside region A
. General unimodal case
. General multimodal case

Concepts in Bayesian Inference 305


The Slice sampler* 2

General unimodal case = example of bivariate Gibbs algorithm

y | x U(0, f (x))
x | y U(miny , maxy )
miny (maxy ) minimal (maximal x-value) of solution y = f (x)

Stochastic intervals S(y) = [miny , maxy ] = slices

Implemented in WinBUGS when support is finite (first case)

General multimodal case = extension (slices are not intervals anymore)

Concepts in Bayesian Inference 306


Example VI.6: Slice sampling applied to the normal density

Sample from a standard normal density:


 2

. y | x U 0, ex /2
p p p p
. x | y U( 2 log(y), 2 log(y)) (S(y) = [ 2 log(y), 2 log(y)] )
0.4

0.0 0.1 0.2 0.3 0.4 0.5


(a) (b)
0.3

Density
S(y)
0.2
y
0.1
0.0

4 2 0 2 4 3 1 0 1 2 3
x x

Concepts in Bayesian Inference 307


5.3 The Metropolis(-Hastings) algorithm

Metropolis-Hastings (MH) algorithm = general Markov chain Monte Carlo


technique to sample from the posterior distribution but does not require full
conditionals

Special case: Metropolis algorithm proposed by Metropolis in 1953

General case: Metropolis-Hastings algorithm proposed by Hastings in 1970

Became popular only after introduction of Gelfand & Smiths paper (1990)

Further generalization: Reversible Jump MCMC algorithm by Green (1995)

Concepts in Bayesian Inference 308


5.3.1 The Metropolis algorithm 1

Sketch of algorithm:

New positions are proposed by a proposal density q

Proposed positions will be:


. Accepted:
Proposed location has higher posterior probability: with probability 1
Otherwise: with probability proportional to posterior probability
. Rejected:
Otherwise

Algorithm satisfies again Markov property MCMC algorithm

Similarity with AR algorithm

Concepts in Bayesian Inference 309


The Metropolis(-Hastings) algorithm 2

Metropolis algorithm:
Chain is at k Metropolis algorithm samples value (k+1) as follows:

1. Sample a candidate e | ), with


e from the symmetric proposal density q(
= k
2. The next value (k+1) will be equal to:
e with probability ( k , )
e (accept proposal),
k otherwise (reject proposal),
with !
e | y)
p(
k
( , )
e = min r= k
,1
p( | y)

Function ( k , )
e = probability of a move

Concepts in Bayesian Inference 310


Example VI.7: SAP study Metropolis algorithm for NI prior case 1

Settings as in Example VI.1, now apply Metropolis algorithm:

. Proposal density: N( k , ) with k = (k , ( 2)k )T and = diag(0.03, 0.03)

2.6

2.6
(a) (b)
2.4

2.4
2.2

2.2
2

2
2.0

2.0
1.8

1.8
1.6

1.6
1.4

1.4
6.6 6.8 7.0 7.2 7.4 6.6 6.8 7.0 7.2 7.4

Jumps to any location in the (, 2)-plane


Burn-in = 500, total chain = 1,500

Concepts in Bayesian Inference 311


Example VI.7: SAP study Metropolis algorithm for NI prior case 2

Marginal distributions:

3.0
6 (a) (b)

2.5
5

2.0
4

1.5
3

1.0
2

0.5
1

0.0
0

6.9 7.0 7.1 7.2 7.3 1.6 1.8 2.0 2.2 2.4
2

Acceptance rate = 40%


Burn-in = 500, total chain = 1,500

Concepts in Bayesian Inference 312


Example VI.7: SAP study Metropolis algorithm for NI prior case 3

Trace plots:

7.3
(a)

7.1

6.9

600 800 1000 1200 1400


Iteration
2.4

(b)
2.0
2
1.6

600 800 1000 1200 1400


Iteration

Accepted moves = blue color, rejected moves = red color

Concepts in Bayesian Inference 313


Example VI.7: SAP study Metropolis algorithm for NI prior case 4

. Proposal density: N( k , ) with k = (k , ( 2)k )T and = diag(0.001, 0.001)

2.6
(a) (b)

3.0
2.4
2.2

2.0
2
2.0
1.8

1.0
1.6
1.4

0.0
6.6 6.8 7.0 7.2 7.4 1.5 1.7 1.9 2.1
2

Acceptance rate = 84%


Poor approximation of true distribution

Concepts in Bayesian Inference 314


5.3.2 The Metropolis-Hastings algorithm 1

Metropolis-Hastings algorithm:
Chain is at k Metropolis-Hastings algorithm samples value (k+1) as follows:

1. Sample a candidate e | ), with


e from the (asymmetric) proposal density q(
= k
2. The next value (k+1) will be equal to:
e with probability ( k , )
e (accept proposal),
k otherwise (reject proposal),
with !
k
e | y) q( | )
k p( e
( , )
e = min r =
k k
,1
p( | y) q( | )
e

Concepts in Bayesian Inference 315


The Metropolis-Hastings algorithm 2

Reversibility condition: Probability of move from to


e = probability of move
from
e to

Reversible chain: chain satisfying reversibility condition

e | k ) q()
Example asymmetric proposal density: q( e (Independent MH
algorithm)

WinBUGS makes use of univariate MH algorithm to sample from some


non-standard full conditionals

Concepts in Bayesian Inference 316


Example VI.8: Sampling a t-distribution using Independent MH algorithm

Target distribution : t3(3, 22)-distribution

(a) Independent MH algorithm with proposal density N(3,42)


(b) Independent MH algorithm with proposal density N(3,22)
0.30

0.30
(a) (b)
0.20

0.20
0.10

0.10
0.00

0.00

5 0 5 10 5 0 5 10
t t

Concepts in Bayesian Inference 317


5.3.3 Remarks* 1

The Gibbs sampler is a special case of the Metropolis-Hastings algorithm, but


Gibbs sampler is still treated differently

The transition kernel of the MH-algorithm

The reversibility condition

Difference with AR algorithm

Concepts in Bayesian Inference 318


Remarks* Gibbs sampler = special case of MH algorithm

Define d transition functions


e | k ) = p(ej | k , y) if
qjG( ej = k
j j
= 0 otherwise
j = without the jth component.

Only possible jumps are to parameter vectors e that match k on all components
other than the jth. Then ratio r in the jth substep:
e | y)q G( k | )
p( e e | y)p( k | ej , y)
p(
j j
r = =
e | k ) p( k | y)p(ej | k , y)
p( k | y)qjG( j
e | y)/p(ej | k , y)
p( j
= k k e
=1
p( | y)/p( j | j , y)
each jump is accepted

Concepts in Bayesian Inference 319


Remarks* Transition kernel

Transition kernel = probability to move from k to (k+1) with , .

Jump in 2 components:
. First component (move): K(, ) = (, )q( | )
move to = e suggested by the proposal density q( | ) and accepted with
probability (, )
R
. Second component (stay): r() = 1 (, )q( | )d
IRd
probability that no move is made, i.e. =

Probability that B, with B :


Z
p(, B) = K(, ) d + r()I( B)
B

Concepts in Bayesian Inference 320


Remarks* Reversibility condition

Reversibility condition:

= Probability to move from set A to set B = probability to move from set B to set
A (any sets A and B in )
Z Z
p(, B) d = p(, A) d
A B

. Condition satisfied when detailed balance condition is satisfied:


Z Z Z Z
K(, ) dd = K(, ) dd all A and B
A B B A

. Detailed balance condition optimally satisfied for MH acceptance probability


(Hastings, 1970)

Concepts in Bayesian Inference 321


Remarks* Comparison with AR algorithm

AR algorithm:

. Makes use of instrumental distribution to propose values


. Proposed values are accepted or rejected
. Generates independent samples
. No trace of rejected values

MH algorithm:

. Makes use of instrumental distribution to propose values


. Proposed values are accepted or rejected
. Generates dependent samples
. Trace of rejected values

Concepts in Bayesian Inference 322


5.3.4 Review of Metropolis(-Hastings) approaches

The Random-Walk Metropolis(-Hastings) algorithm

The Independent Metropolis-Hastings algorithm

The Block Metropolis-Hastings algorithm

The Reversible Jump MCMC (RJMCMC) algorithm

Concepts in Bayesian Inference 323


The Random-Walk Metropolis(-Hastings) algorithm

Proposal density: q(
e | ) = q(e ), e.g. q(
e ) q(|
e |) proposal
density is symmetric and gives the Metropolis algorithm
. Multivariate normal density: WinBUGS & SASr procedures
. Multivariate t-distribution: SASr PROC MCMC for long tailed posteriors

Acceptance rate: 45% for d = 1 and 23.4% for d > 1

Tuning the proposal density:


. WinBUGS (one-dimensional MH algorithm): in first 4000 iterations to produce
an acceptance rate between 20% and 40%
. SASr procedure MCMC: in several loops

Concepts in Bayesian Inference 324


The Independent Metropolis-Hastings algorithm

Proposal density: does not depend on the position in the chain, e.g.
e | ) = Nd(
q( e | , )

One of the possible samplers of the SASr procedure MCMC

Similar to AR algorithm but accepts e > p( k | y)/q( k )


e | y)/q()
e when p(

High acceptance rate is desirable when proposal density q() is close to the
posterior density

If p( | y) A q() for all , then the Markov chain generated by the Independent
MH algorithm enjoys excellent convergence properties (Theorem 7.8) and that the
expected acceptance probability exceeds that of the AR algorithm (Lemma 7.9)

Concepts in Bayesian Inference 325


The Block Metropolis-Hastings algorithm

MH algorithm within Gibbs sampling: Metropolis-within-Gibbs

SASr procedure MCMC: blocks specified by the user

WinBUGS: regression coefficients in one block (blocking option switched on) and
variance parameters in other block

Concepts in Bayesian Inference 326


The Reversible Jump MCMC (RJMCMC) algorithm

Special case of the MH algorithm

Jumps within space and between spaces

Important application: Bayesian variable selection

Concepts in Bayesian Inference 327


5.4 Justification of the MCMC approaches*

. Homogeneous Markov chain


. Reversible Markov chain
. Ergodicity: (irrespective of the starting position)
Irreducibility: the chain can reach each possible outcome
Aperiodicity: there is no cyclic behavior in the chain
Positive recurrence: the chain visits every possible outcome an infinite amount
of times and the expected time to return to a particular outcome is finite.
ergodicity implies that the chain will explore the posterior distribution exhaustively
. Autocovariance and autocorrelation
. Extensions of classical theorems to ergodic Markov chains

Concepts in Bayesian Inference 328


Justification of the MCMC approaches* Autocorrelation

Elements of a Markov chain are conditionally independent given the


immediate past, but they are dependent unconditionally

The autocovariance of lag m (m 0) of the Markov chain (tk )k (t()k )k :


m = cov(tk , tk+m)

The variance of (tk )k : 0 = autocovariance for m = 0

The autocorrelation of lag m: m = m/0

Concepts in Bayesian Inference 329


Justification of the MCMC approaches* Ergodic theorems

Theorem: When (k )k is an ergodic Markov chain with stationary distribution ,


then the limiting distribution is also .

Theorem: (Markov Chain Law of Large Numbers) For an ergodic Markov chain
with a finite expected value for t(), tk converges to the true mean.

Theorem: (Markov Chain Central Limit Theorem) For a uniformly (or


geometrically) ergodic Markov chain with t2() (or t2+(), for some  > 0 in the
geometric case) integrable with respect to then
tk E [t()]
k converges in distribution to N(0, 1) as k ,

with

!
X
2
= 0 1 + 2 m .
m=1

Concepts in Bayesian Inference 330


5.4.1 Properties of the MH algorithm* 1

The MH algorithm creates a reversible Markov chain, i.e. a Markov chain that
satisfies the detailed balance condition.
Proof discrete case:

= discrete random variable S = {x1, x2, . . . , xr }

j = p( = xj )

Q = (qij )ij : matrix that describes the move from xi to xj with probability qij
 
j qji
Probability that a move from xi is made to (6=) xj = ij = min 1, i qij

Probability to move from xi to xj : pij = ij qij

Concepts in Bayesian Inference 331


Properties of the MH algorithm* 2

The detailed balance condition: i pij = j pji is satisfied because


 
j qji
i pij = i ij qij = i min 1, qij
i qij
 
i qij
= min (i qij , j qji) = min 1, j qji
j qji
= j pji

is stationary distribution

MH algorithm creates a Markov chain where the target distribution is also the
stationary distribution

+ Extra verifications show that LLN and CLT for ergodic chains can be applied

Concepts in Bayesian Inference 332


Properties of the Gibbs sampler*

Verifications show that LLN and CLT for ergodic chains can be applied

Concepts in Bayesian Inference 333


5.5 Choice of the sampler

Choice of the sampler depends on a variety of considerations

Concepts in Bayesian Inference 334


Example VI.9: Caries study MCMC approaches for logistic regression

Subset of n = 500 children of the Signal-Tandmobielr study at 1st examination:

. Research questions:
Have girls a different risk for developing caries experience (CE ) than boys
(gender ) in the first year of primary school?
Is there an east-west gradient (x-coordinate) in CE?
. Bayesian model: logistic regression + N(0, 1002) priors for regression coefficients
. No standard full conditionals
. Three algorithms:
Self-written R program: evaluate full conditionals on a grid + ICDF-method
WinBUGS program: multivariate MH algorithm (blocking mode on)
SASr procedure MCMC: Random-Walk MH algorithm

Concepts in Bayesian Inference 335


Program Parameter Mode Mean SD Median MCSE
Intercept -0.5900 0.2800
MLE gender -0.0379 0.1810
x-coord 0.0052 0.0017
Intercept -0.5880 0.2840 -0.5860 0.0104
R gender -0.0516 0.1850 -0.0578 0.0071
x-coord 0.0052 0.0017 0.0052 6.621E-5
Intercept -0.5800 0.2810 -0.5730 0.0094
WinBUGS gender -0.0379 0.1770 -0.0324 0.0060
x-coord 0.0052 0.0018 0.0053 5.901E-5
Intercept -0.6530 0.2600 -0.6450 0.0317
SASr gender -0.0319 0.1950 -0.0443 0.0208
x-coord 0.0055 0.0016 0.0055 0.00016

Concepts in Bayesian Inference 336


Conclusions:

Posterior means/medians of the three samplers are close (to the MLE)

Precision with which the posterior mean was determined (high precision = low
MCSE) differs considerably

The clinical conclusion was the same

Samplers may have quite a different efficiency

Concepts in Bayesian Inference 337


5.6 The Reversible Jump MCMC algorithm*

Reversible Jump MCMC (RJMCMC): extension of the standard MH-algorithm


(Green, 1995) to allow sampling from target distributions on spaces of varying
dimension = trans-dimensional case

Applications:
Mixtures with an unknown # of components and hidden Markov models
Change-point problems with an unknown # of change-points/locations
Model and variable selection problems
Analysis of quantitative trait locus (QTL) data

Theory is complex:
Idea: create 1-to-1 function between the spaces of different dimensions
Construct a MH-algorithm that satisfies detailed-balance condition

Concepts in Bayesian Inference 338


Example VI.10: Caries study choosing between Poisson & NB with RJMCM

We now look at dmft-index of all children in first year

Poisson overdispersion: mean dmft = 2.24 and variance = 7.93


Candidate model for fitting overdispersion: Negative binomial distribution
Choose between Poisson and Negative Binomial with RJMCMC

yi
Qn
Poisson likelihood: L( | y) = i=1 yi ! exp(n)

1/   yi
Qn (1/+yi ) 1
Negative binomial model: L(, | y) = i=1 (1/) yi ! 1+ 1/+

= mean for both distributions


(1 + ) = variance of NB
measures overdispersion ( = 0: Poisson model)

Concepts in Bayesian Inference 339


Example VI.10 Choices of moves

Two models:

. Poisson: 1 = (1, 1) with 1


. Negative binomial: 2 = (2, 2) with 2 = (21, 22) (, )
. Trans-dimensional moves are between 1 = (1, 1) and 2 = (2, 2)
+ Necessary priors

There are four types of moves:

(1) Poisson to Poisson: classical


(2) Poisson to negative binomial: trans-dimensional
(3) Negative binomial to Poisson: trans-dimensional
(4) Negative binomial to negative binomial: classical

Concepts in Bayesian Inference 340


Example V1.10 Mixing behavior

Four settings
. Preference was measured by % of times the Markov chain was in model 2
(negative binomial): between 56% and 74%
. A suggested move from model 1 to model 2 was always accepted
. Percentage of trans-dimensional moves: between 32% and 55%

Concepts in Bayesian Inference 341


Example V1.10 Trace and density plots

Poisson Negative Binomial Negative Binomial

10
12

8
8

10

6
6

4
4

2
2

0
0
0

0 500 1500 2500 3500 0 1000 2000 3000 4000 0 1000 2000 3000 4000
Iteration Iteration Iteration

Poisson Negative Binomial Negative Binomial

0.4
0.30

0.3

0.3
Density

Density

Density
0.20

0.2

0.2
0.10

0.1

0.1
0.00

0.0

0.0
0 2 4 6 8 10 0 2 4 6 8 10 12 14 0 2 4 6 8 10

Concepts in Bayesian Inference 342

You might also like