Professional Documents
Culture Documents
Bayesian Course Main
Bayesian Course Main
Bayesian Course Main
Christel Faes
Interuniversity Institute for Biostatistics and statistical Bioinformatics
Hasselt University
christel.faes@uhasselt.be
Emmanuel Lesaffre
Department of Biostatistics, Erasmus Medical Center
Interuniversity Institute for Biostatistics and statistical Bioinformatics
Catholic University Leuven & University Hasselt
Contents
1.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.4 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.7.3 Posterior summary measures for the linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Three approaches:
. Frequentist approach
. (Likelihood approach)
. Bayesian approach
Classical approach:
Example: RCT
A: 1 & B: 2
H0: = 1 2 = 0
Completion of study:
b = 1.38 with tobs = 2.19 in 0.05 rejection region
The P -value depends on the sample space (Examples I.3 and I.4)
The P -value does not take all evidence into account (Example I.5)
P -value 6= p(H0 | y)
Example I.2
0.4
unobserved y values
0.3
observed t value
Density
0.2
0.015 0.015
0.1
area area
0.0
4 2 0 2 4
t
The possible samples are similar in some characteristics to the observed sample
(e.g. same sample size)
Small P -value does not necessarily indicate large difference between treatments,
strong association, etc.
Interpretation: most likely (with 0.95 probability) lies between 0.14 and 2.62
= a Bayesian interpretation
1. Fishers approach
Inductive approach
Introduction of:
. Null-hypothesis (H0)
. Significance test
. P -value = evidence against H0
. Significance level
. NO alternative hypothesis
. NO power
Deductive approach
Introduction of:
. Alternative hypothesis (HA)
. Type I error
. Type II error & power
. Hypothesis test
Other measures (Bayes factor) have been proposed as measure for or against an
hypothesis
Notation:
Result on ith operation: success yi = 1, failure yi = 0
Total experiment: n operations with s successes
Sample {y1, . . . , yn} y
Probability of success = p(yi) =
Binomial distribution:
Expresses probability of s successes out of n experiments.
n
n X
f (s) = s (1 )ns with s = yi
s i=1
0
(a) (b)
10
0.20
Loglikelihood
Likelihood
30
0.10
50
0.00
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
`(|s) = c + s ln + (n s) ln(1 )
d
d `(|s) = s (ns)
(1) = 0 = s/n
b
LP 2: Two likelihood functions for contain the same information about if they
are proportional to each other.
0.25
0.20
Binomial likelihood
Likelihood
0.15
95% CI
0.10
0.05
MLE
0.00
n
P
. Surgeon 1: s = yi has a binomial distribution
i=1
n
s
binomial likelihood L1(|s) = s (1 )(ns)
n
P
. Surgeon 2: s = yi has a negative binomial (Pascal) distribution
i=1
s+k1
s k
negative binomial likelihood L2(|s) = s (1 )
0.25
0.20
Binomial likelihood
Likelihood
0.15
0.10
MLE
0.00
Examples I.7 and I.8: combination of information from a similar historical surgical
technique could be used in the evaluation of current technique = Bayesian exercise
New mouthwash:
. Daily use of the new mouthwash before tooth brushing reduces plaque?
. Results: new mouthwash reduced 25% of plaque with a 95% CI = [10%, 40%]
. Previous trials: overall reduction in plaque in-between 5% and 15%
. Experts: plaque reduction will probably not exceed 30%
. What to conclude then?
Medical example: Patients treated for CVA with thrombolytic agent suffer from
SBAs. Historical studies (20% - prior), pilot study (10% - data) posterior
p (A | B ) p (B)
p (B | A) =
p (A)
p (A | B ) p (B)
p (B | A) =
p (A | B ) p (B) + p (A | B C ) p (B C )
Folin-Wu blood test: screening test for diabetes (Boston City Hospital)
Se = 56/70 = 0.80
Sp = 461/510 = 0.90
prev = 70/580 = 0.12
Bayes Theorem:
+ + p (T + | D+ ) p (D+)
p D T =
p (T + | D+ ) p (D+) + p (T + | D ) p (D)
Se prev
pred+ =
Se prev + (1 Sp) (1 prev)
Folin-Wu blood test: prior (prevalence) = 0.10 & positive test posterior = 0.47
Ioannidis (2005) explains why many medical research findings appear to be false
(1 )R
pred+ =
(1 )R +
If (1 )R >
Posterior probability of finding a true relationship > 0.5
Power to find a positive result must be > 0.05/R to find with high likelihood a
truly positive result, which is impossible for G large
Other (interesting!) conclusions, see Ioannidis (2005)
Bayes theorem will be further developed in the next chapter such that it
becomes useful in statistical practice
Figure 1.4 (a): AUC on pos x-axis represents = our posterior belief that is
positive (= 0.98)
Figure 1.4 (b): AUC on AUC for the interval [1, [ = our belief that 1/2 > 1
Figure 1.4 (c): incorporation of skeptical prior that is positive (a priori around
-0.5, with some uncertainty) (= 0.54)
(a) (b)
delta sample: 5000 rat sample: 5000
0.6 6.0
0.4 4.0
0.2 2.0
0.0 0.0
-2.0 0.0 2.0 4.0 0.75 1.0 1.25 1.5
(c) (d)
delta sample: 5000 ratsig sample: 5000
0.8 6.0
0.6 4.0
0.4
0.2 2.0
0.0 0.0
-2.0 0.0 2.0 0.8 1.0 1.2 1.4
Philosophical differences aside, there are practical reasons that drive the recent
popularity of Bayesian analysis:
Simplicity in thinking about problems and answering questions
Flexibility in making inference on a wide range of models (data augmentation,
hierarchical models)
Incorporation of prior information
Development of efficient inference and sampling tools
Fast computers
Statistical world:
. Bayesian statistics has not been (widely) accepted for a long time.
. Frequentist world versus Bayesian (Likelihood) world
. From 1990: change in attitude of statistical community.
Medical world:
. The medical world is more conservative. Bayesian methods are better accepted
in exploratory/epidemiological studies than in clinical trials. In clinical trials,
there seems to be a role for Bayesian methods except for in phase III studies.
A Bayesian and a frequentist were sentenced to death. When the judge asked what
their final wishes were, the Bayesian replied that he wished to teach the frequentist the
ultimate lesson. The judge granted his request and then repeated the question to the
frequentist. He replied that he wished to get the lesson again and again and again . . .
In this chapter:
A variety of examples
D+ = 1 and D = 0
T + y = 1 and T y = 0
p(y = 1 | = 1) p( = 1)
p( = 1 | y = 1) =
p(y = 1 | = 1) p( = 1) + p(y = 1 | = 0) p( = 0)
Shorthand notation
p(y | )p()
p( | y) =
p(y)
Probability can have two meanings: limiting proportion (objective) or personal belief
(subjective)
Tour de France
Global warming
...
Ak : p(Ak ) 0 (k=1, . . ., K)
p(S) = 1
p(AC ) = 1 p(A)
p (y | k ) p (k )
p (k | y) = K
P
p (y | k ) p (k )
k=1
L(|y)p() L(|y)p()
p(|y) = = R
p(y) L(|y)p()d
Frequentist
Likelihood
SICH incidence:
SICH: yi = 1, otherwise yi = 0
Pn n
y= 1 yi has Bin(n, ): p(y|) = y y (1 )(ny)
MLE b = 0.20
n0
y0 (1 )(n0y0) (y0 = 8 & n0 = 100)
ECASS 2 likelihood: L(|y0) = y0
0.15
As a function of
L(|y0) 6= density (AUC 6= 1)
0.10
LIKELIHOOD
How to standardize?
0.05
Numerically or analytically?
0.0
0.0 0.05 0.10 0.15 0.20 0.25 0.30
PROPORTION SICH
p() = 1
B(0 ,0 ) 01(1 )01
(b)
15
(+)
() gamma function proportional to
LIKELIHOOD
10
0(9) y0 + 1
0(100 8 + 1) n0 y0 + 1
5
LIKELIHOOD
0
0.0 0.05 0.10 0.15 0.20 0.25 0.30
PROPORTION SICH
Averaged likelihood
n B(0 + y, 0 + n y)
p(y) =
y B(0, 0)
1
p(|y) = 1(1 )1
B(, )
with
= 0 + y
= 0 + n y
15
POSTERIOR
10
PRIOR proportional to
LIKELIHOOD
5
0
n0
Posterior mode: = n0 +n 0 + n0n+n b (analogous result for mean)
Here: posterior more peaked than prior & likelihood (not in general)
Posterior estimate = MLE of combined ECASS 2 data & interim data ECASS 3
Beta(0, 0) prior
binomial experiment with (0 1) successes in (0 + 0 2) experiments
Prior
extra data to observed data set: (0 1) successes and (0 1) failures
Suppose DSMB neurologists believe that SICH incidence is probably more than
5% but most likely not more than 20%
The neurologists could also combine their qualitative prior belief with ECASS 2
data to construct a prior distribution adjust ECASS 2 prior
ECASS 2
15
POSTERIOR
SUBJECTIVE
POSTERIOR
10
ECASS 2 SUBJECTIVE
PRIOR PRIOR
5
0
8
For stroke study: NI prior = proportional to
POSTERIOR LIKELIHOOD
p() = I[0,1] = flat prior on [0,1]
6
Uniform prior on [0,1] = Beta(1,1)
4
2
FLAT PRIOR
y N(, 2) when
1 2 2
f (y) = exp (y ) /2
2
0.004
0.08
(a) (b)
0.06
0.002
0.04
0.02
MLE= 328
0.000
0.00
0 200 400 600 800 310 320 330 340
cholesterol (mg/day)
with 0 y 0
IBBENS-2 study:
sample y with n=50
y = 318 mg/day & s = 119.5 mg/day
95% confidence interval = [284.3, 351.9] mg/day wide
0.08
IBBENS2 POSTERIOR IBBENS PRIOR
0.06
0.04
0.02
IBBENS2 LIKELIHOOD
0.00
and
2 2
=
n0 + n
0.05
POSTERIOR
(a) (b) POSTERIOR
0.04
0.04
PRIOR PRIOR
0.03
0.03
LIKELIHOOD LIKELIHOOD
0.02
0.02
0.01
0.01
0.00
0.00
250 300 350 400 250 300 350 400
Non-informative prior: 02
0.025
0.020
0.015 POSTERIOR
LIKELIHOOD
0.010
0.005
PRIOR
0.000
Poisson() y e
p(y| ) =
y!
Poisson likelihood:
n
Y n
Y
L(| y) p(yi| ) = (yi /yi!) en
i=1 i=1
0.4
Annual examinations from 1996 to 2001
0.3
PROPORTION
4468 children (7% of children born in 1989)
0.2
Caries experience measured by dmft-index
0.1
(min=0, max=20)
0.0
0 5 10 15
dmft-index
MLE of : b = y = 2.42
00 01 0
p() = e
(0)
0.25
0.20
0.15
0.10
Gamma(3,1)
0.05
0.00
0 5 10 15 20
Posterior
n
n
Y
yi 00 01 0
p(| y) e ( /yi!) e
i=1
( 0 )
P
( yi +0 )1 (n+0 )
e
P
Recognize kernel of a Gamma( yi + 0, n + 0) distribution
1
p(| y) p(| y) = e
()
P
with = yi + 0= 9758 + 3 = 9761 and = n + 0= 4351 + 1 = 4352
For STM study posterior more peaked than prior likelihood, but not in general
Bayesian approach satisfies 1st likelihood principle in that inference does not
depend on never observed results
Frequentist approach:
. fixed and data are stochastic
. Many tests are based on asymptotic arguments
. Maximization is key tool
. Does depend on stopping rules
Bayesian approach:
. Condition on observed data (data fixed), uncertainty about ( stochastic)
. No asymptotic arguments are needed, all inference depends on posterior
. Integration is key tool
. Does not depend on stopping rules
Subjectivity objectivity
born on 1701
. Thomas Bayesand died on
was probably born7-4-1761.
in 1701
and died in 1761
verent. with
He wasmathematical
a Presbyterian minister, studied logic
interests.
and theology at Edinburgh University,
and had strong mathematical interests
matics. was published during his life.
Bayes theorem was submitted posthumously by
his friend Richard Price in 1763
and was entitled
An Essay toward a Problem in the Doctrine of Chances
de Finetti: exchangeability
Spiegelhalter: (Win)BUGS
The theory that would not die. How Bayes rule cracked the enigma
code, hunted down Russian submarines & emerged triumphant from
two centuries of controversy
Mc Grayne (2011)
In this chapter:
Direct exploration of the posterior: P (a < < b|y) for different a and b
Posterior mode
Posterior mean
Posterior median
Posterior mode bM :
Properties:
Posterior mean :
R
= p(|y)d
Properties:
R
. minimizes b 2 p(|y)d over all estimators b
( )
. Posterior mean involves twice integration
. = h() with h monotone transformation: 6= h()
Posterior median M :
R
0.5 = M p(|y)d
Properties:
R
. M minimizes a | |
b p(|y)d with a > 0 over all estimators b
Posterior variance 2:
2
R
= ( )2 p(|y)d
Posterior SD:
1
R1
Posterior mean: integrate B(,) 0 1(1 )1 d
= B( + 1, )/B(, ) = /( + ) = 19/152 = 0.125
1
R1
Posterior median: solve 0.5 = B(,) M 1(1 )1 d for
M = = 0.122 (R-function qbeta)
12 1
R1 1
Posterior variance: calculate also 0
B(,)
(1 ) d
2 2
= / ( + ) ( + + 1) = 0.02672
Posterior mode=mean=median:
bM = 327.2 mg/dl
P (a b | y) = 1
P (a b | y) = 1 = F (b) F (a)
Given data set y: 95% credible interval contains the 95% most plausible
parameter values a posteriori
Given data set y: 95% confidence interval either contains or does not contain the
true value. Adjective 95% gets its meaning only in the long run
Posterior = N(, 2)
BETA(19,133)
15
10
0.025 0.025
5
2.0
(a) BETA(19,133) (b)
15
1.5
10
1.0
5
0.5
95% HPD interval log(95% original HPDI)
0.0
0
Naive approach
. Estimate b
y | )
. Predictive distribution of ye: p(e b
Three cases:
. All mass (AUC 1) of p( | y) at bM distribution of ye: p(e y | bM )
1 K
PK
y | k )p(k | y)
. All mass at , . . . , distribution of ye: k=1 p(e
. General case: posterior predictive distribution (PPD)
distribution of ye
R
y | y) =
p(e y | )p( | y) d
p(e
yi = 100/ alpi (i = 1, . . . , 250) normal distribution
Naive approach:
Replace , by y = 7.11, s = 1.4 95% ref interval (alp) = [104.45, 508.95]
Naive approach
. MLE of (incidence SICH) = 8/100 = 0.08 for (fictive) ECASS 2 study
. Predictive distribution: Bin(50, 0.08)
. 95% predictive set: {0, 1, . . . , 7} 94% of the future counts
observed result of 10 SICH patients out of 50 is extreme
0.20
Bin(50,0.08)
0.15
0.10
0.05
0.00
0 2 4 6 8 10 12 14
# future RTPA patients with SICH
0.20
BB(50,9,93) Bin(50,0.08)
0.15
0.10
0.05
0.00
0 2 4 6 8 10 12 14
# future RTPA patients with SICH
OBSERVED DISTRIBUTION
0.4
0.3
Probability
PPD
0.2
0.1
0.0
0 5 10 15 20
dmftindex
Independence:
Qn
p(y1, y2, . . . , yn | ) = i=1 p(yi | )
Independence is defined conditional on
Exchangeable:
is never known but given a prior distribution p():
Z
p(y1, y2, . . . , yn) = p(y1, y2, . . . , yn | )p()d ,
Z Yn
= p(yi | )p()d
i=1
Exchangeable ; independence
Partial/conditional exchangeability
Up to now: choice of prior was taken such that posterior & posterior summary
measures are obtained analytically
Logarithm of OR =
1/(1 1)
= log
0/(1 0)
OR = e
MLE of
r1 (n0 r0)
= log
r0 (n1 r1)
Approximately N(, 2 )
1 1 1 1
var() 2 = + + +
r0 n0 r0 r1 n1 r1
But large sample result for makes the Bayesian analysis easier
Chapter 2
. Prior N(0, 02) for + normal likelihood of
Normal posterior for : N(, 2)
!
0
= + 2
2 02
!1
2 1 1
= +
2 02
0.15
0.10
POSTERIOR
0.05
PRIOR
0.0 95% CI
0 5 10 15 20 25
ODDS RATIO
Expert opinion:
. Best guess (median value) for e = 5
. 95% prior credible interval for OR = [1, 25]
Experts put 95% belief in N(1.6, 0.822) for
0.15
PRIOR
0.10
POSTERIOR
0.05
0.0
0 5 10 15 20 25
ODDS RATIO
Normal posterior for a large sample size is justified even when the likelihood is
combined with a non-normal prior
. = mean dmft-index
P10
. Likelihood: dmft-index of ten children i yi = 26
. Prior: Gamma(3, 1)
. Posterior: Gamma(29, 11) (solid)
0.8
Posterior
0.6
. b = y = 2.6
0.4
. b2 = y/n = 0.26
0.2
0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
Numerical integration
Gaussian quadrature
Non-adaptive
Adaptive (M = 1 = Laplace approximation)
Posterior distribution
log()0 2
Pn
n
i=1 yi 1
e 20
, ( > 0)
Mid-point approach
0.10
k x POSTERIOR DENSITY
AUC =0.13
0.08
0.06
0.04
0.02
0.00
1 2 3 4 5
Posterior for = probability of SICH with rt-PA = Beta(19, 133) (Example II.1)
2.0
20
(a) (b)
1.5
15
1.0
10
0.5
5
0.0
0
ICDF method
. Sample u from U(0, 1) x = F 1(u) F (x)
1.0
F1
0.8
u
0.6
cdf
0.4
0.2
0.0
4 2 0 2 4
0.6
A=1.8
. q = envelope distribution
0.5
N(0,1)
0.4
. A = envelope constant
0.3
q
0.2
0.1
0.0 4 2 0 2 4
Stage 2:
Properties AR algorithm:
. Produces a sample from the posterior
. Only needs p(y | ) p()
. Probability of acceptance = 1/A
Envelope and squeezing density are log of piecewise linear functions with knots at
sampled grid points
TANGENT DERIVATIVEFREE
log posterior
log posterior
SQUEEZING SQUEEZING
1 2 ~ 3 1 2 ~ 3
Interest in E [t() | y]
Rh i h i
t() p(|y) Eq t() p(|y)
R
= t()p( | y)d = q() q()d = q()
Estimate E [t() | y] by
K K
1 X t(k )p(k | y) 1 X k k
k
t( )w( )
K q( ) K
k=1 k=1
Two stage-sampling
e {e1, . . . , eJ } from q() and compute weights
. Stage 1: Draw
p(ej |y)/q(ej )
wj = PJ (j = 1, . . . , J)
i
p( |y)/q( )
e ei
i=1
Accept-reject algorithm
0.8
0.6
0.4
0.2
0.0
1 2 3 4 5
L1 (|y) p1 () L2 (|y) p2 ()
p1( | y) = p1 (y) & p2( | y) = p2 (y)
L2 (|y) p2 ()
p2( | y) L1 (|y) p1 () p1( | y) = v() p1( | y)
Here
log()0 2
p2() e 20
p2( | y) = ?
Stage 2:
. Determine weights v(ei) based on p2() (likelihood stays the same)
. Take weighted random sample from this sample of size K = 1, 000
same histogram
Testing:
. Frequentist: 2-sided binomial test (P = 0.043)
. Bayesian: U(0,1) prior + Bin(21,30) = Beta(22, 10) posterior (pB = 0.023)
pB against = 0.5
Bayes factor =
factor that transforms prior odds for H0 into posterior odds after observed the data
H0 : = 0.5 versus Ha : = 0.8 (only 0.5 and 0.8 are possible for )
is continuous needed
p(y | H0) = weighted average of p(y | ), weights from p( | H0) U (0, 0.5)
p(y | Ha) = weighted average of p(y | ), weights from p( | Ha) U (0.5, 1)
Needed
p(y | H0) = p(y | 0.5)
p(y | Ha) = weighted average of p(y | ), weights from p( | Ha) U (0, 1)
0.521 0.59
Classical likelihood ratio test: Z = (21/30)21 (9/30)9
= 0.0847 (P = 0.026)
Lindleys paradox:
Reason:
Explanation:
Averaging over a large number of unrealistic values under Ha
. Estimation: priors often do not have a great impact on the posterior conclusion
. Testing : priors MAY have a great impact on the posterior conclusion
In this chapter:
Examples
. (multivariate) Gaussian distribution
. Multinomial distribution data
Let
y = sample of n independent observations
= (1, 2, . . . , d)T
L( | y)
Multivariate prior: p()
L( | y)p()
Multivariate posterior: p( | y) = R
L( | y)p() d
Posterior mode:
bM
Posterior mean:
HPD region of content (1-)
Let
. = { 1, 2} Z
. Marginal posterior: p( 1 | y) = p( 1, 2 | y) d 2
Often 1 = one-dimensional
Easy to graphically display marginal posterior
Posterior summary measures based on p( 1 | y) convenient in practice
Marginal posterior mean of 1 = joint posterior mean
Z
Alternatively: p( 1 | y) = p( 1 | 2, y) p( 2 | y) d 2
Three priors:
. No prior knowledge is available
. Previous study is available
. Expert knowledge is available
1
21 2
2
2 2
Posterior distribution p(, | y) n+2
exp (n 1)s + n(y )
2.4
7.4
2.2
7.3
2.0
7.2
1.8
7.1
1.6
2 7.0
1.4 6.9
1.2 6.8
. p( | y)
. p( 2 | y)
(n 1)s2 2
(n 1)
2
2.4
7.4
2.2
7.3
2.0
7.2
1.8
7.1
1.6
2 7.0
1.4 6.9
1.2 6.8
(n1) 2
Posterior variance = n(n2) s
(n1) 2
Posterior mean = (n3) s
(n1) 2
Posterior mode = (n+1) s
(n1)
Posterior median = 2 (0.5,n1)
s2
2(n1)2 4
Posterior variance = 2
(n3) (n5)
s
and 2 unknown:
Z Z
y | y) =
p(e y | , 2)p(, 2 | y) d d 2
p(e
1
2
= tn1 y, s 1 + n -distribution
2.0
POSTERIOR DENSITY
POSTERIOR DENSITY
3
1.5
2
1.0
0.5
1
0.0
0
6.2 6.4 6.6 6.8 7.0 7.2 7.4 1.5 2.0 2.5 3.0 3.5 4.0
MU SIGMA^2
For :
.=
bM = M = 7.11
. 2 = 0.0075
. 95% (equal tail and HPD) CI = [6.94, 7.28]
For 2:
. 2 = 1.88,
bM2
= 1.85, 2M = 1.87
. 22 = 0.029
. 95% equal tail CI = [1.58, 2.24], 95% HPD interval = [1.56, 2.22]
Prior = N-Inv-2(0,0,0,02)-distribution
. 0 = y 0 & 0 = n0
. 0 = n0 1 & 02 = s20
p( | 2, y) = N( | , 2/)
p( | y) = t ( | , 2/)
p( 2 | y) = Inv 2( 2 | , 2)
PPD:
h i
y | y) = t y, s2 1 + 01+n
p(e
2.0
POSTERIOR DENSITY
POSTERIOR DENSITY
3
1.5
2
1.0
0.5
1
0.0
0
6.2 6.4 6.6 6.8 7.0 7.2 7.4 1.5 2.0 2.5 3.0 3.5 4.0
MU SIGMA^2
. For 2:
n
Y
p( 2 | y) N( | 0, 02) Inv 2( 2 | 0, 02) N(yi | , 2)
i=1
Two distributions:
Multinomial distribution
(+p)/2
[( + p)/2] 1/2 1 T 1
p(y | , , ) = p/2
|| 1 + (y ) (y )
(/2)(k)
Properties:
1 Y ij 1
B() i,j ij
Note:
Dirichlet distribution = extension of beta distribution to higher dimensions
Marginal distributions of a Dirichlet distribution = beta distribution
11 22
=
12 21
20
30
15
20
10
10
5
5
0
0
0.30 0.35 0.40 0.45 0.04 0.06 0.08 0.10 0.12
11 12
1.2
15
0.8
10
0.4
5
0.0
0
0.35 0.40 0.45 0.50 0.5 1.0 1.5 2.0 2.5 3.0 3.5
12
Classical frequentist tests (Fishers Exact test, chi-square test, etc) can be
reproduced by Bayesian tests
A dependent prior p(1, 2) is more natural than product of p(1) and p(2)
Stagewise approach
Sampling approach:
. Sample ed from p(d | y)
. Sample e(d1) from p((d1) | ed, y)
. ...
. Sample e1 from p(1 | ed1, . . . , e2, y)
Three cases:
. No prior knowledge
. Historical data available
. Expert knowledge available
ek from a N(y,
Sample e2k /n)-distribution
e1, . . . ,
eK = random sample from p( | y) (tn1(y, s2/n)-distribution)
y | y), 2 approaches:
To sample from the posterior predictive distribution p(e
1
2
1. Sample directly from tn1 y, s 1 + n -distribution
(b)
5
(a)
2.0
4
1.5
3
1.0
2
0.5
1
0.0
0
1.4 1.6 1.8 2.0 2.2 2.4 6.9 7.1 7.3
2
0.30
(d)
2.4
(c)
2.2
0.20
2.0
2
0.10
1.8
1.6
0.00
1.4
e2, sampling
For a given e is straightforward
Priors for y = alp:
N(5.25, 2.75/65) & 2 Inv 2(64, 2.75)
Method of Composition:
. Stage I: sample 2
n
Y
p( 2 | y) N( | 0, 02) Inv 2( 2 | 0, 02) N(yi | , 2)
i=1
p( 2 | y) evaluated on a grid mean and variance
Approximating distribution q( 2) Inv 2( 2 | 294.2, 2.12)
Weighted resampling
. Stage II: sample from a normal distribution
(a) (b)
expert
2.5
conjugate
conjugate
4
2.0
3
1.5
2
1.0
0.5
1
0.0
0
1.5 2.0 2.5 3.0 6.5 6.6 6.7 6.8 6.9 7.0 7.1
2
Based on sample
. 95% normal range for y: [4.05, 9.67]
. 95% normal range for alp: [106.84, 609.70]
Likelihood:
2 1 1
L(, | y, X) = exp 2 (y X)T (y X)
(2 2)n/2 2
b = (X T X)1Xy
. MLE = LSE of :
. Residual sum of squares: S = (y X)T (y X)
. Mean residual sum of squares: s2 = S/(n d 1)
2.5
2.0
TBBMC (kg)
1.5 1.0
0.5
20 25 30 35 40
BMI(kg m2)
Posterior distributions:
h i
2 2 T 1
p(, | y) = N(d+1) | ,
b (X X) Inv 2( 2 | n d 1, s2)
h i
p( | 2, y) = N(d+1) b 2(X T X)1
| ,
p( 2 | y) = Inv h2( 2 | n d 1, s2)i
p( | y) = tnd1 | , b s2(X T X)1
100(1-)%-HPD region=
n o
b T (X T X)( )
C() = : ( ) b d s2 F(d + 1, n d 1)
PPD of ye with x
e:
t-distribution with (n d 1) degrees of freedom with
location parameter: T x
e
h i
2 T T 1
scale parameter: s 1 + x e (X X) x e
ye T x
tnd1
e
Given y : q
e T (X T X)1x
s 1+x e
How to sample?
. Directly from t-distribution
. Method of Composition
Method of Composition
2
Sample future observation ye from N(e
30,
e30 ):
T
.
e30 = (1, 30)
e
T 1
2 2
T
e30 =
. e 1 + (1, 30)(X X) (1, 30)
100
(a) (b)
80
3
60
2
40
1
20
0
0
0.6 0.8 1.0 1.2 0.025 0.035 0.045
0 1
1.5
(c) (d)
0.045
1.0
1
0.035
0.5
0.025
0.0
Generalized Linear Model (GLIM): extension of the linear regression model to a wide
class of regression models
Distributional part
Link function
Variance function
Distributional part:
y b()
p(y | ; ) = exp + c(y; ) , with a(), b(), c() known functions
a()
Often a() = /w, with w a prior weight. For known and w = 1:
d b()
. E(y) = = d
d2 b()
. V ar(y) = a() V () with V () = d2
Examples of a GLIM:
. Normal linear regression model with a normal distribution yi N(i, 2),
identity link (g(i) = i), = 1 and V (i) = 2 assumed known
. Poisson regression model with the Poisson distribution yi Poisson(i), log
link (g(i) = log(i)), = 1 and V (i) = i
. Logistic regression model with the Bernoulli (or Binomial) distribution
yi Bern(i), logistic link (g(i) = logit(i)), = 1 and V (i) = i(1 i)
. Solving the posterior distribution analytically is often not feasible due to the
difficulty in determining the integration constant
. Computing the integral using numerical integration methods is a practical
alternative if only a few parameters are involved
New computational approach is needed
Method of Composition:
Gibbs sampling:
Under mild conditions: sample from the posterior distribution = target distribution
From k0 on: summary measures calculated from the chain consistently estimate
the true posterior measures
Example IV.5: sampling from posterior distribution of the normal likelihood based
on 250 alp measurements of healthy patients with NI prior for both parameters
2.6
2.6
(a) (b)
2.4
2.4
2.2
2.2
2
2
2.0
2.0
1.8
1.8
1.6
1.6
1.4
1.4
6.6 6.8 7.0 7.2 7.4 6.6 6.8 7.0 7.2 7.4
2.5
(a) (b)
4
2.0
3
1.5
2
1.0
0.5
1
0.0
0
6.8 6.9 7.0 7.1 7.2 7.3 7.4 1.4 1.6 1.8 2.0 2.2 2.4
2
0.08
0.06
Density
0.04 0.02
0.00
0 5 10 15 20 25 30
x
Posterior:
1 212 (0)2
p(, 2 | y) e 0
0
2 (0 /2+1) 0 02 /2 2
( ) e
n
1 Y 12 (yi)2
n
e 2
i=1
n 1 2
12 (yi )2 22 (0 ) ( n+0
+1) 0 02 /2 2
Y
2
e 2 e 0 ( ) 2 e
i=1
7.0
6.5
6.0
(a)
5.5
(b)
2.2 2.6
2
1.8
(k+1)
1. Sample 1 from p(1 | 2k , . . . , (d1)
k
, dk , y)
(k+1) (k+1)
2. Sample 2 from p(2 | 1 , 3k , . . . , dk , y)
..
(k+1) (k+1) (k+1)
d. Sample d from p(d | 1 , . . . , (d1) , y)
. British coal-mining disasters data set: # severe accidents in British coal mines
from 1851 to 1962
. Decrease in frequency of disasters from year 40 (+ 1850) onwards?
6
5
4
# Disasters
3
2
1
0
0 20 40 60 80 100
1850+year
Statistical model:
Priors
. : Gamma(a1, b1), (a1, b1 parameters)
. : Gamma(a2, b2), (a2, b2 parameters)
. k: p(k) = 1/n
Full conditionals:
k
X
p( | y, , b1, b2, k) = Gamma(a1 + yi, k + b1)
i=1
Xnk
p( | y, , b1, b2, k) = Gamma(a2 + y i , n k + b2 )
i=k+1
p(b1 | y, , , b2, k) = Gamma(a1 + c1, + d1)
p(b2 | y, , , b1, k) = Gamma(a2 + c2, + d2)
(y | k, , )
p(k | y, , , b1, b2) = Pn
j=1 (y | j, , )
Pki=1 yi
with (y | k, , ) = exp [k( )]
a1 = a2 = 0.5, c1 = c2 = 0, d1 = d2 = 1
0.20
0.15
0.10
0.05
0.00
2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 35 40 45
k
a1 = a2 = 0.5, c1 = c2 = 0, d1 = d2 = 1
Posterior mode of k: 1891
Posterior mean for /= 3.42 with 95% CI = [2.48, 4.59]
r1 = n1
P
(yi 1 xi)
r0 = (yi 0) xi/xT x
P
(a)
0.045
1
0.030
(b)
0.09
2
0.07
(a)
0.045
1
0.030
(b)
0.08 0.10
2
0.06
Autocorrelation:
(k1)
. Autocorrelation of lag 1: correlation of 1k with 1 (k=1, . . .)
(k2)
. Autocorrelation of lag 2: correlation of 1k with 1 (k=1, . . .)
...
(km)
. Autocorrelation of lag m: correlation of 1k with 1 (k=1, . . .)
High autocorrelation:
Transition kernel
Full conditionals determine the joint distribution: see Besag (1974) and
Hammersley and Clifford (1971)
Proofs in Robert and Casella (2004) for bivariate case (Theorem 9.3) and for
general case (Theorem 10.5)
R
p(1, 2) = p(2 | 1)/ [p(2 | 1)/p(1 | 2)] d2
That the joint distribution exists is not enough to determine the joint
Bivariate case (Casella and George, 1992): compute p(1, 2) from p(1 | 2) &
p(2 | 1)
R
1. p1(1) = p(1 | 2)p2(2) d2 and similar for 2
Z Z
2. p1(1) = p(1 | 2) p(2 | 1)p1(1) d1 d2
Z Z
= p(1 | 2)p(2 | 1) d2 p1(1) d1
Z
= K1(1, 1) p1(1) d1
R
with K1(1, 1) = p(1 | 2)p(2 | 1) d2
Gibbs sampler:
K(, ) = p(1 | 2, . . . , d) p(2 | 1, 2, . . . , d) p(d | 1, . . . , (d1))
R
K1(1, 1) = p(1 | 2)p(2 | 1) d2:
transition kernel to express that a move from 1 to 1
can be made via all possible values for 2
Sampling the full conditionals is done via different algorithms depending on:
. Shape of full conditional (classical versus general purpose algorithm)
. Preference of software developer:
SASr procedures GENMOD, LIFEREG and PHREG: ARMS algorithm
WinBUGS: variety of samplers
y = auxiliary variable
Three cases:
. [a, b] finite interval
Sample from [a, b] [0, m = max f (x)] and reject if outside region A
. General unimodal case
. General multimodal case
y | x U(0, f (x))
x | y U(miny , maxy )
miny (maxy ) minimal (maximal x-value) of solution y = f (x)
Density
S(y)
0.2
y
0.1
0.0
4 2 0 2 4 3 1 0 1 2 3
x x
Became popular only after introduction of Gelfand & Smiths paper (1990)
Sketch of algorithm:
Metropolis algorithm:
Chain is at k Metropolis algorithm samples value (k+1) as follows:
Function ( k , )
e = probability of a move
2.6
2.6
(a) (b)
2.4
2.4
2.2
2.2
2
2
2.0
2.0
1.8
1.8
1.6
1.6
1.4
1.4
6.6 6.8 7.0 7.2 7.4 6.6 6.8 7.0 7.2 7.4
Marginal distributions:
3.0
6 (a) (b)
2.5
5
2.0
4
1.5
3
1.0
2
0.5
1
0.0
0
6.9 7.0 7.1 7.2 7.3 1.6 1.8 2.0 2.2 2.4
2
Trace plots:
7.3
(a)
7.1
6.9
(b)
2.0
2
1.6
2.6
(a) (b)
3.0
2.4
2.2
2.0
2
2.0
1.8
1.0
1.6
1.4
0.0
6.6 6.8 7.0 7.2 7.4 1.5 1.7 1.9 2.1
2
Metropolis-Hastings algorithm:
Chain is at k Metropolis-Hastings algorithm samples value (k+1) as follows:
e | k ) q()
Example asymmetric proposal density: q( e (Independent MH
algorithm)
0.30
(a) (b)
0.20
0.20
0.10
0.10
0.00
0.00
5 0 5 10 5 0 5 10
t t
Only possible jumps are to parameter vectors e that match k on all components
other than the jth. Then ratio r in the jth substep:
e | y)q G( k | )
p( e e | y)p( k | ej , y)
p(
j j
r = =
e | k ) p( k | y)p(ej | k , y)
p( k | y)qjG( j
e | y)/p(ej | k , y)
p( j
= k k e
=1
p( | y)/p( j | j , y)
each jump is accepted
Jump in 2 components:
. First component (move): K(, ) = (, )q( | )
move to = e suggested by the proposal density q( | ) and accepted with
probability (, )
R
. Second component (stay): r() = 1 (, )q( | )d
IRd
probability that no move is made, i.e. =
Reversibility condition:
= Probability to move from set A to set B = probability to move from set B to set
A (any sets A and B in )
Z Z
p(, B) d = p(, A) d
A B
AR algorithm:
MH algorithm:
Proposal density: q(
e | ) = q(e ), e.g. q(
e ) q(|
e |) proposal
density is symmetric and gives the Metropolis algorithm
. Multivariate normal density: WinBUGS & SASr procedures
. Multivariate t-distribution: SASr PROC MCMC for long tailed posteriors
Proposal density: does not depend on the position in the chain, e.g.
e | ) = Nd(
q( e | , )
High acceptance rate is desirable when proposal density q() is close to the
posterior density
If p( | y) A q() for all , then the Markov chain generated by the Independent
MH algorithm enjoys excellent convergence properties (Theorem 7.8) and that the
expected acceptance probability exceeds that of the AR algorithm (Lemma 7.9)
WinBUGS: regression coefficients in one block (blocking option switched on) and
variance parameters in other block
Theorem: (Markov Chain Law of Large Numbers) For an ergodic Markov chain
with a finite expected value for t(), tk converges to the true mean.
The MH algorithm creates a reversible Markov chain, i.e. a Markov chain that
satisfies the detailed balance condition.
Proof discrete case:
j = p( = xj )
Q = (qij )ij : matrix that describes the move from xi to xj with probability qij
j qji
Probability that a move from xi is made to (6=) xj = ij = min 1, i qij
is stationary distribution
MH algorithm creates a Markov chain where the target distribution is also the
stationary distribution
+ Extra verifications show that LLN and CLT for ergodic chains can be applied
Verifications show that LLN and CLT for ergodic chains can be applied
. Research questions:
Have girls a different risk for developing caries experience (CE ) than boys
(gender ) in the first year of primary school?
Is there an east-west gradient (x-coordinate) in CE?
. Bayesian model: logistic regression + N(0, 1002) priors for regression coefficients
. No standard full conditionals
. Three algorithms:
Self-written R program: evaluate full conditionals on a grid + ICDF-method
WinBUGS program: multivariate MH algorithm (blocking mode on)
SASr procedure MCMC: Random-Walk MH algorithm
Posterior means/medians of the three samplers are close (to the MLE)
Precision with which the posterior mean was determined (high precision = low
MCSE) differs considerably
Applications:
Mixtures with an unknown # of components and hidden Markov models
Change-point problems with an unknown # of change-points/locations
Model and variable selection problems
Analysis of quantitative trait locus (QTL) data
Theory is complex:
Idea: create 1-to-1 function between the spaces of different dimensions
Construct a MH-algorithm that satisfies detailed-balance condition
yi
Qn
Poisson likelihood: L( | y) = i=1 yi ! exp(n)
1/ yi
Qn (1/+yi ) 1
Negative binomial model: L(, | y) = i=1 (1/) yi ! 1+ 1/+
Two models:
Four settings
. Preference was measured by % of times the Markov chain was in model 2
(negative binomial): between 56% and 74%
. A suggested move from model 1 to model 2 was always accepted
. Percentage of trans-dimensional moves: between 32% and 55%
10
12
8
8
10
6
6
4
4
2
2
0
0
0
0 500 1500 2500 3500 0 1000 2000 3000 4000 0 1000 2000 3000 4000
Iteration Iteration Iteration
0.4
0.30
0.3
0.3
Density
Density
Density
0.20
0.2
0.2
0.10
0.1
0.1
0.00
0.0
0.0
0 2 4 6 8 10 0 2 4 6 8 10 12 14 0 2 4 6 8 10