Bayesian Data Analysis

You might also like

Download as odp, pdf, or txt
Download as odp, pdf, or txt
You are on page 1of 36

Bayesian Data Analysis

An Introduction
By Martin Roa Villescas
What is Bayesian inference?

Bayesian inference is reallocation of credibility across possibilities.
Prior Prior Prior
0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0


Credibility

Credibility

Credibility
A B C D A B C D A B C D
Possibilities Possibilities Possibilities

Posterior Posterior Posterior


0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0


A is B is C is
impossible impossible impossible
Credibility

Credibility

Credibility
A B C D A B C D A B C D
Possibilities Possibilities Possibilities

2
What is Bayesian inference?

Bayesian inference is reallocation of credibility across possibilities.
Prior Prior Prior

This reallocation of credibility is not only intuitive, it is also what the
0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0


exact mathematics of Bayesian inference prescribe!
Credibility

Credibility

Credibility
A B C D A B C D A B C D
Possibilities Possibilities Possibilities

Posterior Posterior Posterior


0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0


A is B is C is
impossible impossible impossible
Credibility

Credibility

Credibility
A B C D A B C D A B C D
Possibilities Possibilities Possibilities

3
Foundational ideas

Bayesian data analysis has two foundational ideas:

1) Bayesian inference is reallocation of credibility


across possibilities.
2) The possibilities, over which we allocate
credibility, are parameter values in meaningful
mathematical models.

You can think of parameters as control knobs on
mathematical devices that simulate data generation.

4
Bayesian probability

The mathematics that govern reallocation of
credibility.

Boils down to one simple equation: Bayes’ rule!

Bayes’ rule is derived from:
– The conditional probability law
– The total probability theorem

6
Thomas Bayes

Thomas Bayes (1702-1761)

English mathematician and Presbyterian minister.

His theorem was published posthumously in 1763 by Richard Price.

An alternative approach of statistical inference emerged later in the
20th century known as Frequentist inference with Ronald Fisher
(1890-1962) as its main figure.

Although the Fisherian approach was dominant in the 20th century,
it is curious and re-assuring that the older Bayesian approach of
the 18th century is taking over in the 21st century.

7
Probabilistic models

A probabilistic model is a mathematical description of an uncertain
situation.

It has a sample space Ω, which is the set of all possible outcomes of an
experiment.

And a probability law, which assigns to a set A of possible outcomes a
nonnegative number P(A) that encodes our belief about the collective
“likelihood”of the elements of A.

8
Probabilistic models

Elements of the sample space must be distinct and
mutually exclusive.

The sample space must be collectively exhaustive.

9
Conditional probability law

10
Total probability theorem

11
Bayes’ rule

Conditional probability law (Product rule)


Total probability theorem


Bayes’ rule

12
Bayes’ rule

13
Bayes’ rule


Bayes’ rule is merely the mathematical relation between the prior
allocation of credibility and the posterior reallocation of credibility
conditional on data.

What is inference?
– There are a number of “causes” θi that may result in an “effect” D.
– We observe the effect D, and we wish to infer the cause θi.
14
Bayes’ rule


Bayes’ rule is merely the mathematical relation between the prior
allocation of credibility and the posterior reallocation of credibility
conditional on data.

What is inference?
– There are a number of “causes” θi that may result in an “effect” D.
– We observe the effect D, and we wish to infer the cause θi.
15
Bayes’ rule


Likelihood function

Prior distribution

Evidence

Posterior distribution

16
Bayes’ rule

Likelihood function The probability that the data D could




be generated by the model with
parameter value θi.

Prior distribution

Although it specifies a probability at

Evidence each value of θi, the likelihood
function is not a probability

Posterior distribution distribution.

17
Bayes’ rule

Likelihood function The probability distrubution that




describes the credibility of the
parameter values θi before the data D

Prior distribution is taken into account.

Evidence

Posterior distribution

18
Bayes’ rule

Likelihood function The overall probability of the data




according to the model, determined
by averaging across all possible

Prior distribution parameter values weighted by the
strength of belief in those parameter

Evidence values.


Posterior distribution

19
Bayes’ rule

Likelihood function The probability distribution that




discribes the credibility of the
parameter values θi with the data D

Prior distribution taken into account.

Evidence

Posterior distribution

20
Two-way discrete table

Example of Bayes’ rule in action:
Hair color
Eye color Black Brunette Red Blond Marginal (Eye color)
Brown 0.11 0.20 0.04 0.01 0.37
Blue 0.03 0.14 0.03 0.16 0.36
Hazel 0.03 0.09 0.02 0.02 0.16
Green 0.01 0.05 0.02 0.03 0.11
Marginal (hair color) 0.18 0.48 0.12 0.21 1.0

21
Two-way discrete table

Example of Bayes’ rule in action:
Hair color
Eye color Black Brunette Red Blond Marginal (Eye color)
Brown 0.11 0.20 0.04 0.01 0.37
Blue 0.03 0.14 0.03 0.16 0.36
Hazel 0.03 0.09 0.02 0.02 0.16
Green 0.01 0.05 0.02 0.03 0.11
Marginal (hair color) 0.18 0.48 0.12 0.21 1.0

Hair color
Eye color Black Brunette Red Blond Marginal (Eye color)
Blue 0.03/ 0.36 0.14/ 0.36 0.03/ 0.36 0.16/ 0.36 0.36/ 0.36 = 1.0
= 0.08 = 0.39 = 0.08 = 0.45

22
Bayes’ rule

Bayes’ rule in the context of continuous
variables:

25
Bayes’ rule difficulty

26
Bayes’ rule difficulty

27
Bayes’ rule difficulty


For complex models this integral is impossible to solve analytically!

How has this difficulty been addressed?
– Analytically

Restricting to relatively simple likelihood functions with conjugate priors.

Variational approximation: Approximating functions with others that are
easier to work with.
– Numerically

Exhaustive summation over a grid of points covering the space.

Markov chain Monte Carlo (MCMC) methods: Randomly sampling a large
number of representative combinations of parameter values from the
posterior distribution.
28
Example
Inferring a binomial probability using pure
analytical mathematics without any
approximations.

29
Step 1: Data type

The first step is to identify the type of data
being described.

In this example we will try to estimate the
bias of a coin, i.e. the data can take one of
two values: heads (1) or tails (0).

30
Step 2: Descriptive model

The next step is to create a descriptive model with meaningful
paramaters.

This means coming up with a likelihood function.

In this example we will use the Bernoulli likelihood function:

where z is the number of heads and N – z is the number of tails.



In this function, θ represents the underlying probability of
heads, and therefore it can only take values from 0 to 1.

31
Step 3: Prior distribution

The next step is to establish a prior
distribution over the parameter values.

Two desiderata for mathematical tractability:
– The product of results in a
function of the same form as .
– The denominator of Bayes’ rule can be solved
analytically, namely .

32
Step 3: Prior distribution

Notice that if the prior is of the form

then when multiplied with the likelihood function, the resulting


function is of the same form, namely


A probability density of that form is called a beta distribution and it is
defined as

where B(a,b) is a normalizing constant that ensures that the area


under the beta density integrates to 1, namely

33
Step 3: Prior distribution

3.0

3.0

3.0

3.0
a = 0.1, b = 0.1 a = 1, b = 0.1 a = 2, b = 0.1 a = 3, b = 0.1

p(θ|a, b)

p(θ|a, b)

p(θ|a, b)

p(θ|a, b)
2.0

2.0

2.0

2.0
1.0

1.0

1.0

1.0
0.0

0.0

0.0

0.0
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
θ θ θ θ
3.0

3.0

3.0

3.0
a = 0.1, b = 1 a = 1, b = 1 a = 2, b = 1 a = 3, b = 1

p(θ|a, b)

p(θ|a, b)

p(θ|a, b)
p(θ|a, b)
2.0

2.0

2.0

2.0
1.0

1.0

1.0

1.0
0.0

0.0

0.0

0.0
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
θ θ θ θ
3.0

3.0

3.0

3.0
a = 0.1, b = 2 a = 1, b = 2 a = 2, b = 2 a = 3, b = 2
p(θ|a, b)

p(θ|a, b)
p(θ|a, b)
p(θ|a, b)
2.0

2.0

2.0

2.0
1.0

1.0

1.0

1.0
0.0

0.0

0.0

0.0
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
θ θ θ θ
3.0

3.0

3.0

3.0
a = 0.1, b = 3 a = 1, b = 3 a = 2, b = 3 a = 3, b = 3
p(θ|a, b)

p(θ|a, b)

p(θ|a, b)
p(θ|a, b)
2.0

2.0

2.0

2.0
1.0

1.0

1.0

1.0
0.0

0.0

0.0

0.0

0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
θ θ θ θ
34
Step 4: Bayesian inference

The next steps are collecting the data and
applying Bayes’ rule to re-allocate credibility
across the possible parameter values.
p(θ|z , N ) = p(z , N |θ)p(θ)/ p(z , N ) Bayes’ rule

z (N − z ) θ (a− 1) (1 − θ) (b− 1)
= θ (1 − θ) / p(z , N )
B(a, b)
by definitions of Bernoulli and beta distributions

= θ z (1 − θ) (N − z ) θ (a− 1) (1 − θ) (b− 1) / [ B(a, b)p(z , N ) ] by rearranging factors

= θ ((z + a)− 1) (1 − θ) ((N − z + b)− 1) / [ B(a, b) p(z , N ) ] by collecting powers

= θ ((z + a)− 1) (1 − θ) ((N − z + b)− 1) / [ B(z + a, N − z + b) ]

35
Step 4: Bayesian inferene

If the prior distribution is

and the data have z heads in N flips, then the


posterior distribution is

36
Step 4: Bayesian inference

Prior (beta) Prior (beta) Prior (beta)

dbeta(θ|18.25, 6.75)
dbeta(θ|100, 100)

mode=0.5 mode=0.75
12

0 1 2 3 4 5
dbeta(θ|1, 1)
6
0 2 4 6 8

4
95% HDI 95% HDI

2
0.431 0.569 0.558 0.892

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
θ θ θ
Likelihood (Bernoulli) Likelihood (Bernoulli) Likelihood (Bernoulli)

Data: z=17,N=20 Data: z=17,N=20 Data: z=17,N=20


0.00015

0.00015

0.00015
p(D|θ)

p(D|θ)

p(D|θ)
max at 0.85 max at 0.85 max at 0.85
0.00000

0.00000

0.00000
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
θ θ θ
Posterior (beta) Posterior (beta) Posterior (beta)
dbeta(θ|35.25, 9.75)
dbeta(θ|117, 103)

mode=0.532 mode=0.797 mode=0.85


12

0 1 2 3 4 5
dbeta(θ|18, 4)
6
0 2 4 6 8

95% HDI 95% HDI 95% HDI


2

0.466 0.597 0.663 0.897 0.66 0.959


0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
θ θ θ

37
Step 5:

38
References

Doing Bayesian Data Analysis: A Tutorial with R and BUGS
by J.J. Kruschke

Introduction to Probability by D. P. Bersekas and J. N.
Tsitsiklis

39

You might also like