Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

ST3189 Machine Learning

Block 4 - Bayesian Inference


Introduction video (transcript)
Click here for the link to the video.
[music]
Dr. Kostas Kalogeropoulos: Welcome to the video of this block that introduces actually an entire branch
of statistics that of Bayesian inference. Record the example of the advertising data set of the James and
colleagues' book that we studied in the linear regression block.
In addition to this data, suppose that an expert from the marketing department of the company that
supplies this product provides us with the additional following information. Let’s assume an increase
of $1000 in the budget, in the advertising budget of either TV, radio of newspaper advertisement. The
extra information is, that it is extremely unlikely that the sales will change either upwards or downwards
by more than 50 units as a result of this increased budget.
Now the question is, how can we incorporate this extra information in our conclusions? One way to
address this question formally is through an alternative school of statistics, that of Bayesian inference.
In addition to a line incorporating additional information in probabilistic manners, Bayesian inference
is also useful for other reasons such as incorporating parameter and secondly, in prediction.
In this block we will review and apply its basic concepts noting difference with standout school of
statistics, that of frequentist inference. The material of this blocks will also provide you some
understanding of the basic concepts of Bayesian inference and also the ability to identify its differences
from frequentist inference, specifically in point and interval estimation, hypothesis testing and
prediction. Also, you should be able to implement Bayesian inference in various real world examples,
including linear regression using r.
[00:02:11] [END OF AUDIO]

Learning outcomes
 Understand the basic concepts of Bayesian inference
 Identify differences between Frequentist and Bayesian Inference in point and interval
estimation, hypothesis testing and prediction
 Implement Bayesian Inference in various examples including linear regression using R.

Introductory example
Recall the example of the 'Advertising' dataset of the James et al book that we analysed in the Linear
Regression block.
Suppose that an expert from the marketing department of the company that supplies this product
provides us with the following additional information.
Let's assume an increase of $1,000 in the budget of either TV, Radio or newspaper advertisement. It is
extremely unlikely that the sales will change (upwards or downwards) by more than 50 units as a
result of this increased budget.'
How can we incorporate this extra information in our conclusions about the role of these types of
advertisement on the sales of this product?
One way to address this question formally is through an alternative school of statistics, that of
Bayesian inference. In addition to allowing incorporating additional information in a probabilistic

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

manner, Bayesian inference can be useful for other reasons such as incorporating parameter
uncertainty in prediction (although this can perhaps also be done via Bootstrap). In this block we will
review and apply its basic concepts noting differences with frequentist inference.

Essentials of Bayesian Inference

Bayes Theorem for Events


In terms of events and their probabilities, let A and B be two events such that P(A)>0.
then P(B|A) and P(A|B) are related by:

P(A|B)P(B) P(A|B)P(B)
P(B|A) = =
P(A) P(A|B)P(B) + P(A|𝐵𝑐 )P(𝐵𝑐 )
Note that the events of B and 𝐵𝑐 form a partition of the sample space (i.e. they are disjoint and their
union is equal to the sample space). The law of total probability then ensures that:

P = P(A|B)P(B) + P(A|𝐵𝑐 )P(𝐵𝑐 )


More generally if 𝐵1 , 𝐵2 , ……, 𝐵𝑘 form a partition (k can also be ∞), we can write for all j = 1,…,k
𝑘
P(A|𝐵𝑗 )P(𝐵𝑗 )
P(𝐵𝑗 |A) = , 𝑤ℎ𝑒𝑟𝑒 𝑃(𝐴) = ∑ P(A|𝐵𝑖 )P(𝐵𝑖 )
P(A)
𝑖=1

Bayes Theorem for Statistical Models


Let y denote the set of observed data. A statistical model is defined by viewing y as the realisation of
a random variable and assigning a probability model on it, typically governed by some unknown
parameters θ. Let f(y|θ) denote the joint density of the observations. Note that f(y|θ) has the same
expression as the likelihood function, a fundamental object in frequentist inference.
In Bayesian inference we also view θ as a random variable and assign on it the so-
called prior distribution π(θ). This distribution represents our uncertainty on the unknown
parameters θ before observing the data y (a-priori).
Assume that θθ is a discrete variable, i.e. takes values 𝜃1 , 𝜃2 ,… and therefore the prior distribution is
defined by the probabilities π(𝜃𝑗 ) for each j. Then the Bayes theorem for each 𝜃𝑗 is provided by just
matching 𝜃𝑗 with 𝐵𝑗 and A with y in Equation 4.1

(4.2)
The distribution π(𝜃𝑦 ), formed by the probabilities π(𝜃𝑗|𝑦 ), for each j, is termed as the posterior
distribution and represents our uncertainty around θ after observing y (a-posteriori).
If θ is a continuous variable, i.e. takes values on a interval such as the real line R, then π(θ) denote the
probability density function (pdf) of the prior. The posterior pdf is then given by:

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

(4.3)
The term f(y) is known as marginal likelihood or evidence and reflects the density (probability for
discrete y) of the data under the adopted probability model. Note that it does not depend on θ, hence it
may be viewed as the normalising constant of π(θ|y) (to ensure that it integrates/sums to 1).

Bayes Estimators
In general point estimators are functions of y and other known things (but not θ), the output of which
provide our 'educated guess' for θ. Examples of point estimators include the maximum likelihood
estimators (MLE), method of moments estimators, least square estimators etc.
Bayes estimators provide another alternative. Their exact description requires concepts related to
statistical decision theory such as loss(utility) function, frequentist, posterior and Bayes risk (in fact,
Bayes estimator minimise the posterior and Bayes risk), that are beyond the scope of the course. So
we will just focus on the following three Bayes estimators:

 The posterior mean 𝜃̂ = 𝐸(𝜃|𝑦) = ∫𝜃 𝜃𝜋 (𝜃|𝑦)𝑑𝜃

 The posterior median 𝜃̂ = 𝑞 such that P(θ ≤ q|y) = P(θ ≥ q|y)=0.5

 The posterior mode 𝜃̂ = 𝑞π(θ|y)≤π(q|y) for all θ, also known as MAP (Maximum a-
posteriori).

(Bayesian) Credible Intervals


In frequentist inference the task is often to report a confidence interval with level 100(1−α)% for the
unknown parameter θ. If the experiment was repeated many times, a confidence interval with
level 100(1−α)% for θ would contain the true value of θ in 100(1−α)% of them.
In Bayesian inference the corresponding task is to report a 100(1−α)% credible interval. The unknown
parameter θ is in a 100(1−α)% credible interval with probability 100(1−α)%
Note that there exist many 100(1−α)% credible intervals in the same way there exist many 100(1−α)%
confidence intervals. Usually, for a 95% credible interval, the interval from 2.5% to the 97.5% point
of π(θ|y) is reported.

Bayesian Hypothesis Testing / Model Choice


Let 𝐻0 : 𝜃𝜖Θ0 and 𝐻1 : 𝜃𝜖Θ1 . Frequentist hypothesis testing provides a test rule which is a function of
the data and other known things (but not θ), the output of which is either `reject 𝐻0 ' or 'do not
reject 𝐻0 '. A test rule with significance level α had the property that if the experiment was repeated
many times it will falsely reject 𝐻0 at most α% of the time.
In Bayesian hypothesis testing the optimal rule is determined by the highest posterior probability, i.e.
choose 𝐻0 if P(𝐻0 |y)=P(θ∈Θ0 |y)>P(𝐻1 |y)=P(θ∈Θ1 |y)
An alternative method is provided by the Bayes factors. The Bayes factor against 𝐻0 is defined as:

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

P(θ ∈ Θ1 |y)/P(θ ∈ Θ0 |y) P(𝐻1 |y)/P(𝐻0 |y)


𝐵10(𝑦) = =
P(θ ∈ Θ1 |y)/P(θ ∈ Θ0 |y) P(𝐻1 )/P(𝐻0 )
(4.4)

and we choose 𝐻1 if 𝐵10(𝑦)


> 1. Note that this rule coincides with the previous one
if 𝜋(θ ∈ Θ0 |y)/π(θ ∈ Θ1 |y), in other words if 𝐻0 and 𝐻1 have equal prior probability.

The value of the Bayes factor can also provide evidence in terms of the strength of evidence
against 𝐻0 . In terms of wording the following guidelines are available:

0<log𝐵10 (y)<0.5 evidence against 𝐻0 is poor

0.5<log𝐵10 (y)<1 evidence against 𝐻0 is substantial

1<log𝐵10 (y)<2 evidence against 𝐻0 is strong

<log𝐵10 (y)<2 evidence against 𝐻0 is decisive

Sometimes the following expression is used. Applying Bayes theorem on P(𝐻1 |y) and P(𝐻0 |y) in
(Equation 4.4) we get

𝑃(𝑦|𝐻1 )𝑃(𝐻1 )
𝑃(𝑦)𝑃(𝐻1 ) 𝑃(𝑦|𝐻1 )
𝐵10(𝑦) = =
𝑃(𝑦|𝐻0 )𝑃(𝐻0 ) 𝑃(𝑦|𝐻0 )
𝑃(𝑦)𝑃(𝐻0 )
which is also known as the ratio of marginal likelihoods (evidences) of 𝐻1 and 𝐻0 . Hence, Bayes
factor chooses the hypothesis with the higher marginal likelihood.
Bayesian Model Choice comes as a straightforward extension to Bayesian Hypothesis Testing. In the
presence of several models (can be more than two), we choose the one with the higher marginal
likelihood.

Bayesian Prediction/Forecasting
Let 𝑦𝑛 denote a future observation. Under the assumption that 𝑦𝑛 comes from the same probability
model as y, we are interested predicting its value. Under Bayesian Prediction/Forecasting this is done
via the (posterior-)predictive distribution that combines the uncertainty of the unknown
parameters θ as well as the uncertainty of the future observation:

𝑓(𝑦𝑛 |𝑦) = ∫ 𝑓(𝑦𝑛 |𝜃)𝜋(𝜃|𝑦)𝑑𝜃

The predictive distribution can be used in different ways (e.g. point prediction, interval prediction,
etc) depending on the forecasting task at hand.

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

Prior Specication
As mentioned earlier the prior distribution represents our uncertainty regarding the unknown
parameter θ prior to seeing the data y.
In some cases information may be available prior to the experiment about various feature of the
distribution such as moments, percentiles, probabilities of certain intervals etc. This information can
then be summarised to define the prior distribution. For example in an experiment where the
observations are independent Poisson random variables with unknown mean λ, it is known from
previous experience that the mean of λ is around 5 and the variance around 4.
In this case we can use the Gamma(α,β) distribution that takes valued on the positive real line
as λ does. The prior parameters α and β can be determined using the fact that if λ is Gamma(α,β)
𝑎 𝑎
then E(λ)= 𝛽 =5 and Var(λ)= 𝛽2 =4. Combining these, we get that α=6.25 and β=1.25, hence we can
use the Gamma(6.25,1.25) as our prior. This procedure is known as prior elicitation.
When there is no information about θ prior to the experiment it is not straightforward as to how this
can be reflected via a distribution. Quite often low informative priors are being used where the
variance of the distribution is set to a very large value. This is usually accompanied by a sensitivity
analysis at the end to ensure that this choice (on this large value of the variance) does not affect the
conclusions much.

Jeffreys Prior
An alternative approach is the Jeffreys prior which is based on Fisher's information for θ. Given the
joint density (likelihood) f(y|θ), Fisher's information for θ is defined as:

𝜕 log 𝑓(𝑦|𝜃) 2 𝜕 log 𝑓(𝑦|𝜃)


𝐼(𝜃) = 𝐸𝑌 [( ) ] = −𝐸𝑌 ( )
𝜕𝜃 𝜕𝜃 2
Jeffreys suggested the following prior 𝜋(𝜃) ∝ det(𝐼(𝜃))2 and for the 1-parameter case 𝜋(𝜃) ∝
𝐼(𝜃)1/2
Theorem:
Jeffreys prior is invariant to transformations. This is an appealing property for a distribution claiming
to contain no information on θ in the sense that if no information is available on θ then no information
should be available for any deterministic function of θ either.
Lemma: θ = g(ϕ). Then if θ ∼ 𝜋𝜃 (θ) the pdf of ϕ is
𝜕𝑔 (𝜙) 𝜕𝜙
𝜋𝜙 (𝜙) = 𝜋𝜃 (𝑔(𝜙)) | | = 𝜋𝜃 (𝜃) | |
𝜕𝜙 𝜕𝜙

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

Hence the Jeffreys prior: π(λ) ∝ 1/√𝜆



Note that this expression is not a valid distribution as the integral ∫1 1/ √𝜆𝑑𝜆 is equal to infinity.
Nevertheless, if we use this expression as prior, simple calculations will yield the posterior
1
distribution to be the Gamma( + ∑ 𝑦𝑖 , 𝑛). Using expressions that are not distributions but yield
2
specific posterior distribution is also known as using improper priors. Improper priors provide a
popular alternative to low informative priors for cases where no information is available prior to the
experiment.

Bayesian Central Limit Theorem


The posterior distribution π(θ|x) is the necessary in all task of Bayesian inference. There are many
cases however, especially in models with many unknown parameters, where this distribution is not
available in closed form. A widely used approximation of the posterior distribution is the Laplace
approximation. Let y=(𝑦1 ,…,𝑦𝑛 ) be a sample with joint density/likelihood f(y|θ), θ=(𝜃1 ,…, 𝜃𝑝 ) and
denote the prior by π(θ). Let π∗(θ|x)=f(x|θ)π(θ) and denote the posterior mode 𝜃𝑀 , which is (under
regularity conditions) a solution of
∂logπ∗(θ|x)
= 0,for all i=1,…,p.
𝜕𝜃𝑖

Denote also by H(θ) be the Hessian matrix, for which

𝜕 2 log 𝜋 ∗ (𝜃|𝑥)
[H(θ)] 𝑖𝑗 =−
𝜕𝜃𝑖 𝜕𝜃𝑗
Then as n → ∞

π(θ|x) → 𝑁𝑝 (𝜃𝑀 ,𝐻−1 (𝜃𝑀 ))

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

where 𝑁𝑝 denotes a p−dimensional distribution. The proof of the above statement is similar to the
case of asymptotic distribution of Maximum Likelihood Estimators. Hence it is often referred to as
'Bayesian Central Limit Theorem'.
The expression in Equation 4.5 offers a way to conduct Bayesian Inference in an 'asymptotic' manner,
i.e. based on an approximate posterior distribution whose approximation error is small when we have
a 'large' amount of data. Note however that this is just an approximation and it could perform quite
poorly, especially in relatively small datasets.

Examples of Bayesian Inference

Introduction
We will go through some examples of Bayesian Inference. In most of those we will use the following
to trick in order to find the posterior distribution without calculating the denominator in Bayes
theorem who is generally a difficult integral.

A trick for finding posterior distributions


Since ∫f(x|θ)π(θ)dθ is independent of θ, we can write
 If θ|x ∼ N(μ,𝜎 2 ), for 𝜎 2 known, then

(𝜃 − 𝜇)2 𝜃 2 − 2𝜃𝜇
𝑝(𝜃|𝑥) ∝ exp (− ) ∝ exp (− )
2𝜎 2 2𝜎 2

 If θ|x ∼ Gamma(α,β) → π(θ|x)∝𝑒 −𝛽𝜃 𝜃 𝑎−1

 If θ|x ∼ Beta(α,β)→ π(θ|x)∝𝜃 𝑎−1 (1 − θ)𝛽−1


So, one approach is to work out f(x|θ)π(θ) in terms of θ and check with the above (or other known
distributions).

Beta-Binomial
Suppose that y is an observation from a Binomial(n,θ) random variable. The likelihood is given by the
probability of y given θ which is provided by the Binomial distribution
𝑛
𝑓(𝑦|𝜃) = ( ) 𝜃 𝑦 (1 − 𝜃)𝑛−𝑦 ∝ 𝜃 𝑦 (1−𝜃)𝑛−𝑦
𝑦
As 0<θ<1 a corresponding distribution must be chosen as the prior. A Beta distribution with hyper
parameters αα and ββ, denote Beta(α,βα,β), provides such an example.

𝜋(𝜃) ∝ 𝜃 𝑎−1 (1−𝜃)𝛽−1

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

The posterior distribution can then be obtained as

Note that the posterior mean, one of the most commonly used Bayes estimators, that could be
𝛼+𝑥
used as an estimator for pp is equal to 𝛼+𝛽+𝑛. For α = β = 0 it coincides with the maximum
likelihood estimator, MLE x/n. No such prior distribution (improper prior) but posterior is
well defined.
𝛼+𝑥−1
Other Bayes estimators include the posterior mode which is equal to 𝛼+𝛽+𝑛−2 and the
posterior median that is not available in closed form but can by calculated via Monte Carlo,
see the relevant section in this block.
Poisson-Gamma
Let y = (𝑦1 ,…, 𝑦𝑛 ) be a random sample (𝑦𝑖 's are independent and identically distributed) from the
Poisson(λ) population. The likelihood is given by the joint density of the sample

𝑛
exp(−𝜆)𝜆𝑦𝑖
𝑓(𝑦|𝜆) = ∏ ∝ exp(−𝑛𝜆)𝜆 ∑ 𝑦𝑖
𝑦𝑖 !
𝑖=1
As λ>0 a corresponding distribution must be chosen as prior. The Gamma distribution with hyper-
parameters αα and β, denote Gamma(α,β), provides such an example.

𝜋(𝜆) ∝ 𝜆𝑎−1 exp(−𝛽𝜆)

The posterior can then be obtained as

Note that the posterior mean, one of the standard Bayes estimators can be written as

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

which is a weighted average between the prior mean and 𝑦̅. As n→∞ the posterior mean converges
to 𝑦̅ which is the MLE in this case.
Let now 𝑦𝑛 denote a future observation from the same Poisson(λ) model (likelihood). The predictive
distribution for 𝑦𝑛 is,

for 𝑦𝑛 = 0,1, …

Normal-Normal
Let y=(𝑦1 ,…, 𝑦𝑛 ) be a random sample (𝑦𝑖 's are independent and identically distributed) from a
N(θ,𝜎 2 ) - 𝜎 2 known. The likelihood is given by the joint density of the sample

We assume another Normal prior for θ, N(μ,𝜏 2 ), which gives

(θ − μ)
𝜋(𝜃) ∝ 𝑒𝑥𝑝 (− )
2𝜏 2
The posterior can then be obtained as

As with the previous example note that the posterior mean can be written as a weighted average
between the prior mean μ and the MLE 𝑦̅

𝜏2 𝜏2
(1 − 2 )𝜇 + 2 𝑦̅
𝜎 𝜎
𝑛 + 𝜏2 𝑛 + 𝜏2

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

Note that the posterior mean in the case of the Normal distribution is the same as the posterior mean
and mode. Hence all previously mentioned Bayes estimators coincide to the one above.
In case we want to obtain a symmetric 100(1−α)% interval for $\mu$, we can use the mean and the
standard error of the posterior (similarly with confidence intervals) to get t

𝜎2 𝜎2 2
𝑦̅𝜏 2 + 𝜇 𝑛 𝜏
2 ± 𝒵𝑎/2 √ 𝑛2
𝜎 𝜎
( 𝑛 + 𝜏 2) ( 𝑛 + 𝜏2
𝑎
where 𝑍𝑎/2 is the 2 percentile of a N(0,1). If we assume the improper prior of π(θ)∝1 (which is the
Jeffreys prior in this case), we get as posterior the N(𝑥̅ ,𝜎 2 /n). The corresponding 100(1−α)% credible
interval will then be

𝜎2
𝑦̅ ± 𝒵𝑎/2 √
𝑛

which is the same as the 100(1−α)% confidence interval. Their interpretation is however slightly
different, check it.

Linear Regression
Consider the multiple linear regression model that we first discussed in Block 2.

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋3𝑖 + 𝛽3 𝑋3𝑖 + 𝜖𝑖


with 𝜖𝑖 ∼N(0,𝜎 2 )ϵi∼N(0,σ2) for all i,and (𝜖1 ,…, 𝜖𝑛 ) being independent.

Alternatively, using matrix algebra notation, define ϵ = (𝜖1 , … , 𝜖𝑛 )2 , β = (𝛽0 , 𝛽1 , 𝛽2 , 𝛽3 ) 2 and the
design matrix

1 𝑋11 ⋯ 𝑋𝑝1
1 𝑋12 ⋯ 𝑋𝑝2
𝑋=( )
⋮ ⋮ ⋮
1 𝑋1𝑛 ⋯ 𝑋𝑝𝑛
Then we can rewrite equation 4.2 in matrix notation as

Y = Xβ + ϵ
where ϵ ∼ 𝑁𝑛 (0𝑛 ,𝜎 2 𝐼𝑛 ), with 𝑁𝑛 (⋅) denoting the multivariate Normal distribution of
dimension n, 0𝑛 being n-dimensional vector of zeros and 𝐼𝑛 being the identity matrix of dimension n.

We first need to specify a prior in the parameters θ = (β, 𝜎 2 ). We will factorise the joint density
of θ as

𝜋 = (β, 𝜎 2 ) = 𝜋(𝜎 2 )𝜋(𝛽|𝜎 2 )


We assign the inverse Gamma distribution as a prior on σ2σ2 with parameters α0α0 and β0β0 such
that

𝛽0𝑎0 𝛽0
𝜋(𝜎 2 )= (𝜎 2 )−𝑎0−1 exp(− )
Γ(𝑎0 ) 𝜎2

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

Regarding π(β|𝜎 2 ) one option is the N(𝜇0 ,)N(𝜇0 , 𝜎2 Ω0 ) distribution.

With standard, yet tedious, matrix algebra calculations that are beyond the scope of this course one
can the show that the posterior can again be factorised as

𝜋 = (β, 𝜎 2 |𝑦, 𝑋) = 𝜋(𝜎 2 |𝑦, 𝑋)𝜋(𝛽|𝜎 2 |𝑦, 𝑋)


where the first term is the Inverse Gamma distribution with parameters 𝑎𝑛 and 𝛽𝑛 ,
whereas β given 𝜎 2 is the N(𝜇𝑛 , 𝜎 2 Ω𝑛 ), where

Loosely speaking one can note that as Ω0 goes to infinity, 𝜇𝑛 converges to the MLE (𝑋 𝑇 X) −1 𝑋 𝑇 y, in
other words as the prior information vanishes the Bayes estimator is the same as the MLE. But if there
is some prior information, reflected through the prior mean 𝜇0 and the prior variance Ω0 , then the
Bayes estimator combines information from the data and prior beliefs.

Bayesian Inference Video (Transcript)


To watch this video, click here.
[music]
Dr. Kostas Kalogeropoulos: Hi, there. This is an illustration video for the course on machine learning
and, in particular, its fourth block on Bayesian inference. We will see two things in this video. The first
one regards conjugate models. Conjugate models are very convenient models and often encountered in
Bayesian inference. They have the interesting feature that both of the prior and the posterior
distributions come from the same family of distribution. For example, if you have a binomial likelihood,
if you assume a beta prior on the unknown probability of success, then if you do the calculations, it
turns out that the posterior is another beta distribution, it just has different parameters.
Similarly, if the data come from a Poisson distribution and you have a random sample from there, if
you assign a gamma prior on the unknown mean of the Poisson, the resulting posterior will also be a
gamma distribution with just different parameters. Another example, of course, is given if you have a
normal likelihood and you want to estimate the unknown mean of this normal assuming that the virus
is known, then assigning a normal prior on the unknown mean will lead to a normal posterior. In this
video, we'll see in detail the second example in the Poisson gamma case.
Assume that you have a random sample of X's, X1 up to Xn, they're all independent and they all come
from the Poisson distribution with lambda mean and let's X denote the vector of the X's. The first thing
you need to do pretty much in every exercise in statistics is to find the likelihood, which is just given
by the joint density of the sample X. This, you can do that like that, and because this is a random sample,
the joint density is just the product of individual densities. We have just this problem. This is the pdf of
the Poisson and standard calculations. Here we give you that this is proportional. You only need to
derive the likelihood up to proportionality. This is proportional to E to the minus and lambda, lambda
to the sum of the observations. That's the likelihood.

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

The second thing you need to to do then is to assign a prior on the unknown lambda parameter, and we
have to choose a suitable distribution, this is the case where we assign a gamma prior. We know that,
so we'll assume that a priori, our uncertainty knowledge about lambda is reflected by the gamma
distribution that has two hyperparameters, alpha and beta. Typically, this will be known and actually,
the researcher has to choose them, but in order to derive a general expression here, we just denote them
with the Greek letters alpha and beta, and we will give the result depending on these parameters at the
end.
Again, you only need to write down, and that takes up a lot of trouble, you only need to write down the
prior up the proportionality. That means that you only need to include the terms that involve lambda,
any other term that has nothing to do with lambda, you can skip it, remembering though to put the
proportional sign, the equality sign. That's step number two. Step number three. Now, it's to just derive
the posterior. As always, we know that the posterior lambda given X, in this case, is equal to the product
of the latter than the prior, so just write things now. Then, I'm Screencast.mp4 1 going to substitute the
likelihood and the prior with the expressions in the previous slides.
The likelihood is this guy here and the prior is this guy here. That's all I did here. There next step here,
we just put together the lambdas and the X. If we do that, we get that lambda alpha plus sum of XI
minus one, and then this is what you get for the X. Actually, you can put them together as minus lambda
N plus beta. Now it's the time the standard trick arrives. You have to think, does this expression here
resembles the density of a distribution that you know?
Actually, if we go a bit back when lambda was a gamma prior or the prior for-- Remember it was just
a gamma of alpha and beta, its pdf was-- Lambda, whatever the first parameter Alpha minus one, and
then we had X to the minus lambda here and beta, whatever, this is the second parameter. Now, let's
see, again, where we end up. Here we have lambda to something minus one and then E to the minus
lambda times something. Well, it's not very difficult. Now, to see that this something and this something
could be the parameters of the new gamma posterior. Actually, that's exactly how we derive. Here, it is
the derivation of the gamma posterior in the Poisson likelihood- Poisson gamma conjugate model.
That was the first thing we're going to talk about about Bayesian inference. The second thing I want to
talk about is the predictive distribution, in particular, the Bayesian forecasting. Essentially, you do in
Bayesian inference you do forecasting using the predictive distribution. What is the predictive
distribution? Well, you denote it with this f of Y given X. Y here represents the future observations that
we haven't seen, X represents the training data. Obviously, we have a model for the data, and the
predictive distribution is this integral f of Y given theta here is the destiny of the future observation,
and PI of theta given X is the posterior on theta based on the training data X.
Now, you can see that as what do we know about the future observation. Why? Well, the first thing that
can express our uncertainty is the density, assuming that we knew the theta, so that's what FY given
theta does. Of course, this assumption is not correct. We don't know theta. What Bayesian do is that
they an average, some sort of an average, you can see this integral as a weighted average where the
weights are given by the posterior. Most probable thetas, according to the posterior, will get more weight
and less probable will get less weight. This averaging, that's what you can think of this integral X, this
averaging then gives you the predicted distribution.
I'm going to show you in the next slide an example and how to derive this predictive distribution. In
order to do this, to do this derivation, I'm going to use a special trick, which is a very handy trick and
you can use in other exercises. It's a neat way of calculating very difficult integrals without doing any
calculus, essentially just by remembering the formula of a probability density function. Here is the trick.
Let's say that lambda has a distribution, which is a gamma distribution with parameters A and B. Then,
we can write that because the pdf f of lambda is a pdf, then the integral over the range of lambda will
integrate to one. That's the definition of a pdf.

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

What I'm going to do here is I'm going to substitute f of lambda with its exact expression. Remember
this is the gamma distribution with parameters A and B. This is the pdf of lambda. Essentially, what I'm
going to do now is I'm going to take out the parts in this integral, this is an integral with respect to the
lambda, with lambda. Everything that doesn't involve lambda can come out of the integral. I will take
this part here, this fraction out and bring in to this side.
Then, I will get this expression here and that tells you that this integral, which is by no means easy to
compute, can be expressed in that form. If you're going to do that with calculus, you have to do
integration by parts several times. It's a very nice thing to learn to calculate yet if you just remember
that lambda skew the pdf of lambda, which in the [unintelligible 00:09:13] one, you can immediately
find a solution. This holds for all lambda B and A.
Now, let's go to the example. Let's derive the predictive distribution in a case where the likelihood is an
exponential with lambda distribution. If we assume a gamma prior and we do a similar calculation with
what we did earlier, then we get the posterior as a gamma and plus alpha and X bar plus beta. We have
the posterior, we have the model, we have f of y given lambda, how do we find the predicted
distribution? Remember that the predicted distribution Y is a positive distribution, so the range of Y is
important, don't forget to write it down. What I'm going to do now is I'm going to-- This is a big integral
that can seem a bit scary, but it's not so scary. All this is doing is, essentially, if you go back to this
formula, I'm just substituting what is F of Y given theta, in this case, lambda, and what is the posterior.
In this integral, this part is coming from the fact that the new Y will be coming from an exponential
lambda distribution. That's the PDF of an exponential lambda, and then this part here is just a posterior,
which is gamma with these parameters. This is just a PDF of this gamma. Here it's important to write
everything to equality, no proportionality, all the terms have been used here.
Next step, we have an integral here which is the lambda. That means that anything, all the terms that do
not involve lambda can be taken out of the integral. This guy here, there's no lambda, I can take it out,
so that's what I'm going to do. I will also put together the lambdas and the Xs. Doing so will give me
this expression. This guy comes here. Here I had A+α-1+another lambda. That's coming here and
similarly for the Xs. All right. Where am I? Well, I still have this difficult integral to calculate. What
am I going to do? Well, I can use the trick in the previous slide. All I have to see is know that if I set
here this guy to be A and this guy to be B, then I'm exactly into the situation here.
For this A and B, I have exactly the same expression. All I have to do is-- then this guy will be equal to
this guy. All I have to do is make sure I put the correct A and B in here. This guy will be the A. This
guy will be the B. If you put it together, you'll get this complicated looking expression, that's what it is,
and we are able to derive it in just a few lines of calculation. That concludes the purpose of this video.
We were able to demonstrate these few types of calculations that we often encounter in Bayesian
inference.
[00:12:36] [END OF AUDIO]

Self-test Activities
1. Let x=(x1,…,xn) be a random sample from the Exponential(λ) distribution. Set the prior
for λ to be a Gamma(α,β) and derive its posterior distribution. Then find a Bayes estimator
for λ and the predictive distribution for a new observation y which again from the
Exponential(λ) model.

Solution
The joint density (likelihood) can be written as

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

The prior is set to a Gamma(α,β), so we can write

π(λ)∝𝜆𝑎−1 exp(−βλ
The posterior is then proportional to

A Bayes estimator is provided by the posterior mean who in this case is equal to
𝑛+𝑎
𝛽 + ∑𝑛𝑖=1 𝑥𝑖
The predictive distribution (for y>0) is

2. Let x=(𝑥1 ,…,𝑥𝑛 ) be a random sample from a N(0,𝜎 2 ) distribution. Set the prior for 𝜎 2 to be
IGamma(α,β) and derive its posterior distribution.

Solution
The joint density (likelihood) can be written as

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

The prior is set to a IGamma(α,β), so we can write

𝛽
𝜋(𝜎 2 ) ∝ (𝜎 2 )−𝑎−1 exp(− )
𝜎2
The posterior is then proportional to

3. A student uses Bayesian inference in modeling his IQ score x|θ which is N(θ,80). His prior is
N(110,120). After a score of x=98 the posterior becomes N(102.8,48).
a. Find a 95\% confidence interval, a 95\% credible interval and comment on their
diffrences.
b. The student claims it was not his day and his genuine IQ is at least 105. So we want
to test 𝐻0 : θ ≥ 105 vs 𝐻1 :θ < 105.

Solution
a. The frequentist 95% confidence interval is

[98−1.96√80, 98+1.96√80]= [80.5, 115.5]


The HPD (and symmetric) 95% credible interval is

[102.8−1.96√48, 102.8+1.96√48,] = [89.2, 116.4]


which is shorter as it includes prior information.
The R code for this calculation is given below:
mupost = 102.8
sdpost = sqrt(48)
muprior = 110
sdprior = sqrt(120
qnorm(c(0.025,0.975), mupost,sdpost )

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

105−102.8
b. P (𝜃 ≥ 105|𝑥) = 𝑃(𝑍 ≥ = 1 − Φ(0.3175) = 0.3754
√48

Since the probability is less the 50% (in fact less than 38%) the claim of the student is rejected.
The Bayes factor against 𝐻0 is log𝐵10 (98)=1.244, indicating strong evidence against 𝐻0 .
The R code for this calculation is given below:
mupost = 102.8
sdpost = sqrt(48)
muprior = 110
sdprior = sqrt(120)

post𝐻0 = 1-pnorm(105,mupost,sdpost

post𝐻1 = pnorm(105,mupost,sdpost)
prior𝐻0 = 1-pnorm(105,muprior,sdprior
prior𝐻1 = pnorm(105,muprior,sdprior)
logB10 = log(post𝐻1 /post𝐻0 ) - log(prior𝐻1 /prior𝐻0 )

Consolidation Activities
Exercise 1
A big magnetic roll tape needs tape. An experiment is being conducted in which each time 1 meter of
the tape is examined randomly. The procedure is repeated 5 times and the number of defects is
recorded to be 2,2,6,0 and 3 respectively. The researcher assumes a Poisson distribution for the
parameter λλ. From previous experience, the beliefs of the researcher about λλ can be expressed by a
Gamma distribution with mean and variance equal to 3. Derive the posterior distribution that will be
obtained. What would be the expected mean and variance of the number of defects per tape meter
after the experiment?

Solution
From the section Bayesian Inference Examples we get that for likelihood

and a prior Gamma(α,β)


π(λ)∝𝜆𝑎−1exp(−βλ)
we get a posterior Gamma(α+∑𝑦𝑖 , n+β).
𝛼 𝛼
We know that the prior mean and variance are equal to 3, meaning 𝛽 = 3, 𝛽2 =3, which implies
that α=3, β=1. Also ∑ 𝑦𝑖 = 2 + 2 + 6 + 0 = 3 = 13 Hence the posterior distribution is a
Gamma(16,6) distribution, implying that the posterior mean is 8/3. This is slightly lower than the
prior mean of 3 but higher than what the data suggest as the MLE 𝑦̅=2.6. The posterior variance 4/9
which is lower than the prior variance of 3 reflecting the fact that our uncertainty decreased in light of
the observed data y.

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

Exercise 2
Let (𝑥 = (𝑥1 ,…, 𝑥𝑛 ) be a random sample from a N(θ,𝜎 2 ) distribution with 𝜎 2 known.
a. Show that the likelihood is proportional to

𝑛(𝑥̅ − 𝜃)2 + (𝑛 − 1)𝑆 2


𝑓(𝑥|𝜃) ∝ exp (− )
2𝜎 2
where 𝑥̅ is the sample mean and 𝑆 2 is the sample variance

𝑛
2
1
𝑆 = ∑(𝑥𝑖 − 𝑥̅ )2
𝑛−1
𝑖=1

Hence the likelihood simplifies to

(𝜃 − 𝑥 2 )
𝑓(𝑥|𝜃) ∝ exp (− )
𝜎2
2
𝑛
Solution
The joint density of the sample x is

Hence, it suffices to show that

𝑛 𝑛

∑(𝑥𝑖 − 𝜃) = 𝑛(𝑥̅ − 𝜃) + ∑(𝑥𝑖 − 𝑥̅ )2


2 2

𝑖=1 𝑖=1

Note that
𝑛 𝑛 𝑛

∑(𝑥𝑖 − 𝜃)2 = ∑(𝑥𝑖2 − 2𝜃𝑥𝑖 + 𝜃 2 ) = ∑(𝑥𝑖2 − 2𝜃𝑛𝑥̅ + 𝑛𝜃 2 )


𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛

∑(𝑥𝑖 − 𝑥̅ )2 = ∑(𝑥𝑖2 − 2𝑥̅ 𝑥𝑖 + 𝑥̅ 2 ) = ∑(𝑥𝑖2 − 2𝑛𝑥̅ 2 + 𝑛𝑥̅ 2 ) = ∑ 𝑥𝑖2 − 𝑛𝑥 2


𝑖=1 𝑖=1 𝑖=1 𝑖=1

Subtracting the second of these equations from the first yields


𝑛 𝑛

∑(𝑥𝑖 − 𝜃) − ∑(𝑥𝑖 − 𝑥̅ )2 = 𝑛𝑥̅ 2 − 2𝜃𝑛𝑥̅ + 𝑛𝜃 2 = 𝑛(𝑥̅ − 𝜃)2


2

𝑖=1 𝑖=1

Since 𝜎 2 is known, we are interested in f(x|θ) which is proportional to

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

b. Set the prior for θ to be N(μ,𝜏 2 ) and derive its posterior distribution. (You can use the above
result)

Solution
The prior for θ is set to be N(μ,𝜏 2 ). Hence we can write:

(𝜃 − 𝜇)2
𝜋(𝜃) ∝ exp(−
2𝜏 2
Using the result of part (a), the posterior is then proportional to

Exercise 3
Consider a sample 𝑥 = (𝑥1 ,…, 𝑥𝑛 ) of independent random variables, identically distributed from the
Geometric$(\theta)$ distribution, i.e. the probability mass function for each 𝑥𝑖 is

𝑓(𝑥𝑖|𝜃) = 𝑃(𝑥 = 𝑥𝑖 |𝜃) = 𝜃𝑥𝑖 (1 − 𝜃), 𝑥𝑖 = 0,1,2, …,


𝜃
where 0<θ<1 is a unknown parameter and 𝐸(𝑥) = 1−𝜃

a. Choose a suitable proper prior for θ and derive the corresponding posterior. Justify your
choice of prior.

Solution
The likelihood for the sample is given by

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

Assume the Beta(α,β) as the prior on θ.

In our case there is no prior information available so the prior parameters (α, β) can be chosen such
that the distribution is as flat as possible. Setting α=β=1 will correspond to the Uniform(0,1)
distribution.

b. Derive the Jeffreys prior for θ. Use it obtain the corresponding posterior distribution.

Solution
To find the Jeffreys' prior we write

If assigned in our problem it would give the posterior Beta(1/2+n𝑥̅ ,n)


c. Provide a Normal approximation to the posterior distribution of θ that converges to the true
posterior as the sample size increases.

Solution
We will need the mode of the likelihood, 𝜃𝑀 and the Fisher information. The latter was derived from
𝑁
part (b) and is equal to 𝜃(1−𝜃)2 . For the former we set

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

𝑥̅ 𝜃(1−𝜃)2
The requested Normal approximation to the posterior has mean 1+𝑥̅ and variance 𝑛

d. Find the Bayes estimator corresponding to the posterior mean for the posteriors in parts (a),
(b) and (c).

Solution
For part (a) it is equal to

𝑛𝑥̅ + 1
𝑛𝑥̅ + 𝑛 + 2
whereas for part (b) it is

𝑛𝑥̅ + 1/2
𝑛𝑥̅ + 𝑛 + 1/2
and (c) it is equal to

𝑥̅
𝑥̅ + 1

e. Let y represent a future observation from the same model.


Find the posterior predictive distribution of y for one of the posteriors in parts (a) or (b).

Solution
Note that from the pdf of the Beta distribution we get for all positive α, β that

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

Consider the posterior 𝜋 𝐽 (𝜃|𝑥) corresponding to the Jeffreys prior. The predictive distribution for y,
𝑥𝑖 = 1, …, n, is

Exercise 4
Let x be an observation from a Binomial (n,θ) and assign the prior for θ to be the Beta(α,β).
a. Find a Normal approximation to the posterior based on the mode and the Hessian matrix
of π∗(θ|x)=f(x|θ)π(θ).

Solution
We can write

𝑥+𝑎−1
which is negative for x≠0 and x≠n, so the mode is at 𝜃𝑀 = . The Hessian is
𝑛+𝑎+𝛽−2

𝑥+𝑎−1
The normal approximation to the posterior for θ will have meant 𝜃𝑀 = . and
𝑛+𝑎+𝛽−2
variance H(𝜃𝑀 ) −1
b. Assume x=15, n=20 and α=β=2. Write an R script to provide a graph of true posterior density
and the normal approximation derived in the previous part.
c. Repeat with x=75, n=100.

Solution
R code for parts (b) and (c)

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

theta=seq(from=0,to=1,by=0.001) #grid of points for theta


x=15
n=20
mu=(x+2-1)/(n+2+2-2) # mode. substituted
alpha=beta=1
hessian = ((x+2-1)/(mu^2))+((n-x+2-1)/((1-mu)^2))
S=sqrt(1/hessian) # square root of inverse hessian evaluated at the mode
ftheta=dbeta(theta,x,n-x) # true density
fthetaclt=dnorm(theta,mu,S) # normal approximation
plot(theta,ftheta,type="l") # plot commands
lines(theta,fthetaclt,col="red")
d.
e.
f. Code for finding the median
x=15
n=20
alpha=2
beta=2
N=1000000
rx=rbeta(N,alpha+x,n-x+beta) # Generate N samples from the posterior
median(rx) # Find the median of the posterior samples
For part (c) just change to x=75 and n=20 in the previous code.

Exercise 5
Let x=(𝑥 1 ,…,𝑥 𝑛 ) be a r.s. from a Poisson(λ) and assign a Gamma(α,β) prior to λ
a. Find a Normal approximation to the posterior based on the mode and the Hessian matrix
of π∗(θ|x)=f(x|θ)π(θ).

Solution
We can write

𝑙𝑜𝑔𝜋 ∗ (𝜆|𝑥) = 𝑙𝑜𝑔[𝑓(𝑥|𝜆)𝜋(𝜆)] ∝ 𝑙𝑜𝑔 {𝜆 𝛼+∑𝑥𝑖−1 𝑒𝑥𝑝[−𝜆(𝑛 + 𝛽)] }

= (𝑎 + ∑ 𝑥𝑖 − 1) log(𝜆) + (𝑛 + 𝛽)log(𝜆)

𝜕𝑙𝑜𝑔𝜋 ∗ (𝜆|𝑥) 𝑎 + ∑𝑥𝑖 −1


= 0, 𝑔𝑖𝑣𝑒𝑠 𝜆𝑀 =
𝜕𝜆 𝑛+𝛽
We also note that

𝜕 2 𝑙𝑜𝑔𝜋 ∗ (𝜆|𝑥) 𝑎 + ∑𝑖 𝑥𝑖 − 1
= −
𝜕𝜆2 𝜆2
which is negative for a+∑𝑖 𝑥𝑖 > 1 implying the mode is at 𝜆𝑀 in this case. The Hessian is

𝑎 + ∑𝑖 𝑥𝑖 − 1
𝐻(𝜃) =
𝜆2

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

The normal approximation to the posterior for λ will have mean 𝜆𝑀 and variance H(𝜆𝑀 )2 .
b. Assume ∑𝑖 𝑥𝑖 = 95, n=50 and α=β=1. Write an R script to provide to calculate the posterior
probability of λ being less than 1.7 using the true posterior density and the normal
approximation derived in the previous part.
c. Using R provide a 95 credible set for λ using the true posterior.

Solution
Code for finding the probability and the credible interval:
sumx=95
n=50
alpha=1
beta=1
mu=(alpha+sumx-1)/(n+beta) # mode. substituted alpha=beta=1
S=sqrt(mu*mu/(alpha+sumx -1)) # square root of inverse hessian at the mode
pgamma(1.7,1+sumx,n+1) # true requested probability
pnorm(1.7,mu,S) # probability from normal approximation
N=1000000
rx=rgamma(N, alpha+sumx, n+beta) # Generate N posterior samples
quantile(rx,c(0.025,0.975)) # 2.5 and 97.5 posterior sample percentiles

Exercise 6
Consider the problem raised in the section 'Introductory Example' and provide estimates of the
parameters of the relevant linear regression model.

Solution
The additional information states that changes to the sales of more than 50 units per 1,000$'s spent are
unlikely. In other words we would expect the regression coefficient of the three predictors to be
between -0.05 and 0.05. A prior distribution that reflects that would be the N(0,0.0252 ) for each
coefficient, as it implies that it will be in that interval with probability 95%. We complete the prior
specification by assuming independence between the β parameters a-priori. For the constant 𝛽0 we
can assign a very large variance -say- 106 to reflect our ignorance about its value.
R code for fitting the model is given below. It will be necessary to install the R packages
'MCMCpack', 'MASS' and 'coda'. Note that the package requires the precision, being the inverse of
the variance, rather than the variance. In our case this will be 10−6 for 𝛽0 and around 1540 for the
remaining β's
library(MCMCpack) # initialise the MCMCpack package
advertising=read.csv("advertising.csv",header=T,na.strings="?") # load data
model.freq=lm(sales~TV+radio+newspaper,data=advertising) # fit frequentist linear regression
summary(model.freq)
# set prior parameters
mu=c(0,0,0,0) Precision=diag(c(1e-6,1540,1540,1540)) # fit Bayesian Linear regression
model.bayes=MCMCregress(sales~TV+radio+newspaper, data=advertising,b0=mu,B0=Precision)
summary(model.bayes)
We note that the coefficient corresponding to radio is the highest under linear regression,
around 0.188. Bayesian Linear regression provides a slightly smaller number which is due to the prior
distribution pulling towards its mean, 0. The results are otherwise similar.

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

Learning outcomes checklist


 Understand the basic concepts of Bayesian inference.
 Identify differences between Frequentist and Bayesian Inference in point and interval
estimation, hypothesis testing and prediction.
 Implement Bayesian Inference in various examples including linear regression using R.

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

Version 1.0 Last updated 09/08/19

You might also like