Professional Documents
Culture Documents
Lec5-MultivariateGaussian Students Dirichlet
Lec5-MultivariateGaussian Students Dirichlet
Lec5-MultivariateGaussian Students Dirichlet
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
Dirichlet distribution
The symmetry property of the covariance matrix does not affect the value of
(𝒙 − 𝒙0 )𝑇 𝜮−1 (𝒙 − 𝒙0 ). However, for symmetric covariance matrices we only
need to describe 𝐷(𝐷 + 1)/2 elements rather than 𝐷2.
It is invariant under linear transformations, i.e. for 𝑨, 𝑩 ∈ ℝ𝑀×𝐷 , 𝒄 ∈ ℝ𝑀
X1 ~ N ( 1 , 1 ), X 2 ~ N ( 2 , 2 )
AX1 BX 2 c ~ N ( A1 B2 c, A1 AT B 2 BT )
and X1 , X 2 independent
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4
Conditional and Marginal Probability Densities
conditional bivariate normal pdf conditional bivariate normal pdf
5
4.5 Conditional
4 1
3.5
Ellipsoids : 0.8
Probability Density
3
2.5
equiprobability 0.6
0.4
2 curves of 𝑝(𝑥, 𝑦)
0.2
1.5
. 0
1 5
0.5 p(x|y=2) 4
3 4
5
2 3
0 2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1
y 0 0
Marginal bivariate normal pdf Marginal bivariate normal pdfx
5
4.5 Marginal
4 1
3.5 0.8
Probability Density
3 0.6
2.5
y
0.4
2
0.2
1.5
0
5
1
0.5
𝑝(𝑥) 4
3 4
5
T ( x | , , ) N x | ,
0
1
Gamma | / 2, / 2 d
This form is useful in providing generalization to a multivatiate Student’s 𝒯
T ( x | , , ) N x | ,
0
1
Gamma | / 2, / 2 d
*Usechange of variables for distributions, also 𝑑𝜏 = 𝜆𝑑𝜂, and notice that the extra terms that
appear cancel out. Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6
Multivariate Student’s 𝓣 Distribution
T ( x | , , ) N x | , Gamma | / 2, / 2 d
1
0
This integral can be computed analytically as:
D
( ) 2 /2 D /2
| |1/2
T ( x | , , ) 2 2
D /2
1
( )
2
2 x x (Mahalanobis Distance )
T
One can derive the above form of the distribution by substitution in the Eq. on
the top. / 2
/2
1/2
/ 2 2 Use / 2 / 2
D /2 / 2 1 /2 2 / 2
T ( x | , , ) e e
D/2
d 2
/ 2 / 2 d / 2
/2
1/2 1/2
D/2
D /2 /2 D /2 /2
0
D /2 / 2 1
/ 2 2
/ 2 e d 1 2
/
/ 2 2 D /2 / 2
The Student’s 𝒯 has fatter tails than a Gaussian. The smaller 𝜐 is, the fatter
the tails.
The distribution can also be written in terms of 𝚺 = 𝜦−1 (scale matrix – not the
covariance) or 𝑽 = 𝜐𝚺.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9
Dirichlet Distribution
We introduce the Dirichlet distribution as a family of “conjugate priors” (to be
formally introduced in a follow up lecture) for the parameters 𝜇𝑘 in the
multinomial distribution.
0 k 1,
K
k 1
k 1
for all 𝜇1, … , 𝜇𝐾−1 > 0 satisfying 𝜇1+. . +𝜇𝐾−1 < 1, where 𝜇𝐾 is an abbreviation
for 1 – 𝜇1 − ⋯ − 𝜇𝐾−1 . The normalizing constant is the multinomial Beta
function:
K
( k )
Beta( ) , 1 , 2 ,..., K
k 1 T
K
k
k 1
The Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3)
is confined on a plane as shown.
j
k
k 1
k 1
j
e
k 1
k 1 ln k
ln j k k 1
k 1
ln K ( ) ln ( j ) ln ( 0 )
1 1 K
1
..
k 1
K ( ) d 1....d K K ( )
j j K ( ) j j 0
k
0 0 k 1
0
j
where Ψ 𝛼 = 𝑑𝑙𝑛Γ(𝛼)/𝑑𝛼 is the digamma function.
ln ( j ) ln ( 0 )
ln j j 0 , j , 0
j 0
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13
Dirichlet Distribution: Normalization
To show the normalization, we use induction. The case for 𝑀 = 2 was shown
earlier for the Beta distribution.
pM 1 ( 1 ,..., M 2 ) CM k M 1 1 j M 1 d M 1
0 k 1 j 1
M 1 1 M 11
k 1 M 1 1
M 2 1 M 2
CM k t 1 j 1 t
M 1
dt
M 2 k 1 0 j 1
M 1 t 1
j
j 1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14
Dirichlet Distribution: Normalization
M 1 M 1 1
M 2
M 2
pM 1 ( 1 ,..., M 2 ) CM k k 1 1 j 1 t
M 1
M 1 1
t dt
k 1 j 1 0
M 1 M 1
M 2 k 1 M 2 M 1 M
CM k 1 j
k 1 j 1 M 1 M
Dirichlet ( M 1)
1 ... M 2 M 1 M M 1 M
1 CM
1 ... M M 1 M
1 ... M
CM
1 ... M 2 M 1 M
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Dirichlet Distribution
Consider a data set 𝒟 i.i.d sampled from the multinomial. Recall that the
𝑚𝑘
“likelihood” is 𝑝 𝒟 𝝁 ~ ς𝐾𝑘=1 𝑘 . Introduce the Dirichlet as “the conjugate
𝜇
prior” 𝑝 𝜇 . From Bayes’ formula, we arrive at the following “posterior”
K
p ( | D ) p(D | ) p( ) p( | D ) k k mk 1
k 1
Multinomial Dirichlet
f(x1,x2,1-x1-x2)
f(x1,x2,1-x1-x2)
10 15
0
10 0
5
5
0 0.5
0 0.5
1
0.8
0.6 1
0.4 0.8
0.2 1 x2 0.6
0
0.4
x1 0.2 1 x2
0
{ k } 1 x1
3.5
3
f(x1,x2,1-x1-x2)
2.5
2 0
1.5
1 0.5
1
0.8
0.6
0.4
x2
MatLab Code
0.2 1
0
x1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18
Dirichlet Distribution
The Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3) where the horizontal
axes are 𝜇1 and 𝜇2 and the vertical axis is the density. =0.10
15
10
{ k } 0.1, 0.1, 0.1 If ak<1/3 for all k,
we obtain spikes
p
5 at the corners
0
1
1
0.5
=10.00 0.5
{ k } 2, 2, 2 0 0
25
20
15
{ k } 10,10,10
p
10
0
1
Run visDirichletGui & dirichlet3dPlot 1
from PMTK 0.5
0.5
0 0
0.5 0.5
0 0
1 2 3 4 5 1 2 3 4 5
1 1
0.5 0.5
0 0
1 2 3 4 5 1 2 3 4 5
1 1
0.5 0.5
0 0
1 2 3 4 5 1 2 3 4 5
1 1
0.5 0.5
0 0
1 2 3 4 5 1 2 3 4 5
1 1
0.5 0.5
0 0
1 2 3 4 5 1 2 3 4 5
Often we use:
k K