Lec5-MultivariateGaussian Students Dirichlet

Multivariate Gaussian and
Student’s 𝓣 and Dirichlet

Distributions
Prof. Nicholas Zabaras
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
August 27, 2020
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

Contents
 The multivariate Gaussian
 Multivariate Student’s 𝒯 distribution
 Dirichlet distribution
 The goals of today’s lecture are:
 Familiarize ourselves with the multivariate Gaussian and Student’s 𝒯
 Learn about the Dirichlet distribution
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2

References
• Following closely Chris Bishops’ PRML book, Chapter 2
• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2
• Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge

University Press.
• Bertsekas, D. and J. Tsitsiklis (2008). Introduction to Probability. Athena

Scientiﬁc. 2nd Edition
• Wasserman, L. (2004). All of statistics. A Concise Course in Statistical

Inference. Springer.

Multivariate Gaussian
 A random variable 𝑋 ∈ ℝ is Gaussian or normally distributed 𝑋~𝒩 𝑥0 , 𝜎 2 if:
 1 2
t
1
P  X  t 
2 
exp   ( x  x ) dx
2   2 2 0

 A multivariate variable 𝒙 ∈ ℝ𝐷 is Gaussian if its probability density is

1/2
 1   1 T 1 
p( x )    exp   ( x  x )  ( x  x ) 
  2  D det    2
0 0

 
where 𝒙0 ∈ ℝ𝐷 , 𝜮 ∈ ℝ𝐷×𝐷 is symmetric positive definite (covariance matrix).
 The symmetry property of the covariance matrix does not affect the value of
(𝒙 − 𝒙0 )𝑇 𝜮−1 (𝒙 − 𝒙0 ). However, for symmetric covariance matrices we only
need to describe 𝐷(𝐷 + 1)/2 elements rather than 𝐷2.
 It is invariant under linear transformations, i.e. for 𝑨, 𝑩 ∈ ℝ𝑀×𝐷 , 𝒄 ∈ ℝ𝑀
X1 ~ N ( 1 , 1 ), X 2 ~ N ( 2 ,  2 )
AX1  BX 2  c ~ N ( A1  B2  c, A1 AT  B 2 BT )
and X1 , X 2 independent 
Conditional and Marginal Probability Densities
conditional bivariate normal pdf conditional bivariate normal pdf
5
4.5 Conditional
4 1
3.5
Ellipsoids : 0.8
Probability Density
3
2.5
equiprobability 0.6
0.4
2 curves of 𝑝(𝑥, 𝑦)
0.2
1.5
. 0
1 5
0.5 p(x|y=2) 4
3 4
5
2 3
0 2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1
y 0 0
Marginal bivariate normal pdf Marginal bivariate normal pdfx
5
4.5 Marginal
4 1
3.5 0.8
Probability Density
3 0.6
2.5
y
0.4
2
0.2
1.5
0
5
1
0.5
𝑝(𝑥) 4
3 4
5
2 Link here for 3a MatLab program

2
0
1 to 1generate these figures
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 y 0 0
x
x

Multivariate Student’s 𝓣 Distribution

p( x |  , a, b)   N  x |  , 1  Gamma  | a, b  d
0
 If we return to the derivation of the univariate Student’s 𝒯 distribution and

𝑎
substitute 𝜐 = 2𝑎, 𝜆 = , 𝜂 = 𝜏𝑏/𝑎, and use
𝑏
ba a 1 b
Gamma  | a, b    e
( a )
we can write the Student’s 𝒯 distribution as:*


T ( x |  ,  , )   N x |  ,  
0
1
 Gamma  |  / 2, / 2  d
 This form is useful in providing generalization to a multivatiate Student’s 𝒯


T ( x |  ,  , )   N x |  ,  
0
1
 Gamma  |  / 2, / 2  d
*Usechange of variables for distributions, also 𝑑𝜏 = 𝜆𝑑𝜂, and notice that the extra terms that
appear cancel out. Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6
Multivariate Student’s 𝓣 Distribution

T ( x |  ,  , )   N  x |  ,    Gamma  |  / 2, / 2  d
1
0
 This integral can be computed analytically as:
D 
(  )  2  /2  D /2

|  |1/2

T ( x |  ,  , )  2 2
D /2 
1  
      
( )
2
 2   x      x    (Mahalanobis Distance )
T
 One can derive the above form of the distribution by substitution in the Eq. on
the top.  / 2 
 /2

1/2 
  / 2   2   Use     / 2   / 2 
D /2  / 2 1  /2  2 / 2
T ( x |  ,  , )    e e
D/2
d 2
 / 2   / 2  d / 2  
 /2 

1/2 1/2
  D/2  
 D /2  /2  D /2  /2
0
D /2  / 2 1 
  / 2   2
/ 2  e d  1   2
/ 
  / 2   2  D /2   / 2   
 Normalization proof is immediate from the normalization of the normal &

Gamma distributions. Statistical Computing and Machine Learning, Fall 2020, N. Zabaras
7
Multivariate Student’s T Distribution
D 
(  )  2  /2  D /2

|  |1/2

T ( x |  ,  , )  2 2
D /2 
1  
      
( )
2
 Some useful results of the multivatiate Student’s 𝒯 are given below:

 x   if   1, cov  x   1 if   2, mode  x   
 2
 One can show easily the expression for the mean by using 𝒙 = 𝒛 + 𝝁:
D 
 (  )  2  /2  D /2

|  |1/2

 x   2 2
 1    z    dz
( )    
D /2
 
2
 The 1st term drops out since 𝒯(𝒛|𝟎, 𝜦, 𝝊) is even. The 2nd term gives 𝝁 from
the normalization of the distribution.
 The covariance is computed as:
 / 2  
 /2

 
 0  x

cov  x      N x |  ,  
1
  x    x    T
dx Gamma  |  / 2, / 2  d 


  / 2   0
    /21e /2 d
1
 / 2    / 2  1  / 2    / 2  1 1   / 2 1   1

 /2
1
 
  / 2   / 2  / 2  / 2  /21   / 2   / 2 1  2
Multivariate Student’s T Distribution
D 
(  )  2  /2  D /2

|  |1/2

T ( x |  ,  , )  2 2
D /2 
1  
      
( )
2
 Differentiation with respect to 𝒙 also shows the mode being 𝝁:

 x    if   1, cov  x   1 if   2, mode  x   
 2
 The Student’s 𝒯 has fatter tails than a Gaussian. The smaller 𝜐 is, the fatter
the tails.
 For 𝜐 ∞, the distribution approaches a Gaussian. Indeed note that:

 2 
 /2  D /2
   D    2         2 1   2 2    2 
1     exp      ln 1     exp          exp    O  1  
    2 2      2   2      2 
  
 The distribution can also be written in terms of 𝚺 = 𝜦−1 (scale matrix – not the
covariance) or 𝑽 = 𝜐𝚺.
Dirichlet Distribution
 We introduce the Dirichlet distribution as a family of “conjugate priors” (to be
formally introduced in a follow up lecture) for the parameters 𝜇𝑘 in the
multinomial distribution.
 The Dirichlet distribution 𝒟𝒾𝓇(𝛼), is a family of continuous multivariate

probability distributions parametrized by the vector 𝜶 of positive reals.
 It is the multivariate generalization of the 𝓑𝓮𝓽𝓪 distribution.

 Its probability density function returns the belief that the probabilities of 𝐾 rival
events are 𝜇𝑘 given that each event has been observed 𝛼𝑘 − 1 times:
K
p(  |  )   k k 1 ,
k 1
0  k  1,
K

k 1
k 1
The distribution over the space of 𝜇𝑘 is 𝐾 − 1 dimensional due to the last

constraint above.

The Dirichlet distribution of order 𝐾 ≥ 2 with parameters 𝛼1, … , 𝛼𝐾 > 0 has a
PDF with respect to Lebesgue measure on ℝ𝐾−1 given by
K
1
p( |  )   k
Beta ( ) k 1
  k 1
for all 𝜇1, … , 𝜇𝐾−1 > 0 satisfying 𝜇1+. . +𝜇𝐾−1 < 1, where 𝜇𝐾 is an abbreviation
for 1 – 𝜇1 − ⋯ − 𝜇𝐾−1 . The normalizing constant is the multinomial Beta
function:
K
 ( k )
Beta( )  ,   1 ,  2 ,...,  K 
k 1 T
 K 
  k 
 k 1 
The Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3)
is confined on a plane as shown.

 We write the Dirichlet distribution as:
K
(a0 )
p (  |  )  K ( ) k k 1 , K ( )  , a0  a1  ...  aK
k 1 (a1 )...(aK )
 Note the following useful relation:
 K
 K K
 j
 k
k 1
 k 1

 j
e
k 1
 k 1 ln k
 ln  j  k k 1
k 1
 From this we can derive an interesting expression for 𝔼 𝑙𝑛𝜇𝑗 .


1 1 K 1 1 K
ln  j   K ( )  .. ln  j   k  k 1
d 1....d  K  K ( )  ..   k 1
d 1....d  K 
 j
k
0 0 k 1 0 0 k 1
   ln K ( )  ln ( j )  ln ( 0 )
1 1 K
1
 ..  
 k 1
K ( ) d 1....d  K K ( )   
 j  j K ( )  j  j  0
k
0 0 k 1
  0 
 
 j
where Ψ 𝛼 = 𝑑𝑙𝑛Γ(𝛼)/𝑑𝛼 is the digamma function.
 ln ( j )  ln ( 0 )
ln  j     j     0  ,   j   ,   0  
 j  0
Dirichlet Distribution: Normalization
To show the normalization, we use induction. The case for 𝑀 = 2 was shown
earlier for the Beta distribution.
 Assume that the Dirichlet normalization formula is valid for 𝑀 − 1 terms. We

will show the formula for 𝑀 terms:
 M 1
M 1
 M 1
pM ( 1 ,...,  M 1 )  CM  k k 1 1    j 
k 1  j 1 
M
 Let us integrate out 𝜇𝑀−1 :

M 2
1  j  M 1
 M  2  k 1   M 1 1  M  2 
j 1
pM 1 ( 1 ,...,  M  2 )  CM     k   M 1 1    j   M 1  d  M 1 
0  k 1   j 1 
 M 1 1 M 11
  k 1   M 1 1  
M 2 1 M 2
CM    k   t 1    j  1  t 
 M 1
 dt
 M 2   k 1 0  j 1 
 M 1 t 1
  j  
 j 1 
Dirichlet Distribution: Normalization
 M 1  M 1 1

M 2
 M 2 
pM 1 ( 1 ,...,  M  2 )  CM   k k 1  1    j  1  t 
 M 1

 M 1 1
t dt 
 k 1  j 1  0
 M 1  M 1
 M  2  k 1   M  2    M 1    M 
 CM    k   1    j 
 k 1  j 1    M 1   M 
Dirichlet ( M 1)
The last step above comes from the normalization of Beta.
 What we have above is an (𝑀 − 1) term Dirichlet distribution with coefficients

𝛼1 , … , 𝛼𝑀−2 , 𝛼𝑀−1 + 𝛼𝑀 . Since we assumed that the normalization formula is
valid for (𝑀 − 1) terms, we must have:
 1  ...  M  2    M 1   M    M 1    M 
1  CM 
 1  ...   M    M 1   M 
 1  ...   M 
CM 
 1  ...  M  2    M 1    M 
 Consider a data set 𝒟 i.i.d sampled from the multinomial. Recall that the
𝑚𝑘
“likelihood” is 𝑝 𝒟 𝝁 ~ ς𝐾𝑘=1 𝑘 . Introduce the Dirichlet as “the conjugate
𝜇
prior” 𝑝 𝜇 . From Bayes’ formula, we arrive at the following “posterior”
K
p (  | D )  p(D |  ) p(  )  p(  | D )   k k  mk 1
k 1
Multinomial Dirichlet
Note this is a Dirichlet distribution 𝒟𝒾𝓇 𝜇|𝛼1 + 𝑚1 , … , 𝛼𝐾 + 𝑚𝐾 .
The normalization factor is computed easily from the normalization factor of

the Dirichlet as:
 K 
  k  N  K
p(  | D )  K k 1 
 k k  mk 1
 k k
 (
k 1
  m ) k 1
 𝑎𝑘 can be interpreted as “the effective number of prior observations of 𝑥𝑘 = 1”.

Examples of Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3) which can
be plotted in 2𝐷 since 𝜇3 = 1 − 𝜇1 − 𝜇2.
Uniform Broad centered Narrow centered
at (1/3,1/3,1/3) at (1/3,1/3,1/3)
a0  a1  ...  aK
controls how
peaked the
distribution is
The ak ' s control the

location of the peak

The Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3) where the horizontal
axes are 𝜇1 and 𝜇2 and the vertical axis is the density.
{ k }  0.1 { k }  10
25
15
20
f(x1,x2,1-x1-x2)
f(x1,x2,1-x1-x2)
10 15
0
10 0
5
5
0 0.5
0 0.5
1
0.8
0.6 1
0.4 0.8
0.2 1 x2 0.6
0
0.4
x1 0.2 1 x2
0
{ k }  1 x1
3.5
3
f(x1,x2,1-x1-x2)
2.5
2 0
1.5
1 0.5
1
0.8
0.6
0.4
x2
MatLab Code
0.2 1
0
x1
The Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3) where the horizontal
axes are 𝜇1 and 𝜇2 and the vertical axis is the density.  =0.10
15
10
{ k }  0.1, 0.1, 0.1 If ak<1/3 for all k,
we obtain spikes
p
5 at the corners
0
1
1
0.5
 =10.00 0.5
{ k }  2, 2, 2 0 0
25
20
15
{ k }  10,10,10
p
10
0
1
Run visDirichletGui & dirichlet3dPlot 1
from PMTK 0.5
0.5
0 0

Samples from a 5 −dimensional symmetric Dirichlet
distribution.
Samples from Dir (alpha=5) Samples from Dir (alpha=0.1)
1 1
0.5 0.5
0 0
1 2 3 4 5 1 2 3 4 5
1 1
0.5 0.5
0 0
1 2 3 4 5 1 2 3 4 5
1 1
0.5 0.5
0 0
1 2 3 4 5 1 2 3 4 5
1 1
0.5 0.5
0 0
1 2 3 4 5 1 2 3 4 5
1 1
0.5 0.5
0 0
1 2 3 4 5 1 2 3 4 5
{ k }  5,5,...,5 { k }  0.1, 0.1,..., 0.1

Run dirichletHistogramDemo
from PMTK

 In closing, we have the following properties (you only need the normalization
of the Dirichlet distribution    d   (a)...(a)(a ) , a  a  ...  a and the property
K
 k 1 1 K
k 0 1 K
k 1 0
Γ 𝑥 + 1 = 𝑥Γ 𝑥 to prove them):
k  1     k   j l
[ k ]  , mode[ k ]  k , var[ k ]  k 2 0 , cov[  j l ]   2  j  l
0 0 1  0 ( 0  1)  0 ( 0  1)
K
where :  0    k
k 1
 Often we use:
k   K
 In this case (𝛼0 = 𝛼): 1 K 1

[ k ]  , var[ k ]  2
K K (  1)
 Increasing  increases the precision of the distribution.


Lec5-MultivariateGaussian Students Dirichlet

Uploaded by

Copyright:

Available Formats

You might also like

Lec5-MultivariateGaussian Students Dirichlet

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec5-MultivariateGaussian Students Dirichlet

Uploaded by

Copyright:

Available Formats

Multivariate Gaussian and

Student’s 𝓣 and Dirichlet

August 27, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

 Multivariate Student’s 𝒯 distribution

 The goals of today’s lecture are:

 Familiarize ourselves with the multivariate Gaussian and Student’s 𝒯

 Learn about the Dirichlet distribution

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

• Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge

• Bertsekas, D. and J. Tsitsiklis (2008). Introduction to Probability. Athena

• Wasserman, L. (2004). All of statistics. A Concise Course in Statistical

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3

 A multivariate variable 𝒙 ∈ ℝ𝐷 is Gaussian if its probability density is

where 𝒙0 ∈ ℝ𝐷 , 𝜮 ∈ ℝ𝐷×𝐷 is symmetric positive definite (covariance matrix).

2 Link here for 3a MatLab program

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5

 If we return to the derivation of the univariate Student’s 𝒯 distribution and

 Normalization proof is immediate from the normalization of the normal &

 / 2    / 2  1  / 2    / 2  1 1   / 2 1   1

 For 𝜐 ∞, the distribution approaches a Gaussian. Indeed note that:

 The Dirichlet distribution 𝒟𝒾𝓇(𝛼), is a family of continuous multivariate

 It is the multivariate generalization of the 𝓑𝓮𝓽𝓪 distribution.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10

The distribution over the space of 𝜇𝑘 is 𝐾 − 1 dimensional due to the last

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12

 From this we can derive an interesting expression for 𝔼 𝑙𝑛𝜇𝑗 .

 Assume that the Dirichlet normalization formula is valid for 𝑀 − 1 terms. We

 Let us integrate out 𝜇𝑀−1 :

The last step above comes from the normalization of Beta.

 What we have above is an (𝑀 − 1) term Dirichlet distribution with coefficients

Note this is a Dirichlet distribution 𝒟𝒾𝓇 𝜇|𝛼1 + 𝑚1 , … , 𝛼𝐾 + 𝑚𝐾 .

The normalization factor is computed easily from the normalization factor of

 𝑎𝑘 can be interpreted as “the effective number of prior observations of 𝑥𝑘 = 1”.

The ak ' s control the

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19

{ k }  5,5,...,5 { k }  0.1, 0.1,..., 0.1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20

 In this case (𝛼0 = 𝛼): 1 K 1

 Increasing  increases the precision of the distribution.

You might also like