Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Multivariate Gaussian and

Student’s 𝓣 and Dirichlet


Distributions
Prof. Nicholas Zabaras

Email: nzabaras@gmail.com
URL: https://www.zabaras.com/

August 27, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras


Contents
 The multivariate Gaussian

 Multivariate Student’s 𝒯 distribution

 Dirichlet distribution

 The goals of today’s lecture are:

 Familiarize ourselves with the multivariate Gaussian and Student’s 𝒯

 Learn about the Dirichlet distribution

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2


References
• Following closely Chris Bishops’ PRML book, Chapter 2

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

• Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge


University Press.

• Bertsekas, D. and J. Tsitsiklis (2008). Introduction to Probability. Athena


Scientific. 2nd Edition

• Wasserman, L. (2004). All of statistics. A Concise Course in Statistical


Inference. Springer.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3


Multivariate Gaussian
 A random variable 𝑋 ∈ ℝ is Gaussian or normally distributed 𝑋~𝒩 𝑥0 , 𝜎 2 if:
 1 2
t
1
P  X  t 
2 
exp   ( x  x ) dx
2   2 2 0

 A multivariate variable 𝒙 ∈ ℝ𝐷 is Gaussian if its probability density is


1/2
 1   1 T 1 
p( x )    exp   ( x  x )  ( x  x ) 
  2  D det    2
0 0

 

where 𝒙0 ∈ ℝ𝐷 , 𝜮 ∈ ℝ𝐷×𝐷 is symmetric positive definite (covariance matrix).

 The symmetry property of the covariance matrix does not affect the value of
(𝒙 − 𝒙0 )𝑇 𝜮−1 (𝒙 − 𝒙0 ). However, for symmetric covariance matrices we only
need to describe 𝐷(𝐷 + 1)/2 elements rather than 𝐷2.
 It is invariant under linear transformations, i.e. for 𝑨, 𝑩 ∈ ℝ𝑀×𝐷 , 𝒄 ∈ ℝ𝑀
X1 ~ N ( 1 , 1 ), X 2 ~ N ( 2 ,  2 )
AX1  BX 2  c ~ N ( A1  B2  c, A1 AT  B 2 BT )
and X1 , X 2 independent 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4
Conditional and Marginal Probability Densities
conditional bivariate normal pdf conditional bivariate normal pdf
5

4.5 Conditional
4 1

3.5
Ellipsoids : 0.8

Probability Density
3

2.5
equiprobability 0.6

0.4
2 curves of 𝑝(𝑥, 𝑦)
0.2
1.5
. 0
1 5

0.5 p(x|y=2) 4
3 4
5

2 3
0 2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1
y 0 0
Marginal bivariate normal pdf Marginal bivariate normal pdfx
5

4.5 Marginal
4 1

3.5 0.8

Probability Density
3 0.6

2.5
y

0.4

2
0.2

1.5
0
5
1

0.5
𝑝(𝑥) 4
3 4
5

2 Link here for 3a MatLab program


2
0
1 to 1generate these figures
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 y 0 0
x
x

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5


Multivariate Student’s 𝓣 Distribution

p( x |  , a, b)   N  x |  , 1  Gamma  | a, b  d
0

 If we return to the derivation of the univariate Student’s 𝒯 distribution and


𝑎
substitute 𝜐 = 2𝑎, 𝜆 = , 𝜂 = 𝜏𝑏/𝑎, and use
𝑏
ba a 1 b
Gamma  | a, b    e
( a )
we can write the Student’s 𝒯 distribution as:*


T ( x |  ,  , )   N x |  ,  
0
1
 Gamma  |  / 2, / 2  d
 This form is useful in providing generalization to a multivatiate Student’s 𝒯


T ( x |  ,  , )   N x |  ,  
0
1
 Gamma  |  / 2, / 2  d
*Usechange of variables for distributions, also 𝑑𝜏 = 𝜆𝑑𝜂, and notice that the extra terms that
appear cancel out. Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6
Multivariate Student’s 𝓣 Distribution

T ( x |  ,  , )   N  x |  ,    Gamma  |  / 2, / 2  d
1

0
 This integral can be computed analytically as:
D 
(  )  2  /2  D /2

|  |1/2

T ( x |  ,  , )  2 2
D /2 
1  
      
( )
2
 2   x      x    (Mahalanobis Distance )
T

 One can derive the above form of the distribution by substitution in the Eq. on
the top.  / 2 
 /2

1/2 

  / 2   2   Use     / 2   / 2 
D /2  / 2 1  /2  2 / 2
T ( x |  ,  , )    e e
D/2
d 2

 / 2   / 2  d / 2  
 /2 

1/2 1/2

  D/2  
 D /2  /2  D /2  /2
0
D /2  / 2 1 
  / 2   2
/ 2  e d  1   2
/ 
  / 2   2  D /2   / 2   

 Normalization proof is immediate from the normalization of the normal &


Gamma distributions. Statistical Computing and Machine Learning, Fall 2020, N. Zabaras
7
Multivariate Student’s T Distribution
D 
(  )  2  /2  D /2

|  |1/2

T ( x |  ,  , )  2 2
D /2 
1  
      
( )
2
 Some useful results of the multivatiate Student’s 𝒯 are given below:

 x   if   1, cov  x   1 if   2, mode  x   
 2
 One can show easily the expression for the mean by using 𝒙 = 𝒛 + 𝝁:
D 
 (  )  2  /2  D /2

|  |1/2

 x   2 2
 1    z    dz
( )    
D /2
 
2
 The 1st term drops out since 𝒯(𝒛|𝟎, 𝜦, 𝝊) is even. The 2nd term gives 𝝁 from
the normalization of the distribution.
 The covariance is computed as:
 / 2  
 /2

 
 0  x

cov  x      N x |  ,  
1
  x    x    T
dx Gamma  |  / 2, / 2  d 


  / 2   0
    /21e /2 d
1

 / 2    / 2  1  / 2    / 2  1 1   / 2 1   1


 /2
1
 
  / 2   / 2  / 2  / 2  /21   / 2   / 2 1  2
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8
Multivariate Student’s T Distribution
D 
(  )  2  /2  D /2

|  |1/2

T ( x |  ,  , )  2 2
D /2 
1  
      
( )
2
 Differentiation with respect to 𝒙 also shows the mode being 𝝁:

 x    if   1, cov  x   1 if   2, mode  x   
 2

 The Student’s 𝒯 has fatter tails than a Gaussian. The smaller 𝜐 is, the fatter
the tails.

 For 𝜐 ∞, the distribution approaches a Gaussian. Indeed note that:


 2 
 /2  D /2
   D    2         2 1   2 2    2 
1     exp      ln 1     exp          exp    O  1  
    2 2      2   2      2 
  

 The distribution can also be written in terms of 𝚺 = 𝜦−1 (scale matrix – not the
covariance) or 𝑽 = 𝜐𝚺.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9
Dirichlet Distribution
 We introduce the Dirichlet distribution as a family of “conjugate priors” (to be
formally introduced in a follow up lecture) for the parameters 𝜇𝑘 in the
multinomial distribution.

 The Dirichlet distribution 𝒟𝒾𝓇(𝛼), is a family of continuous multivariate


probability distributions parametrized by the vector 𝜶 of positive reals.

 It is the multivariate generalization of the 𝓑𝓮𝓽𝓪 distribution.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10


Dirichlet Distribution
 Its probability density function returns the belief that the probabilities of 𝐾 rival
events are 𝜇𝑘 given that each event has been observed 𝛼𝑘 − 1 times:
K
p(  |  )   k k 1 ,
k 1

0  k  1,
K


k 1
k 1

The distribution over the space of 𝜇𝑘 is 𝐾 − 1 dimensional due to the last


constraint above.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11


Dirichlet Distribution
The Dirichlet distribution of order 𝐾 ≥ 2 with parameters 𝛼1, … , 𝛼𝐾 > 0 has a
PDF with respect to Lebesgue measure on ℝ𝐾−1 given by
K
1
p( |  )   k
Beta ( ) k 1
  k 1

for all 𝜇1, … , 𝜇𝐾−1 > 0 satisfying 𝜇1+. . +𝜇𝐾−1 < 1, where 𝜇𝐾 is an abbreviation
for 1 – 𝜇1 − ⋯ − 𝜇𝐾−1 . The normalizing constant is the multinomial Beta
function:
K

 ( k )
Beta( )  ,   1 ,  2 ,...,  K 
k 1 T

 K 
  k 
 k 1 
The Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3)
is confined on a plane as shown.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12


Dirichlet Distribution
 We write the Dirichlet distribution as:
K
(a0 )
p (  |  )  K ( ) k k 1 , K ( )  , a0  a1  ...  aK
k 1 (a1 )...(aK )
 Note the following useful relation:
 K
 K K

 j
 k
k 1
 k 1

 j
e
k 1
 k 1 ln k
 ln  j  k k 1
k 1

 From this we can derive an interesting expression for 𝔼 𝑙𝑛𝜇𝑗 .



1 1 K 1 1 K
ln  j   K ( )  .. ln  j   k  k 1
d 1....d  K  K ( )  ..   k 1
d 1....d  K 
 j
k
0 0 k 1 0 0 k 1

   ln K ( )  ln ( j )  ln ( 0 )
1 1 K
1
 ..  
 k 1
K ( ) d 1....d  K K ( )   
 j  j K ( )  j  j  0
k
0 0 k 1
  0 
 
 j
where Ψ 𝛼 = 𝑑𝑙𝑛Γ(𝛼)/𝑑𝛼 is the digamma function.
 ln ( j )  ln ( 0 )
ln  j     j     0  ,   j   ,   0  
 j  0
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13
Dirichlet Distribution: Normalization
To show the normalization, we use induction. The case for 𝑀 = 2 was shown
earlier for the Beta distribution.

 Assume that the Dirichlet normalization formula is valid for 𝑀 − 1 terms. We


will show the formula for 𝑀 terms:
 M 1
M 1
 M 1
pM ( 1 ,...,  M 1 )  CM  k k 1 1    j 
k 1  j 1 
M

 Let us integrate out 𝜇𝑀−1 :


M 2
1  j  M 1
 M  2  k 1   M 1 1  M  2 
j 1

pM 1 ( 1 ,...,  M  2 )  CM     k   M 1 1    j   M 1  d  M 1 
0  k 1   j 1 
 M 1 1 M 11
  k 1   M 1 1  
M 2 1 M 2
CM    k   t 1    j  1  t 
 M 1
 dt
 M 2   k 1 0  j 1 
 M 1 t 1
  j  
 j 1 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14
Dirichlet Distribution: Normalization
 M 1  M 1 1

M 2
 M 2 
pM 1 ( 1 ,...,  M  2 )  CM   k k 1  1    j  1  t 
 M 1

 M 1 1
t dt 
 k 1  j 1  0
 M 1  M 1
 M  2  k 1   M  2    M 1    M 
 CM    k   1    j 
 k 1  j 1    M 1   M 
Dirichlet ( M 1)

The last step above comes from the normalization of Beta.

 What we have above is an (𝑀 − 1) term Dirichlet distribution with coefficients


𝛼1 , … , 𝛼𝑀−2 , 𝛼𝑀−1 + 𝛼𝑀 . Since we assumed that the normalization formula is
valid for (𝑀 − 1) terms, we must have:

 1  ...  M  2    M 1   M    M 1    M 
1  CM 
 1  ...   M    M 1   M 
 1  ...   M 
CM 
 1  ...  M  2    M 1    M 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Dirichlet Distribution
 Consider a data set 𝒟 i.i.d sampled from the multinomial. Recall that the
𝑚𝑘
“likelihood” is 𝑝 𝒟 𝝁 ~ ς𝐾𝑘=1 𝑘 . Introduce the Dirichlet as “the conjugate
𝜇
prior” 𝑝 𝜇 . From Bayes’ formula, we arrive at the following “posterior”
K
p (  | D )  p(D |  ) p(  )  p(  | D )   k k  mk 1
k 1
Multinomial Dirichlet

Note this is a Dirichlet distribution 𝒟𝒾𝓇 𝜇|𝛼1 + 𝑚1 , … , 𝛼𝐾 + 𝑚𝐾 .

The normalization factor is computed easily from the normalization factor of


the Dirichlet as:
 K 
  k  N  K
p(  | D )  K k 1 
 k k  mk 1
 k k
 (
k 1
  m ) k 1

 𝑎𝑘 can be interpreted as “the effective number of prior observations of 𝑥𝑘 = 1”.


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16
Dirichlet Distribution
Examples of Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3) which can
be plotted in 2𝐷 since 𝜇3 = 1 − 𝜇1 − 𝜇2.
Uniform Broad centered Narrow centered
at (1/3,1/3,1/3) at (1/3,1/3,1/3)
a0  a1  ...  aK
controls how
peaked the
distribution is

The ak ' s control the


location of the peak

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17


Dirichlet Distribution
The Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3) where the horizontal
axes are 𝜇1 and 𝜇2 and the vertical axis is the density.
{ k }  0.1 { k }  10
25
15
20

f(x1,x2,1-x1-x2)
f(x1,x2,1-x1-x2)

10 15

0
10 0
5

5
0 0.5
0 0.5
1
0.8
0.6 1
0.4 0.8
0.2 1 x2 0.6
0
0.4
x1 0.2 1 x2
0

{ k }  1 x1

3.5

3
f(x1,x2,1-x1-x2)

2.5

2 0

1.5

1 0.5
1
0.8
0.6
0.4
x2
MatLab Code
0.2 1
0
x1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18
Dirichlet Distribution
The Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3) where the horizontal
axes are 𝜇1 and 𝜇2 and the vertical axis is the density.  =0.10

15

10
{ k }  0.1, 0.1, 0.1 If ak<1/3 for all k,
we obtain spikes

p
5 at the corners

0
1
1
0.5
 =10.00 0.5

{ k }  2, 2, 2 0 0

25

20

15
{ k }  10,10,10
p

10

0
1
Run visDirichletGui & dirichlet3dPlot 1
from PMTK 0.5
0.5

0 0

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19


Dirichlet Distribution
Samples from a 5 −dimensional symmetric Dirichlet
distribution.
Samples from Dir (alpha=5) Samples from Dir (alpha=0.1)
1 1

0.5 0.5

0 0
1 2 3 4 5 1 2 3 4 5
1 1

0.5 0.5

0 0
1 2 3 4 5 1 2 3 4 5
1 1

0.5 0.5

0 0
1 2 3 4 5 1 2 3 4 5
1 1

0.5 0.5

0 0
1 2 3 4 5 1 2 3 4 5
1 1

0.5 0.5

0 0
1 2 3 4 5 1 2 3 4 5

{ k }  5,5,...,5 { k }  0.1, 0.1,..., 0.1


Run dirichletHistogramDemo
from PMTK

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20


Dirichlet Distribution
 In closing, we have the following properties (you only need the normalization
of the Dirichlet distribution    d   (a)...(a)(a ) , a  a  ...  a and the property
K
 k 1 1 K
k 0 1 K
k 1 0
Γ 𝑥 + 1 = 𝑥Γ 𝑥 to prove them):
k  1     k   j l
[ k ]  , mode[ k ]  k , var[ k ]  k 2 0 , cov[  j l ]   2  j  l
0 0 1  0 ( 0  1)  0 ( 0  1)
K
where :  0    k
k 1

 Often we use:
k   K

 In this case (𝛼0 = 𝛼): 1 K 1


[ k ]  , var[ k ]  2
K K (  1)

 Increasing  increases the precision of the distribution.


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21

You might also like