Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Machine Learning

Exponential Family
Dmitrij Schlesinger
WS2013/2014, 22.11.2013
General form
h i
p(x; θ) = h(x) exp hη(θ), T (x)i − A(θ)
with
– x is a random variable
– θ is a parameter
– η(θ) is a natural parameter, vector (often η(θ) = θ)
– T (x) is a sufficient statistic
– A(θ) is the log-partition function

Almost all probability distributions you can imagine are


members of the exponential family

Example – Gaussian (board)

ML: Exponential Family 22.11.2013 2


Our models
Let x be an observed variable and y be a hidden one

1. The joint probability distribution is in the exponential


family (a generative model):
1 h i
p(x, y; w) = exp hφ(x, y), wi
Z(w)
X h i
Z(w) = exp hφ(x, y), wi
x,y

2. The conditional probability distribution is in the exponential


family (a discriminative model):
p(x, y; w) = p(x) · p(y|x; w)
1 h i
p(y|x; w) = exp hφ(x, y), wi
Z(w, x)
X h i
Z(w, x) = exp hφ(x, y), wi ∀x
y

ML: Exponential Family 22.11.2013 3


Our learning schemes

– Generative model, supervised → Maximum Likelihood,


Gradient

– Discriminative model, supervised → Maximum


Conditional Likelihood, Gradient

– Generative model, unsupervised → Maximum Likelihood,


Expectation Maximization, Gradient for the M-step

ML: Exponential Family 22.11.2013 4


Generative model, supervised
Model:
1 h i
p(x, y; w) = exp hφ(x, y), wi
Z(w)
X h i
Z(w) = exp hφ(x, y), wi
x,y
 
Training set: L = (xl , y l ) . . .

Maximum Likelihood:
Xh i
hφ(xl , y l ), wi − ln Z(w) → min
w
l

Gradient:
∂ 1 X ∂ ln Z(w)
= φ(xl , y l ) −
∂w |L| l ∂w

ML: Exponential Family 22.11.2013 5


Generative model, supervised
Partition function:
X h i
Z(w) = exp hφ(x, y), wi
x,y

Gradient of the log-partition function:


∂ ln Z(w)
=
∂w
1 X h i
= exp hφ(x, y), wi · φ(x, y)
Z(w) x,y
X
= p(x, y; w) · φ(x, y) = Ep(x,y;w) [φ]
x,y

The Gradient is the difference of expectations:


∂ 1 X
= φ(xl , y l ) − Ep(x,y;w) [φ] = EL [φ] − Ep(x,y;w) [φ]
∂w |L| l

ML: Exponential Family 22.11.2013 6


Discriminative model (posterior), supervised
Model:
1 h i
p(y|x; w) = exp hφ(x, y), wi
Z(w, x)
X h i
Z(w, x) = exp hφ(x, y), wi ∀x
y
 
Training set: L = (xl , y l ) . . .

Maximum Conditional Likelihood:


Xh i
hφ(xl , y l ), wi − ln Z(w, xl ) → min
w
l

Gradient:
∂ 1 X 1 X ∂ ln Z(w, xl )
= φ(xl , y l ) −
∂w |L| l |L| l ∂w

ML: Exponential Family 22.11.2013 7


Discriminative model (posterior), supervised
Partition function:
X h i
Z(w, x) = exp hφ(x, y), wi
y

Gradient of the log-partition function for a particular xl :


∂ ln Z(w, xl )
=
∂w
1 h
l
i
· φ(xl , y)
X
= l
exp hφ(x , y), wi
Z(w, x ) y
h i
p(y|xl ; w) · φ(xl , y) = Ep(y|xl ;w) φ(xl )
X
=
y

The Gradient is again the difference of expectations:


∂ 1 X h i
= EL [φ] − Ep(y|xl ;w) φ(xl )
∂w |L| l

ML: Exponential Family 22.11.2013 8


Generative model, unsupervised
Model:
1 h i
p(x, y; w) = exp hφ(x, y), wi
Z(w)
X h i
Z(w) = exp hφ(x, y), wi
x,y
 
Training set (incomplete): L = xl . . .

Expectation:
αl (y) = p(y|xl ; w) ∀l, y
Maximization:
XX
αl (y) ln p(x, y; w) → max
w
l y

ML: Exponential Family 22.11.2013 9


Generative model, unsupervised
Maximization:
XX
αl (y) ln p(x, y; w) =
l y
h i
αl (y) hφ(xl , y), wi − ln Z(w) =
XX
=
l y

αl (y)hφ(xl , y), wi −
XX XX
= αl (y) ln Z(w) =
l y l y

αl (y)hφ(xl , y), wi − |L| · ln Z(w)


XX
=
l y

The gradient is again a difference of expectations:


∂ 1 XX
= αl (y)φ(xl , y) − Ep(x,y;w) [φ] =
∂w |L| l y
1 X l
= E l [φ(x )] − Ep(x,y;w) [φ]
|L| l p(y|x )

ML: Exponential Family 22.11.2013 10


Conclusion

In all variants the gradient of the log-likelihood is a difference


between expectations of the sufficient statistic:

∂ ln L
= Edata [φ] − Emodel [φ]
∂w

→ the likelihood is in optimum if they coincide

In supervised cases the “data” expectation is the simple


average over the training set → Edata does not depend on w
→ the problem is concave → global optimum.

ML: Exponential Family 22.11.2013 11

You might also like