Professional Documents
Culture Documents
All of Graphical Models
All of Graphical Models
Xiaojin Zhu
Department of Computer Sciences
University of WisconsinMadison, USA
Tutorial at ICMLA 2011
The Whole Tutorial in One Slide
Do inference = p(X
Q
| X
E
), in general
X
Q
X
E
{x
1
. . . x
n
}
If p(x
1
, . . . , x
n
) not given, estimate it from data
Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Life without Graphical Models
. . . is ne mathematically:
e.g., x
1
, . . . , x
n1
can be the discrete or continuous features
e.g., x
n
y can be the discrete class label
Prediction: y
= argmax p(x
n
| x
1
, . . . , x
n
), a.k.a.
inference
1
, . . . , x
n
) =
p(x
1
, . . . , x
n
, x
n
)
v
p(x
1
, . . . , x
n
, x
n
= v)
Conclusion
Do inference = p(X
Q
| X
E
), in general
X
Q
X
E
{x
1
. . . x
n
}
If p(x
1
, . . . , x
n
) not given, estimate it from data
Cant do it either
Acknowledgments Before We Start
Much of this tutorial is based on
i
p(x
i
| Pa(x
i
))}
where Pa(x
i
) is the set of parents of x
i
.
p(x
i
| Pa(x
i
)) is the conditional probability distribution
(CPD) at x
i
p(y, x
1
, . . . x
d
) = p(y)
d
i=1
p(x
i
| y)
N
d
w
D
T
z
A generative model for p(, , z, w | , ):
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(
z
)
Inference goals: p(z | w, , ), argmax
,
p(, | w, , )
Example: Latent Dirichlet Allocation (LDA)
N
d
w
D
T
z
A generative model for p(, , z, w | , ):
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(
z
)
Inference goals: p(z | w, , ), argmax
,
p(, | w, , )
Example: Latent Dirichlet Allocation (LDA)
N
d
w
D
T
z
A generative model for p(, , z, w | , ):
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(
z
)
Inference goals: p(z | w, , ), argmax
,
p(, | w, , )
Example: Latent Dirichlet Allocation (LDA)
N
d
w
D
T
z
A generative model for p(, , z, w | , ):
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(
z
)
Inference goals: p(z | w, , ), argmax
,
p(, | w, , )
Example: Latent Dirichlet Allocation (LDA)
N
d
w
D
T
z
A generative model for p(, , z, w | , ):
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(
z
)
Inference goals: p(z | w, , ), argmax
,
p(, | w, , )
Example: Latent Dirichlet Allocation (LDA)
N
d
w
D
T
z
A generative model for p(, , z, w | , ):
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(
z
)
Inference goals: p(z | w, , ), argmax
,
p(, | w, , )
Example: Latent Dirichlet Allocation (LDA)
N
d
w
D
T
z
A generative model for p(, , z, w | , ):
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(
z
)
Inference goals: p(z | w, , ), argmax
,
p(, | w, , )
Example: Latent Dirichlet Allocation (LDA)
N
d
w
D
T
z
A generative model for p(, , z, w | , ):
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(
z
)
Inference goals: p(z | w, , ), argmax
,
p(, | w, , )
Some Topics by LDA on the Wish Corpus
p(word | topic)
troops election love
Conditional Independence
A, B in general dependent
A, B in general dependent
A, B in general independent
A, B dependent given C
A
C
B
F
E
d-Separation Example 2
C
(X
C
)
_
Z =
_
C
C
(X
C
)dX is the partition function
Example: A Tiny Markov Random Field
x x
1 2
C
x
1
, x
2
{1, 1}
A single clique
C
(x
1
, x
2
) = e
ax
1
x
2
p(x
1
, x
2
) =
1
Z
e
ax
1
x
2
Z = (e
a
+e
a
+e
a
+e
a
)
p(1, 1) = p(1, 1) = e
a
/(2e
a
+ 2e
a
)
p(1, 1) = p(1, 1) = e
a
/(2e
a
+ 2e
a
)
Real-valued weights w
1
, . . . , w
k
p(X) =
1
Z
exp
_
i=1
w
i
f
i
(X)
_
Example: The Ising Model
x
s
x
t
st
This is an undirected model with x {0, 1}.
p
(x) =
1
Z
exp
_
_
sV
s
x
s
+
(s,t)E
st
x
s
x
t
_
_
f
s
(X) = x
s
, f
st
(X) = x
s
x
t
w
s
=
s
, w
st
=
st
Example: Image Denoising
noisy image argmax
X
P(X|Y )
[From Bishop PRML]
Example: Gaussian Random Field
p(X) N(, ) =
1
(2)
n/2
||
1/2
exp
_
1
2
(X )
1
(X )
_
Multivariate Gaussian
Let =
1
be the precision matrix
x
i
, x
j
are conditionally independent given all other variables, if
and only if
ij
= 0
When
ij
= 0, there is an edge between x
i
, x
j
Conditional Independence in Markov Random Fields
A, B beome disconnected
A
C
B
Factor Graph
A B
C
A B
C
(A,B,C)
f
A B
C
f
P(A)P(B)P(C|A,B)
Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Inference by Enumeration
Let X = (X
Q
, X
E
, X
O
) for query, evidence, and other
variables.
Infer P(X
Q
| X
E
)
By denition
P(X
Q
| X
E
) =
P(X
Q
, X
E
)
P(X
E
)
=
X
O
P(X
Q
, X
E
, X
O
)
X
Q
,X
O
P(X
Q
, X
E
, X
O
)
v
r
x
i
=v
1
This is exponential (r
k
):
x
1
...x
k
We want
x
1
...x
k
p(X)
m
j=1
f
j
(X
(j)
)
Each factor f
j
operates on X
(j)
X
Eliminating a Variable
Rearrange factors
x
1
...x
k
f
1
. . . f
l
f
+
l+1
. . . f
+
m
by whether
x
1
X
(j)
Obviously equivalent:
x
2
...x
k
f
1
. . . f
l
_
x
1
f
+
l+1
. . . f
+
m
_
m+1
=
_
x
1
f
+
l+1
. . . f
+
m
_
m+1
contains the union of variables in f
+
l+1
. . . f
+
m
except x
1
In fact, x
1
disappears altogether in
x
2
...x
k
f
1
. . . f
l
f
m+1
m+1
once, use it thereafter
Hope: f
m+1
contains very few variables
Binary variables
A,B,C
P(A)P(B|A)P(C|B)P(D|C)
Let f
1
(A) = P(A). Note f
1
is an array of size two:
P(A = 0)
P(A = 1)
f
2
(A, B) is a table of size four:
P(B = 0|A = 0)
P(B = 0|A = 1)
P(B = 1|A = 0)
P(B = 1|A = 1))
A,B,C
f
1
(A)f
2
(A, B)f
3
(B, C)f
4
(C, D) =
B,C
f
3
(B, C)f
4
(C, D)(
A
f
1
(A)f
2
(A, B))
Example: Chain Graph
A B C D
f
1
(A)f
2
(A, B) an array of size four: match A values
P(A = 0)P(B = 0|A = 0)
P(A = 1)P(B = 0|A = 1)
P(A = 0)P(B = 1|A = 0)
P(A = 1)P(B = 1|A = 1)
f
5
(B)
A
f
1
(A)f
2
(A, B) an array of size two
P(A = 0)P(B = 0|A = 0) +P(A = 1)P(B = 0|A = 1)
P(A = 0)P(B = 1|A = 0) +P(A = 1)P(B = 1|A = 1)
B,C
f
3
(B, C)f
4
(C, D)f
5
(B) =
C
f
4
(C, D)(
B
f
3
(B, C)f
5
(B)), and so on
In the end, f
7
(D) = (P(D = 0), P(D = 1))
Example: Chain Graph
A B C D
Enumeration: 48 , 14 +
Eliminate variables X
O
= X X
E
X
Q
X
Q
f(X
Q
)
Summary: Exact Inference
Enumeration
Variable elimination
i=1
1
(x
(i)
Q
=c
Q
)
i=1
1
(B
(i)
=1)
where m is the number of surviving samples
Initialization:
e.g. X
(0)
= (B = 0, E = 1, A = 0, J = 0, M = 1)
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
P(J | A) = 0.9
P(J | ~A) = 0.05
P(M | A) = 0.7
P(M | ~A) = 0.01
A
J M
B
E
P(E)=0.002 P(B)=0.001
Gibbs Update
This is equivalent to x
i
P(x
i
| MarkovBlanket(x
i
))
yC(x
i
)
P(y | Pa(y))
where Pa(x) are the parents of x, and C(x) the children of x.
For example,
B P(B | E = 1, A = 0) P(B)P(A = 0 | B, E = 1)
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
P(J | A) = 0.9
P(J | ~A) = 0.05
P(M | A) = 0.7
P(M | ~A) = 0.01
A
J M
B
E
P(E)=0.002 P(B)=0.001
Gibbs Update
Starting from X
(1)
, sample
A P(A | B = 1, E = 1, J = 0, M = 1) to get X
(2)
(x) =
1
Z
exp
_
_
sV
s
x
s
+
(s,t)E
st
x
s
x
t
_
_
Gibbs Example 2: The Ising Model
x
s
A
B
C
D
tN(s)
st
x
t
)) + 1
Gibbs Sampling as a Markov Chain
| X)
i
) | (X
i
, x
i
)) = p(x
i
| X
i
), and stationary
distribution p(x
Q
| X
E
)
Use all of X
(T+1)
, . . . for inference (they are correlated)
Do not thin
Collapsed Gibbs Sampling
In general, E
p
[f(X)]
1
m
m
i=1
f(X
(i)
) if X
(i)
p
If so,
E
p
[f(X)] = E
p(Y )
E
p(Z|Y )
[f(Y, Z)]
1
m
m
i=1
E
p(Z|Y
(i)
)
[f(Y
(i)
, Z)]
if Y
(i)
p(Y )
i
) | (Y
i
, y
i
)) = p(y
i
| Y
i
)
Note p(y
i
| Y
i
) =
_
p(y
i
, Z | Y
i
)dZ
Example: Collapsed Gibbs Sampling for LDA
N
d
w
D
T
z
Collapse , , Gibbs update:
P(z
i
= j | z
i
, w)
n
(w
i
)
i,j
+n
(d
i
)
i,j
+
n
()
i,j
+Wn
(d
i
)
i,
+T
n
(w
i
)
i,j
: number of times word w
i
has been assigned to topic j,
excluding the current position
n
(d
i
)
i,j
: number of times a word from document d
i
has been
assigned to topic j, excluding the current position
n
()
i,j
: number of times any word has been assigned to topic j,
excluding the current position
n
(d
i
)
i,
: length of document d
i
, excluding the current position
Summary: Markov Chain Monte Carlo
Gibbs sampling
Observing x
1
= R, x
2
= G, the directed graphical model
z
1
x =G
2
z
2
x =R
1
Factor graph
z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
Messages
A message is a vector of length K, where K is the number of
values x takes.
There are two types of messages:
1.
fx
: message from a factor node f to a variable node x
fx
(i) is the ith element, i = 1 . . . K.
2.
xf
: message from a variable node x to a factor node f
Leaf Messages
f
1
z
1
(z
1
= 1) = P(z
1
= 1)P(R|z
1
= 1) = 1/2 1/2 = 1/4
f
1
z
1
(z
1
= 2) = P(z
1
= 2)P(R|z
1
= 2) = 1/2 1/4 = 1/8
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
Message from Variable to Factor
Let x be in factor f
s
.
xf
s
(x) =
fne(x)\f
s
fx
(x)
ne(x)\f
s
are factors connected to x excluding f
s
.
z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
z
1
f
2
(z
1
= 1) = 1/4
z
1
f
2
(z
1
= 2) = 1/8
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
Message from Factor to Variable
Let x be in factor f
s
. Let the other variables in f
s
be x
1:M
.
f
s
x
(x) =
x
1
. . .
x
M
f
s
(x, x
1
, . . . , x
M
)
M
m=1
x
m
f
s
(x
m
)
z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
f
2
z
2
(s) =
2
=1
z
1
f
2
(s
)f
2
(z
1
= s
, z
2
= s)
= 1/4P(z
2
= s|z
1
= 1)P(x
2
= G|z
2
= s)
+1/8P(z
2
= s|z
1
= 2)P(x
2
= G|z
2
= s)
We get
f
2
z
2
(z
2
= 1) = 1/32
f
2
z
2
(z
2
= 2) = 1/8
Up to Root, Back Down
z
2
f
2
(z
2
= 1) = 1
z
2
f
2
(z
2
= 2) = 1
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
Keep Passing Down
z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
f
2
z
1
(s) =
2
s
=1
z
2
f
2
(s
)f
2
(z
1
= s, z
2
= s
)
= 1P(z
2
= 1|z
1
= s)P(x
2
= G|z
2
= 1)
+ 1P(z
2
= 2|z
1
= s)P(x
2
= G|z
2
= 2). We get
f
2
z
1
(z
1
= 1) = 7/16
f
2
z
1
(z
1
= 2) = 3/8
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
From Messages to Marginals
Once a variable receives all incoming messages, we compute its
marginal as
p(x)
fne(x)
fx
(x)
In this example
P(z
1
|x
1
, x
2
)
f
1
z
1
f
2
z
1
=
_
1/4
1/8
_
_
7/16
3/8
_
=
_
7/64
3/64
_
_
0.7
0.3
_
P(z
2
|x
1
, x
2
)
f
2
z
2
=
_
1/32
1/8
_
_
0.2
0.8
_
One can also compute the marginal of the set of variables x
s
involved in a factor f
s
p(x
s
) f
s
(x
s
)
xne(f)
xf
(x)
Handling Evidence
Observing x = v,
set messages
xf
(x) = 0 for all x = v
Observing X
E
,
fne(x)
fx
(x)
p(x
, X
E
)
Loopy Belief Propagation
x
s
x
t
st
The random variables x take values in {0, 1}.
p
(x) =
1
Z
exp
_
_
sV
s
x
s
+
(s,t)E
st
x
s
x
t
_
_
The Conditional
x
s
x
t
st
This reduces to
p(x
s
= 1 | x
N(s)
) =
1
exp((
s
+
tN(s)
st
x
t
)) + 1
tN(s)
st
x
t
)) + 1
s
1
exp((
s
+
tN(s)
st
t
)) + 1
Let (X) = (
1
(X), . . . ,
d
(X))
be d sucient statistics,
where
i
: X R
i
(X) sometimes called a feature function
Let = (
1
, . . . ,
d
)
R
d
be canonical parameters.
(x) = exp
_
(x) A()
_
Exponential Family
p
(x) = exp
_
(x) A()
_
(x)
_
(dx)
A = log Z
Minimal vs. Overcomplete Models
(x) = constant x
Can be rewritten as
p(x) = exp (xlog + (1 x) log(1 ))
1
(x) = x,
2
(x) = 1 x,
1
= log ,
2
= log(1 ), and
A() = 0.
Overcomplete:
1
=
2
= 1 makes
x
s
x
t
st
p
(x) = exp
_
_
sV
s
x
s
+
(s,t)E
st
x
s
x
t
A()
_
_
This is a regular ( = R
d
), minimal exponential family.
Exponential Family Example 3: Potts Model
x
s
x
t
st
Indicator functions f
sj
(x) = 1 if x
s
= j and 0 otherwise, and
f
stjk
(x) = 1 if x
s
= j x
t
= k, and 0 otherwise.
p
(x) = exp
_
_
sj
sj
f
sj
(x) +
stjk
stjk
f
stjk
(x) A()
_
_
d = r|V | +r
2
|E|
r1
j=0
sj
(x) = 1 for any
s V and all x.
e.g.,
sj
(x) = f
sj
(x) = 1 if x
s
= j and 0 otherwise
[
sj
(x)] = P(x
s
= j)
is
i
= E
p
[
i
(x)] =
_
i
(x)p(x)dx
If
(1)
,
(2)
M, there must exist p
(1)
, p
(2)
Therefore M is convex
Example: The First Two Moments
Let
1
(x) = x,
2
(x) = x
2
2
1
0 for any p
M is not R
2
but rather the subset
1
R,
2
2
1
.
The Marginal Polytope
Recall M= { R
d
| =
x
(x)p(x) for some p}
only 4 dierent x = (x
1
, x
2
).
2
E
p
[x
2
= 1] and edge marginal
12
E
p
[x
1
= x
2
= 1],
hence the name.
The Log Partition Function A
Nice property:
A()
i
= E
[
i
(x)]
.
Conjugate Duality
The conjugate dual function A
to A is dened as
A
() = sup
A()
Such denition, where a quantity is expressed as the solution to an
optimization problem, is called a variational denition.
Then A
() = H(p
()
) the negative entropy.
()
[(x)].
Example: Conjugate Dual for Bernoulli
By denition
A
() = sup
R
log(1 + exp())
() = log + (1 ) log(1 )
i.e., the negative entropy.
Inference with Variational Representation
A() = sup
M
() is attained by = E
[(x)].
[
ij
(x)] under standard
overcomplete representation.
()
Two issues:
Set all
i
= 0 if
i
involves edges
(x) =
sV
p(x
s
;
s
)
The Geometry of M(F)
Example:
) = 1 is realized as a
limit to the series p(x) = exp(
1
x
1
+
2
x
2
A()) where
1
and
2
.
()
attained by solution to inference problem = E
[(x)].
()
Even if
() = H(p
The mean parameters for the Ising model are the node and
edge marginals:
s
= p(x
x
= 1),
st
= p(x
s
= 1, x
t
= 1)
() =
sV
H(
s
) =
sV
s
log
s
+ (1
s
) log(1
s
)
sV
(
s
log
s
+ (1
s
) log(1
s
))
= max
(
1
...
m
)[0,1]
m
_
_
sV
s
+
(s,t)E
st
t
+
sV
H(
s
)
_
_
Example: Mean Field for Ising Model
L() = max
(
1
...
m
)[0,1]
m
_
_
sV
s
+
(s,t)E
st
t
+
sV
H(
s
)
_
_
s
=
1
1 + exp
_
(
s
+
(s,t)E
st
t
)
_
as weve seen before.
()
The sum-product algorithm makes two approximations:
with an approximation.
A() = sup
L
()
The Outer Relaxation
sj
= p(x
s
= j),
stjk
= p(x
s
= j, x
t
= k).
Now consider R
d
+
satisfying node normalization and
edge-node marginal consistency conditions:
r1
j=0
sj
= 1 s V
r1
k=0
stjk
=
sj
s, t V, j = 0 . . . r 1
r1
j=0
stjk
=
tk
s, t V, k = 0 . . . r 1
(x) =
sV
sx
s
(s,t)E
stx
s
x
t
sx
s
tx
t
) =
sV
r1
j=0
sj
log
sj
(s,t)E
j,k
stjk
log
stjk
sj
tk
) =
sV
r1
j=0
sj
log
sj
(s,t)E
j,k
stjk
log
stjk
sj
tk
Note H
Bethe
is not a true entropy. The second approximation in
sum-product is to replace A
() with H
Bethe
(p
).
The Sum-Product Algorithm as Variational Approximation
With these two approximations, we arrive at the variational
problem
A() = sup
L
+H
Bethe
(p
At the solution,
A() is not guaranteed to be either an upper
or a lower bound of A()
n
= arg max
k
p(z
n
= k|x
1:N
)
However z
1:N
as a whole may not be the best
In fact z
1:N
can even have zero probability!
2. An alternative is to nd
z
1:N
= arg max
z
1:N
p(z
1:N
|x
1:N
)
with
max in the factor-to-variable messages.
f
s
x
(x) = max
x
1
. . . max
x
M
f
s
(x, x
1
, . . . , x
M
)
M
m=1
x
m
f
s
(x
m
)
x
m
f
s
(x
m
) =
fne(x
m
)\f
s
fx
m
(x
m
)
x
leaf
f
(x) = 1
f
leaf
x
(x) = f(x)
Intermediate: The Max-Product Algorithm
fx
(x)
_
_
f
s
x
(x) = max
x
1
. . . max
x
M
f
s
(x, x
1
, . . . , x
M
)
M
m=1
x
m
f
s
(x
m
)
for each x value, we separately create M pointers back to the
values of x
1
, . . . , x
M
that achieve the maximum.
f
1
z
1
(z
1
= 1) = P(z
1
= 1)P(R|z
1
= 1) = 1/2 1/2 = 1/4
f
1
z
1
(z
1
= 2) = P(z
1
= 2)P(R|z
1
= 2) = 1/2 1/4 = 1/8
z
1
f
2
(z
1
= 1) = 1/4
z
1
f
2
(z
1
= 2) = 1/8
Intermediate: The Max-Product Algorithm
z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
f
2
z
2
(z
2
= 1)
= max
z
1
f
2
(z
1
, z
2
)
z
1
f
2
(z
1
)
= max
z
1
P(z
2
= 1 | z
1
)P(x
2
= G | z
2
= 1)
z
1
f
2
(z
1
)
= max(1/4 1/4 1/4, 1/2 1/4 1/8) = 1/64
Back pointer for z
2
= 1: either z
1
= 1 or z
1
= 2
Intermediate: The Max-Product Algorithm
z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
The other element of the same message:
f
2
z
2
(z
2
= 2)
= max
z
1
f
2
(z
1
, z
2
)
z
1
f
2
(z
1
)
= max
z
1
P(z
2
= 2 | z
1
)P(x
2
= G | z
2
= 2)
z
1
f
2
(z
1
)
= max(3/4 1/2 1/4, 1/2 1/2 1/8) = 3/32
Back pointer for z
2
= 2: z
1
= 1
Intermediate: The Max-Product Algorithm
z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
f
2
z
2
=
_
1/64 z
1
=1,2
3/32 z
1
=1
_
At root z
2
,
max
s=1,2
f
2
z
2
(s) = 3/32
z
2
= 2 z
1
= 1
z
1:2
= arg max
z
1:2
p(z
1:2
|x
1:2
) = (1, 2)
In this example, sum-product and max-product produce the same
best sequence; In general they dier.
From Max-Product to Max-Sum
The max-sum algorithm is equivalent to the max-product
algorithm, but work in log space to avoid underow.
f
s
x
(x) = max
x
1
...x
M
log f
s
(x, x
1
, . . . , x
M
) +
M
m=1
x
m
f
s
(x
m
)
x
m
f
s
(x
m
) =
fne(x
m
)\f
s
fx
m
(x
m
)
x
leaf
f
(x) = 0
f
leaf
x
(x) = log f(x)
When at the root,
log p
max
= max
x
_
_
fne(x)
fx
(x)
_
_
The back pointers are the same.
Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Parameter Learning
(x) = exp
_
(x) A()
_
i=1
log p
(x
i
) =
_
1
n
n
i=1
(x
i
)
_
A() =
A()
1
n
n
i=1
(x
i
) is the mean parameter of the empirical
distribution on x
1
. . . x
n
. Clearly M.
Maximum likelihood:
ML
= arg sup
A()
The solution is
ML
= ( ), the exponential family density
whose mean parameter matches .
When M
0
and minimal, there is a unique maximum
likelihood solution
ML
.
Partially Observed Data
Full data (x
1
, z
1
) . . . (x
n
, z
n
), but we only observe x
1
. . . x
n
n
i=1
log p
(x
i
) where
p
(x
i
) =
_
p
(x
i
, z)dz
Can be written as () =
1
n
n
i=1
A
x
i
() A()
(z | x
i
), one per item:
A
x
i
() = log
_
exp(
(x
i
, z
))dz
x
i
()
i
A
x
i
(
i
),
i
M
x
i
i=1
A
x
i
() A()
1
n
n
i=1
_
i
A
x
i
(
i
)
_
A()
L(
1
, . . . ,
n
, )
Exact EM: The E-Step
The EM algorithm is coordinate ascent on L(
1
, . . . ,
n
, ).
i
arg max
i
M
x
i
L(
1
, . . . ,
n
, )
Equivalently, argmax
i
M
x
i
i
A
x
i
(
i
)
i
() = E
[(x
i
, z)]
L(
1
, . . . ,
n
, ) = arg max
A()
=
1
n
n
i=1
Cant maximize
max
i
M
x
i
i
A
x
i
(
i
)
i
M
x
i
(F)
i
A
x
i
(
i
)
up to local maximum
recall M
x
i
(F) is an inner approximation to M
x
i
iM
i
f
i
(X)
_
lnP(Data | M, )
Equivalently, x
i
, x
j
are conditionally independent given other
variables
x
x
x
x
1
2
3
4
Structure Learning for Gaussian Random Fields
Let data be X
(1)
, . . . , X
(n)
N(, )
n
i=1
(X
(i)
)
(X
(i)
)
i
(X
(i)
X)
(X
(i)
X)
where
X is the sample mean
S
1
is not a good estimate of when n is small
Structure Learning for Gaussian Random Fields
i=1
X
(i)
X
(i)
+
i=j
|
ij
|
Known as glasso
Recap
BN or MRF
conditional independence
Do inference = p(X
Q
| X
E
), in general
X
Q
X
E
{x
1
. . . x
n
}
If p(x
1
, . . . , x
n
) not given, estimate it from data