All of Graphical Models

All of Graphical Models
Xiaojin Zhu
Department of Computer Sciences
University of WisconsinMadison, USA
Tutorial at ICMLA 2011
The Whole Tutorial in One Slide
Given GM = joint distribution p(x

1
, . . . , x
n
)
Do inference = p(X
Q
| X
E
), in general
X
Q
X
E
{x
1
. . . x
n
}
If p(x
1
, . . . , x
n
) not given, estimate it from data
Outline
Life without Graphical Models
Representation
Directed Graphical Models (Bayesian Networks)
Undirected Graphical Models (Markov Random Fields)
Inference
Exact Inference
Markov Chain Monte Carlo
Variational Inference
Loopy Belief Propagation
Mean Field Algorithm
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
. . . is ne mathematically:
The universe is reduced to a set of random variables

x
1
, . . . , x
n
e.g., x
1
, . . . , x
n1
can be the discrete or continuous features
e.g., x
n
y can be the discrete class label
The joint p(x

1
, . . . , x
n
) completely describes how the universe
works
Machine learning: estimate p(x

1
, . . . , x
n
) from training
data X
(1)
, . . . , X
(N)
, where X
(i)
= (x
(i)
1
, . . . , x
(i)
n
)
Prediction: y
= argmax p(x
n
| x
1
, . . . , x
n
), a.k.a.
inference
by the denition of conditional probability

p(x
n
| x
1
, . . . , x
n
) =
p(x
1
, . . . , x
n
, x
n
)
v
p(x
1
, . . . , x
n
, x
n
= v)
Conclusion
Life without graphical models is just ne
So why are we still here?

Life can be Better for Computer Scientists

1
, . . . , x
n
)
exponential nave storage (2

n
for binary r.v.)
hard to interpret (conditional independence)
Do inference = p(X
Q
| X
E
), in general
X
Q
X
E
{x
1
. . . x
n
}
Often cant do it computationally
If p(x
1
, . . . , x
n
Cant do it either
Acknowledgments Before We Start
Much of this tutorial is based on
Koller & Friedman, Probabilistic Graphical Models. MIT 2009
Wainwright & Jordan, Graphical Models, Exponential

Families, and Variational Inference. FTML 2008
Bishop, Pattern Recognition and Machine Learning. Springer

2006.
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Graphical-Model-Nots
Graphical model is the study of probabilistic models
Just because there is a graph with nodes and edges doesnt

mean its GM
These are not graphical models
neural network decision tree network ow HMM template
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Bayesian Network
A directed graph has nodes X = (x

1
, . . . , x
n
), some of them
connected by directed edges x
i
x
j
A cycle is a directed path x

1
. . . x
k
where x
1
= x
k
A directed acyclic graph (DAG) contains no cycles
A Bayesian network on the DAG is a family of distributions

satisfying
{p | p(X) =
i
p(x
i
| Pa(x
i
))}
where Pa(x
i
) is the set of parents of x
i
.
p(x
i
| Pa(x
i
)) is the conditional probability distribution
(CPD) at x
i
By specifying the CPDs for all i, we specify a particular

distribution p(X)
Example: Alarm
Binary variables
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
P(J | A) = 0.9
P(J | ~A) = 0.05
P(M | A) = 0.7
P(M | ~A) = 0.01
A
J M
B
E
P(E)=0.002 P(B)=0.001
P(B, E, A, J, M)
= P(B)P( E)P(A | B, E)P(J | A)P( M | A)
= 0.001 (1 0.002) 0.94 0.9 (1 0.7)
.000253
Example: Naive Bayes
y
y
x x
. . .
1 d
x
d
p(y, x
1
, . . . x
d
) = p(y)
d
i=1
p(x
i
| y)
Used extensively in natural language processing
Plate representation on the right

No Causality Whatsoever
P(A)=a
P(B|A)=b
P(B|~A)=c
A
B
B
A
P(B)=ab+(1a)c
P(A|B)=ab/(ab+(1a)c)
P(A|~B)=a(1b)/(1ab(1a)c)
The two BNs are equivalent in all respects
Bayesian networks imply no causality at all
They only encode the joint probability distribution (hence

correlation)
However, people tend to design BNs based on causal relations

Example: Latent Dirichlet Allocation (LDA)
N
d
w
D

T
z
A generative model for p(, , z, w | , ):
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
For each word position in d
topic z Multinomial()
word w Multinomial(
z
)
Inference goals: p(z | w, , ), argmax
,
p(, | w, , )
N
d
w
D

T
z
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
word w Multinomial(
z
)
,
p(, | w, , )
N
d
w
D

T
z
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
word w Multinomial(
z
)
,
p(, | w, , )
N
d
w
D

T
z
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
word w Multinomial(
z
)
,
p(, | w, , )
N
d
w
D

T
z
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
word w Multinomial(
z
)
,
p(, | w, , )
N
d
w
D

T
z
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
word w Multinomial(
z
)
,
p(, | w, , )
N
d
w
D

T
z
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
word w Multinomial(
z
)
,
p(, | w, , )
N
d
w
D

T
z
For each topic t
t
Dirichlet()
For each document d
Dirichlet()
word w Multinomial(
z
)
,
p(, | w, , )
Some Topics by LDA on the Wish Corpus
p(word | topic)
troops election love
Conditional Independence
Two r.v.s A, B are independent if P(A, B) = P(A)P(B) or

P(A|B) = P(A) (the two are equivalent)
Two r.v.s A, B are conditionally independent given C if

P(A, B | C) = P(A | C)P(B | C) or
P(A | B, C) = P(A | C) (the two are equivalent)
This extends to groups of r.v.s
Conditional independence in a BN is precisely specied by

d-separation (directed separation)
d-Separation Case 1: Tail-to-Tail
C
A B
C
A B
A, B in general dependent
A, B conditionally independent given C
C is a tail-to-tail node, blocks the undirected path A-B

d-Separation Case 2: Head-to-Tail
A
C
B
A C B
A, B in general dependent
A, B conditionally independent given C
C is a head-to-tail node, blocks the path A-B

d-Separation Case 3: Head-to-Head
A B
A B
C
C
A, B in general independent
A, B conditionally dependent given C, or any of Cs

descendants
C is a head-to-head node, unblocks the path A-B

d-Separation
Any groups of nodes A and B are conditionally independent

given another group C, if all undirected paths from any node
in A to any node in B are blocked
A path is blocked if it includes a node x such that either
The path is head-to-tail or tail-to-tail at x and x C, or
The path is head-to-head at x, and neither x nor any of its

descendants is in C.
d-Separation Example 1
The path from A to B not blocked by either E or F
A, B dependent given C
A
C
B
F
E
d-Separation Example 2
The path from A to B is blocked both at E and F
A, B conditionally independent given F

A
B
F
E
C
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Markov Random Fields
The eciency of directed graphical model (acyclic graph,

locally normalized CPDs) also makes it restrictive
A clique C in an undirected graph is a fully connected set of

nodes (note: full of loops!)
Dene a nonnegative potential function

C
: X
C
R
+
An undirected graphical model (aka Markov Random Field)

on the graph is a family of distributions satisfying
_
p | p(X) =
1
Z
C
(X
C
)
_
Z =
_

C

C
(X
C
)dX is the partition function
Example: A Tiny Markov Random Field
x x
1 2
C
x
1
, x
2
{1, 1}
A single clique
C
(x
1
, x
2
) = e
ax
1
x
2
p(x
1
, x
2
) =
1
Z
e
ax
1
x
2
Z = (e
a
+e
a
+e
a
+e
a
)
p(1, 1) = p(1, 1) = e
a
/(2e
a
+ 2e
a
)
p(1, 1) = p(1, 1) = e
a
/(2e
a
+ 2e
a
)
When the parameter a > 0, favor homogeneous chains
When the parameter a < 0, favor inhomogeneous chains

Log Linear Models
Real-valued feature functions f

1
(X), . . . , f
k
(X)
Real-valued weights w
1
, . . . , w
k
p(X) =
1
Z
exp
_
i=1
w
i
f
i
(X)
_
Example: The Ising Model
x
s
x
t
st
This is an undirected model with x {0, 1}.
p
(x) =
1
Z
exp
_
_
sV
s
x
s
+
(s,t)E
st
x
s
x
t
_
_
f
s
(X) = x
s
, f
st
(X) = x
s
x
t
w
s
=
s
, w
st
=
st
Example: Image Denoising
noisy image argmax
X
P(X|Y )
[From Bishop PRML]
Example: Gaussian Random Field
p(X) N(, ) =
1
(2)
n/2
||
1/2
exp
_
1
2
(X )
1
(X )
_
Multivariate Gaussian
The n n covariance matrix positive semi-denite
Let =
1
be the precision matrix
x
i
, x
j
are conditionally independent given all other variables, if
and only if
ij
= 0
When
ij
= 0, there is an edge between x
i
, x
j
Conditional Independence in Markov Random Fields
Two group of variables A, B are conditionally independent

given another group C, if
Remove C and all edges involving C
A, B beome disconnected
A
C
B
Factor Graph
For both directed and undirected graphical models
Bipartite: edges between a variable node and a factor node
Factors represent computation

A B
C
(A,B,C)
A B
C
A B
C
(A,B,C)
f
A B
C
f
P(A)P(B)P(C|A,B)
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Inference by Enumeration
Let X = (X
Q
, X
E
, X
O
) for query, evidence, and other
variables.
Infer P(X
Q
| X
E
)
By denition
P(X
Q
| X
E
) =
P(X
Q
, X
E
)
P(X
E
)
=
X
O
P(X
Q
, X
E
, X
O
)
X
Q
,X
O
P(X
Q
, X
E
, X
O
)
Summing exponential number of terms: with k variables in

X
O
each taking r values, there are r
k
terms
Details of the summing problem
There are a bunch of other variables x

1
, . . . , x
k
We sum over r values each variable can take
v
r
x
i
=v
1
This is exponential (r
k
):
x
1
...x
k
We want
x
1
...x
k
p(X)
For a graphical model, the joint probability factors

p(X) =
m
j=1
f
j
(X
(j)
)
Each factor f
j
operates on X
(j)
X
Eliminating a Variable
Rearrange factors
x
1
...x
k
f
1
. . . f
l
f
+
l+1
. . . f
+
m
by whether
x
1
X
(j)
Obviously equivalent:
x
2
...x
k
f
1
. . . f
l
_
x
1
f
+
l+1
. . . f
+
m
_
Introduce a new factor f
m+1
=
_
x
1
f
+
l+1
. . . f
+
m
_
m+1
contains the union of variables in f
+
l+1
. . . f
+
m
except x
1
In fact, x
1
disappears altogether in
x
2
...x
k
f
1
. . . f
l
f
m+1
Dynamic programming: compute f
m+1
once, use it thereafter
Hope: f
m+1
contains very few variables
Recursively eliminate other variables in turn

Example: Chain Graph
A B C D
Binary variables
Say we want P(D) =
A,B,C
P(A)P(B|A)P(C|B)P(D|C)
Let f
1
(A) = P(A). Note f
1
is an array of size two:
P(A = 0)
P(A = 1)
f
2
(A, B) is a table of size four:
P(B = 0|A = 0)
P(B = 0|A = 1)
P(B = 1|A = 0)
P(B = 1|A = 1))

A,B,C
f
1
(A)f
2
(A, B)f
3
(B, C)f
4
(C, D) =
B,C
f
3
(B, C)f
4
(C, D)(
A
f
1
(A)f
2
(A, B))
A B C D
f
1
(A)f
2
(A, B) an array of size four: match A values
P(A = 0)P(B = 0|A = 0)
P(A = 1)P(B = 0|A = 1)
P(A = 0)P(B = 1|A = 0)
P(A = 1)P(B = 1|A = 1)
f
5
(B)
A
f
1
(A)f
2
(A, B) an array of size two
P(A = 0)P(B = 0|A = 0) +P(A = 1)P(B = 0|A = 1)
P(A = 0)P(B = 1|A = 0) +P(A = 1)P(B = 1|A = 1)
For this example, f

5
(B) happens to be P(B)

B,C
f
3
(B, C)f
4
(C, D)f
5
(B) =
C
f
4
(C, D)(
B
f
3
(B, C)f
5
(B)), and so on
In the end, f
7
(D) = (P(D = 0), P(D = 1))
A B C D
Computation for P(D): 12 , 6 +
Enumeration: 48 , 14 +
Saving depends on elimination order. Finding optimal order

NP-hard; there are heuristic methods.
Saving depends more critically on the graph structure (tree

width), can be intractable
Handling Evidence
For evidence variables X

E
, simply plug in their value e
Eliminate variables X
O
= X X
E
X
Q
The nal factor will be the joint f(X

Q
) = P(X
Q
, X
E
= e)
Normalize to answer query:

P(X
Q
| X
E
= e) =
f(X
Q
)
X
Q
f(X
Q
)
Summary: Exact Inference
Enumeration
Variable elimination
Not covered: junction tree (aka clique tree)

Exact, but intractable for large graphs
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Inference by Monte Carlo
Consider the inference problem p(X

Q
= c
Q
| X
E
) where
X
Q
X
E
{x
1
. . . x
n
}
p(X
Q
= c
Q
| X
E
) =
_
1
(x
Q
=c
Q
)
p(x
Q
| X
E
)dx
Q
If we can draw samples x

(1)
Q
, . . . x
(m)
Q
p(x
Q
| X
E
), an
unbiased estimator is
p(X
Q
= c
Q
| X
E
)
1
m
m
i=1
1
(x
(i)
Q
=c
Q
)
The variance of the estimator decreases as V/m
Inference reduces to sampling from p(x

Q
| X
E
)
Forward Sampling Example
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
P(J | A) = 0.9
P(J | ~A) = 0.05
P(M | A) = 0.7
P(M | ~A) = 0.01
A
J M
B
E
P(E)=0.002 P(B)=0.001
To generate a sample X = (B, E, A, J, M):
1. Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then
B = 1 else B = 0
2. Sample E Ber(0.002)
3. If B = 1 and E = 1, sample A Ber(0.95), and so on
4. If A = 1 sample J Ber(0.9) else J Ber(0.05)
5. If A = 1 sample M Ber(0.7) else M Ber(0.01)
Works for Bayesian networks.
Inference with Forward Sampling
Say the inference task is P(B = 1 | E = 1, M = 1)
Throw away all samples except those with (E = 1, M = 1)

p(B = 1 | E = 1, M = 1)
1
m
m
i=1
1
(B
(i)
=1)
where m is the number of surviving samples
Can be highly inecient (note P(E = 1) tiny)
Does not work for Markov Random Fields

Gibbs Sampler Example: P(B = 1 | E = 1, M = 1)
Gibbs sampler is a Markov Chain Monte Carlo (MCMC)

method.
Directly sample from p(x

Q
| X
E
)
Works for both graphical models
Initialization:
Fix evidence; randomly set other variables
e.g. X
(0)
= (B = 0, E = 1, A = 0, J = 0, M = 1)
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
P(J | A) = 0.9
P(J | ~A) = 0.05
P(M | A) = 0.7
P(M | ~A) = 0.01
A
J M
B
E
P(E)=0.002 P(B)=0.001
Gibbs Update
For each non-evidence variable x

i
, xing all other nodes X
i
,
resample its value x
i
P(x
i
| X
i
)
This is equivalent to x
i
P(x
i
| MarkovBlanket(x
i
))
For a Bayesian network MarkovBlanket(x

i
) includes x
i
s
parents, spouses, and children
P(x
i
| MarkovBlanket(x
i
)) P(x
i
| Pa(x
i
))
yC(x
i
)
P(y | Pa(y))
where Pa(x) are the parents of x, and C(x) the children of x.
For many graphical models the Markov Blanket is small.
For example,
B P(B | E = 1, A = 0) P(B)P(A = 0 | B, E = 1)
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
P(J | A) = 0.9
P(J | ~A) = 0.05
P(M | A) = 0.7
P(M | ~A) = 0.01
A
J M
B
E
P(E)=0.002 P(B)=0.001
Gibbs Update
Say we sampled B = 1. Then

X
(1)
= (B = 1, E = 1, A = 0, J = 0, M = 1)
Starting from X
(1)
, sample
A P(A | B = 1, E = 1, J = 0, M = 1) to get X
(2)
Move on to J, then repeat B, A, J, B, A, J . . .
Keep all later samples. P(B = 1 | E = 1, M = 1) is the

fraction of samples with B = 1.
P(A | B, E) = 0.95
P(A | B, ~E) = 0.94
P(A | ~B, E) = 0.29
P(A | ~B, ~E) = 0.001
P(J | A) = 0.9
P(J | ~A) = 0.05
P(M | A) = 0.7
P(M | ~A) = 0.01
A
J M
B
E
P(E)=0.002 P(B)=0.001
Gibbs Example 2: The Ising Model
x
s
A
B
C
D
This is an undirected model with x {0, 1}.
p
(x) =
1
Z
exp
_
_
sV
s
x
s
+
(s,t)E
st
x
s
x
t
_
_
Gibbs Example 2: The Ising Model
x
s
A
B
C
D
The Markov blanket of x

s
is A, B, C, D
In general for undirected graphical models

p(x
s
| x
s
) = p(x
s
| x
N(s)
)
N(s) is the neighbors of s.
The Gibbs update is

p(x
s
= 1 | x
N(s)
) =
1
exp((
s
+
tN(s)
st
x
t
)) + 1
Gibbs Sampling as a Markov Chain
A Markov chain is dened by a transition matrix T(X
| X)
Certain Markov chains have a stationary distribution such

that = T
Gibbs sampler is such a Markov chain with

T
i
((X
i
, x
i
) | (X
i
, x
i
)) = p(x
i
| X
i
), and stationary
distribution p(x
Q
| X
E
)
But it takes time for the chain to reach stationary distribution

(mix)
Can be dicult to assert mixing
In practice burn in: discard X

(0)
, . . . , X
(T)
Use all of X
(T+1)
, . . . for inference (they are correlated)
Do not thin
Collapsed Gibbs Sampling
In general, E
p
[f(X)]
1
m
m
i=1
f(X
(i)
) if X
(i)
p
Sometimes X = (Y, Z) where Z has closed-form operations
If so,
E
p
[f(X)] = E
p(Y )
E
p(Z|Y )
[f(Y, Z)]
1
m
m
i=1
E
p(Z|Y
(i)
)
[f(Y
(i)
, Z)]
if Y
(i)
p(Y )
No need to sample Z: it is collapsed
Collapsed Gibbs sampler T

i
((Y
i
, y
i
) | (Y
i
, y
i
)) = p(y
i
| Y
i
)
Note p(y
i
| Y
i
) =
_
p(y
i
, Z | Y
i
)dZ
Example: Collapsed Gibbs Sampling for LDA
N
d
w
D

T
z
Collapse , , Gibbs update:
P(z
i
= j | z
i
, w)
n
(w
i
)
i,j
+n
(d
i
)
i,j
+
n
()
i,j
+Wn
(d
i
)
i,
+T
n
(w
i
)
i,j
: number of times word w
i
has been assigned to topic j,
excluding the current position
n
(d
i
)
i,j
: number of times a word from document d
i
has been
assigned to topic j, excluding the current position
n
()
i,j
: number of times any word has been assigned to topic j,
excluding the current position
n
(d
i
)
i,
: length of document d
i
, excluding the current position
Summary: Markov Chain Monte Carlo
Gibbs sampling
Not covered: block Gibbs, Metropolis-Hastings

Unbiased (after burn-in), but can have high variance
To learn more, come to Prof. Prasad Tetalis tutorial Markov
Chain Mixing with Applications 2pm Monday.
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
The Sum-Product Algorithm
Also known as belief propagation (BP)
Exact if the graph is a tree; otherwise known as loopy BP,

approximate
The algorithm involves passing messages on the factor graph
Alternative view: variational approximation (more later)

Example: A Simple HMM
The Hidden Markov Model template (not a graphical model)

= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
Observing x
1
= R, x
2
= G, the directed graphical model
z
1
x =G
2
z
2
x =R
1
Factor graph
z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
Messages
A message is a vector of length K, where K is the number of
values x takes.
There are two types of messages:
1.
fx
: message from a factor node f to a variable node x
fx
(i) is the ith element, i = 1 . . . K.
2.
xf
: message from a variable node x to a factor node f
Leaf Messages
Assume tree factor graph. Pick an arbitrary root, say z

2
Start messages at leaves.
If a leaf is a factor node f,

fx
(x) = f(x)
If a leaf is a variable node x,

xf
(x) = 1
z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
f
1
z
1
(z
1
= 1) = P(z
1
= 1)P(R|z
1
= 1) = 1/2 1/2 = 1/4
f
1
z
1
(z
1
= 2) = P(z
1
= 2)P(R|z
1
= 2) = 1/2 1/4 = 1/8
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
Message from Variable to Factor
A node (factor or variable) can send out a message if all other

incoming messages have arrived
Let x be in factor f
s
.
xf
s
(x) =
fne(x)\f
s
fx
(x)
ne(x)\f
s
are factors connected to x excluding f
s
.
z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
z
1
f
2
(z
1
= 1) = 1/4
z
1
f
2
(z
1
= 2) = 1/8
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
Message from Factor to Variable
Let x be in factor f
s
. Let the other variables in f
s
be x
1:M
.
f
s
x
(x) =
x
1
. . .
x
M
f
s
(x, x
1
, . . . , x
M
)
M
m=1
x
m
f
s
(x
m
)
z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
f
2
z
2
(s) =
2
=1
z
1
f
2
(s
)f
2
(z
1
= s
, z
2
= s)
= 1/4P(z
2
= s|z
1
= 1)P(x
2
= G|z
2
= s)
+1/8P(z
2
= s|z
1
= 2)P(x
2
= G|z
2
= s)
We get
f
2
z
2
(z
2
= 1) = 1/32
f
2
z
2
(z
2
= 2) = 1/8
Up to Root, Back Down
The message has reached the root, pass it back down

z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
z
2
f
2
(z
2
= 1) = 1
z
2
f
2
(z
2
= 2) = 1
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
Keep Passing Down
z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
f
2
z
1
(s) =
2
s
=1
z
2
f
2
(s
)f
2
(z
1
= s, z
2
= s
)
= 1P(z
2
= 1|z
1
= s)P(x
2
= G|z
2
= 1)
+ 1P(z
2
= 2|z
1
= s)P(x
2
= G|z
2
= 2). We get
f
2
z
1
(z
1
= 1) = 7/16
f
2
z
1
(z
1
= 2) = 3/8
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
From Messages to Marginals
Once a variable receives all incoming messages, we compute its
marginal as
p(x)
fne(x)
fx
(x)
In this example
P(z
1
|x
1
, x
2
)
f
1
z
1

f
2
z
1
=
_
1/4
1/8
_
_
7/16
3/8
_
=
_
7/64
3/64
_
_
0.7
0.3
_
P(z
2
|x
1
, x
2
)
f
2
z
2
=
_
1/32
1/8
_
_
0.2
0.8
_
One can also compute the marginal of the set of variables x
s
involved in a factor f
s
p(x
s
) f
s
(x
s
)
xne(f)
xf
(x)
Handling Evidence
Observing x = v,
we can absorb it in the factor (as we did); or
set messages
xf
(x) = 0 for all x = v
Observing X
E
,
multiplying the incoming messages to x / X

E
gives the joint
(not p(x|X
E
)):
p(x, X
E
)
fne(x)
fx
(x)
The conditional is easily obtained by normalization

p(x|X
E
) =
p(x, X
E
)
p(x
, X
E
)
So far, we assumed a tree graph
When the factor graph contains loops, pass messages

indenitely until convergence
But convergence may not happen
But in many cases loopy BP still works well, empirically

Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Example: The Ising Model
x
s
x
t
st
The random variables x take values in {0, 1}.
p
(x) =
1
Z
exp
_
_
sV
s
x
s
+
(s,t)E
st
x
s
x
t
_
_
The Conditional
x
s
x
t
st
Markovian: the conditional distribution for x

s
is
p(x
s
| x
s
) = p(x
s
| x
N(s)
)
N(s) is the neighbors of s.
This reduces to
p(x
s
= 1 | x
N(s)
) =
1
exp((
s
+
tN(s)
st
x
t
)) + 1
Gibbs sampling would draw x

s
like this.
The Mean Field Algorithm for Ising Model
p(x
s
= 1 | x
N(s)
) =
1
exp((
s
+
tN(s)
st
x
t
)) + 1
Instead of Gibbs sampling, let

s
be the estimated marginal
p(x
s
= 1)
s

1
exp((
s
+
tN(s)
st
t
)) + 1
The s are updated iteratively
The Mean Field algorithm is coordinate ascent and

guaranteed to converge to a local optimal (more later).
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Exponential Family
Let (X) = (
1
(X), . . . ,
d
(X))
be d sucient statistics,
where
i
: X R
Note X is all the nodes in a Graphical model

i
(X) sometimes called a feature function
Let = (
1
, . . . ,
d
)
R
d
be canonical parameters.
The exponential family is a family of probability densities:

p
(x) = exp
_
(x) A()
_
Exponential Family
p
(x) = exp
_
(x) A()
_
The key is the inner product between parameters and

sucient statistics .
A is the log partition function,

A() = log
_
exp
_
(x)
_
(dx)
A = log Z
Minimal vs. Overcomplete Models
Parameters for which the density is normalizable:

= { R
d
| A() < }
A minimal exponential family is where the s are linearly

independent.
An overcomplete exponential family is where the s are

linearly dependent:
R
d
,
(x) = constant x
Both minimal and overcomplete representations are useful.

Exponential Family Example 1: Bernoulli
p(x) =
x
(1 )
1x
for x {0, 1} and (0, 1).
Does not look like an exponential family!
Can be rewritten as
p(x) = exp (xlog + (1 x) log(1 ))
Now in exponential family form with
1
(x) = x,
2
(x) = 1 x,
1
= log ,
2
= log(1 ), and
A() = 0.
Overcomplete:
1
=
2
= 1 makes
(x) = 1 for all x

Exponential Family Example 1: Bernoulli
p(x) = exp (xlog + (1 x) log(1 ))
Can be further rewritten as

p(x) = exp (x log(1 + exp()))
Minimal exponential family with

(x) = x, = log

1
, A() = log(1 + exp()).
Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are
in the exponential family, but not all (e.g., the Laplace
distribution).
Exponential Family Example 2: Ising Model
x
s
x
t
st
p
(x) = exp
_
_
sV
s
x
s
+
(s,t)E
st
x
s
x
t
A()
_
_
Binary random variable x

s
{0, 1}
d = |V | +|E| sucient statistics: (x) = (. . . x

s
. . . x
st
. . .)
This is a regular ( = R
d
), minimal exponential family.
Exponential Family Example 3: Potts Model
x
s
x
t
st
Similar to Ising model but generalizing x

s
{0, . . . , r 1}.
Indicator functions f
sj
(x) = 1 if x
s
= j and 0 otherwise, and
f
stjk
(x) = 1 if x
s
= j x
t
= k, and 0 otherwise.
p
(x) = exp
_
_
sj
sj
f
sj
(x) +
stjk
stjk
f
stjk
(x) A()
_
_
d = r|V | +r
2
|E|
Regular but overcomplete, because
r1
j=0
sj
(x) = 1 for any
s V and all x.
The Potts model is a special case where the parameters are

tied:
stkk
= , and
stjk
= for j = k.
Important Relation
For sucient statistics dened by indicator functions
e.g.,
sj
(x) = f
sj
(x) = 1 if x
s
= j and 0 otherwise
The marginal can be obtained via the mean

E
[
sj
(x)] = P(x
s
= j)
Since inference is about computing the marginal, in this case

it is equivalent to computing the mean.
Mean Parameters
Let p be any density (not necessarily in exponential family).
Given sucient statistics , the mean parameters

= (
1
, . . . ,
d
)
is
i
= E
p
[
i
(x)] =
_

i
(x)p(x)dx
The set of mean parameters

M= { R
d
| p s.t. E
p
[(x)] = }
If
(1)
,
(2)
M, there must exist p
(1)
, p
(2)
The convex combinations of p

(1)
, p
(2)
leads to another mean
parameter in M
Therefore M is convex
Example: The First Two Moments
Let
1
(x) = x,
2
(x) = x
2
For any p (not necessarily Gaussian) on x, the mean

parameters = (
1
,
2
) = (E(x), E(x
2
))
Note V(x) = E(x

2
) E
2
(x) =
2
2
1
0 for any p
M is not R
2
but rather the subset
1
R,
2

2
1
.
The Marginal Polytope
The marginal polytope is dened for discrete x

s
Recall M= { R
d
| =
x
(x)p(x) for some p}
p can be a point mass function on a particular x.
In fact any p is a convex combination of such point mass

functions.
M= conv{(x), x} is a convex hull, called the marginal

polytope.
Marginal Polytope Example
Tiny Ising model: two nodes x
1
, x
2
{0, 1} connected by an edge.
minimal sucient statistics (x

1
, x
2
) = (x
1
, x
2
, x
1
x
2
)
only 4 dierent x = (x
1
, x
2
).
the marginal polytope is

M= conv{(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)}
the convex hull is a polytope inside the unit cube.
the three coordinates are node marginals

1
E
p
[x
1
= 1],
2
E
p
[x
2
= 1] and edge marginal
12
E
p
[x
1
= x
2
= 1],
hence the name.
The Log Partition Function A
For any regular exponential family, A() is convex in .
Strictly convex for minimal exponential family.
Nice property:
A()
i
= E
[
i
(x)]
Therefore, A = , the mean parameters of p
.
Conjugate Duality
The conjugate dual function A
to A is dened as
A
() = sup
A()
Such denition, where a quantity is expressed as the solution to an
optimization problem, is called a variational denition.
For any Ms interior, let () satisfy

E
()
[(x)] = A(()) = .
Then A
() = H(p
()
) the negative entropy.
The dual of the dual gives back A:

A() = sup
M
()
For all , the supremum is attained uniquely at the

M
0
by the moment matching conditions = E
[(x)].
Example: Conjugate Dual for Bernoulli
Recall the minimal exponential family for Bernoulli with

(x) = x, A() = log(1 + exp()), = R.
By denition
A
() = sup
R
log(1 + exp())
Taking derivative and solve

A
() = log + (1 ) log(1 )
i.e., the negative entropy.
Inference with Variational Representation
A() = sup
M
() is attained by = E
[(x)].
Want to compute the marginals P(x

s
= j)? They are the
mean parameters
sj
= E
[
ij
(x)] under standard
overcomplete representation.
Want to compute the mean parameters

sj
? They are the
arg sup to the optimization problem above.
This variational representation is exact, not approximate (will

relax it next to derive loopy BP and mean eld)
The Diculties with Variational Representation
A() = sup
M
()
Dicult to solve even though it is a convex problem
Two issues:
Although the marginal polytope M is convex, it can be quite

complex (exponential number of vertices)
The dual function A
() usually does not admit an explicit

form.
Variational approximation modies the optimization problem

so that it is tractable, at the price of an approximate solution.
Next, we cast mean eld and sum-product algorithms as

variational approximations.
The Mean Field Method as Variational Approximation
The mean eld method replaces M with a simpler subset

M(F) on which A
() has a closed form.
Consider the fully disconnected subgraph F = (V, ) of the

original graph G = (V, E)
Set all
i
= 0 if
i
involves edges
The densities in this sub-family are all fully factorized:

p
(x) =
sV
p(x
s
;
s
)
The Geometry of M(F)
Let M(F) be the mean parameters of the fully factorized

sub-family. In general, M(F) M
Recall M is the convex hull of extreme points {(x)}.
It turns out the extreme points {(x)} M(F).
Example:
The tiny Ising model x

1
, x
2
{0, 1} with = (x
1
, x
2
, x
1
x
2
)
The point mass distribution p(x = (0, 1)
) = 1 is realized as a
limit to the series p(x) = exp(
1
x
1
+
2
x
2
A()) where
1
and
2
.
This series is in F because

12
= 0.
Hence the extreme point (x) = (0, 1, 0) is in M(F).

The Geometry of M(F)
Because the extreme points of M are in M(F), if M(F)

were convex, we would have M= M(F).
But in general M(F) is a true subset of M
Therefore, M(F) is a nonconvex inner set of M

M(F)
M
( ) x
The Mean Field Method as Variational Approximation
Recall the exact variational problem

A() = sup
M
()
attained by solution to inference problem = E
[(x)].
The mean eld method simply replaces M with M(F)

L() = sup
M(F)
()
Obvious L() A().
The original solution
may not be in M(F)
Even if
M(F), may hit local maximum and not nd it
Why both? Because A
() = H(p
()) has a very simple

form for M(F)
Example: Mean Field for Ising Model
The mean parameters for the Ising model are the node and
edge marginals:
s
= p(x
x
= 1),
st
= p(x
s
= 1, x
t
= 1)
Fully factorized M(F) means no edge.

st
=
s
For M(F), the dual function A
() has the simple form

A
() =
sV
H(
s
) =
sV
s
log
s
+ (1
s
) log(1
s
)
Thus the mean eld problem is

L() = sup
M(F)
sV
(
s
log
s
+ (1
s
) log(1
s
))
= max
(
1
...
m
)[0,1]
m
_
_
sV
s
+
(s,t)E
st
t
+
sV
H(
s
)
_
_
Example: Mean Field for Ising Model
L() = max
(
1
...
m
)[0,1]
m
_
_
sV
s
+
(s,t)E
st
t
+
sV
H(
s
)
_
_
Bilinear in , not jointly concave
But concave in a single dimension

s
, xing others.
Iterative coordinate-wise maximization: xing

t
for t = s and
optimizing
s
.
Setting the partial derivative w.r.t.

s
to 0 yields:
s
=
1
1 + exp
_
(
s
+
(s,t)E

st
t
)
_
as weve seen before.
Caution: mean eld converges to a local maximum depending

on the initialization of
1
. . .
m
.
The Sum-Product Algorithm as Variational Approximation
A() = sup
M
()
The sum-product algorithm makes two approximations:
it relaxes M to an outer set L
it replaces the dual A
with an approximation.
A() = sup
L
()
The Outer Relaxation
For overcomplete exponential families on discrete nodes, the

mean parameters are node and edge marginals
sj
= p(x
s
= j),
stjk
= p(x
s
= j, x
t
= k).
The marginal polytope is M= { | p with marginals }.
Now consider R
d
+
satisfying node normalization and
edge-node marginal consistency conditions:
r1
j=0
sj
= 1 s V
r1
k=0
stjk
=
sj
s, t V, j = 0 . . . r 1
r1
j=0
stjk
=
tk
s, t V, k = 0 . . . r 1
Dene L = { satisfying the above conditions}.

The Outer Relaxation
If the graph is a tree then M= L
If the graph has cycles then M L
L is too lax to satisfy other constraints that true marginals

need to satisfy
Nice property: L is still a polytope, but much simpler than M.

( ) x
L
M
The rst approximation in sum-product is to replace M with L.

Approximating A
Recall are node and edge marginals
If the graph is a tree, one can exactly reconstruct the joint

probability
p
(x) =
sV
sx
s
(s,t)E
stx
s
x
t
sx
s
tx
t
If the graph is a tree, the entropy of the joint distribution is

H(p
) =
sV
r1
j=0
sj
log
sj

(s,t)E
j,k
stjk
log

stjk
sj
tk
Neither holds for graph with cycles.

Approximating A
Dene the Bethe entropy for L on loopy graphs in the same

way:
H
Bethe
(p
) =
sV
r1
j=0
sj
log
sj

(s,t)E
j,k
stjk
log

stjk
sj
tk
Note H
Bethe
is not a true entropy. The second approximation in
sum-product is to replace A
() with H
Bethe
(p
).
The Sum-Product Algorithm as Variational Approximation
With these two approximations, we arrive at the variational
problem
A() = sup
L
+H
Bethe
(p
Optimality conditions require the gradients vanish w.r.t. both

and the Lagrangian multipliers on constraints L.
The sum-product algorithm can be derived as an iterative

xed point procedure to satisfy optimality conditions.
At the solution,

A() is not guaranteed to be either an upper
or a lower bound of A()
may not correspond to a true marginal distribution

Summary: Variational Inference
The sum-product algorithm (loopy belief propagation)
The mean eld method
Not covered: Expectation Propagation

Ecient computation. But often unknown bias in solution.
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Maximizing Problems
Recall the HMM example
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
There are two senses of best states z
1:N
given x
1:N
:
1. So far we computed the marginal p(z
n
|x
1:N
)
We can dene best as z
n
= arg max
k
p(z
n
= k|x
1:N
)
However z
1:N
as a whole may not be the best
In fact z
1:N
can even have zero probability!
2. An alternative is to nd
z
1:N
= arg max
z
1:N
p(z
1:N
|x
1:N
)
nds the most likely state conguration as a whole
The max-sum algorithm solves this
Generalizes the Viterbi algorithm for HMMs

Intermediate: The Max-Product Algorithm
Simple modication to the sum-product algorithm: replace
with
max in the factor-to-variable messages.
f
s
x
(x) = max
x
1
. . . max
x
M
f
s
(x, x
1
, . . . , x
M
)
M
m=1
x
m
f
s
(x
m
)
x
m
f
s
(x
m
) =
fne(x
m
)\f
s
fx
m
(x
m
)
x
leaf
f
(x) = 1
f
leaf
x
(x) = f(x)
As in sum-product, pick an arbitrary variable node x as the

root
Pass messages up from leaves until they reach the root
Unlike sum-product, do not pass messages back from root to

leaves
At the root, multiply incoming messages

p
max
= max
x
_
_

fne(x)
fx
(x)
_
_
This is the probability of the most likely state conguration

To identify the conguration itself, keep back pointers:
When creating the message
f
s
x
(x) = max
x
1
. . . max
x
M
f
s
(x, x
1
, . . . , x
M
)
M
m=1
x
m
f
s
(x
m
)
for each x value, we separately create M pointers back to the
values of x
1
, . . . , x
M
that achieve the maximum.
At the root, backtrack the pointers.

z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
Message from leaf f

1
f
1
z
1
(z
1
= 1) = P(z
1
= 1)P(R|z
1
= 1) = 1/2 1/2 = 1/4
f
1
z
1
(z
1
= 2) = P(z
1
= 2)P(R|z
1
= 2) = 1/2 1/4 = 1/8
The second message
z
1
f
2
(z
1
= 1) = 1/4
z
1
f
2
(z
1
= 2) = 1/8
z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
f
2
z
2
(z
2
= 1)
= max
z
1
f
2
(z
1
, z
2
)
z
1
f
2
(z
1
)
= max
z
1
P(z
2
= 1 | z
1
)P(x
2
= G | z
2
= 1)
z
1
f
2
(z
1
)
= max(1/4 1/4 1/4, 1/2 1/4 1/8) = 1/64
Back pointer for z
2
= 1: either z
1
= 1 or z
1
= 2
z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
The other element of the same message:
f
2
z
2
(z
2
= 2)
= max
z
1
f
2
(z
1
, z
2
)
z
1
f
2
(z
1
)
= max
z
1
P(z
2
= 2 | z
1
)P(x
2
= G | z
2
= 2)
z
1
f
2
(z
1
)
= max(3/4 1/2 1/4, 1/2 1/2 1/8) = 3/32
Back pointer for z
2
= 2: z
1
= 1
z
1
f
1
z
2
f
2
P(z )P(x | z ) P(z | z )P(x | z )
1 1 1 2 1 2 2
= = 1/2 1 2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4)
1 2
1/4 1/2
R G B R G B
f
2
z
2
=
_
1/64 z
1
=1,2
3/32 z
1
=1
_
At root z
2
,
max
s=1,2
f
2
z
2
(s) = 3/32
z
2
= 2 z
1
= 1
z
1:2
= arg max
z
1:2
p(z
1:2
|x
1:2
) = (1, 2)
In this example, sum-product and max-product produce the same
best sequence; In general they dier.
From Max-Product to Max-Sum
The max-sum algorithm is equivalent to the max-product
algorithm, but work in log space to avoid underow.
f
s
x
(x) = max
x
1
...x
M
log f
s
(x, x
1
, . . . , x
M
) +
M
m=1
x
m
f
s
(x
m
)
x
m
f
s
(x
m
) =
fne(x
m
)\f
s
fx
m
(x
m
)
x
leaf
f
(x) = 0
f
leaf
x
(x) = log f(x)
When at the root,
log p
max
= max
x
_
_

fne(x)
fx
(x)
_
_
The back pointers are the same.
Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Parameter Learning
Assume the graph structure is given
Learning in exponential family: estimate from iid data

x
1
. . . x
n
.
Principle: maximum likelihood
Distinguish two cases:
fully observed data: all dimensions of x are observed
partially observed data: some dimensions of x are unobserved.

Fully Observed Data
p
(x) = exp
_
(x) A()
_
Given iid data x

1
. . . x
n
, the log likelihood is
() =
1
n
n
i=1
log p
(x
i
) =
_
1
n
n
i=1
(x
i
)
_
A() =
A()

1
n
n
i=1
(x
i
) is the mean parameter of the empirical
distribution on x
1
. . . x
n
. Clearly M.
Maximum likelihood:
ML
= arg sup
A()
The solution is
ML
= ( ), the exponential family density
whose mean parameter matches .
When M
0
and minimal, there is a unique maximum
likelihood solution
ML
.
Partially Observed Data
Each item (x, z) where x observed, z unobserved
Full data (x
1
, z
1
) . . . (x
n
, z
n
), but we only observe x
1
. . . x
n
The incomplete likelihood () =

1
n
n
i=1
log p
(x
i
) where
p
(x
i
) =
_
p
(x
i
, z)dz
Can be written as () =
1
n
n
i=1
A
x
i
() A()
New log partition function of p
(z | x
i
), one per item:
A
x
i
() = log
_
exp(
(x
i
, z
))dz
Expectation-Maximization (EM) algorithm: lower bound A

x
i
EM as Variational Lower Bound
Mean parameter realizable by any distribution on z while

holding x
i
xed:
M
x
i
= { R
d
| = E
p
[(x
i
, z)] for some p}
The variational denition A

x
i
() = sup
M
x
i
x
i
()
Trivial variational lower bound:

A
x
i
()
i
A
x
i
(
i
),
i
M
x
i
Lower bound L on the incomplete log likelihood:

() =
1
n
n
i=1
A
x
i
() A()
1
n
n
i=1
_
i
A
x
i
(
i
)
_
A()
L(
1
, . . . ,
n
, )
Exact EM: The E-Step
The EM algorithm is coordinate ascent on L(
1
, . . . ,
n
, ).
In the E-step, maximizes each

i
i
arg max
i
M
x
i
L(
1
, . . . ,
n
, )
Equivalently, argmax
i
M
x
i
i
A
x
i
(
i
)
This is the variational representation of the mean parameter
i
() = E
[(x
i
, z)]
The E-step is named after this E
[] under the current

parameters
Exact EM: The M-Step
In the M-step, maximize holding the s xed:

arg max
L(
1
, . . . ,
n
, ) = arg max
A()
=
1
n
n
i=1
The solution ( ) satises E

( )
[(x)] =
Standard fully observed maximum likelihood problem, hence

the name M-step
Variational EM
For loopy graphs E-step often intractable.
Cant maximize
max
i
M
x
i
i
A
x
i
(
i
)
Improve but not necessarily maximize: generalized EM
The mean eld method maximizes

max
i
M
x
i
(F)
i
A
x
i
(
i
)
up to local maximum
recall M
x
i
(F) is an inner approximation to M
x
i
Mean eld E-step leads to generalized EM
The sum-product algorithm does not lead to generalized EM

Outline
Representation
Inference
Exact Inference
Exponential Family
Maximizing Problems
Parameter Learning
Structure Learning
Score-Based Structure Learning
Let M be all allowed candidate features
Let M M be a log-linear model structure

P(X | M, ) =
1
Z
exp
_
iM
i
f
i
(X)
_
A score for the model M can be max
lnP(Data | M, )
The score is always better for larger M needs regularization
M and treated separately

Structure Learning for Gaussian Random Fields
Consider a p-dimensional multivariate Gaussian N(, )
The graphical model has p nodes x

1
, . . . , x
p
The edge between x

i
, x
j
is absent if and only if
ij
= 0,
where =
1
Equivalently, x
i
, x
j
are conditionally independent given other
variables
x
x
x
x
1
2
3
4
Let data be X
(1)
, . . . , X
(n)
N(, )
The log likelihood is

n
2
log ||
1
2
n
i=1
(X
(i)
)
(X
(i)
)
The maximum likelihood estimate of is the sample

covariance
S =
1
n
i
(X
(i)

X)
(X
(i)

X)
where

X is the sample mean
S
1
is not a good estimate of when n is small
For centered data, minimize a regularized problem instead:

log || +
1
n
n
i=1
X
(i)
X
(i)
+
i=j
|
ij
|
Known as glasso
Recap

1
, . . . , x
n
)
BN or MRF
conditional independence
Do inference = p(X
Q
| X
E
), in general
X
Q
X
E
{x
1
. . . x
n
}
exact, MCMC, variational
If p(x
1
, . . . , x
n
parameter and structure learning

Much on-going research!

All of Graphical Models

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

All of Graphical Models

Uploaded by

Copyright:

Available Formats

All of Graphical Models

Given GM = joint distribution p(x

The universe is reduced to a set of random variables

The joint p(x

Machine learning: estimate p(x

by the denition of conditional probability

Life without graphical models is just ne

So why are we still here?

Given GM = joint distribution p(x

exponential nave storage (2

hard to interpret (conditional independence)

Often cant do it computationally

Koller & Friedman, Probabilistic Graphical Models. MIT 2009

Wainwright & Jordan, Graphical Models, Exponential

Bishop, Pattern Recognition and Machine Learning. Springer

Graphical model is the study of probabilistic models

Just because there is a graph with nodes and edges doesnt

A directed graph has nodes X = (x

A cycle is a directed path x

A directed acyclic graph (DAG) contains no cycles

A Bayesian network on the DAG is a family of distributions

By specifying the CPDs for all i, we specify a particular

Used extensively in natural language processing

Plate representation on the right

Bayesian networks imply no causality at all

They only encode the joint probability distribution (hence

However, people tend to design BNs based on causal relations

Two r.v.s A, B are independent if P(A, B) = P(A)P(B) or

Two r.v.s A, B are conditionally independent given C if

This extends to groups of r.v.s

Conditional independence in a BN is precisely specied by

A, B conditionally independent given C

C is a tail-to-tail node, blocks the undirected path A-B

A, B conditionally independent given C

C is a head-to-tail node, blocks the path A-B

A, B conditionally dependent given C, or any of Cs

C is a head-to-head node, unblocks the path A-B

Any groups of nodes A and B are conditionally independent

A path is blocked if it includes a node x such that either

The path is head-to-tail or tail-to-tail at x and x C, or

The path is head-to-head at x, and neither x nor any of its

The path from A to B not blocked by either E or F

The path from A to B is blocked both at E and F

A, B conditionally independent given F

The eciency of directed graphical model (acyclic graph,

A clique C in an undirected graph is a fully connected set of

Dene a nonnegative potential function

An undirected graphical model (aka Markov Random Field)

When the parameter a > 0, favor homogeneous chains

When the parameter a < 0, favor inhomogeneous chains

Real-valued feature functions f

The n n covariance matrix positive semi-denite

Two group of variables A, B are conditionally independent

Remove C and all edges involving C

For both directed and undirected graphical models

Bipartite: edges between a variable node and a factor node

Factors represent computation

Summing exponential number of terms: with k variables in

There are a bunch of other variables x

We sum over r values each variable can take

For a graphical model, the joint probability factors

Introduce a new factor f

Dynamic programming: compute f

Recursively eliminate other variables in turn