2-BP Basics

Section 2:
Belief Propagation Basics
1
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1. Models: Factor graphs can express interactions among linguistic
structures.
2. Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
3. Intuitions: What’s going on here? Can we trust BP’s estimates?
4. Fancier Models: Hide a whole grammar and dynamic programming
algorithm within a single factor. BP coordinates multiple factors.
5. Tweaked Algorithm: Finish in fewer steps and make the steps faster.
6. Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
7. Software: Build the model you want!
2
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1. Models: Factor graphs can express interactions among linguistic
structures.
2. Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
3. Intuitions: What’s going on here? Can we trust BP’s estimates?
4. Fancier Models: Hide a whole grammar and dynamic programming
algorithm within a single factor. BP coordinates multiple factors.
5. Tweaked Algorithm: Finish in fewer steps and make the steps faster.
6. Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
7. Software: Build the model you want!
3
Factor Graph Notation
• Variables: X9
ψψ{1,8,9}
{1,8,9}
{1,8,9}
X8
• Factors: ψ{2,7,8}
X7
ψ{3,6,7}
Joint Distribution
X1 ψψψ{1,2}
222 X2 ψ{2,3} X3 ψ{3,4} X
ψψ{1}
111 ψψ{2}
333 ψψ{3}
555 ψ
time flies like a

4
Factors are Tensors
s vp pp …
s vp pp …
s 0 2 .3 X9
s 0 s2 vp
.3 pp …
vp 3 40 22 .3
vp s3 4 2 ψψ{1,8,9}
pp .1 23 14 2 {1,8,9}
{1,8,9}
pp vp
.1 2 1 X8
…
… pp .1 2 1
• Factors: s
vp … ψ{2,7,8}
pp
X7
v n p d
v 1 6 3 4 ψ{3,6,7}
n 8 4 2 0.1
p 1 3 1 3
d 0.1 8 0 0
X1 ψψψ{1,2}
222 X2 ψ{2,3} X3 ψ{3,4} X
v 3
n 4
p 0.1 ψψ{1} ψψ{2} ψψ{3} ψ
111 333 555
d 0.1
time flies like a
5
Inference
Given a factor graph, two common tasks …
– Compute the most likely joint assignment,
x* = argmaxx p(X=x)
– Compute the marginal distribution of variable Xi:
p(Xi=xi) for each value xi
Both consider all joint assignments.

Both are NP-Hard in general. p(Xi=xi) = sum of
p(X=x) over joint
assignments with
So, we turn to approximations. Xi=xi
6
Marginals by Sampling on Factor Graph
Suppose we took many samples from the distribution over
taggings:
Sample 1: n v p d n
Sample 2: n n v d n
Sample 3: n v p d n
Sample 4: v n p d n
Sample 5: v n v d n
Sample 6: n v p d n
X0 ψ0 X1 ψ2 X2 ψ4 X3 ψ6 X4 ψ8 X5
<START> ψ1 ψ3 ψ5 ψ7 ψ9
time flies like an arrow 7
The marginal p(Xi = xi) gives the probability that variable Xi
takes value xi in a random sample
Sample 1: n v p d n
Sample 2: n n v d n
Sample 3: n v p d n
Sample 4: v n p d n
Sample 5: v n v d n
Sample 6: n v p d n
X0 ψ0 X1 ψ2 X2 ψ4 X3 ψ6 X4 ψ8 X5
Estimate the n 4/6 n 3/6 p 4/6
marginals as: d 6/6 n 6/6
v 2/6 v 3/6 v 2/6
Sample 1: n v p d n
Sample 2: n n v d n
Sample 3: n v p d n
Sample 4: v n p d n
Sample 5: v n v d n
Sample 6: n v p d n
X0 ψ0 X1 ψ2 X2 ψ4 X3 ψ6 X4 ψ8 X5
How do we get marginals without sampling?
That’s what Belief Propagation is all about!
Why not just sample?

• Sampling one joint assignment is also NP-hard in general.
– In practice: Use MCMC (e.g., Gibbs sampling) as an anytime algorithm.
– So draw an approximate sample fast, or run longer for a “good” sample.
• Sampling finds the high-probability values xi efficiently.

But it takes too many samples to see the low-probability ones.
– How do you find p(“The quick brown fox …”) under a language model?
• Draw random sentences to see how often you get it? Takes a long time.
10
• Or multiply factors (trigram probabilities)? That’s what BP would do.
Great Ideas in ML: Message
_ _ _ _ _ _ _ _ _ _ _ _
Count the soldiers Passing

there's
1 of me
1 2 3 4 5
before before before before before
you you you you you
5 4 3 2 1
behind behind behind behind behind
you you you you you
11
adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
Count the soldiers
there's Belief:
1 of me Must be
22 + 11 + 3 = 6 of
2 us
before
you
only see 3
my incoming behind
messages you
12
Count the soldiers
there's Belief: Belief:
1 of me Must be Must be
11 + 1 + 4 = 6 of 22 + 11 + 3 = 6 of
1 before us us
you
only see 4
my incoming behind
messages you
13
Each soldier receives reports from all branches of tree
3 here
7 here
1 of me
11 here
(= 7+3+1)
14
3 here
7 here
(= 3+3+1)
3 here
15
11 here
(= 7+3+1)
7 here
3 here
16
3 here
7 here
Belief:
Must be
3 here 14 of us
17
3 here
7 here
Belief:
Must be
3 here 14 of us
wouldn't work correctly

with a 'loopy' (cyclic) graph
18
Message Passing in Belief Propagation
v 6
n 6 v 1
a 9 n 6
a 3
My other factors
think I’m a noun …
… X Ψ
But my other
…
variables and I
think you’re a verb
v 6
…
n 1
a 3
Both of these messages judge the possible values of variable X.

Their product = belief at X = product of all 3 messages to X. 19
Sum-Product Belief Propagation
Variables Factors
ψ2
X2
Beliefs
ψ1 X1 ψ3 X1 ψ1 X3
ψ2
Messages
X2
ψ1 X1 ψ3 X1 ψ1 X3
20
Variable Belief
ψ2 v 1
n 2
p 2
v 0.1 v 4
n 3 n 1
p 1 p 0
ψ1 ψ3
X1
v .4
n 6
p 0
21
Variable Message
ψ2 v 1
n 2
p 2
v 0.1 v 0.1
n 3 n 6
p 1 p 2
ψ1 ψ3
X1
22
Factor Belief
v n
p 4
p 0.1 8
v 8
d 1
d 3 0
n 0.2
n 0
n 1 1
ψ1
X1 X3
v n
p 3.2 6.4
d 24 0
n 0 0
23
Factor Belief
ψ1
X1 X3
v n
p 3.2 6.4
d 24 0
n 0 0
24
Factor Message
v n
p 0.8 + 0.16 p 0.1 8
d 24 + 0 v 8
d 3 0
n 0.2
n 8 + 0.2
n 1 1
ψ1
X1 X3
25
Factor Message
matrix-vector product
(for a binary factor)
ψ1
X1 X3
26
Input: a factor graph with no cycles
Output: exact marginals for each variable and factor
Algorithm:
1. Initialize the messages to the uniform distribution.
2. Choose a root node.

3. Send messages from the leaves to the root.
Send messages from the root to the leaves.
4. Compute the beliefs (unnormalized marginals).
5. Normalize beliefs and return the exact marginals.
27
Variables Factors
ψ2
X2
Beliefs
ψ1 X1 ψ3 X1 ψ1 X3
ψ2
Messages
X2
ψ1 X1 ψ3 X1 ψ1 X3
28
Variables Factors
ψ2
X2
Beliefs
ψ1 X1 ψ3 X1 ψ1 X3
ψ2
Messages
X2
ψ1 X1 ψ3 X1 ψ1 X3
29
CRF Tagging Model
X1 X2 X3
find preferred tags
Could be verb or noun Could be adjective or verb Could be noun or verb
30
CRF Tagging by Belief Propagation
Forward algorithm = Backward algorithm =

message passing belief message passing
v 1.8 (matrix-vector products)
(matrix-vector products)
n 0
a 4.2
message message
α α β β
v 7 v n av 3 v 2v n a v 3
v 0 2 1n 1 nv 1 0 2 1
… n 2
n 2 1 0a 6
n 6 …
a 1 an7 2 1 0
a 1
a 0 3 1 a 0 3 1
v 0.3
n 0
a 0.1
find preferred tags
• Forward-backward is a message passing algorithm.

• It’s the simplest case of belief propagation.
31
So Let’s Review Forward-Backward …
X1 X2 X3
find preferred tags
Could be verb or noun Could be adjective or verb Could be noun or verb
32
X1 X2 X3
v v v
START n n n END
a a a
find preferred tags

• Show the possible values for each variable
33
X1 X2 X3
v v v
START n n n END
a a a
find preferred tags

• Let’s show the possible values for each variable
• One possible assignment
34
X1 X2 X3
v v v
START n n n END
a a a
find preferred tags

• Let’s show the possible values for each variable
• One possible assignment
• And what the 7 factors think of it … 35
Viterbi Algorithm: Most Probable Assignment
X1 X2 X3
)
RT,v v v v
A
( ST
1} ψ{1}(v)
ψ {0 ,
ψ {1
ψ{3,4}(a,END)
,2
(} v
n n n
, n)
START END
, a)
(a
ψ {2 ,
3} ψ{3}(n)
a a a
ψ{2}(a)
find preferred tags

• So p(v a n) = (1/Z) * product of 7 numbers
• Numbers associated with edges and nodes of path
• Most probable assignment = path with highest product
36
Viterbi Algorithm: Most Probable Assignment
X1 X2 X3
)
RT,v v v v
A
( ST
1} ψ{1}(v)
ψ {0 ,
ψ {1
ψ{3,4}(a,END)
,2
(} v
n n n
, n)
START END
, a)
(a
ψ {2 ,
3} ψ{3}(n)
a a a
ψ{2}(a)
find preferred tags

• So p(v a n) = (1/Z) * product weight of one path
37
Forward-Backward Algorithm: Finds Marginals
X1 X2 X3
v v v
START n n n END
a a a
find preferred tags

• Marginal probability p(X2 = a)
= (1/Z) * total weight of all paths a 38
X1 X2 X3
v v v
START n n n END
a a a
find preferred tags

= (1/Z) * total weight of all paths n 39
X1 X2 X3
v v v
START n n n END
a a a
find preferred tags

= (1/Z) * total weight of all paths v 40
X1 X2 X3
v v v
START n n n END
a a a
find preferred tags

= (1/Z) * total weight of all paths through
n
41
X1 X2 X3
v v v
START n n n END
a a a
find preferred tags

α2(n) = total weight of these
path prefixes
42
(found by dynamic programming: matrix-vector products)
X1 X2 X3
v v v
START n n n END
a a a
find preferred tags

2(n) = total weight of these
path suffixes
43
(found by dynamic programming: matrix-vector products)
X1 X2 X3
v v v
START n n n END
a a a
find preferred tags

α2(n) = total weight of these 2(n) = total weight of these
path prefixes (a + b + c) path suffixes (x + y + z)
44
Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of paths
X1 X2 X3
Oops! The weight of av path v v

through a state also
includes a weight at that
state.
n
So α(n)∙β(n) isn’t enough.
START n “belief
n that X2 = n” END
α2(n) 2(n)
The extra weight is the
opinion of the unigram
a a a
factor at this variable.
ψ{2}(n)
find preferred tags
total weight of all paths through n
= α2(n)  ψ{2}(n)  2(n)

45
X1 X2 X3
v v “belief
v that X2 = v”
START n n “belief
n that X2 = n” END
α2(v) 2(v)
a a a
ψ{2}(v)
find preferred tags
total weight of all paths through v
= α2(v)  ψ{2}(v)  2(v)

46
v X 1.8
1 X2 X3
n 0
a 4.2
v v “belief
v that X2 = v”
divide
v 0.3 by Z=6 to
n 0
START
get n n “belief
n that X2 = n”END
a 0.7 marginal α2(a) 2(a)
probs
a a “belief
a that X2 = a”
sum = Z
ψ{2}(a) (total probability
of all paths)
find preferred tags
total weight of all paths through a
= α2(a)  ψ{2}(a)  2(a)

47
(Acyclic) Belief Propagation
In a factor graph with no cycles:
1. Pick any node to serve as the root.
3. Send messages from the root to the leaves.
A node computes an outgoing message along an edge
only after it has received incoming messages along all its other edges.
X8
ψ12
X7
ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9
48
time flies like an arrow
(Acyclic) Belief Propagation
In a factor graph with no cycles:
1. Pick any node to serve as the root.
3. Send messages from the root to the leaves.
A node computes an outgoing message along an edge
only after it has received incoming messages along all its other edges.
X8
ψ12
X7
ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9
49
Acyclic BP as Dynamic Programming
Subproblem:
ψ12 Inference using just the
Xi factors in subgraph H
F
ψ14 ψ11
X9 X6
H
ψ13
G ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9
Figure adapted from50
time flies like an arrow Burkett & Klein (2012)
Subproblem:
Inference using just the
ψ11
The marginal of Xi in
X9 X6
H
that smaller model is the
ψ10 message sent to Xi from
subgraph H
X1 X2 X3 X4 X5
Message to
a variable
ψ7 ψ9
51
Subproblem:
Inference using just the
ψ14
X9 X6 that smaller model is the
G message sent to Xi from
subgraph H
X1 X2 X3 X4 X5
Message to
a variable
ψ5
52
X 8
Subproblem:
F
X9 X6 that smaller model is the
ψ13 message sent to Xi from
subgraph H
X1 X2 X3 X4 X5
Message to
a variable
ψ1 ψ3
53
Subproblem:
Xi factors in subgraph FH
F
ψ14 ψ11
X9 X6
H
that smaller model is the
ψ13 ψ10 message sent by Xi
out of subgraph FH
X1 X2 X3 X4 X5
Message from
a variable
ψ1 ψ3 ψ5 ψ7 ψ9
54
• If you want the marginal pi(xi) where Xi has degree k, you can think of that summation
as a product of k marginals computed on smaller subgraphs.
• Each subgraph is obtained by cutting some edge of the tree.
• The message-passing algorithm uses dynamic programming to compute the
marginals on all such subgraphs, working from smaller to bigger. So you can
compute all the marginals.
X8
ψ12
X7
ψ14 ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9

X8
ψ12
X7
ψ14 ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9

X8
ψ12
X7
ψ14 ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9

X8
ψ12
X7
ψ14 ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9

X8
ψ12
X7
ψ14 ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9

X8
ψ12
X7
ψ14 ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9

X8
ψ12
X7
ψ14 ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9

Loopy Belief Propagation
What if our graph has cycles? • Messages from
different
subgraphs are
no longer
independent!
X8
– Dynamic
programming
ψ12 can’t help. It’s
X7 now #P-hard in
general to
ψ14 ψ11 compute the
X9 X6
exact marginals.
• But we can still
ψ13 ψ10
run BP -- it's a local
X1 ψ2 X2 ψ4 X3 ψ6 X4 ψ8 X5 algorithm so it
doesn't "see the
ψ1 ψ3 ψ5 ψ7 ψ9
cycles."
What can go wrong with loopy BP?
F
All 4 factors
on cycle
F F
enforce
equality
63
T
All 4 factors
on cycle
T T
enforce
equality
This factor says

upper variable
is twice as likely
to be true as false
(and that’s the true
marginal!)
64
T 4 T 2
F 1 F 1 • Messages loop around and around …
T 2
F 1 • 2, 4, 8, 16, 32, ... More and more
convinced that these variables are T!
All 4 factors • So beliefs converge to marginal
on cycle distribution (1, 0) rather than (2/3, 1/3).
enforce
equality
• BP incorrectly treats this message as
T 4 T 2 T 2
separate evidence that the variable is T.
F 1 F 1 F 1
• Multiplies these two messages as if they
were independent.
This factor says • But they don’t actually come from
upper variable T 2 independent parts of the graph.
is twice as likely F 1 • One influenced the other (via a cycle).
to be T as F
(and that’s the true This is an extreme example. Often in practice,
marginal!) the cyclic influences are weak. (As cycles are
long or include at least one weak correlation.) 65
Your prior doesn’t think Obama owns it. A rumor is circulating
But everyone’s saying he does. Under a that Obama secretly
owns an insurance
Naïve Bayes model, you therefore believe it. company.
(Obamacare is
actually designed to
maximize his profit.)
T 2048
T 1
F 99
F 99
Obama T 2
owns it F 1
T 2 T 2
T 2
F 1 F 1 Kathy
F 1 says so
Alice
says so A lie told often enough becomes truth.
Bob
… -- Lenin
Charlie
says so says so 66
Better model ... Rush can influence conversation.
Actually 4 ways:
– Now there are 2 ways to explain why everyone’s but “both” has a
repeating the story: it’s true, or Rush said it was. low prior and
– The model favors one solution (probably Rush). “neither” has a low
likelihood, so only 2
– Yet BP has 2 stable solutions. Each solution is self- good ways.
reinforcing around cycles; no impetus to switch.
T ???
T 1 T 1 If everyone blames Obama,
F ???
F 99 F 24 then no one has to blame
Rush.
Rush
Obama
says so But if no one blames Rush,
owns it then everyone has to
continue to blame Obama
Kathy (to explain the gossip).
Alice says so
says so A lie told often enough becomes truth.
Bob
… -- Lenin
Charlie
says so says so 67
Loopy Belief Propagation Algorithm
• Run the BP update equations on a cyclic graph
– Hope it “works” anyway (good approximation)
• Though we multiply messages that aren’t independent
• No interpretation as dynamic programming
– If largest element of a message gets very big or small,
• Divide the message by a constant to prevent over/underflow
• Can update messages in any order
– Stop when the normalized messages converge
• Compute beliefs from final messages
– Return normalized beliefs as approximate marginals
68
e.g., Murphy, Weiss & Jordan (1999)
Loopy Belief Propagation
Input: a factor graph with cycles
Output: approximate marginals for each variable and factor
Algorithm:
1. Initialize the messages to the uniform distribution.
2. Send messages until convergence.

Normalize them when they grow too large.
3. Compute the beliefs (unnormalized marginals).
4. Normalize beliefs and return the approximate marginals.
69

2-BP Basics

Uploaded by

Copyright:

Available Formats

You might also like

2-BP Basics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2-BP Basics

Uploaded by

Copyright:

Available Formats

Section 2:

Belief Propagation Basics

time flies like a

Both consider all joint assignments.

That’s what Belief Propagation is all about!

Why not just sample?

• Sampling finds the high-probability values xi efficiently.

Count the soldiers Passing

wouldn't work correctly

Both of these messages judge the possible values of variable X.

2. Choose a root node.

4. Compute the beliefs (unnormalized marginals).

5. Normalize beliefs and return the exact marginals.

find preferred tags

Could be verb or noun Could be adjective or verb Could be noun or verb

Forward algorithm = Backward algorithm =

find preferred tags

• Forward-backward is a message passing algorithm.

find preferred tags

Could be verb or noun Could be adjective or verb Could be noun or verb

find preferred tags

find preferred tags

find preferred tags

find preferred tags

find preferred tags

find preferred tags

find preferred tags

find preferred tags

find preferred tags

find preferred tags

find preferred tags

find preferred tags

Oops! The weight of av path v v

find preferred tags

total weight of all paths through n

= α2(n)  ψ{2}(n)  2(n)

find preferred tags

total weight of all paths through v

= α2(v)  ψ{2}(v)  2(v)

total weight of all paths through a

= α2(a)  ψ{2}(a)  2(a)

time flies like an arrow 55

time flies like an arrow 56

time flies like an arrow 57

time flies like an arrow 58

time flies like an arrow 59

time flies like an arrow 60

time flies like an arrow 61

This factor says

2. Send messages until convergence.

3. Compute the beliefs (unnormalized marginals).

4. Normalize beliefs and return the approximate marginals.

You might also like