Professional Documents
Culture Documents
2-BP Basics
2-BP Basics
2-BP Basics
1
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1. Models: Factor graphs can express interactions among linguistic
structures.
2. Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
3. Intuitions: What’s going on here? Can we trust BP’s estimates?
4. Fancier Models: Hide a whole grammar and dynamic programming
algorithm within a single factor. BP coordinates multiple factors.
5. Tweaked Algorithm: Finish in fewer steps and make the steps faster.
6. Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
7. Software: Build the model you want!
2
Outline
• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
• Then this tutorial is extremely practical for you!
1. Models: Factor graphs can express interactions among linguistic
structures.
2. Algorithm: BP estimates the global effect of these interactions on
each variable, using local computations.
3. Intuitions: What’s going on here? Can we trust BP’s estimates?
4. Fancier Models: Hide a whole grammar and dynamic programming
algorithm within a single factor. BP coordinates multiple factors.
5. Tweaked Algorithm: Finish in fewer steps and make the steps faster.
6. Learning: Tune the parameters. Approximately improve the true
predictions -- or truly improve the approximate predictions.
7. Software: Build the model you want!
3
Factor Graph Notation
• Variables: X9
ψψ{1,8,9}
{1,8,9}
{1,8,9}
X8
• Factors: ψ{2,7,8}
X7
ψ{3,6,7}
Joint Distribution
X1 ψψψ{1,2}
222 X2 ψ{2,3} X3 ψ{3,4} X
ψψ{1}
111 ψψ{2}
333 ψψ{3}
555 ψ
X1 ψψψ{1,2}
222 X2 ψ{2,3} X3 ψ{3,4} X
v 3
n 4
p 0.1 ψψ{1} ψψ{2} ψψ{3} ψ
111 333 555
d 0.1
time flies like a
5
Inference
Given a factor graph, two common tasks …
– Compute the most likely joint assignment,
x* = argmaxx p(X=x)
– Compute the marginal distribution of variable Xi:
p(Xi=xi) for each value xi
Sample 1: n v p d n
Sample 2: n n v d n
Sample 3: n v p d n
Sample 4: v n p d n
Sample 5: v n v d n
Sample 6: n v p d n
X0 ψ0 X1 ψ2 X2 ψ4 X3 ψ6 X4 ψ8 X5
<START> ψ1 ψ3 ψ5 ψ7 ψ9
time flies like an arrow 7
Marginals by Sampling on Factor Graph
The marginal p(Xi = xi) gives the probability that variable Xi
takes value xi in a random sample
Sample 1: n v p d n
Sample 2: n n v d n
Sample 3: n v p d n
Sample 4: v n p d n
Sample 5: v n v d n
Sample 6: n v p d n
X0 ψ0 X1 ψ2 X2 ψ4 X3 ψ6 X4 ψ8 X5
<START> ψ1 ψ3 ψ5 ψ7 ψ9
time flies like an arrow 8
Marginals by Sampling on Factor Graph
Estimate the n 4/6 n 3/6 p 4/6
marginals as: d 6/6 n 6/6
v 2/6 v 3/6 v 2/6
Sample 1: n v p d n
Sample 2: n n v d n
Sample 3: n v p d n
Sample 4: v n p d n
Sample 5: v n v d n
Sample 6: n v p d n
X0 ψ0 X1 ψ2 X2 ψ4 X3 ψ6 X4 ψ8 X5
<START> ψ1 ψ3 ψ5 ψ7 ψ9
time flies like an arrow 9
How do we get marginals without sampling?
5 4 3 2 1
behind behind behind behind behind
you you you you you
11
adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
Count the soldiers
there's Belief:
1 of me Must be
22 + 11 + 3 = 6 of
2 us
before
you
only see 3
my incoming behind
messages you
12
adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
Count the soldiers
there's Belief: Belief:
1 of me Must be Must be
11 + 1 + 4 = 6 of 22 + 11 + 3 = 6 of
1 before us us
you
only see 4
my incoming behind
messages you
13
adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
1 of me
11 here
(= 7+3+1)
14
adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
(= 3+3+1)
3 here
15
adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
11 here
(= 7+3+1)
7 here
3 here
16
adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
Belief:
Must be
3 here 14 of us
17
adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
Belief:
Must be
3 here 14 of us
18
adapted from MacKay (2003) textbook
Message Passing in Belief Propagation
v 6
n 6 v 1
a 9 n 6
a 3
My other factors
think I’m a noun …
… X Ψ
But my other
…
variables and I
think you’re a verb
v 6
…
n 1
a 3
ψ2
X2
Beliefs
ψ1 X1 ψ3 X1 ψ1 X3
ψ2
Messages
X2
ψ1 X1 ψ3 X1 ψ1 X3
20
Sum-Product Belief Propagation
Variable Belief
ψ2 v 1
n 2
p 2
v 0.1 v 4
n 3 n 1
p 1 p 0
ψ1 ψ3
X1
v .4
n 6
p 0
21
Sum-Product Belief Propagation
Variable Message
ψ2 v 1
n 2
p 2
v 0.1 v 0.1
n 3 n 6
p 1 p 2
ψ1 ψ3
X1
22
Sum-Product Belief Propagation
Factor Belief
v n
p 4
p 0.1 8
v 8
d 1
d 3 0
n 0.2
n 0
n 1 1
ψ1
X1 X3
v n
p 3.2 6.4
d 24 0
n 0 0
23
Sum-Product Belief Propagation
Factor Belief
ψ1
X1 X3
v n
p 3.2 6.4
d 24 0
n 0 0
24
Sum-Product Belief Propagation
Factor Message
v n
p 0.8 + 0.16 p 0.1 8
d 24 + 0 v 8
d 3 0
n 0.2
n 8 + 0.2
n 1 1
ψ1
X1 X3
25
Sum-Product Belief Propagation
Factor Message
matrix-vector product
(for a binary factor)
ψ1
X1 X3
26
Sum-Product Belief Propagation
Input: a factor graph with no cycles
Output: exact marginals for each variable and factor
Algorithm:
1. Initialize the messages to the uniform distribution.
27
Sum-Product Belief Propagation
Variables Factors
ψ2
X2
Beliefs
ψ1 X1 ψ3 X1 ψ1 X3
ψ2
Messages
X2
ψ1 X1 ψ3 X1 ψ1 X3
28
Sum-Product Belief Propagation
Variables Factors
ψ2
X2
Beliefs
ψ1 X1 ψ3 X1 ψ1 X3
ψ2
Messages
X2
ψ1 X1 ψ3 X1 ψ1 X3
29
CRF Tagging Model
X1 X2 X3
30
CRF Tagging by Belief Propagation
X1 X2 X3
32
So Let’s Review Forward-Backward …
X1 X2 X3
v v v
START n n n END
a a a
v v v
START n n n END
a a a
v v v
START n n n END
a a a
)
RT,v v v v
A
( ST
1} ψ{1}(v)
ψ {0 ,
ψ {1
ψ{3,4}(a,END)
,2
(} v
n n n
, n)
START END
, a)
(a
ψ {2 ,
3} ψ{3}(n)
a a a
ψ{2}(a)
)
RT,v v v v
A
( ST
1} ψ{1}(v)
ψ {0 ,
ψ {1
ψ{3,4}(a,END)
,2
(} v
n n n
, n)
START END
, a)
(a
ψ {2 ,
3} ψ{3}(n)
a a a
ψ{2}(a)
v v v
START n n n END
a a a
v v v
START n n n END
a a a
v v v
START n n n END
a a a
v v v
START n n n END
a a a
v v v
START n n n END
a a a
42
(found by dynamic programming: matrix-vector products)
Forward-Backward Algorithm: Finds Marginals
X1 X2 X3
v v v
START n n n END
a a a
43
(found by dynamic programming: matrix-vector products)
Forward-Backward Algorithm: Finds Marginals
X1 X2 X3
v v v
START n n n END
a a a
ψ{2}(n)
X1 X2 X3
v v “belief
v that X2 = v”
START n n “belief
n that X2 = n” END
α2(v) 2(v)
a a a
ψ{2}(v)
sum = Z
ψ{2}(a) (total probability
of all paths)
find preferred tags
X8
ψ12
X7
ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9
48
time flies like an arrow
(Acyclic) Belief Propagation
In a factor graph with no cycles:
1. Pick any node to serve as the root.
2. Send messages from the leaves to the root.
3. Send messages from the root to the leaves.
A node computes an outgoing message along an edge
only after it has received incoming messages along all its other edges.
X8
ψ12
X7
ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9
49
time flies like an arrow
Acyclic BP as Dynamic Programming
Subproblem:
ψ12 Inference using just the
Xi factors in subgraph H
F
ψ14 ψ11
X9 X6
H
ψ13
G ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9
Figure adapted from50
time flies like an arrow Burkett & Klein (2012)
Acyclic BP as Dynamic Programming
Subproblem:
Inference using just the
Xi factors in subgraph H
ψ11
The marginal of Xi in
X9 X6
H
that smaller model is the
ψ10 message sent to Xi from
subgraph H
X1 X2 X3 X4 X5
Message to
a variable
ψ7 ψ9
51
time flies like an arrow
Acyclic BP as Dynamic Programming
Subproblem:
Inference using just the
Xi factors in subgraph H
ψ14
The marginal of Xi in
X9 X6 that smaller model is the
G message sent to Xi from
subgraph H
X1 X2 X3 X4 X5
Message to
a variable
ψ5
52
time flies like an arrow
Acyclic BP as Dynamic Programming
X 8
Subproblem:
ψ12 Inference using just the
Xi factors in subgraph H
F
The marginal of Xi in
X9 X6 that smaller model is the
ψ13 message sent to Xi from
subgraph H
X1 X2 X3 X4 X5
Message to
a variable
ψ1 ψ3
53
time flies like an arrow
Acyclic BP as Dynamic Programming
Subproblem:
ψ12 Inference using just the
Xi factors in subgraph FH
F
ψ14 ψ11
The marginal of Xi in
X9 X6
H
that smaller model is the
ψ13 ψ10 message sent by Xi
out of subgraph FH
X1 X2 X3 X4 X5
Message from
a variable
ψ1 ψ3 ψ5 ψ7 ψ9
54
time flies like an arrow
Acyclic BP as Dynamic Programming
• If you want the marginal pi(xi) where Xi has degree k, you can think of that summation
as a product of k marginals computed on smaller subgraphs.
• Each subgraph is obtained by cutting some edge of the tree.
• The message-passing algorithm uses dynamic programming to compute the
marginals on all such subgraphs, working from smaller to bigger. So you can
compute all the marginals.
X8
ψ12
X7
ψ14 ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9
ψ12
X7
ψ14 ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9
ψ12
X7
ψ14 ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9
ψ12
X7
ψ14 ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9
ψ12
X7
ψ14 ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9
ψ12
X7
ψ14 ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9
ψ12
X7
ψ14 ψ11
X9 X6
ψ13 ψ10
X1 X2 X3 X4 X5
ψ1 ψ3 ψ5 ψ7 ψ9
All 4 factors
on cycle
F F
enforce
equality
63
What can go wrong with loopy BP?
T
All 4 factors
on cycle
T T
enforce
equality
Obama T 2
owns it F 1
T 2 T 2
T 2
F 1 F 1 Kathy
F 1 says so
Alice
says so A lie told often enough becomes truth.
Bob
… -- Lenin
Charlie
says so says so 66
What can go wrong with loopy BP?
Better model ... Rush can influence conversation.
Actually 4 ways:
– Now there are 2 ways to explain why everyone’s but “both” has a
repeating the story: it’s true, or Rush said it was. low prior and
– The model favors one solution (probably Rush). “neither” has a low
likelihood, so only 2
– Yet BP has 2 stable solutions. Each solution is self- good ways.
reinforcing around cycles; no impetus to switch.
T ???
T 1 T 1 If everyone blames Obama,
F ???
F 99 F 24 then no one has to blame
Rush.
Rush
Obama
says so But if no one blames Rush,
owns it then everyone has to
continue to blame Obama
Kathy (to explain the gossip).
Alice says so
says so A lie told often enough becomes truth.
Bob
… -- Lenin
Charlie
says so says so 67
Loopy Belief Propagation Algorithm
• Run the BP update equations on a cyclic graph
– Hope it “works” anyway (good approximation)
• Though we multiply messages that aren’t independent
• No interpretation as dynamic programming
– If largest element of a message gets very big or small,
• Divide the message by a constant to prevent over/underflow
• Can update messages in any order
– Stop when the normalized messages converge
• Compute beliefs from final messages
– Return normalized beliefs as approximate marginals
68
e.g., Murphy, Weiss & Jordan (1999)
Loopy Belief Propagation
Input: a factor graph with cycles
Output: approximate marginals for each variable and factor
Algorithm:
1. Initialize the messages to the uniform distribution.
69