Professional Documents
Culture Documents
Inference On Relational Models Using Markov Chain Monte Carlo
Inference On Relational Models Using Markov Chain Monte Carlo
Inference On Relational Models Using Markov Chain Monte Carlo
Brian Milch
Massachusetts Institute of Technology
UAI Tutorial
July 19, 2007
Example 1: Bibliographies
“Rus...” “AI...” “AI: A...” “Rob...” “Adv...” “Rob...” “Rus...” “AI...” “AI: A...”
“Seuss” “Shak...”
“The...” “Tempest”
“If you...” “Hamlet”
“Seu...” “The...” “Seu...” “Shak...” “Haml...” “Wm...” “Rus...” “AI...” “AI: A...”
• Approximate P(Q|E) as
fraction of s1, s2, ... that E
satisfy query Q
5
Outline
• Probabilistic models for relational structures
– Modeling the number of objects
– Three mistakes that are easy to make
• Markov chain Monte Carlo (MCMC)
– Gibbs sampling
– Metropolis-Hastings
– MCMC over events
• Case studies
– Citation matching
– Multi-target tracking
6
Simple Example: Clustering
= 22 = 49 = 80
10 20 30 40 50 60 70 80 90 100
Wingspan (cm)
7
Simple Bayesian Mixture Model
• Number of latent objects is known to be k
• For each latent object i, have parameter:
i ~ Uniform [0,100]
• For each data point j, have object selector
C j ~ Uniform({1,..., k }
and observable value
X j ~ Normal c j , 5 2
8
BN for Mixture Model
1 2 … k
X1 X2 X3 … Xn
C1 C2 C3 … Cn
9
Context-Specific Dependencies
1 2 … k
X1 X2 X3 … Xn
C1 C2 C3 … Cn
=2 =1 =2
10
Extensions to Mixture Model
• Random number of latent objects k, with
distribution p(k) such as:
– Uniform({1, …, 100})
– Geometric(0.1)
unbounded!
– Poisson(10)
• Random distribution for selecting objects
– p( | k) ~ Dirichlet(1,..., k)
(Dirichlet: distribution over probability vectors)
– Still symmetric: each i = /k
11
Existence versus Observation
• A latent object can exist even if no observations
correspond to it
– Bird species may not be observed yet
– Aircraft may fly over without yielding any blips
• Two questions:
– How many objects correspond to observations?
– How many objects are there in total?
• Observed 3 species, each 100 times: probably no more
• Observed 200 species, each 1 or 2 times: probably more
exist
12
Expecting Additional Objects
r observed species observe more later?
…
… …
15
Mistake 1: Ignoring
Interchangeability
• Which birds are in species S1?
B2 B1 B3 B5 B4
• Latent object indices are
interchangeable
– Posterior on selector variable CB1 is uniform
– Posterior on S1 has a peak for each cluster of birds
• Really care about partition of observations
{{1, 3}, {2}, {4, 5}}
• Partition with r blocks corresponds to k! / (k-r)!
instantiations of the Cj variables
(1, 2, 1, 3, 3), (1, 2, 1, 4, 4), (1, 4, 1, 3, 3), (2, 1, 2, 3, 3), …
16
Ignoring Interchangeability, Cont’d
• Say k = 4. What’s prior probability that
B1, B3 are in one species, B2 in another?
• Multiply probabilities for CB1, CB2, CB3:
(1/4) x (1/4) x (1/4)
• Not enough! Partition {{B1, B3}, {B2}}
corresponds to 12 instantiations of C’s
(S1, S2, S1), (S1, S3, S1), (S1, S4, S1), (S2, S1, S2), (S2, S3, S2), (S2, S4, S2)
(S3, S1, S3), (S3, S2, S3), (S3, S4, S3), (S4, S1, S4), (S4, S2, S4), (S4, S3, S4)
• Partition with r blocks corresponds to kPr
instantiations
17
Mistake 2: Underestimating the
Bayesian Ockham’s Razor Effect
• Say k = 4. Are B1 and B2 in same species?
XB1=50 XB2=52
10 20 30 40 50 60 70 80 90 100
Wingspan (cm)
18
Bayesian Ockham’s Razor
XB1=50 XB2=52
10 20 30 40 50 60 70 80 90 100
H1: Partition is {{B1, B2}}
2
p( H1, data) 4 P1 1
4 0
100
p( 1 ) p( x1 | 1 ) p( x2 | 1 ) d1
1.3 x 10-4
= 0.01
H2: Partition is {{B1}, {B2}}
p( H 2 , data ) 4 P2 1
4
2
0
100
p( 1 ) p( x1 | 1 ) d1
100
0
p( 2 ) p( x2 | 2 ) d2
7.5 x 10-5
Don’t use more latent objects than necessary to explain your data
[MacKay 1992] 19
Mistake 3: Comparing Densities
Across Dimensions
XB1=50 XB2=52
10 20 30 40 50 60 70 80 90 100
Wingspan (cm)
H1: Partition is {{B1, B2}}, = 51
4
p( H 1, data ) 4 P1 1
2
0.01 N (50 ; 51, 5 2 ) N (52 ; 51, 5 2 )
1.5 x 10-5 H1 wins by greater margin
2
p( H 2 , data) 4 P1 1 0.01 N (50 ; 50, 5 2 ) 0.01 N (52 ; 52, 52 )
4
4.8 x 10-7
20
What If We Change the Units?
XB1=0.50 XB2=0.52
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Wingspan (m)
H1: Partition is {{B1, B2}}, = 0.51
4
p( H 1, data ) 4 P1 1
2
1 N (0.50 ; 0.51, 0.05 2 ) N (0.52 ; 0.51, 0.05 2 )
15
density of Uniform(0, 1) is 1!
4
p( H 2 , data ) 4 P1 1
2
1 N (0.50 ; 0.50, 0.05 2 ) 1 N (0.52 ; 0.52, 0.05 2 )
48 Now H wins by a landslide
2
21
Lesson: Comparing Densities
Across Dimensions
• Densities don’t behave like probabilities
(e.g., they can be greater than 1)
• Heights of density peaks in spaces of
different dimension are not comparable
• Work-arounds:
– Find most likely partition first, then most likely
parameters given that partition
– Find region in parameter space where most of
the posterior probability mass lies
22
Outline
• Probabilistic models for relational structures
– Modeling the number of objects
– Three mistakes that are easy to make
• Markov chain Monte Carlo (MCMC)
– Gibbs sampling
– Metropolis-Hastings
– MCMC over events
• Case studies
– Citation matching
– Multi-target tracking
23
Why Not Exact Inference?
• Number of possible partitions is
superexponential in n
• Variable elimination? 1 2 k
– Summing out i
couples all the Cj’s
– Summing out Cj X 1 X2 X3 … Xn
couples all the
i’s C C2 C3 … Cn
1
24
Markov Chain Monte Carlo
(MCMC)
• Start in arbitrary state
(possible world) s1
satisfying evidence E Q
• Sample s2, s3, ...
according to transition E
kernel T(si, si+1), yielding
Markov chain
• Approximate p(Q | E) by
fraction of s1, s2, …, sL
that are in Q
25
Why a Markov Chain?
• Why use Markov chain rather than
sampling independently?
– Stochastic local search for high-probability s
– Once we find such s, explore around it
26
Convergence
• Stationary distribution is such that
( s ) T ( s, s ' ) ( s ' )
s
29
Gibbs on Bayesian Mixture Model
• Given current state s:
– Resample each i
given prior and 1 2 k
{Xj : Cj = i in s}
– Resample each Cj
given Xj and 1:k
X1 X2 X3 … Xn
context-specific
Markov blanket
C1 C2 C3 … Cn
[Neal 2000] 30
Sampling Given Markov Blanket
p(v | sV ) p(v | s[Pa(V )]) p( s[Y ] | v, s[Pa
Y ch(V )
V (Y )])
10 20 30 40 50 60 70 80 90 100
should be two clusters
Wingspan (cm)
33
Metropolis-Hastings
[Metropolis et al. 1953; Hastings 1970]
35
Split-Merge Proposals
• Choose two observations i, j
• If Ci = Cj = c, then split cluster c
– Get unused latent object c
– For each observation m such that Cm = c,
change Cm to c with probability 0.5
– Propose new values for c, c
• Else merge clusters ci and cj
– For each m such that Cm = cj, set Cm = ci
– Propose new value for c
[Jain & Neal 2004] 36
Split-Merge Example
1 = 20 2 = 27 2 = 90
10 20 30 40 50 60 70 80 90 100
Wingspan (cm)
37
Mixtures of Kernels
• If T1,…,Tm all have stationary distribution
, then so does mixture
m
T ( s, s' ) wiTi ( s, s' )
i 1
38
Outline
• Probabilistic models for relational structures
– Modeling the number of objects
– Three mistakes that are easy to make
• Markov chain Monte Carlo (MCMC)
– Gibbs sampling
– Metropolis-Hastings
– MCMC over events
• Case studies
– Citation matching
– Multi-target tracking
39
MCMC States in Split-Merge
• Not complete instantiations!
– No parameters for unobserved species
• States are partial instantiations of random
variables
k = 12, CB1 = S2, CB2 = S8, S2 = 31, S8 = 84
40
MCMC over Events
[Milch & Russell 2006]
• Theorem: Fraction of
visited events in Q E
converges to p(Q|E) if:
– Each is either subset of Q
or disjoint from Q
– Events form partition of E
41
Computing Probabilities of Events
• Engine needs to compute p() / p(n)
efficiently (without summations)
• Use instantiations that
include all active parents
of the variables they
instantiate
• Then probability is product of CPDs:
p( ) p ( X ) | (Pa ( X ))
X vars( )
X
42
States That Are Even More Abstract
• Typical partial instantiation:
k = 12, CB1 = S2, CB2 = S8, S2 = 31, S8 = 84
44
Representative Applications
• Tracking cars with cameras [Pasula et al. 1999]
• Segmentation in computer vision [Tu & Zhu 2002]
• Citation matching [Pasula et al. 2003]
• Multi-target tracking with radar [Oh et al. 2004]
45
Citation Matching Model
[Pasula et al. 2003; Milch & Russell 2006]
#Researcher ~ NumResearchersPrior();
Name(r) ~ NamePrior();
#Paper ~ NumPapersPrior();
FirstAuthor(p) ~ Uniform({Researcher r});
Title(p) ~ TitlePrior();
PubCited(c) ~ Uniform({Paper p});
Text(c) ~ NoisyCitationGrammar
(Name(FirstAuthor(PubCited(c))), Title(PubCited(c)));
46
Citation Matching
• Elaboration of generative model shown earlier
• Parameter estimation
– Priors for names, titles, citation formats learned
offline from labeled data
– String corruption parameters learned with Monte
Carlo EM
• Inference
– MCMC with split-merge proposals
– Guided by “canopies” of similar citations
– Accuracy stabilizes after ~20 minutes
47
[Pasula et al., NIPS 2002]
Citation Matching Results
0.25
(Fraction of Clusters Not Recovered Correctly)
0.2
Phrase Matching
0.15 [Lawrence et al. 1999]
Error
0.05
0
Reinforce Face Reason Constraint
49
Preliminary Experiments:
Information Extraction
• P(citation text | title, author names)
modeled with simple HMM
• For each paper: recover title, author
surnames and given names
• Fraction whose attributes are recovered
perfectly in last MCMC state:
– among papers with one citation: 36.1%
– among papers with multiple citations: 62.6%
Can use inferred knowledge for disambiguation 50
Multi-Object Tracking
Unobserved
Object
False
Detection
51
State Estimation for “Aircraft”
#Aircraft ~ NumAircraftPrior();
State(a, t)
if t = 0 then ~ InitState()
else ~ StateTransition(State(a, Pred(t)));
#Blip(Source = a, Time = t)
~ NumDetectionsCPD(State(a, t));
#Blip(Time = t)
~ NumFalseAlarmsPrior();
ApparentPos(r)
if (Source(r) = null) then ~ FalseAlarmDistrib()
else ~ ObsCPD(State(Source(r), Time(r)));
52
Aircraft Entering and Exiting
#Aircraft(EntryTime = t) ~ NumAircraftPrior();
Exits(a, t)
if InFlight(a, t) then ~ Bernoulli(0.1);
InFlight(a, t)
if t < EntryTime(a) then = false
elseif t = EntryTime(a) then = true
else = (InFlight(a, Pred(t)) & !Exits(a, Pred(t)));
State(a, t)
if t = EntryTime(a) then ~ InitState()
elseif InFlight(a, t) then
~ StateTransition(State(a, Pred(t)));
#Blip(Source = a, Time = t)
if InFlight(a, t) then
~ NumDetectionsCPD(State(a, t));
…plus last two statements from previous slide
53
MCMC for Aircraft Tracking
• Uses generative model from previous slide
(although not with BLOG syntax)
• Examples of Metropolis-Hastings proposals:
56
General MCMC Engine
[Milch & Russell 2006]
Model
(in declarative language)
MCMC states: partial worlds
• Define p(s)
Custom proposal distribution
(Java class)
• Propose MCMC
state s given sn
• Compute acceptance
probability based on • Compute ratio
model q(sn | s) / q(s | sn)
• Set sn+1
58
References
• Blei, D. M. and Jordan, M. I. (2005) “Variational inference for Dirichlet process
mixtures”. J. Bayesian Analysis 1(1):121-144.
• Casella, G. and Robert, C. P. (1996) “Rao-Blackwellisation of sampling schemes”.
Biometrika 83(1):81-94.
• Ferguson T. S. (1983) “Bayesian density estimation by mixtures of normal
distributions”. In Rizvi, M. H. et al., eds. Recent Advances in Statistics: Papers in
Honor of Herman Chernoff on His Sixtieth Birthday. Academic Press, New York, pages
287-302.
• Geman, S. and Geman, D. (1984) “Stochastic relaxation, Gibbs distributions and the
Bayesian restoration of images”. IEEE Trans. on Pattern Analysis and Machine
Intelligence 6:721-741.
• Gilks, W. R., Thomas, A. and Spiegelhalter, D. J. (1994) “A language and program for
complex Bayesian modelling”. The Statistician 43(1):169-177.
• Gilks, W. R., Richardson, S., and Spiegelhalter, D. J., eds. (1996) Markov Chain
Monte Carlo in Practice. Chapman and Hall.
• Green, P. J. (1995) “Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination”. Biometrika 82(4):711-732.
59
References
• Hastings, W. K. (1970) “Monte Carlo sampling methods using Markov chains and
their applications”. Biometrika 57:97-109.
• Jain, S. and Neal, R. M. (2004) “A split-merge Markov chain Monte Carlo procedure
for the Dirichlet process mixture model”. J. Computational and Graphical Statistics
13(1):158-182.
• Jordan M. I. (2005) “Dirichlet processes, Chinese restaurant processes, and all that”.
Tutorial at the NIPS Conference, available at
http://www.cs.berkeley.edu/~jordan/nips-tutorial05.ps
• MacKay D. J. C. (1992) “Bayesian Interpolation” Neural Computation 4(3):414-447.
• MacEachern, S. N. (1994) “Estimating normal means with a conjugate style Dirichlet
process prior” Communications in Statistics: Simulation and Computation 23:727-741.
• Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E.
(1953) “Equations of state calculations by fast computing machines”. J. Chemical
Physics 21:1087-1092.
• Milch, B., Marthi, B., Russell, S., Sontag, D., Ong, D. L., and Kolobov, A. (2005)
“BLOG: Probabilistic Models with Unknown Objects”. In Proc. 19th Int’l Joint Conf. on
AI, pages 1352-1359.
• Milch, B. and Russell, S. (2006) “General-purpose MCMC inference over relational
structures”. In Proc. 22nd Conf. on Uncertainty in AI, pages 349-358.
60
References
• Neal, R. M. (2000) “Markov chain sampling methods for Dirichlet process mixture
models”. J. Computational and Graphical Statistics 9:249-265.
• Oh, S., Russell, S. and Sastry, S. (2004) “Markov chain Monte Carlo data association
for general multi-target tracking problems”. In Proc. 43rd IEEE Conf. on Decision and
Control, pages 734-742.
• Pasula, H., Russell, S. J., Ostland, M., and Ritov, Y. (1999) “Tracking many objects
with many sensors”. In Proc. 16th Int’l Joint Conf. on AI, pages 1160-1171.
• Pasula, H., Marthi, B., Milch, B., Russell, S., and Shpitser, I. (2003) “Identity
uncertainty and citation matching”. In Advances in Neural Information Processing
Systems 15, MIT Press, pages 1401-1408.
• Richardson,, S. and Green, P. J. (1997) “On Bayesian analysis of mixtures with an
unknown number of components”. J. Royal Statistical Society B 59:731-792.
• Sethuraman, J. (1994) “A constructive definition of Dirichlet priors”. Statistica Sinica
4:639-650.
• Sudderth, E. (2006) “Graphical models for visual object recognition and tracking”.
Ph.D. thesis, Dept. of EECS, Massachusetts Institute of Technology, Cambridge, MA.
• Tu, Z. and Zhu, S.-C. (2002) “Image segmentation by data-driven Markov chain
Monte Carlo”. IEEE Trans. Pattern Analysis and Machine Intelligence 24(5):657-673.
61