Inference On Relational Models Using Markov Chain Monte Carlo

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 61

Inference on Relational Models

Using Markov Chain Monte Carlo

Brian Milch
Massachusetts Institute of Technology

UAI Tutorial
July 19, 2007
Example 1: Bibliographies

Stuart Russell Peter Norvig

Artificial Intelligence: A Modern Approach

Russell, Stuart and Norvig, Peter. Articial Intelligence. Prentice-Hall, 1995.

S. Russel and P. Norvig (1995). Artificial Intelligence: A Modern


Approach. Upper Saddle River, NJ: Prentice Hall. 2
Example 2: Aircraft Tracking

(1.9, 9.0, 2.1)


(1.8, 7.4, 2.3)
(1.9, 6.1, 2.2)

(0.7, 5.1, 3.2)


(0.6, 5.9, 3.2)
(0.9, 5.8, 3.1)

t=1 t=2 t=3


3
Inference on Relational Structures
1.2 x 10-12 2.3 x 10-12 4.5 x 10-14
“Russell” “Norvig”
“Russell” “Roberts”

“AI: A Mod...” “Advance...” “AI: A Mod...”

“Rus...” “AI...” “AI: A...” “Rob...” “Adv...” “Rob...” “Rus...” “AI...” “AI: A...”

“Seuss” “Shak...”

“The...” “Tempest”
“If you...” “Hamlet”

“Seu...” “The...” “Seu...” “Shak...” “Haml...” “Wm...” “Rus...” “AI...” “AI: A...”

6.7 x 10-16 8.9 x 10-16 5.0 x 10-20 4


Markov Chain Monte Carlo
(MCMC)
• Markov chain s1, s2, ...
over worlds where
evidence E is true Q

• Approximate P(Q|E) as
fraction of s1, s2, ... that E
satisfy query Q

5
Outline
• Probabilistic models for relational structures
– Modeling the number of objects
– Three mistakes that are easy to make
• Markov chain Monte Carlo (MCMC)
– Gibbs sampling
– Metropolis-Hastings
– MCMC over events
• Case studies
– Citation matching
– Multi-target tracking

6
Simple Example: Clustering

 = 22  = 49  = 80

10 20 30 40 50 60 70 80 90 100
Wingspan (cm)

7
Simple Bayesian Mixture Model
• Number of latent objects is known to be k
• For each latent object i, have parameter:
i ~ Uniform [0,100]
• For each data point j, have object selector
C j ~ Uniform({1,..., k }
and observable value

X j ~ Normal  c j , 5 2

8
BN for Mixture Model
1 2 … k

X1 X2 X3 … Xn

C1 C2 C3 … Cn

9
Context-Specific Dependencies
1 2 … k

X1 X2 X3 … Xn

C1 C2 C3 … Cn

=2 =1 =2

10
Extensions to Mixture Model
• Random number of latent objects k, with
distribution p(k) such as:
– Uniform({1, …, 100})
– Geometric(0.1)
unbounded!
– Poisson(10)
• Random distribution  for selecting objects
– p( | k) ~ Dirichlet(1,..., k)
(Dirichlet: distribution over probability vectors)
– Still symmetric: each i = /k

11
Existence versus Observation
• A latent object can exist even if no observations
correspond to it
– Bird species may not be observed yet
– Aircraft may fly over without yielding any blips
• Two questions:
– How many objects correspond to observations?
– How many objects are there in total?
• Observed 3 species, each 100 times: probably no more
• Observed 200 species, each 1 or 2 times: probably more
exist

12
Expecting Additional Objects
r observed species observe more later?


… …

• P(ever observe new species | seen r so far)


bounded by P(k  r)
• So as # species observed  , probability of
ever seeing more  0
• What if we don’t want this?
13
Dirichlet Process Mixtures
• Set k = , let  be infinite-dimensional
probability vector with stick-breaking prior
1 2 3 4 5 …

• Another view: Define prior directly on


partitions of data points, allowing
unbounded number of blocks
• Drawback: Can’t ask about number of
unobserved latent objects (always infinite)
[Ferguson 1983; Sethuraman 1994]
[tutorials: Jordan 2005; Sudderth 2006] 14
Outline
• Probabilistic models for relational structures
– Modeling the number of objects
– Three mistakes that are easy to make
• Markov chain Monte Carlo (MCMC)
– Gibbs sampling
– Metropolis-Hastings
– MCMC over events
• Case studies
– Citation matching
– Multi-target tracking

15
Mistake 1: Ignoring
Interchangeability
• Which birds are in species S1?
B2 B1 B3 B5 B4
• Latent object indices are
interchangeable
– Posterior on selector variable CB1 is uniform
– Posterior on S1 has a peak for each cluster of birds
• Really care about partition of observations
{{1, 3}, {2}, {4, 5}}
• Partition with r blocks corresponds to k! / (k-r)!
instantiations of the Cj variables
(1, 2, 1, 3, 3), (1, 2, 1, 4, 4), (1, 4, 1, 3, 3), (2, 1, 2, 3, 3), …
16
Ignoring Interchangeability, Cont’d
• Say k = 4. What’s prior probability that
B1, B3 are in one species, B2 in another?
• Multiply probabilities for CB1, CB2, CB3:
(1/4) x (1/4) x (1/4)
• Not enough! Partition {{B1, B3}, {B2}}
corresponds to 12 instantiations of C’s
(S1, S2, S1), (S1, S3, S1), (S1, S4, S1), (S2, S1, S2), (S2, S3, S2), (S2, S4, S2)
(S3, S1, S3), (S3, S2, S3), (S3, S4, S3), (S4, S1, S4), (S4, S2, S4), (S4, S3, S4)
• Partition with r blocks corresponds to kPr
instantiations
17
Mistake 2: Underestimating the
Bayesian Ockham’s Razor Effect
• Say k = 4. Are B1 and B2 in same species?
XB1=50 XB2=52

10 20 30 40 50 60 70 80 90 100
Wingspan (cm)

• Maximum-likelihood estimation would yield one


species with  = 50 and another with  = 52
• But Bayesian model trades off likelihood against
prior probability of getting those  values

18
Bayesian Ockham’s Razor
XB1=50 XB2=52

10 20 30 40 50 60 70 80 90 100
H1: Partition is {{B1, B2}}

  2
p( H1, data)  4 P1  1  
4 0
100
p( 1 ) p( x1 | 1 ) p( x2 | 1 ) d1
 1.3 x 10-4
= 0.01
H2: Partition is {{B1}, {B2}}

 
p( H 2 , data )  4 P2  1  
4
2

0
100
p( 1 ) p( x1 | 1 ) d1  
100

0
p( 2 ) p( x2 | 2 ) d2
 7.5 x 10-5

Don’t use more latent objects than necessary to explain your data
[MacKay 1992] 19
Mistake 3: Comparing Densities
Across Dimensions
XB1=50 XB2=52

10 20 30 40 50 60 70 80 90 100
Wingspan (cm)
H1: Partition is {{B1, B2}},  = 51

 4
p( H 1, data )  4 P1  1
2
 0.01 N (50 ; 51, 5 2 )  N (52 ; 51, 5 2 )
 1.5 x 10-5 H1 wins by greater margin

H2: Partition is {{B1}, {B2}}, B1 = 50, B2 = 52

  2
p( H 2 , data)  4 P1  1  0.01 N (50 ; 50, 5 2 )  0.01 N (52 ; 52, 52 )
4
 4.8 x 10-7
20
What If We Change the Units?
XB1=0.50 XB2=0.52

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Wingspan (m)
H1: Partition is {{B1, B2}},  = 0.51

 4
p( H 1, data )  4 P1  1
2
 1 N (0.50 ; 0.51, 0.05 2 )  N (0.52 ; 0.51, 0.05 2 )
 15
density of Uniform(0, 1) is 1!

H2: Partition is {{B1}, {B2}}, B1 = 0.50, B2 = 0.52

 4
p( H 2 , data )  4 P1  1
2
 1 N (0.50 ; 0.50, 0.05 2 )  1 N (0.52 ; 0.52, 0.05 2 )
 48 Now H wins by a landslide
2

21
Lesson: Comparing Densities
Across Dimensions
• Densities don’t behave like probabilities
(e.g., they can be greater than 1)
• Heights of density peaks in spaces of
different dimension are not comparable
• Work-arounds:
– Find most likely partition first, then most likely
parameters given that partition
– Find region in parameter space where most of
the posterior probability mass lies

22
Outline
• Probabilistic models for relational structures
– Modeling the number of objects
– Three mistakes that are easy to make
• Markov chain Monte Carlo (MCMC)
– Gibbs sampling
– Metropolis-Hastings
– MCMC over events
• Case studies
– Citation matching
– Multi-target tracking

23
Why Not Exact Inference?
• Number of possible partitions is
superexponential in n
• Variable elimination?   1 2 k

– Summing out i
couples all the Cj’s
– Summing out Cj X 1 X2 X3 … Xn
couples all the
i’s C C2 C3 … Cn
1

24
Markov Chain Monte Carlo
(MCMC)
• Start in arbitrary state
(possible world) s1
satisfying evidence E Q
• Sample s2, s3, ...
according to transition E
kernel T(si, si+1), yielding
Markov chain
• Approximate p(Q | E) by
fraction of s1, s2, …, sL
that are in Q

25
Why a Markov Chain?
• Why use Markov chain rather than
sampling independently?
– Stochastic local search for high-probability s
– Once we find such s, explore around it

26
Convergence
• Stationary distribution  is such that
  ( s ) T ( s, s ' )   ( s ' )
s

• If chain is ergodic (can get to anywhere


from anywhere*), then:
– It has unique stationary distribution 
– Fraction of s1, s2, ..., sL in Q converges to
(Q) as L  
• We’ll design T so (s) = p(s | E)
* and it’s aperiodic 27
Gibbs Sampling
• Order non-evidence variables V1,V2,...,Vm
• Given state s, sample from T as follows:
– Let s = s
– For i = 1 to m
• Sample vi from p(Vi | s-i)
• Let s = (s-i, Vi = vi) Conditional for Vi given
– Return s other vars in s

• Theorem: stationary distribution is p(s | E)


[Geman & Geman 1984] 28
Gibbs on Bayesian Network
• Conditional for V depends only on factors
that contain v
p(v | sV )  p(v | s[Pa(V )])  p( s[Y ] | v, s[Pa
Y ch(V )
V (Y )])

• So condition on V’s Markov


blanket mb(V): parents,
children, and co-parents V

29
Gibbs on Bayesian Mixture Model
• Given current state s:
– Resample each i
given prior and 1 2 k
{Xj : Cj = i in s}
– Resample each Cj
given Xj and 1:k
X1 X2 X3 … Xn
context-specific
Markov blanket
C1 C2 C3 … Cn

[Neal 2000] 30
Sampling Given Markov Blanket
p(v | sV )  p(v | s[Pa(V )])  p( s[Y ] | v, s[Pa
Y ch(V )
V (Y )])

• If V is discrete, just iterate over values,


normalize, sample from discrete distrib.
• If V is continuous:
– Simple if child distributions are conjugate to
V’s prior: posterior has same form as prior
with different parameters
– In general, even sampling from p(v | s-V) can
be hard
[See BUGS software: http://www.mrc-bsu.cam.ac.uk/bugs] 31
Convergence Can Be Slow
1 = 20 2 = 90

species 2 is far away

10 20 30 40 50 60 70 80 90 100
should be two clusters
Wingspan (cm)

• Cj’s won’t change until 2 is in right area


 2 does unguided random walk as long as no
observations are associated with it
– Especially bad in high dimensions
32
Outline
• Probabilistic models for relational structures
– Modeling the number of objects
– Three mistakes that are easy to make
• Markov chain Monte Carlo (MCMC)
– Gibbs sampling
– Metropolis-Hastings
– MCMC over events
• Case studies
– Citation matching
– Multi-target tracking

33
Metropolis-Hastings
[Metropolis et al. 1953; Hastings 1970]

• Define T(si, si+1) as follows:


– Sample s from proposal distribution q(s | s)
– Compute acceptance probability
 p s | E  q si | s 
  min1, 
 p si | E  q s | si  
relative posterior backward / forward
probabilities proposal probabilities

– With probability , let si+1 = s;


else let si+1 = si
Can show that p(s | E) is stationary distribution for T 34
Metropolis-Hastings
• Benefits
– Proposal distribution can propose big steps
involving several variables
– Only need to compute ratio p(s | E) / p(s | E),
ignoring normalization factors
– Don’t need to sample from conditional distribs
• Limitations
– Proposals must be reversible, else q(s | s) = 0
– Need to be able to compute q(s | s) / q(s | s)

35
Split-Merge Proposals
• Choose two observations i, j
• If Ci = Cj = c, then split cluster c
– Get unused latent object c
– For each observation m such that Cm = c,
change Cm to c with probability 0.5
– Propose new values for c, c
• Else merge clusters ci and cj
– For each m such that Cm = cj, set Cm = ci
– Propose new value for c
[Jain & Neal 2004] 36
Split-Merge Example
1 = 20 2 = 27 2 = 90

10 20 30 40 50 60 70 80 90 100
Wingspan (cm)

• Split two birds from species 1


• Resample 2 to match these two birds
• Move is likely to be accepted

37
Mixtures of Kernels
• If T1,…,Tm all have stationary distribution
, then so does mixture
m
T ( s, s' )   wiTi ( s, s' )
i 1

• Example: Mixture of split-merge and Gibbs


moves
• Point: Faster convergence

38
Outline
• Probabilistic models for relational structures
– Modeling the number of objects
– Three mistakes that are easy to make
• Markov chain Monte Carlo (MCMC)
– Gibbs sampling
– Metropolis-Hastings
– MCMC over events
• Case studies
– Citation matching
– Multi-target tracking

39
MCMC States in Split-Merge
• Not complete instantiations!
– No parameters for unobserved species
• States are partial instantiations of random
variables
k = 12, CB1 = S2, CB2 = S8, S2 = 31, S8 = 84

– Each state corresponds to an event: set of


outcomes satisfying description

40
MCMC over Events
[Milch & Russell 2006]

• Markov chain over


events , with stationary
distrib. proportional to p() Q

• Theorem: Fraction of
visited events in Q E
converges to p(Q|E) if:
– Each  is either subset of Q
or disjoint from Q
– Events form partition of E

41
Computing Probabilities of Events
• Engine needs to compute p() / p(n)
efficiently (without summations)
• Use instantiations that
include all active parents
of the variables they
instantiate
• Then probability is product of CPDs:
p( )   p  ( X ) |  (Pa  ( X ))
X vars( )
X

42
States That Are Even More Abstract
• Typical partial instantiation:
k = 12, CB1 = S2, CB2 = S8, S2 = 31, S8 = 84

– Specifies particular species numbers, even though


species are interchangeable
• Let states be abstract partial instantiations:
 x  y  x [k = 12, CB1 = x, CB2 = y, x = 31, y = 84]

• See [Milch & Russell 2006] for conditions under


which we can compute probabilities of such
events
43
Outline
• Probabilistic models for relational structures
– Modeling the number of objects
– Three mistakes that are easy to make
• Markov chain Monte Carlo (MCMC)
– Gibbs sampling
– Metropolis-Hastings
– MCMC over events
• Case studies
– Citation matching
– Multi-target tracking

44
Representative Applications
• Tracking cars with cameras [Pasula et al. 1999]
• Segmentation in computer vision [Tu & Zhu 2002]
• Citation matching [Pasula et al. 2003]
• Multi-target tracking with radar [Oh et al. 2004]

45
Citation Matching Model
[Pasula et al. 2003; Milch & Russell 2006]

#Researcher ~ NumResearchersPrior();
Name(r) ~ NamePrior();
#Paper ~ NumPapersPrior();
FirstAuthor(p) ~ Uniform({Researcher r});
Title(p) ~ TitlePrior();
PubCited(c) ~ Uniform({Paper p});
Text(c) ~ NoisyCitationGrammar
(Name(FirstAuthor(PubCited(c))), Title(PubCited(c)));

46
Citation Matching
• Elaboration of generative model shown earlier
• Parameter estimation
– Priors for names, titles, citation formats learned
offline from labeled data
– String corruption parameters learned with Monte
Carlo EM
• Inference
– MCMC with split-merge proposals
– Guided by “canopies” of similar citations
– Accuracy stabilizes after ~20 minutes

47
[Pasula et al., NIPS 2002]
Citation Matching Results
0.25
(Fraction of Clusters Not Recovered Correctly)

0.2

Phrase Matching
0.15 [Lawrence et al. 1999]
Error

Generative Model + MCMC


[Pasula et al. 2002]
0.1 Conditional Random Field
[Wellner et al. 2004]

0.05

0
Reinforce Face Reason Constraint

Four data sets of ~300-500 citations, referring to ~150-300 papers 48


Cross-Citation Disambiguation
Wauchope, K. Eucalyptus: Integrating Natural Language
Input with a Graphical User Interface. NRL Report
NRL/FR/5510-94-9711 (1994).

Is "Eucalyptus" part of the title, or is the author


named K. Eucalyptus Wauchope?
Kenneth Wauchope (1994). Eucalyptus: Integrating
natural language input with a graphical user
interface. NRL Report NRL/FR/5510-94-9711, Naval
Research Laboratory, Washington, DC, 39pp.

Second citation makes it clear how to parse the first one

49
Preliminary Experiments:
Information Extraction
• P(citation text | title, author names)
modeled with simple HMM
• For each paper: recover title, author
surnames and given names
• Fraction whose attributes are recovered
perfectly in last MCMC state:
– among papers with one citation: 36.1%
– among papers with multiple citations: 62.6%
Can use inferred knowledge for disambiguation 50
Multi-Object Tracking

Unobserved
Object
False
Detection

51
State Estimation for “Aircraft”

#Aircraft ~ NumAircraftPrior();
State(a, t)
if t = 0 then ~ InitState()
else ~ StateTransition(State(a, Pred(t)));
#Blip(Source = a, Time = t)
~ NumDetectionsCPD(State(a, t));
#Blip(Time = t)
~ NumFalseAlarmsPrior();
ApparentPos(r)
if (Source(r) = null) then ~ FalseAlarmDistrib()
else ~ ObsCPD(State(Source(r), Time(r)));

52
Aircraft Entering and Exiting
#Aircraft(EntryTime = t) ~ NumAircraftPrior();
Exits(a, t)
if InFlight(a, t) then ~ Bernoulli(0.1);
InFlight(a, t)
if t < EntryTime(a) then = false
elseif t = EntryTime(a) then = true
else = (InFlight(a, Pred(t)) & !Exits(a, Pred(t)));
State(a, t)
if t = EntryTime(a) then ~ InitState()
elseif InFlight(a, t) then
~ StateTransition(State(a, Pred(t)));
#Blip(Source = a, Time = t)
if InFlight(a, t) then
~ NumDetectionsCPD(State(a, t));
…plus last two statements from previous slide
53
MCMC for Aircraft Tracking
• Uses generative model from previous slide
(although not with BLOG syntax)
• Examples of Metropolis-Hastings proposals:

[Figures by Songhwai Oh] [Oh et al., CDC 2004] 54


Aircraft Tracking Results
Estimation Error Running Time

MCMC has smallest error, MCMC is nearly as fast as


hardly degrades at all as greedy algorithm;
tracks get dense much faster than MHT
55
[Figures by Songhwai Oh] [Oh et al., CDC 2004]
Toward General-Purpose Inference
• Currently, each new application requires
new code for:
– Proposing moves
– Representing MCMC states
– Computing acceptance probabilities
• Goal:
– User specifies model and proposal distribution
– General-purpose code does the rest

56
General MCMC Engine
[Milch & Russell 2006]
Model
(in declarative language)
MCMC states: partial worlds
• Define p(s)
Custom proposal distribution
(Java class)

• Propose MCMC
state s given sn
• Compute acceptance
probability based on • Compute ratio
model q(sn | s) / q(s | sn)
• Set sn+1

General-purpose engine Handle arbitrary proposals efficiently


(Java code)
57
using context-specific structure
Summary
• Models for relational structures go beyond
standard probabilistic inference settings
• MCMC provides a feasible path for
inference
• Open problems
– More general inference
– Adaptive MCMC
– Integrating discriminative methods

58
References
• Blei, D. M. and Jordan, M. I. (2005) “Variational inference for Dirichlet process
mixtures”. J. Bayesian Analysis 1(1):121-144.
• Casella, G. and Robert, C. P. (1996) “Rao-Blackwellisation of sampling schemes”.
Biometrika 83(1):81-94.
• Ferguson T. S. (1983) “Bayesian density estimation by mixtures of normal
distributions”. In Rizvi, M. H. et al., eds. Recent Advances in Statistics: Papers in
Honor of Herman Chernoff on His Sixtieth Birthday. Academic Press, New York, pages
287-302.
• Geman, S. and Geman, D. (1984) “Stochastic relaxation, Gibbs distributions and the
Bayesian restoration of images”. IEEE Trans. on Pattern Analysis and Machine
Intelligence 6:721-741.
• Gilks, W. R., Thomas, A. and Spiegelhalter, D. J. (1994) “A language and program for
complex Bayesian modelling”. The Statistician 43(1):169-177.
• Gilks, W. R., Richardson, S., and Spiegelhalter, D. J., eds. (1996) Markov Chain
Monte Carlo in Practice. Chapman and Hall.
• Green, P. J. (1995) “Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination”. Biometrika 82(4):711-732.

59
References
• Hastings, W. K. (1970) “Monte Carlo sampling methods using Markov chains and
their applications”. Biometrika 57:97-109.
• Jain, S. and Neal, R. M. (2004) “A split-merge Markov chain Monte Carlo procedure
for the Dirichlet process mixture model”. J. Computational and Graphical Statistics
13(1):158-182.
• Jordan M. I. (2005) “Dirichlet processes, Chinese restaurant processes, and all that”.
Tutorial at the NIPS Conference, available at
http://www.cs.berkeley.edu/~jordan/nips-tutorial05.ps
• MacKay D. J. C. (1992) “Bayesian Interpolation” Neural Computation 4(3):414-447.
• MacEachern, S. N. (1994) “Estimating normal means with a conjugate style Dirichlet
process prior” Communications in Statistics: Simulation and Computation 23:727-741.
• Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E.
(1953) “Equations of state calculations by fast computing machines”. J. Chemical
Physics 21:1087-1092.
• Milch, B., Marthi, B., Russell, S., Sontag, D., Ong, D. L., and Kolobov, A. (2005)
“BLOG: Probabilistic Models with Unknown Objects”. In Proc. 19th Int’l Joint Conf. on
AI, pages 1352-1359.
• Milch, B. and Russell, S. (2006) “General-purpose MCMC inference over relational
structures”. In Proc. 22nd Conf. on Uncertainty in AI, pages 349-358.

60
References
• Neal, R. M. (2000) “Markov chain sampling methods for Dirichlet process mixture
models”. J. Computational and Graphical Statistics 9:249-265.
• Oh, S., Russell, S. and Sastry, S. (2004) “Markov chain Monte Carlo data association
for general multi-target tracking problems”. In Proc. 43rd IEEE Conf. on Decision and
Control, pages 734-742.
• Pasula, H., Russell, S. J., Ostland, M., and Ritov, Y. (1999) “Tracking many objects
with many sensors”. In Proc. 16th Int’l Joint Conf. on AI, pages 1160-1171.
• Pasula, H., Marthi, B., Milch, B., Russell, S., and Shpitser, I. (2003) “Identity
uncertainty and citation matching”. In Advances in Neural Information Processing
Systems 15, MIT Press, pages 1401-1408.
• Richardson,, S. and Green, P. J. (1997) “On Bayesian analysis of mixtures with an
unknown number of components”. J. Royal Statistical Society B 59:731-792.
• Sethuraman, J. (1994) “A constructive definition of Dirichlet priors”. Statistica Sinica
4:639-650.
• Sudderth, E. (2006) “Graphical models for visual object recognition and tracking”.
Ph.D. thesis, Dept. of EECS, Massachusetts Institute of Technology, Cambridge, MA.
• Tu, Z. and Zhu, S.-C. (2002) “Image segmentation by data-driven Markov chain
Monte Carlo”. IEEE Trans. Pattern Analysis and Machine Intelligence 24(5):657-673.

61

You might also like