Professional Documents
Culture Documents
1996 Probabilistic - Expert - Systems PDF
1996 Probabilistic - Expert - Systems PDF
1996 Probabilistic - Expert - Systems PDF
IN APPLIED MATHEMATICS
A series of lectures on topics of current research interest in applied mathematics under the direction of
the Conference Board of the Mathematical Sciences, supported by the National Science Foundation and
published by SIAM.
Probabilistic Expert
Systems
SJLHJTL.
SOCIETY FOR INDUSTRIAL AND APPLIED MATHEMATICS
PHILADELPHIA
Copyright 1996 by the Society for Industrial and Applied Mathematics.
10987654321
All rights reserved. Printed in the United States of America. No part of this book may be
reproduced, stored, or transmitted in any manner without the written permission of the
publisher. For information, write to the Society for industrial and Applied Mathematics,
3600 University City Science Center, Philadelphia, PA 19104-2688.
Preface vii
Index 79
Preface
have added a brief chapter on resources, which gives information on software and
includes an annotated bibliography. I have also added some exercises that will
help the reader begin to explore the problem of generalizing from probability to
broader domains of recursive computation.
The resulting monograph should be useful to scholars and students in artificial
intelligence, operations research, and the various branches of applied statistics
that use probabilistic methods. Probabilistic expert systems are now used in
areas ranging from diagnosis (in medicine, software maintenance, and space ex-
ploration) and auditing to tutoring, and the computational methods described
here are basic to nearly all implementations in all these areas.
I wish to thank Lonnie Winnrich, who organized the conference in North
Dakota, as well as the other participants. They made the week very pleasant
and productive for me. I also wish to thank the many students and colleagues,
at the University of Kansas and around the world, who helped me learn about
expert systems in the late 1980s and early 1990s. Foremost among them is
Prakash P. Shenoy, my colleague in the School of Business at the University
of Kansas from 1984 to 1992. I am grateful for his steadfast friendship and
indispensable collaboration.
Augustine Kong and A. P. Dempster, who joined with Shenoy and me in
the early 1980s in the study of join-tree computation for belief functions, were
also important in the development of the ideas reported here. Section 3.1 is in-
spired by an unpublished memorandum by Kong. Other colleagues and students
with whom I collaborated particularly closely during this period include Khalid
Mellouli, Debra K. Zarley, and Rajendra P. Srivastava.
Special thanks are due Niven Lianwen Zhang, Chingfu Chang, and the late
George Kryrollos, all of whom made useful comments on the 1992 draft of the
monograph.
I would also like to acknowledge the friendship and encouragement of many
other scholars whose work is reported here, especially A. P. Dawid, Finn V.
Jensen, Steffen L. Lauritzen, Judea Pearl, and David Spiegelhalter. The field of
probabilistic expert systems has benefited not only from their energy, intellect,
and vision, but also from their generosity and good humor.
Finally, at an even more personal level, I would like to thank my wife, Nell
Irvin Painter, who has supported this and my other scholarly work through thick
and thin.
CHAPTER 1
Multivariate Probability
This chapter reviews the basic ingredients of the theory of multivariate proba-
bility: marginals, conditionals, and expectations. These will be familiar topics
for many readers, but our approach will take us down some relatively unex-
plored paths. One of these paths opens when we develop an explicit notation
for marginalization. This notation allows us to recognize properties of marginal-
ization that are shared by many types of recursive computation. Another path
opens when we distinguish among probability distributions on the basis of how
they are stored. We distinguish between tabular distributions, which are sim-
ply tables of probabilities, and algorithmic distributions, which are algorithms
for computing probabilities. A parametric distribution is a special kind of al-
gorithmic distribution; it consists of a few numerical parameters and a rela-
tively simple algorithm, usually a formula, for computing probabilities from those
parameters.
The most complex topic in this chapter is conditional probability. Our pur-
poses require that we understand conditional probability from several viewpoints,
and we rely on some careful terminology to keep the viewpoints distinct. We
distinguish between conditional probabilities in general, which can stand on
their own, without reference to any prior probability distribution, and poste-
rior probabilities, which are conditional probabilities obtained by conditioning
a probability distribution on observations. And we distinguish two kinds of
tables of conditional probabilities: conditionals and posterior distributions. A
conditional consists of many probability distributions for a set of variables (the
conditional's head)one for each configuration of another set of variables (its
tail). A posterior distribution is a single probability distribution consisting of
posterior probabilities.
In the next chapter, we study how to construct a probability distribution
by multiplying conditional probabilitiesor, more precisely, by multiplying con-
ditionals. When we multiply the conditionals in an appropriate order, each
multiplication produces a larger marginal of the final distribution. This means
that each conditional is a continuer for the final distribution; it continues it from
a smaller to a larger set of variables. The concept of a continuer will help us
minimize complications arising from the presence of zero probabilities, which are
unavoidable in expert systems, where much of our knowledge is in the form of
1
2 CHAPTER 1
TABLE 1.1
A discrete tabular probability distribution for three variables.
female male
Dem ind Rep Dem ind Rep
young .08 .16 .08 .02 .04 .02
middle-aged .05 .05 .05 .00 .00 .00
old .05 .05 .05 .10 .10 .10
rules that do not admit exceptions. Continuers will also help us, in Chapter 3,
to understand architectures for recursive computation.
This chapter is about multivariate probability, not about probability in gen-
eral. Not all probability models are multivariate. The chapter concludes with a
brief explanation of why multivariate models are sometimes inadequate.
for each configuration c of w. Here x\w consists of the variables in x but not
w and c.d is the configuration of x that we get by combining the configuration c
of w and the configuration d of x \ w. For example, if x = {Age,Sex,Party} and
w {Age,Party}, then x \ u> = {^ex}; if c = (old,Democrat) and d (male),
then c.d = (old,male,Democrat).
The arrow notation emphasizes the variables that remain when we marginal-
ize. Sometimes we use instead a notation that emphasizes the variables we sum
out: P~y is the marginal obtained when we sum out the variables in y. Thus
when x = w U y, where w and y are disjoint sets of variables, and P is a proba-
bility distribution on x, both P^w and P~y will represent P's marginal on w.
Though we are concerned primarily with probability distributions, any
numerical2 function / on a set of variables x has a marginal f ^ w for every subset
w of x. The function / need not be nonnegative or sum to one. If w is not
empty, then f ^ w is a function on w.
2
A numerical function is one that takes real numbers as values. We will consider only
numerical functions in this monograph.
3
In order to understand this equation, we must recognize that the product fg is a function
on x U y. Its value for a configuration c of x U y is given by (fg)(c) f ( c ^ x ) g ( c ^ y ) , where
c^x is the result of dropping from c the values for variables not in x. For example, if / is a
function on {Age,Party} and g is a function on {Sex,Party}, then (/g)(old, male, Democrat) =
/(old, Democrat)(7(male, Democrat).
MULTIVARIATE PROBABILITY 5
FIG. 1.1. Removing y\x from y leaves x n y; removing y\x from x U y leaves x.
This version of Property 2 makes it clear that we are summing out the same
variables on both sides of the equation (fg)^-x = /(fl^xny)- Summing these
variables out of f g , which is a function on x U y. leaves the variables in x, but
summing them out of g, which is a function on y, leaves the variables in x d y
(see Figure 1.1).
The second version of Property 2 also suggests the following generalization:
Property 3. If / is a function on x. g is a function on y, and
We leave it to the reader to derive this property also from equation (1.2).
As we will see in Chapter 3, Properties 1 and 2 are responsible for the pos-
sibility of recursively computing marginals of probability distributions given as
products of tables. These properties also hold and justify recursive computation
in other domains, where we work with different objects and different meanings
for marginalization and multiplication. Because of their generality, we call Prop-
erties 1 and 2 axioms; Property 1 is the transitivity axiom, and Property 2 is the
combination axiom.
The definition of marginalization, equation (1.2), together with the proofs
of Properties 1, 2, and 3, can be adapted to the continuous case by replacing
summation with integration. We leave this to the reader. We also leave aside
complications that arise if infinities are allowedif the sum or integral is over an
infinite frame or an unbounded function. Our primary interest is in distributions
given by tables, and here the frames are both discrete and finite.
1.3. Conditionals.
Table 1.5 gives conditional probabilities for Party given Age and Sex. We call
these numbers conditional probabilities because they are nonnegative and each
group of three (the three probabilities for Party given each Age-Sex configura-
tion) sums to one. In other words, the marginal for {Age,Sex}, Table 1.6, consists
of ones.
We call Table 1.5 as a whole a conditional. We call {Party} its head, and
we call {Age,Sex} its tail In general, a conditional is a nonnegative function Q
6 CHAPTER 1
TABLE 1.5
A conditional for Party given Age and Sex.
female male
Dem ind Rep Dem ind Rep
young 1/4 1/2 1/4 1/4 1/2 1/4
middle-aged 1/3 1/3 1/3 1/5 1/5 3/5
old 1/3 1/3 1/3 1/3 1/3 1/3
TABLE 1.6
The marginal of Table 1.5 on its tail.
female male
young 1 1
middle-aged 1 1
old 1 1
on the union of two disjoint sets of variables, its head h and its tail t, with the
property that Q^ = 1^, where lt is the function on t that is identically equal to
one.
Two special cases deserve mention. If t is empty, then Q is a probability
distribution for h. If h is empty, then Q = It. We are interested in conditionals
not for their own sake but because we can multiply them together to construct
probability distributions. This is the topic of the next chapter.
Frequently, we are interested only in a subtable of a conditional. In Table 1.5,
for example, we might be interested only in the conditional probabilities for
femalesthe subtable shown in Table 1.7. We call such a subtable a slice. In
general, if / is a table on x and c is a configuration of a subset w of x, then we
write f\w=c for the table on x \ w given by
and we call f\w=c the slice of / on w = c. We leave it to the reader to verify the
following proposition.
PROPOSITION 1.1. Suppose Q is a conditional with head h and tail t, and
suppose w Ct. Then Q\w=c is a conditional with head h and tail t\w.
Table 1.7 illustrates Proposition 1.1; it is a conditional with {Party} as its
head and {^4<?e} as its tail.
We will sometimes find it convenient to generalize the notation for slicing by
allowing the variables whose values we fix to include variables that are outside
the domain of the table and hence have no effect on the result. In general, if / is
a table on x, w is a set of variables, and c is a configuration of iu, then we write
f\w=c for the table on x \ w given by
TABLE 1.7
The slice of Table 1.5 on Sex = female.
TABLE 1.8
The marginal of Table 1.1 for Age and Sex.
female male
young .32 .08
middle-aged .15 .00
old .15 .30
1.4. Continuation.
If / is a function on re, w C re, and
or
4
Bear in mind that P = PiwQ means P(c) = Piw(c^w)Q(c). Thus each entry in Table 1.8
multiplies a whole row (three entries) in Table 1.5.
8 CHAPTER 1
and
Moreover,
MULTIVARIATE PROBABILITY 9
4. A probability distribution is its own unique continuer from the empty set
to its domain.
Proof. Statement 1 follows directly from the definition of marginalization,
equation (1.2).
To prove statement 2, we substitute kg for / in equation (1.9), obtaining
(kg)^x = (kg)^wQ. By the combination axiom, this becomes kg^x = kg^wQ, or
9lx = 9lwQ-
Again by the combination axiom, P kf implies P^ kf^. Since P
is a probability distribution, P^ = 1, whence k = l/f^. So equation (1.10)
holds. Since f^ is a positive number, equation (1.10) is the unique solution of
equation (1.11); P is the unique continuer of / from 0 to x.
To prove statement 4, substitute P for / in equation (1.11) and again apply
the combination axiom.
Equations (1.6) and (1.9) do not require that Q be a function on x. They
require only that Q's domain, say v, should satisfy x w U v or, equivalently,
x \ w C v C x. In some cases (when the right-hand side of equation (1.8) does
not depend on all the coordinates of c), there is a continuer with a domain v
that is smaller than x. The situation is illustrated in Figure 1.2, where we have
written u\ for w\v, u^ for wHv, and ^3 for v\w. We may say, in this situation,
that u-2 is sufficient for the continuation from w to x; the other variables in w,
those in wi, can be neglected.
If the function / that we are continuing is a probability distribution, then
the idea of sufficiency can be elaborated in terms of the meaning of the proba-
bilities. If we give the probabilities an objective interpretation, then we can say
that once the configuration of u^ is determined, the configuration of u\ will not
affect the determination of the configuration of 143. If we give the probabilities a
subjective interpretation, then we can say that once we know the configuration
of U2, information about the configuration of u\ will not affect our beliefs about
the configuration of 143.
The philosophy of probability that underlies this monograph is neither strictly
objective nor strictly subjective. Instead, it is constructive. We see a probability
distribution as something we deliberately construct in order to make predictions.
Though these predictions may be the best we can do, we need not be fully com-
mitted to them as beliefs. And though they should be evaluated empirically,
they need not individually represent stable frequencies. In terms of this con-
structive interpretation, sufficiency simply means adequacy for prediction. Once
the configuration of u^ is specified, we ignore information about u\ when we
predict u3.
Instead of saying that u<2 is sufficient for the continuation from w to x, we
may say that 113 is independent of u\ given u^. The concept of conditional inde-
pendence thus defined is mathematically interesting. Its properties include the
symmetry suggested by Figure 1.2: if u^ is independent of u\ given u-2, then u\
is independent of u% given u^ (see Dawid [27], Pearl [8], or Appendix F of Shafer
[9]). Conditional independence is an important concept for both the objective
and subjective interpretations of probability. In the objective interpretation, a
10 CHAPTER 1
We call this number P's posterior probability for d given c. It exists only if
P^w(c) > 0, but we may suppose that if P^w(c) is zero we will not observe
w = c.
Equation (1.13) defines a whole probability distributiona distribution on
x\w that we may designate by px\w\w=c:
Equation (1.15) says that p\w=c is equal to the product of P and the function
on w that assigns the value l/P^w(c) to the configuration c and the value 0 to
all other configurations. It follows that p\w=c is proportional to the product
of P and a function on w that assigns 1 to c and 0 to all other configurations.
This point is sufficiently important to merit being stated in symbols. To this
end, we write Iw=c for the function on w that assigns 1 to c and 0 to all other
configurations:
where the fi are tables of reasonable size, but the number of variables involved
altogether is too large to allow the actual computation and storage of the table P.
(It will not be difficult to compute the value of P for a particular configuration,
at least if we know the constant of proportionality. But there may be too many
configurations for us to compute the value of P for all of them.) In this situation,
as we will see, we can often work from the factorization to find marginals for P,
even though we cannot compute P itself. We may also be interested in computing
marginals for posteriors of P, and therefore we will be interested in transforming
12 CHAPTER 1
and
1.6. Expectation.
Most readers will be familiar with the idea of the expectation of a function
V on x with respect to a probability distribution P on x. This is a number,
usually denoted by EP(V}. In the discrete case, it is obtained by multiplying
corresponding values of P arid V and adding the products. Thus
1.8. A limitation.
Though the multivariate framework for probability is widely used, it has its
limitations. A principal limitation is that it requires every variable to have a
value no matter how matters come out. This is often appropriate in statistical
work; in our example, every individual has an age and a sex, and we invent the
category "independent" so that every individual will have a party affiliation. It is
less appropriate in expert-system work, where the meaningfulness of a variable
often depends on the values of other variables. A particular medical test or
procedure only has a result if it is carried out, and we carry it out only for
some patients. A particular phoneme has a certain characteristic in the seventh
millisecond only if it lasts that long, and sometimes it may not. "Number of
pregnancies" is applicable only to women, not to men and children. We can
pretend that these variables always have values, but when there are many of
them, this is computationally awkward as well as artificial.
It is one thing to recognize this limitation and another to correct it. The
multivariate framework is flexible as well as expressive, and the obvious alter-
natives lack much of its flexibility. A tree, for example, allows us to represent
some variables as being meaningful only if others have certain values but al-
lows access to the variables only in a certain order. Consequently, most work in
probabilityboth theory and applicationis carried out within the multivariate
framework, and extensions to the framework are developed and used on a fairly
ad hoc basis.
The graphical models that we will study in the following chapters are squarely
within the multivariate framework. For some ideas about going beyond it, see
Dempster [16] and Chapter 16 of Shafer [9].
MULTIVARIATE PROBABILITY 15
Exercises.
EXERCISE 1.1. Derive the three properties of marginalization listed in 1.2
from equation (1.2).
EXERCISE 1.2. Here are some familiar problems, each with its own concept
of combination and its own concept of marginalization. Discuss, in each case,
how to formalize the problems so that the axioms of transitivity and combination
are satisfied.
Show that these operations satisfy the axioms of transitivity and combination.
(Compare equation (1.22).) This example, suggested to the author by Robert
Cowell, is relevant to computation in decision theory, where f may represent
a probability distribution and V may represent a utility function.
EXERCISE 1.4. Consider a function f on a set of variables x, together with a
collection hx,xcx of functions on the individual variables in x. For each subset
w of x, let f^w be the marginal on w of the function obtained by multiplying f
by the hx for X not in w. In symbols,
16 CHAPTER 1
Construction Sequences
Under certain conditions on the heads and tails of a sequence of conditionals, the
product of the conditionals will be a probability distribution. We call a sequence
of conditionals satisfying these conditions a construction sequence.
As we will see, the conditionals in a construction sequence are coritinuers for
the probability distribution obtained by multiplying them together. Initial seg-
ments of the sequence produce marginals of this probability distribution. Thus
the construction sequence represents a step-by-step construction of the proba-
bility distribution.
After constructing a probability distribution, we may want to find a marginal
for it or one of its posteriors. This may be difficult computationally, especially
if the joint frame of all the variables is too large to permit us to carry out the
multiplication of the conditionals. Were we able to carry out this multiplication,
we could store the resulting table and work directly with it to find marginals.
But if we are obliged to keep the probability distribution stored as a product of
tables, then we must look for less direct methods.
In some cases, as we will see in this chapter, a computationally inexpensive
adaptation of a construction sequence will produce a construction sequence for
the marginal we desire. To obtain the marginal for the variables in an initial
segment of a construction sequence, we need only omit the later factors from the
construction sequence. To obtain the posterior for later variables given values
of the variables in an initial segment, we need only slice the later factors. If the
construction sequence is a chain, then we can find a construction sequence for
the variables in a final segment by a simple forward propagation. The general
case, however, requires the more general methods that we will study in the next
chapter -methods that apply to any distribution stored as a product of tables,
whether or not the tables form a construction sequence.
If each new conditional in a construction sequence involves a single new vari-
able, then the most essential qualitative aspects of the construction sequence
can be represented by a directed acyclic graph (DAG). Such graphs have been
widely used for knowledge acquisition for probabilistic expert systems, and on
the theoretical side, they have been studied as a representation of conditional in-
dependence relations (Pearl [8]). Here we emphasize the value of DAGs for repre-
senting alternative construction sequencesconstruction sequences that use the
17
18 CHAPTER 2
TABLE 2.1
Qi, a probability distribution for Age. (This is a conditional with an empty tail and with
Age as its head.)
young .40
middle-aged .15
old .45
TABLE 2.2
Q2, a conditional with Age as its tail and Sex as its head.
female male
young 4/5 1/5
middle-aged 1 0
old 1/3 2/3
TABLE 2.3
QiQ2, a probability distribution for Age and Sex.
female male
young .32 .08
middle-aged .15 .00
old .15 .30
same conditionals but order them differently. By bringing these alternative or-
derings into the picture, a DAG enlarges the number of marginals and posteriors
that we can find by simple manipulations. In the general case, where each new
conditional is allowed to involve more than one new variable, we can similarly
indicate alternative orderings with a bubble graph, which is slightly more general
than a DAG.
Table 2.1 gives a probability distribution Q\ for Age (its single column adds to
one), and Table 2.2 gives a conditional Q% for Sex given Age (each row adds to
one). When we multiply these two tables, we get Table 2.3, which qualifies as a
probability distribution for Age and Sex (its six entries add to one). Notice that
Qi is a marginal of this probability distribution and hence Qi is a continuer.
We need not carry out the numerical multiplication in order to see that the
product Q\Qi is a probability distribution. We can instead perform an abstract
computation:
CONSTRUCTION SEQUENCES 19
Here we have first broken the summation into a summation over Sex followed
by a summation over Age. Since Qi does not involve Sex, it can be factored out
of the first summation, leaving Qi, which sums to one over Sex because it is a
conditional. This leaves us with the sum of Qi over Age, which is one because
Qi is a probability distribution.
Consider more generally any two conditionals Q\ and Q^. Write ti for the
tail, hi for the head, and di for the domain of Q%. (Recall that dl = ^ U/i z .) Our
example generalizes to the following proposition.
PROPOSITION 2.1. Suppose t\ is empty, t? is contained in d\, and hi is
disjoint from d\.
1. The product Q\Qz is a probability distribution on d\ U di.
2. The conditional Qi is Q\Qi 's marginal on d\.
3. The, conditional Qi continues Q\Qi from d\ to d\ U di.
Proof. Since we do not have symbols for individual variables, we will not use
summations like those in equation (2.1); instead, we will use our notation for
marginalization. We prove statement 1 by writing
Here we have used both the transitivity and the combination axioms.
Since Qi has an empty tail, it is a probability distribution. By the combina-
tion axiom,
FlG. 2.1. Left: the first tail is empty. The. second tail in contained in the first domain,
and the second head is disjoint from the. first domain. Right: two more head-tail pairs have
been added. Each time, the new tail is contained in the existing domain, and the new head is
disjoint from, it.
and we say that the construction sequence represents this probability distribu-
tion. The restrictions on the head tail structure of a construction sequence are
illustrated in Figure 2.1.
Statement 2 of Proposition 2.2 indicates one way that we can exploit a con-
struction sequence. If we are interested only in the variables in di U U di and
not in the remaining variablesthose in /ii+1 U U hnthen we can simply
omit the last n i conditionals from the construction sequence: Q\- Qi is a
construction sequence for the marginal probability distribution on d\ U U ci,.
Another way to exploit a construction sequence is to fix the values of variables
we have observed. If these variables appear at the beginning of the construction
sequence, then this produces a construction sequence for the posterior distribu-
tion.
PROPOSITION 2.3. Suppose Qi,---,Qn is a construction sequence. Suppose
1 < i < n. Write d for U"=1/ij, the domain of Q\- Qn, and write i for U*=1 hj,
the domain of Q^ Q,. Suppose c is a configuration o f t . Then
T,\rn.K 2. 1
A conditional jFor Party given .Age.
Notice that if we use instead the conditional Q'3 given by Table 2.4, then we
obtain the same probability distribution PA<;K.Sex,Party'-
The middle graph in Figure 2.2 and the graph in Figure 2.3 both have cycles,
but not cycles following the arrows. The cycle Xi,X3,X/i,Xi in Figure 2.3, for
example, goes against an arrow on its last step.
A belief net is a finite DAG with variables as nodes, together with, for each
node X, a conditional that has X as its head and X's immediate predecessors
5
Some authors prefer the name acyclic directed graph in order to emphasize that only
directed cycles are forbidden; a path that does not always follow the arrows is allowed to be a
cycle. But the name directed acyclic graph and the acronym DAG are strongly established in
the literature.
22 CHAPTER 2
6
A variety of other names are also in use, including Bayesian network and graphical model.
CONSTRUCTION SEQUENCES 23
Every DAG construction ordering for the DAG of a belief net gives, of course,
an ordering of its conditionals that is a construction sequence for the probabil-
ity distribution represented by the belief net. Thus the five DAG construction
orderings we just listed produce five construction sequences for the probabil-
ity distribution in equation (2.6)five ways to permute the Qi and still have a
construction sequence.
We can talk about a belief net representing a probability distribution, without
reference to any particular construction sequence: a belief net represents a prob-
ability distribution P if P is equal to the product of the conditionals attached
to its DAG. We can also talk about a DAG by itself representing a probability
distribution: a DAG represents P if by attaching appropriate conditionals we
can make it into a belief net representing Pi.e., if P factors into conditionals
in the way indicated by the DAG.
Considered abstractly, a belief net represents a probability distribution more
concisely than a construction sequence does. It provides the same conditionals,
but it refrains from ordering them completely. For this reason, belief nets are
considered more fundamental than construction sequences in much of the litera-
ture on probabilistic expert systems. As a practical matter, however, belief nets
arise from a step-by-step construction that provides a complete ordering, and
we usually preserve this ordering when we store a belief net. Moreover, as we
will see in the next section, there is no practical advantage in considering only
construction sequences that introduce one new variable at a time. So in this
monograph, we take construction sequences as fundamental, and we treat belief
nets as secondary toolstools that help us see alternative orderings for particu-
lar one-new-variable-at-a-time construction sequences. In small problems, where
we can actually draw the DAG, it enables us to see alternative orderings at a
glance. In larger problems, the idea of the DAG reminds us of the existence of
alternative orderings.
for some k.
24 CHAPTER 2
We call a DAG a chain if its nodes can be ordered, as in Figure 2.4, so that
the first has no immediate predecessors in the DAG and each of the others has
its predecessor in the ordering as its only immediate predecessor in the DAG.
Notice that a chain has only one DAG construction ordering: Xi,... ,Xn is the
unique DAG construction ordering for the chain X\ > - > Xn.
We call a belief net a belief chain if its DAG is a chain. Thus a belief chain
consists of a chain X\ + . . . > Xn and corresponding conditionals Q\,..., Qn.
The first conditional has X\ as its head and an empty tail; the ith conditional
has Xi as its head and _X";_i as its tail. The idea of forward propagation in such
a chain is based on the following lemma.
LEMMA 2.3. In a belief chain,
Markov chains and hidden Markov models. Readers familiar with the
theory of Markov chains may find it illuminating to note that a finite Markov
chain is a special kind of belief net. It is a belief chain such that each variable has
the same frame and all the conditionals after the first are identical. Figure 2.5
shows a simple Markov chain.
Most of the theory of Markov chains is concerned with their repetitive nature
and hence does not extend to belief nets in general or even to belief chains in
general. For example, a Markov chain is sometimes described in terms of its state
graph. This is a directed graph (not usually acyclic) with the states (elements of
the common frame) as nodes and with an arrow from state i to state j whenever
the (i,j)th entry of the common conditional is positive. (Figure 2.6 shows the
26 CHAPTER 2
FIG. 2.6. The state graph for the Markov chain in Figure 2.5.
state graph for the Markov chain of Figure 2.5.) In general, we cannot draw a
state graph for a belief chain because the successive variables may have different
frames. Even if the frames are the same, the possible transitions or at least their
probabilities will vary.
In recent years, considerable use has been made of belief nets of a type
slightly more general than Markov chainshidden Markov models. To form a
hidden Markov model, we begin with a Markov chain, say X\ > + Xn, and
from each node Xi we add an arrow to a new node, say Yi, so as to obtain a
DAG as in Figure 2.7. All the Yi have the same frame (possibly different from the
frame for the Xi) and the same conditional. In applications, the Yi are observed,
while the Xi are notthe Markov chain X\ > - - Xn is hidden. We are
interested in rinding posterior probabilities for the Xi, We may, for example,
want to find the most likely configuration of Xi,... ,Xn. Since the Yi do not
form an initial segment of the belief net, we cannot use Proposition 2.4 to find
posterior probabilities for the Xi. But efficient methods for finding posterior
probabilities (and for finding most likely configurations) have been developed in
the literature on hidden Markov models, and these methods, as it turns out, are
special cases of more general methods that we will study in Chapter 3.
Figure 2.7 represents only the simplest type of hidden Markov model; in
practice, the model is elaborated in various ways. One common elaboration
involves attaching more than one observable variable to each X;. There may be
a fixed number of observable variables for each Xit or this number itself may be
an observable variable. In speech recognition, for example, each Xi represents
CONSTRUCTION SEQUENCES 27
Though the visual clarity of belief nets is very attractive, there is no practi-
cal reason to limit ourselves to construction sequences involving only one new
variable at a time. All the computational ideas we considered in the preceding
section generalize to the general case, and we can also generalize the graphical
representation itself.
The simplest graphical representation of a general construction sequence is
the bubble graph. This graph has a node for each conditional. This nodecalled
a bubblecontains all the variables in the head and has an arrow to it from each
variable in the tail. Figure 2.8 shows a bubble graph for a construction sequence
for ten variables:
A bubble graph is acyclic in the same sense that a DAG is acyclicwe cannot
go in a cycle following the arrows. Moreover, a bubble graph, like a DAG,
permits us to pick out alternative construction orderings for the nodes i.e.,
alternative construction sequences for the probability distribution. In Figure 2.8,
for example, the bubbles can be ordered in seven different ways:
And hence there are seven ways of ordering the conditionals to form a construc-
tion sequence:
28 CHAPTER 2
We are particularly interested in the marginal of this posterior for the variable
N, which corresponds to an overall judgment that the financial statement is fairly
stated. Since the observed variables do not form an initial segment of the bubble
graph, we cannot find this marginal using the methods we have studied in this
chapter. Instead, we must use the methods of the next chapter, which apply to
arbitrary factorizations.
Exercises.
EXERCISE 2.1. The idea of a construction sequence for a probability distri-
bution generalizes to the idea of a construction sequence for a conditional. In
32 CHAPTER 2
Then consider a sequence of conditionals Q\,..., Qn. Under the hypothesis that
hi is disjoint from d,\ U Ud,_i for i = 2 , . . . , n, prove the following statements:
1. The product Q\ Qn is a conditional with head h\ U - U hn
and domain d\ U U dn.
2. For i = 2, ...,n, Qi Qi-]l(dlij-udn)\(h,\j-uh.n)isthe
marginal of Qi Qn on (d\ U U dn) \ (ht U U hn).
CONSTRUCTION SEQUENCES 33
FIG. 2.16. Here we ask only that the second head be disjoint from the first domain.
7
The name "join tree" was coined in the theory of relational databases in the early 1980s
(Beeri et al. [22]). An alternative, "junction tree," is also current in the literature on belief
nets.
35
36 CHAPTER 3
The clusters of variables involved in the tables are shown in Panel 1 of Figure 3.1.
Let us imagine summing the variables out in the reverse of the order in which
they are numbered, keeping track as we go of the new clusters we create.
Summing Xj out yields
comes was created. The node to which an arrow points always includes all the
variables in the node from which the arrow comes, except the variable that was
summed out. For any particular variable X, any node n containing X must be
connected to the node n' created when X is summed out, because the tables
created as we go downward from n continue to contain X until it is summed out.
It follows that all the nodes containing X are connected in the tree (i.e.form
a subtree), and this is equivalent to the tree being a join tree.
The join tree that we construct is this way is interesting because it can be
interpreted as a picture of the computations involved in the variable-by-variable
summing out. We interpret a node x as a register that can store a table for its
variables, and we interpret an arrow from x to y as an instruction to sum out a
variable from x's table and multiply y's table by the result.
We begin by putting tables in the storage registers; in Figure 3.2, for example,
we put the table /i in 23, the table /2 in 57, the product /3/4 in 1234, and the
table /s in 146. We put tables of ones in the other three nodes. The number
beside each arrow tells us which variable to sum out of the table in the node
preceding the arrow. Figure 3.3 shows the summations we perform when we
follow these instructions.
We summed the variables out in the reverse of the order in which they were
numbered: 7, 6, 5, 4, 3, 2. Figures 3.2 and 3.3 make it clear, however, that
this order can be varied to some extent without changing the join tree or the
computations performed. The only constraint is that we sum out of a given node
only after the node has absorbed messages from all nodes with arrows pointing
to it. Only the three nodes 23, 57, and 146 can begin the computation, 1245 can
act after 57, 124 can act after 1245 and 146, and so on.
We do not need the numbers beside the arrows in Figure 3.2. These numbers
tell us which variable to sum out, but we can also find this information by
comparing the node sending the message to the node receiving it. The sender
always sums out the variable it has that its neighbor does not have. In other
words, it marginalizes to its intersection with the neighbor.
The final result of the computation is f ^ X l , the marginal of / for X\. If we
continue by summing X\ out of this table, then we obtain /^ 0 , the marginal of
/ on the empty set. Figure 3.2 can be extended to include this final summation;
we simply add 0 as a node, with an arrow to it from 1.
PROPAGATION IN JOIN TREES 41
Rule 1. Each node waits to send its message to its neighbor nearer
to r until it has received messages from all its other neighbors.
Rule 2. When a node is ready to send its message, it computes the
message by summing out of its current table any variables it has but
the neighbor to whom it is sending the message does not have. (This
was always a single variable in Figure 3.1, but it could be several
variables or none.) In other words, it marginalizes its current table
to its intersection with the neighbor.
Rule 3. When a node receives a message, it replaces its current table
with the product of that table and the message.
Eventually, all the nodes except r will have sent messages, and r will have re-
ceived a message from each of its neighbors and will have multiplied its original
table by all these messages.
Here is the proposition we need to prove.
PROPOSITION 3.1. At the end of the algorithm just described, the table on r
will be (f>^r, the marginal on r of the product of the initial tables.
Proof. Imagine for the moment that the nodes are peeled away from the join
tree as they send their messages, so that in the end only r remains. Thus a single
step of the algorithm consists of three parts: (1) a node t computes the marginal
of its table to b D t, (2) the neighbor b multiplies this marginal into its current
table, and (3) the node t is removed from the tree. This allows us to state the
following lemma.
LEMMA 3.1. After each step, the product of the tables that remain is the
marginal to the variables that remain of the product of the tables before the step.
To see that Lemma 3.1 is true, write N\ for the set of nodes in the tree before
the step, iV2 for the set of nodes in the tree after the step, and i/)x for the table
in node x before the step. Thus the product of the tables before the step is
rizeAT! ^xi and the product of the tables after the step is (O^eTv ^o;)W 0< (see
Figure 3.4). Since the tree is a join tree, b H t = (UA^) H t. So we find, using the
combination axiom, that
which is a restatement of Lemma 3.1. Lemma 3.1, together with the transitivity
axiom, yields the next lemma.
LEMMA 3.2. After each step, the product of the tables that remain is the
marginal to the variables that remain of the product of the initial tables.
PROPAGATION IN JOIN TREES 43
FlG. 3.4. The loaded join tree before and after t 'sends its inward message to b.
At the end of the algorithm, we have only one table, the table on the root,
and so we obtain Proposition 3.1 as a special case of Lemma 3.2.
We can gain some further insight into the algorithm by noting that when a
node b receives a message from a neighbor t, it is also receiving, indirectly, infor-
mation from the nodes on the other side of t. After any step (message-passing
and multiplication) in the algorithm, we can identify the nodes from which a
given node b has received information, either directly or indirectly. These nodes,
together with b itself, form a subtree, which we may call the b's information
branch at that point (see Figure 3.5). The steps we have taken within this sub-
tree are the same as the steps we would have taken had we implemented the
algorithm on it alone, with b as the root. So as a corollary of Proposition 3.1,
we have the following proposition.
PROPOSITION 3.2. After each step, the table on a given node b will be the
marginal on b of the product of the initial tables in b's current information branch.
This is a generalization of Proposition 3.1, because at the end of the algo-
rithm, the root's information branch is the whole tree.
In the course of explaining our algorithm, we have found ourselves talking
about the nodes of the join tree as storage registers and even as individual
processors. Each node can store tables for a certain set of variables, multiply
such tables, and marginalize them. In effect, we have made the join tree, together
with the algorithm, into an architecture for marginalization. We call it the
elementary architecture. In the next few sections, we consider some alternative
architectures, based on the same join tree, that are able to compute marginals
for all the nodes, not merely for a single root node.
Join-tree architectures are potentially applicable to any instance of the gen-
eral problem of computing marginals of a function given as a product of tables,
as in equation (3.1), but in order to apply a join-tree architecture to such a prob-
lem, we first find a join tree that covers the product, one that includes for each
factor a node containing the domain of that factor. (If we want the marginal for a
44 CHAPTER 3
FlG. 3.5. The dashed arrows are those over which messages have already been sent. The
circled subtree is b's information branch at this point.
cluster of variables that is not the domain of one of the factors, then we must
make sure that the join tree also has a node containing this cluster.) Once we
have such a join tree, we place each factor in a node containing its domain. If
a node x receives more than one factor, we multiply them together, and we also
multiply by lx if necessary in order to obtain a table that involves all the vari-
ables in x. If a node x does not receive a factor, we simply assign it the table l x .
If the join tree has more than one node containing the domain of a particular
factor, we can put the factor in whichever of these nodes we please. In Figure 3.2,
for example, we have two different nodes that can accept a table on 124. To
minimize computation, we should choose the node with the smaller frame size,
but this is a minor consideration.
The choice of the join tree is much more important. We want a join-tree
cover with nodes small enough to permit computation. If such a join-tree cover
does not exist, we will have to turn to alternative methods for marginalization,
such as Markov-chain Monte Carlo.
As we noted at the beginning of the chapter, there are heuristics that do
produce reasonable choices for join-tree covers. Some of these heuristics do
involve choosing an order for eliminating (summing out) the variables. This not
only produces a join-tree cover; it also determines a placement of the factors in
the join treeeach factor goes as close as possible to the root.
FIG. 3.6. The partial Shafer-Shenoy architecture. Like the elementary architecture, its
finds the marginal for a single root node. In each separator, we have indicated the set of
variables involved in the messages that will be stored there; this is always the intersection of
the two neighboring nodes.
as the root. This usually involves a great deal of duplication. In Figure 3.4, for
example, most of the steps for computing the marginal on w will be the same as
those for computing the marginal on r.
The Shafer- Shenoy architecture provides one way to eliminate much of this
duplication. In this architecture, each node sends messages in all directions. It
is allowed to send its message to a particular neighbor as soon as it has messages
from all its other neighbors. In order that the computations for a message in one
direction should not interfere with those for a message in another direction, a
node no longer replaces its table each time it receives a message. Instead, it keeps
its initial table, stores the incoming messages, and performs multiplications only
as needed for computing outgoing messages.
As a first step in describing the Shafer-Shenoy architecture, we will describe
a partial version, in which, as in the elementary architecture, messages are prop-
agated only to a single root r. Figure 3.6 shows this partial architecture. The
squares on the arrows in this figure are called separators; they contain storage
registers for storing the messages sent in the direction of the arrows. As in the
elementary architecture, we begin with a table <px on each node x and we want
to find (f>^r for a particular node r, where </? is the product of the (px. The storage
registers in the separators are initially empty.
Here are the rules for propagation in the partial Shafer Shenoy architecture:
Rule 1. Each node waits to send its message to its neighbor nearer to
r until it has received messages from all its other neighbors. (More
precisely, it waits until messages have been received by the separators
between it and these other neighbors.)
46 CHAPTER 3
Here, as in the partial architecture, the tables on the nodes do not change. At
the end of the propagation, each node x still has its initial table y>x. The only
effect of the propagation is to fill all the storage registers in the separators.
A comparison of the rules for the full and partial architectures makes it clear
that the full architecture produces the same messages towards any particular
node as the partial architecture with that node as root. So once we have com-
pleted the propagation in the full architecture, we can find the marginal for
any particular node by collecting all its incoming messages and multiplying the
node's table by them.
PROPOSITION 3.4. At the end of the full Shafer-Shenoy propagation, we can
get <p^x for any node x by collecting all ofx's incoming messages and multiplying
x's table by them.
PROPAGATION IN JOIN TREES 47
FlG. 3.7. The full Shafer-Shenoy architecture. The arrow in each storage register indi-
cates the direction of the message to be stored there.
FIG. 3.8. One order in which messages might be sent in the full Shafer-Shenoy architec-
ture.
If the computations are performed serially, there will necessarily be one node,
such as 1 in Figure 3.8, that is the first to receive messages from all its neighbors.
This node can be considered the root. The propagation consists of a pass inward
to the root and another pass back outward. It is not necessary, however, to
specify the root in advance. If the computations are performed in parallel (a
possibility suggested when we talk as if the nodes were individual processors),
then which node is the first to receive all its messages will depend on the pace
of the computations for the different nodes farther out in the tree, and it is even
possible that two nodes will tie for first. This happens in Figure 3.9, where
the computations proceed in parallel and in synchrony, and 124 and 12 receive
messages from each other simultaneously on the third step of the computation.
PROPAGATION IN JOIN TREES 49
By comparing Figures 3.6 and 3.8, we can understand better why the Shafer
Shenoy architecture stores so many messages. The elementary architecture uses
and discards each message when it is sent. But what would happen if we were
to follow the inward pass of the elementary architecture with an outward pass?
In the case of Figures 3.6 and 3.8, this means that after 1 absorbed the message
from 12, it would send a message back to 12. By the usual rule, the message back
would simply be its current table, which was obtained by multiplying its original
table by the message (no marginalization is needed, because the intersection of
1 with 12 is simply 1). Intuitively, this is the wrong, because it forces 12 to
absorb again the message it just sent, effectively counting it twice. The Shafer-
Shenoy architecture sends instead only the original table, uncontaminated with
the message from 12. It is able to do this because it has kept both its original
table and the message. The same thing happens at each further step on the
outward pass. Node 12, for example, since it still has both its original table and
the messages from 23 and 1, is able to send a message back to 124 that is not
contaminated with the message it received from 124.
Roughly speaking, the Shafer Shenoy architecture computes marginals for
all the nodes at about three times the price for a single marginal. We double
the computation because we compute two messages instead of one for each link,
50 CHAPTER 3
and then we increase it by about the same amount again when we do the final
multiplications to get the marginal for each node. This contrasts with repeat-
ing the elementary architecture for each node, which multiplies the amount of
computation for a single marginal by the number of nodes.
Unfortunately, the Shafer-Shenoy architecture is still rather wasteful in its
demand for multiplication. Each node computes a message for each of its neigh-
bors only once (in contrast to what happens if we use the elementary architecture
over and over), but the multiplication a node performs to compute the message
to one neighbor still duplicates much of the multiplication it performs to com-
pute the message to another. In Figure 3.7, for example, node 124 will multiply
its original table by the message from 1245 once when it sends its message to
146 and again when it sends a message to 12. With yet more storage, we could
reduce this remaining duplication somewhat, but it is more effective to take an-
other tack. Instead of trying to keep the message a node sends on the inward
pass from being included in the message it gets back, we can allow for the mes-
sage's later return by dividing the it out of the node's current table as it is sent.
This is the tack taken by the Lauritzen-Spiegelhalter architecture.
FIG. 3.10. Rules for the Lauritzen-Spiegelhalter architecture. The message, In or Out, is
always the marginal of the sender's current table to the sender's intersection with the receiver.
Rule 3. When a node receives a message, it replaces its current table
with the product of that table and the message.
Since each node received messages from all its outward neighbors on the inward
pass, we can restate Rule 1 for the outward pass in a simpler way: Each node
waits to send its messages outward until it has received a message from its unique
neighbor nearer to r. (This neighbor may be r itself; r must begin the outward
pass by sending one or more messages.)
Let us check that the Lauritzen-Spiegelhalter architecture produces the ap-
propriate marginals for all the nodes.
PROPOSITION 3.5. At the end of the Lauritzen-Spiegelhalter propagation, the
table on each node x is </^ x .
Proof. First consider the situation at the end of the inward pass. On the
inward pass, the messages sent are the same as in the elementary architecture
and hence also the same as in the Shafer-Shenoy architecture. If x is not equal to
r, then during the inward pass, x sends its inward neighbor w the Shafer-Shenoy
message mx^w. At the end of the inward pass, x has received messages from
all its own outward neighbors (if any) and has sent only the message to w. This
gives the following lemma.
LEMMA 3.3. At the end of the inward pass, a node x not equal to the root
has as its table
LEMMA 3.4. At the end of the inward pass, the table on r is (p^T.
Now consider the outward pass. On the outward pass, each node except the
root receives just one message: the message from its inward neighbor. The root
itself sends messages but does not receive any. So the table on the root does not
change, and each of the other tables changes exactly once, when it is multiplied
by the message from its inward neighbor. Since the propagation moves outward
from the root, Proposition 3.5 follows by induction from Lemma 3.4 together
with the following lemma.
LEMMA 3.5. Suppose w has (p^w as its table when it sends its message to out-
ward neighbor x. Then after absorbing the message, x will have (p^x as its table.
To prove Lemma 3.5, we need a formula for the message w sends to x.
LEMMA 3.6. If w has (p^w as its table when it sends its message to outward
neighbor x, then the message it sends is the product of the Shafer-Shenoy mes-
sages in both directions: mu,^xmx^w.
To prove Lemma 3.6, we note that by its hypothesis and equation (3.3), the
table on w is
marginal of (p on w as its table before sending the message, and it computes the
message by marginalizing this table to w fl x.
Using continuers. The alert reader will have noticed that we glossed over the
problem of zero probabilities in our description of the Lauritzen Spiegelhalter
architecture. If the table mx^w has zero values, then we will not be able to
perform the division in equation (3.4). Fortunately, it is not really necessary to
perform this division. The reasoning with which we proved Proposition 3.5 will
work if we can find a continuer, say Qxnw-^xj of (px HneAf \w mn^x from x PI w
to x, for we can use Qxr\w-+x as x's table after it has sent its message inward
to u>, and this will have the same effect as the division. When the message
mw-+xmx^w comes back, we obtain
54 CHAPTER 3
as our table on x, so that Lemma 3.3 and Proposition 3.5 still hold.
The requirement that continuers should exist makes the Lauritzen-
Spiegelhalter architecture slightly less general than the Shafer-Shenoy architec-
ture, which allows negative entries in the tables (px. Continuers may fail to exist
when negative values are allowed. But if the product of the <px is proportional
to a probability distribution, then we can take it for granted that all the entries
are all nonnegative, because dropping minus signs will not change the product.
And, in this case, continuers exist by Proposition 1.1.
Notice the other implication of Proposition 1.1: we can choose the continuers
to be conditionals. More precisely, we can choose the continuer Qxr\w>x to be a
conditional with head x \ w and tail x n w.
When we look beyond probability to other problems satisfying the transitiv-
ity and combination axioms (see the exercises at the end of Chapter 1 and at the
end of this chapter), we find that the Shafer-Shenoy and Lauritzen-Spiegelhalter
architectures have overlapping but distinct ranges of application. The Shafer-
Shenoy architecture works whenever there are no restrictions on multiplication
and marginalization, even if continuers do not exist. The Lauritzen-Spiegelhalter
architecture, on the other hand, can sometimes work under restrictions on
multiplication or marginalization that prevent the use of the Shafer-Shenoy
architecture.
where w(x) is x's inward neighbor. This new factorization, as it turns out, can
be interpreted as a construction sequence.
In order to make the interpretation as a construction sequence precise, let us
take one more step, continuing the inward pass, as it were, from r to the empty
PROPAGATION IN JOIN TREES 55
set 0. In other words, we factor the marginal <^r into the product of (/^0 and
a continuer from 0 to r. Since (p is proportional to a probability distribution
P, (p^ ^ 0, and hence the continuer is unique; it is the marginal P^r. So
equation (3.6) becomes
If we imagine the a node 0 added to the join tree, with an arrow to it from r,
then at the end of the inward pass, we have the factors on the right-hand side
of equation (3.7) on the nodes of the tree (see Figure 3.12).
By Proposition 1.3, the probability distribution P is equal to (p/tp^. So
equation (3.7) tells us that
It is the conditionals on the right-hand side of this equation that can be arranged
in a construction sequence for P. Indeed, suppose x i , . . . , xm is an ordering of
the nodes of the join tree that moves outward from the rooti.e., such that x\
is the root and each later Xi is an outward neighbor of one of r c i , . . . , x^-i. (Such
orderings exist in any tree.) Write Qi for QXir\w(xi)-*xii fr z = 2 , . . . ,m. Then
we have the following lemma.
LEMMA 3.7. P^r, Q^-, , Qm is a construction sequence for P.
Proof. Equation (3.8) says that P is the product of P^ r ,Q2, , Qmi and
the union of their heads is clearly equal to TV, the domain of P. So to prove
the lemma, we need only show that the head of each conditional is disjoint from
the domain of the preceding ones. But this is an obvious property of join trees:
whenever we order the nodes in a sequence moving outward from a root, the
intersection of each node Xi with the preceding nodes is always contained in its
inward neighbor w(xi), and hence Xi \w(xi) is disjoint from x\ U- -Uzj-i.
Lemma 3.7 says that at the end of the inward pass, the tables on the nodes
are conditionals, and any outward sequence is a construction sequence.
56 CHAPTER 3
FlG. 3.14. The inward and outward action of the Aalborg architecture between x and its
inward neighbor w. Here ifjx and t^w are the tables on x and w, respectively, just before x
computes its message to w, and ipx and ifr'w are the tables just before w computes a message
to send back. The table on w may have changed one or more times as a result of messages
from other outward neighbors and its own inward neighbor.
Since we are more interested in this marginal than in the Shafer-Shenoy message,
we store it in the separator after we forward its quotient by the old message.
The action of the separator on the inward pass seems different from its action
on the outward pass, but Figure 3.15 shows how to describe it in a way that
makes it similar. Instead of beginning with the separator empty, we begin with
it containing l^nx, a table of ones. Since In is the same as In/lwr\Xi we can
say that here too the separator is sending forward a quotient rather than merely
sending forward the message it receives. Thus we have the uniform action shown
in Figure 3.16; the separator always stores New but sends forward New/ Old.
In summary, the Aalborg architecture uses a rooted tree with a separator
between each pair of nodes. Initially, each node x has a table </?x, and each
separator has a table of ones. The propagation follows these rules:
Rule 1. Each nonroot node waits to send its message to a given
neighbor until it has received messages from all its other neighbors.
Rule 2. The root waits to send messages to its neighbors until it has
received messages from them all.
Rule 3. When a node is ready to send its message to a particular
neighbor, it computes the message by marginalizing its current table
to its intersection with this neighbor, and then it sends the message
to the separator between it and the neighbor.
Rule 4. When a separator receives a message New from one of its two
nodes, it divides the message by its current table Old, sends the quo-
tient New/ Old on to the other node, and then replaces Old with New.
Rule 5. When a node receives a message, it replaces its current table
with the product of that table and the message.
58 CHAPTER 3
FIG. 3.15. If we suppose that the separator begins with a table of ones, then the inward
action is the same as the outward.
FIG. 3.16. The uniform action of the Aalborg architecture: When u sends New to its
neighbor v, the message is intercepted by the separator, which divides it by Old and passes the
quotient on.
Rules 1 and 2 force the propagation to move in to the root and then back out.
At the end of the propagation, the tables on all the nodes and separators are
marginals of </?, where ip = Y\ x-
Dealing with zeros. We have again been making the simplifying assumption
that there are no negative or zero values in the <px, so that division is always
possible. Now let us relax this to the assumption that there are no negative
values, which is sufficient for continuers to exist.
When zeros are not allowed in the table Old, the quotient New/ Old is the
unique solution ty of the equation Old tp = New. As it turns out, this equation
can still be solved when we allow zeros; the solution is not unique, but it does
not matter what solution we use. So there are two ways we can proceed. We
can stop talking about divisionwe can talk instead about solving the equation
Old ip = New. Or we can extend the definition of division by picking out a
particular solution of the equation Old ty = New and calling it the quotient
New/ Old.
PROPAGATION IN JOIN TREES 59
We will explore both approaches. First, let us see what happens when we
drop talk about division. Since division appears only in Rule 4, all we need to
do is replace that rule with the following rule:
for ip and sends tp on to its other node. It then discards Old and
stores New in its place.
As the following proposition shows, this works; it is always possible to solve
equation (3.9), and doing so produces the result we want.
PROPOSITION 3.6. If there are no negative values in the initial tables on the
nodes, then propagation under Rules 1,2,3,4', and 5 will result in each node and
separator containing its marginal of (p.
Proof. Since the propagation proceeds inward just as in the elementary archi-
tecture, the root will have its marginal at the end of the inward pass. So we can
prove the proposition by induction on the outward pass. Suppose propagation
to w on the outward pass has resulted in the table (p^w on w, and let us show
that the next step will produce (p^x on ID'S outward neighbor x. On the inward
pass, x had sent in mx^w, and w now sends back (p^xC]w, or mx-+wmw-+x. So
equation (3.9) can be rewritten as
or
Equation (3.11) obviously has a solution, but it may have more than one. We
need to show that any solution will produce the marginal on x when it multiplies
the table now on x. To this end, let Qxt^w-^x De a Lauritzen Spiegelhalter contin-
uer for x. The current table on x is Qxr\w-*xmx->wi so the result of multiplying
it by any solution of equation (3.10) is
In the case at hand, we want to divide one table by another of the same
size, but with an eye to further developments, let us consider a more general
situation, where we want to divide one table by another of the same or possibly
smaller size. Say we want to divide a table B on y by a table A on x, where
x C y. We will show how to do so under the assumption that whenever an entry
in A is zero, everything in the corresponding row in B is zeroi.e.,
or, equivalently,
We will say that A supports B when this condition is met. Given a table A on
x that supports a table B on y, we define a table B/A on y by
Here we have set the value of the quotient equal to zero when the value of
denominator is zero. Any other number would do just as well for our immediate
purpose, but zero will prove convenient later.
This extended definition of division immediately yields the following lemma.
LEMMA 3.8. If A supports B, then
The Aalborg formula. Let us return, for just a moment, to the assumption
that our tables never have zero entries. Write N for the set of nodes, S for the
set of separators, Tx for the current table on the node x, and Us for the current
table on the separator s. At the beginning of the propagation, Tx <px, UK = 1 S)
and hence
At each step, we change the table on one node and on one separator. The table
on the node is multiplied by New/Old, and the table on the separator is changed
from Old to New- i.e., it also is multiplied by New/ Old. Since the table on the
node is multiplied by the same factor as the table on the separator, the ratio
This is the Aalborg formula. In words, the function whose marginals we want is
always the ratio of the product of the tables on the nodes to the product of the
tables on the separators.
The Aalborg formula still holds even if zero entries are allowed in our tables,
but the reasoning with which we established it holds only if we plug a couple of
holes.
First, we must check that Hse?^ 8 alwavs supports Ilze/v^-' so tnal' ^ ne
ratio (3.16) is defined. To check this, we write x ( s ) for the outward neighbor
of the separator s. Since [/.,. if it is not equal to l s , is a marginal of T x ( s ) , Us
supports Tx(s) (statement 4 of Lemma 3.9). Hence Pises ^ suPPrts FLes ^(s)
(statement 3) and also Tr HseS1 ^() (statement 2), which is equal to Y\xN Tx.
Second, we must check that multiplying the top and bottom of the ra-
tio (3.16) by New/0ldvfi\\ not change it. This follows from statements 6 and 8
of Lemma 3.9, together with the fact that New/Old supports the numerator. We
know that New/ Old supports the numerator because New is a marginal of one
of its factors, and by statement 1 of Lemma 3.9, New/'Old supports whatever
New supports.
62 CHAPTER 3
There is one point of notation that should be clarified in connection with the
Aalborg formula. For simplicity, we have been using a notation that identifies
each node x with a set of variables. We could also identify each separator with a
set of variableswe could say that the separator s between the nodes u and v is
equal to uC\v. It is better, however, to assume that the names of the separators
are distinct from the sets of variables involved, for two or more separators might
involve the same set of variables. (We might have one pair of neighboring nodes
u\ and v\ and another pair 11% and V2 with uiHvi = u^ Pi v-2.) It would burden
our notation unnecessarily for us introduce distinct symbols for the separator
and its set of variables, but the distinction should be kept in mind, even when,
as will happen shortly, we write as if they are the same.
where x(s) is the outward neighbor of the separator s. This suggests that we
compare propagation with Us on s arid Tx on x to propagation with ls on s. Tr
on r, and Tx^s->/Us on x ( s ) . Call the former the loaded propagation (because the
separators are loaded at the beginning) and the latter the adjusted propagation
(because the tables on the nodes are adjusted). We know that the adjusted
propagation results in the marginals of (p on all the nodes and separators; let us
show that the loaded propagation gives the same results.
For the moment, we reserve Tx and Us for the initial tables in the loaded
propagation; we write T_Jaded and y]oaded for the current tables in the loaded
propagation and T*dJusted and [/adjusted f or the current tables in the adjusted
propagation. Initially,
and
These equations will hold throughout the inward pass, for if they hold before an
inward step, they hold after it. To see this, write Mx(s^s for the message from
PROPAGATION IN JOIN TREES 63
the inward loaded message from x(s) is multiplied by Us in comparison with the
inward adjusted message. Since this is the new table for s, equation (3.20) will
still hold. But the loaded propagation divides Us out before sending the message
on to the neighbor w; hence the message multiplied into w is the same in the two
propagations, and the relation between T^oaded and J1djusted (equation (3.19) or
(3.21)) will also be unaffected.
Since the root has the same table at the end of the inward pass in the two
propagations, it sends the same messages back out. So we can complete the
proof by induction on the outward pass. We need only show that if the message
from w out back to s is the same in the two propagations, then the table on x ( s )
will end up the same. But if we write Mw->a for the message from w back to s,
then the table we get on x(s) in the loaded propagation is
FlG. 3.17. After COLLECT is called outward from the root, messages move inward.
FIG. 3.18. As DISTRIBUTE is called outward from the root, messages move outward.
Exercises.
EXERCISE 3.1. How great is the computational advantage of the Lauritzen-
Spiegelhalter architecture over the Shafer-Shenoy architecture? For a first pass
at answering this question, you may wish to assume that each nonleaf in the join
tree has the same number of neighbors (the tree's "branching factor"), that each
variable has the same number of elements in its frame, and that each node has
the sam,e number of variables in common with its branch as well as the same
number of new variables.
EXERCISE 3.2. Compare the three architectures on the basis of their storage
requirements. Consider the case where we need to keep the initial inputs and the.
case where we do riot.
EXERCISE 3.3. Show how to use join-tree computation to find P^w(x) for
any set w of variables and any single configuration x of w -even ifw is too large
to be contained in any node of the join tree. (Hint: Pretend x is observed, and
exploit the fact that Piw(x) is the inverse of the normalizing constant for the
posterior probabilities.)
EXERCISE 3.4. Discuss ways of measuring the amount of computation re-
quired by a join tree. (In the introduction to Chapter 3, two measures were
suggested: the, sum of the sizes of the frames, and the size of the largest frame.)
Discuss the issue separately for probability propagation and for each of the prob-
lems listed in Exercise 1.2.
EXERCISE 3.5. Verify that the elementary and Shafer-Shenoy architectures
always work in the. abstract framework you formulated in Exercise 1.5.
EXERCISE 3.6. Explore the analogy between the outward pass of the
LauriLzen-Spiegelhalter architecture and the outward pass in recursive, dynamic
programming, in which solutions of reduced problems are. used to build up an
overall solution (Mitten [40], Bertele and Rrioschi [1], Shenoy [46]). Formulate,
an abstract theory that includes both examples as special cases.
68 CHAPTER 3
4.1. Meetings.
The annual Conference on Uncertainty in Artificial Intelligence (UAI) plays a
leading role in the development of probabilistic, belief-function, fuzzy, and qual-
itative expert systems. Papers given in its first six years (1985-1990) were col-
lected and published by North-Holland in a series entitled Uncertainty in Arti-
ficial Intelligence. Proceedings of subsequent meetings have been published by
Morgan Kaufmann. The Association for Uncertainty in Artificial Intelligence,
the sponsor of the conference, has a site on the World-Wide Web:
http: / /www .auai .org/
This site gives instructions for subscribing to the association's electronic mailing
list and includes links to many other sources of information about the manage-
ment of uncertainty in expert systems.
The biennial International Workshop on Artificial Intelligence and Statistics
is also devoted in part to uncertainty in expert systems. The Web site for its
sponsor, the Society for Artificial Intelligence and Statistics, is
http://www.vuse.vanderbilt.edu/~dfisher/ai-stats/socicty.html
This site is maintained by Douglas H. Fisher at Vanderbilt University.
Another important conference for this community is the International Confer-
ence on Information Processing and Management of Uncertainty in Knowledge-
Based Systems (IPMU), which has been held biennially since 1986. The pro-
ceedings of the most recent conference, held in Paris in 1994, was published
by Springer-Verlag in 1995 under the title Advances in Intelligent Computing,
edited by Bernadette Bouchon-Meunier, Ronald R. Yager, and Lotfi A. Zadeh.
4.2. Software.
A number of software packages for probabilistic expert systems are available.
The most highly developed is the commercial product HUGIN. Developed at
Aalborg, Denmark, it uses the Aalborg architecture described in Chapter 3.
Information on HUGIN can be obtained at:
http: //www.hugin .dk
69
70 CHAPTER 4
http://iridia.ulb.ac.be/pulcinella/
Further information on these and other packages, some commercial and some
free, is available at a Web site maintained by R.ussell Almond:
http://bayes.stat.washington.edu/almond/belief.html
4.3. Books.
There are now many excellent books on probabilistic expert systems and related
topics.
[1] Bertele, Umberto, and Francesco Briosdii (1972). Nonserial Dy-
namic Programming. Academic Press. New York. A readable treat-
ment of join-tree computation for decomposable dynamic program-
ming problems.
[2] Diestel, R. (1990). Graph Decompositions. Clarendon Press. Ox-
ford. A general perspective on decompositions of the type exemplified
by join trees, with hints at the diversity of the applied problems that
inspire these decompositions.
[3] Jensen, Finn V. (1996). An Introduction to Bayesian Networks.
University College Press. London. An engaging and readable intro-
duction to probabilistic networks, with an emphasis on construction
and computation within the Aalborg architecture.
[4] Judd, J. Stephen (1990). Neural Network Design and the Com-
plexity of Learning. MIT Press. Cambridge. This interesting and
readable book demonstrates the relevance of join-tree ideas to the
problem of learning in neural networks.
[5] Lauritzen, Steffen L. (1996). Graphical Models. Oxford Univer-
sity Press. London. A superb treatment of probabilistic networks as
models for data, this book marries probabilistic expert systems with
up-to-date statistical methodology. Relatively comprehensive, it cov-
ers undirected as well as directed graphs, and continuous (normal)
as well as discrete probability distributions. Its greatest originality
lies in its treatment of mixed cases: chain graphs, which combine
directed and undirected graphs, and models with both discrete and
continuous variables.
[6] Neapolitan, E. (1990). Probabilistic Reasoning in Expert Systems.
John Wiley. New York. This readable book covered the state of the
RESOURCES AND REFERENCES 71
[13] Besag, Julian, Peter Green, David Higdon, and Kerrie Mengersen
(1995). Bayesian computation and stochastic systems (with
72 CHAPTER 4
[22] Beeri, Catriel, Ronald Fagin, David Maier, and Mihalis Yan-
nakakis (1983). On the desirability of acyclic database schemes.
Journal of the Association for Computing Machinery. 30, pp. 479-
513. This very widely cited paper first introduced the idea of a join
tree into the literature on relational databases. It is also responsible
for the name "join tree."
[23] Cano, Jose, Miguel Delgado, and Serafin Moral (1993). An ax-
iomatic framework for propagating uncertainty in directed acyclic
networks. International, Journal of Approximate Reasoning. 8, pp.
253-280. This article extends the axioms for join-tree computation,
discussed in Chapter 1 and in Shenoy and Shafer [48], to computa-
tion within directed acyclic graphs, in the style developed in Pearl's
Probabilistic Reasoning in Intelligent Systems [8].
[24] Cooper, Gregory F., and Edward Herskovits (1992). A Bayesian
method for the induction of probabilistic networks from data. Ma-
chine Learning. 9, pp. 309-347. An influential exposition of a
straightforward Bayesian approach to choosing and parametrizing a
DAG from data for a given set of variables. The method developed
in this article can be contrasted with the non-Bayesiari methods de-
veloped in Spirtes, Glymour, arid Scheines's Causation, Prediction,
and Search [11].
[25] Cowell, Robert G., and A. Philip Dawid (1992). Fast retraction
of evidence in a probabilistic expert system. Statistics and Com-
puting. 2, pp. 37-40. Using out-marginalization (see Exercise 1.4),
this article gives a quick join-tree algorithm for adjusting marginal
probabilities to allow for the omission of previously included obser-
vations. The algorithm allows efficient computation of statistics for
monitoring the performance of a belief net.
[26] Cox, David R., and Nanny Wermuth (1993). Linear dependencies
represented by chain graphs (with discussion). Statistical Science. 8,
pp. 204-283. Taking DAGs and chain graphs as a starting point,
this article discusses a wide variety of graphical representations of
multivariate probability distributions.
[27] Dawid, A. Philip (1980). Conditional independence for statistical
operations. Annals of Statistics. 8, pp. 598-617. This pioneering
article studies general properties of conditional independence that
were later studied as axioms by Judea Pearl.
74 CHAPTER 4