Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Generation of Random Bayesian Networks with Constraints

on Induced Width, with Application to the Average Analysis of


d-Connectivity, Quasi-random Sampling, and Loopy
Propagation
Jaime S. Ide Fabio G. Cozman Fábio T. Ramos
Escola Politécnica, Universidade de São Paulo
Departamento de Engenharia Mecatronica, Decision Making Lab
Av. Prof. Mello Moraes, 2231 - CEP 05508900, São Paulo, SP - Brazil
jaime.ide@poli.usp.br, fgcozman@usp.br, fabioram@usp.br

Abstract. We present algorithms for the generation of uniformly distributed Bayesian networks
with constraints on induced width. The algorithms use ergodic Markov chains to generate sam-
ples, building upon previous algorithms by the authors. The introduction of constraints on
induced width leads to more realistic results but requires new techniques. We discuss three ap-
plications of randomly generated networks: we study the average number of nodes d-connected
to a query, the effectiveness of quasi-random samples in approximate inference, and the conver-
gence of loopy propagation for non-extreme distributions.

1. INTRODUCTION
Certain questions involving Bayesian networks are quite difficult to answer exactly. For example, what is
the average number of variables that are d-connected to each other in a Bayesian network? Or, how do
quasi-random importance sampling and quasi-random Gibbs sampling compare? Significant insight into
such theoretical and empirical questions could be obtained analyzing large samples of Bayesian networks.
However it does not seem likely that hundreds of “real” Bayesian networks will be freely distributed in the
near future. We are left with an obvious solution: we must be able to randomly generate Bayesian networks
for our experiments and comparisons. Many researchers have used ad hoc random processes to generate
networks in the past (for example, [19, 22]); however we would like to produce networks that are uniformly
distributed in some suitable space. The most challenging aspect of Bayesian network generation is exactly
to define this space; that is, to define the properties that characterize “realistic” Bayesian networks. Our
first attempt in generating Bayesian networks was to control the maximum node degree and the maximum
number of edges, aiming to generate relatively sparse graphs [7]. We found that such a strategy is reasonable
but not perfect. Restrictions solely on node degree and number of edges lead to “overly random” edges
— real networks often have their variables distributed in groups, with few edges between groups.1 We
then considered a number of other distingushing properties of “real” networks. For example, one option,
suggested by T. Kocka, would be to produce graphs with a large number of equivalent graphs (a property
observed in real networks). However we wanted to use properties with clear intuitive meaning — so that
users of our algorithms would quickly grasp the properties of generated networks.
We eventually found that the most appropriate quantity to control is the induced width of net-
works.2 The induced width of a network can be easily explained and understood; it conveys the algorithmic
complexity of inferences and, indirectly, it captures how dense the network is. Our tests indicate (rather
subjectively) that a network with low induced width “looks like” real networks in the literature. Besides, it
makes sense to control induced width, as we are usually interested in comparing algorithms or parameteriz-
ing results with respect to the complexity of the underlying network, and induced width is the main indicator
of such complexity. Unfortunately, the generation of random graphs with constraints on induced width is
significantly more involved than the generation of graphs with constraints on node degree and number of
1 Tomas Kocka brought this fact to our attention.
2 Carlos Brito suggested the idea.
B B F L
F F

L H
D D
L L
D D
H H
H B
(a) (b)

B F
(c) (d)

Figure 1: a) Network, b) moral graph, c) induced graph for ordering F, L, D, H, B, and d) induced
graph for ordering L, H, D, B, F . Dashed lines represent induced connections.

edges. In this paper we report on new algorithms that accomplish generation of graphs with simultaneous
constraints on all these quantities: induced width, node degree, and number of edges.
Following our previous work [7], we divide the generation of random Bayesian networks into
two steps. First we generate a random directed acyclic graph that satisfies constraints on induced width,
node degree, and number of edges; then we generate probability distributions for the graph. To generate
the random graph, we construct ergodic Markov chains with appropriate stationary distributions, so that
successive sampling from the chains leads to the generation of properly distributed networks. The necessary
theory is presented in Section 2.. Our algorithms are described in Section 3.. A freely distributed program
for Bayesian network generation is presented in Section 4..
To illustrate how random Bayesian networks can be used, we describe three applications of obvious
practical significance in Section 5.. First, we study the the average number of d-connected nodes in Bayesian
networks. Second, we analyze the performance of quasi-random samples (as opposed to pseudo-random
samples) in two inference algorithms: importance sampling and Gibbs sampling. Third, we investigate the
convergence of loopy propagation on networks that have non-extreme distributions (that is, most probability
values are far from zero and one).

2. BASIC CONCEPTS
This section summarizes material from [7] and [5].
A directed graph is composed of a set of nodes and a set of edges. An edge (u, v) goes from a
node u (the parent) to a node v (the child). A path is a sequence of nodes such that each pair of consecutive
nodes is adjacent. A path is a cycle if it contains more than two nodes and the first and last nodes are the
same. A cycle is directed if we can reach the same nodes while following arcs that are in the same direction.
A directed graph is acyclic (a DAG) if it contains no directed cycles. A graph is connected if there exists a
path between every pair of nodes. A graph is singly-connected, also called a polytree, if there exists exactly
one path between every pair of nodes; otherwise, the graph is multiply-connected (or multi-connected for
short). An extreme sub-graph of a polytree is a sub-graph that is connected to the remainder of the polytree
by a single path. In an undirected graph, the direction of the edges is ignored. An ordered graph is a pair
containing an undirected graph and an ordering of nodes. The width of a node in an ordered graph is the
number of its neighbors that precede it in the ordering. The width of an ordering is the maximum width
over all nodes. The induced width of an ordered graph is the width of the ordered graph obtained as follows:
nodes are processed from last to first; when node X is processed, all its preceding nodes are connected (call
these connections induced connections and the resulting graph induced graph). An example is presented at
Figure 1. The induced width of a graph is the minimal induced width over any ordering; the computation
of induced width is an NP-hard problem [5], and computations are usually based on heuristics [9].
A Bayesian network represents a joint probability density over a set of variables X [8]. The density
is specified through a directed acyclic graph; every node in the graph is associated with a variable Xi in X,
and with a conditional probability density p(Xi |pa(Xi )), where pa(Xi ) denotes the parentsQ of Xi in the
graph. A Bayesian network represents a unique joint probability density [15]: p(X) = i p(Xi |pa(Xi ))
(consequence of a Markov condition). The moral graph of a Bayesian network is obtained by connecting
parents of any variable and ignoring direction of edges. The induced width of a Bayesian network is the
induced width of its moral graph. An inference is a computation of a posterior probability density for
a query variable given observed variables; the complexity of inferences is directly related to the induced
width of the underlying Bayesian network [5].
We use Markov chains to generate random graphs, following [12]. Consider a Markov chain
{Xt , t ≥ 0} over finite domains S and P = (pij )M ij=1 to be a M x M matrix representing transition
probabilities, where M is the number of states and pij = P r(Xt+1 = j|Xt = i), for all t [16, 18]. The
(s)
s-step transition probabilities is given by P s = pij = P r(Xt+s = j|Xt = i), independent of t. A
(s)
Markov chain is irreducible if for all i,j there exists s that satisfies pij > 0. A Markov chain is irreducible
if and only if all pair of states intercommunicate. A Markov chain is positive recurrent if every state
i ∈ S can be returned to in a finite number of steps; it follows a that finite irreducible chain is positive
(s)
recurrent. A Markov chain is aperiodic if the greatest common divisor of all those s for which pii > 0,
(s)
is equal to one (that is, G.C.D.(s|pii > 0) = 1). Aperiodicity is ensured if pii > 0 (pii is a self-loop
probability). A Markov chain is ergodic if there exists a vector π (the stationary distribution) satisfying
(s)
lims−→∞ pij = πj , for all i and j; a finite aperiodic, irreducible and positive recurrent chain is ergodic. A
PN
transition matrix is called doubly stochastic if the rows and columns sum to one (that is, if j=1 pij = 1
PN
and i=1 pij = 1). A Markov chain with such a transition matrix has a uniform stationary distribution
[16].

3. GENERATING RANDOM DAGS


In this section we show how to generate random DAGs with constraints on induced width, node degree
and number of edges. After such a random DAG is generated, it is easy to construct a complete Bayesian
network by randomly generating associated probability distributions — if all variables in the Bayesian net-
work are categorical, probability distributions are produced by sampling Dirichlet distributions (for details
see [7]).
To generate random DAGs with specific constraints, we construct an ergodic Markov chain with
uniform limiting distribution, such that every state of the chain is a DAG satisfying the constraints. By
running the chain for many iterations, eventually we obtain a satisfactory DAG.
Algorithm PMMixed produces an ergodic Markov chain with the required properties (Figure 2).
The algorithm is significantly more complex than the algorithms presented in [7]. The added complexity
comes from the constraints in induced width — however such a price is worth paying as the resulting DAGs
are much more “realistic” representatives of Bayesian networks.
The algorithm works as follows. We create a set of n nodes (from 0 to n − 1) and a simple
network to start. The loop between lines 03 and 08 constructs the next state (next DAG) from the current
state. Lines 05 and 08 verify whether the induced width of the current DAG satisfies the maximum value
allowed; constraints on maximum node degree and maximum number of edges must also be checked there.
If the current DAG is a polytree, the next DAG is constructed in lines 04 and 05; if the current DAG is
multi-connected, the next DAG is constructed in lines 07 and 08. Depending on the current graph, different
operations are performed (the procedures AorR and AR correspond to the valid operations). Note that the
particular procedure to be performed and the acceptance (or not) of the resulting DAG is probabilistic,
parameterized by p.
Algorithm PMMixed is essentially a mixture of procedures AorR and AR. The first procedure is
used in [7] to produce multi-connected graphs with constraints on node degree; the second procedure is
used in [7] to produce polytrees with constraints on node degree. We need both to guarantee irreducibility
of Markov chains when constraints on induced width are present; the procedure AR creates a needed “path”
in the space of polytrees that is used in Theorem 3. The mixture of procedures has two other benefits: first,
it creates more complex transitions, hopefully increasing the convergence of the chain; second, it eliminates
a restriction on node degree that was needed in [7].
The PMMixed algorithm can be understood as a sequence of probabilistic transitions that follow
the scheme in Figure 3.
Algorithm PMMixed: Generating DAGs with induced width control
Input: Number of nodes (n), number of iterations (N ), maximum induced width, and possibly con-
straints on node degree and number of nodes.
Output: A connected DAG with n nodes.
01. Create a network with n nodes, where all nodes have just one parent, except the first node that does
not have any parent;
02. Repeat N times:
03. If current graph is a polytree:
04. With probability p, call Procedure AorR; with
probability (1 − p), call Procedure AR.
05. If the resulting graph satisfies imposed
constraints, accept the graph;
otherwise, keep previous graph;
06. else (graph is multi-connected):
07. Call Procedure AorR.
08. If the resulting graph is a polytree and satisfies
imposed constraints, accept with probabil-
ity p; else accept if it satisfies imposed
constraints; otherwise keep previous graph.
09. Return current graph after N iterations.

Procedure AR: Add and Remove


01. Generate uniformly a pair of distinct nodes i, j;
02. If the arc (i, j) exists in the current graph, keep the same state; else
03. Invert the arc with probability 1/2 to (j, i), and then
04. Find the predecessor node k in the path between i and j, remove the arc between k and j, and add
an arc (i, j) or arc (j, i) depending on the result of line 03.

Procedure AorR: Add or Remove


01. Generate uniformly a pair of distinct nodes i, j;
02. If the arc (i, j) exists in the current graph, delete the arc, provided that the underlying graph remains
connected; else
03. Add the arc if the underlying graph remains acyclic, otherwise keep same state.

Figure 2: Algorithm for generating DAGs, mixing operations AR and AorR.

p AorR multiconnected

polytree
1-p
AR polytree

p accept polytree
If
polytree
1-p
reject
multiconnected AorR If

multiconnected multiconnected

Figure 3: Structure of PMMixed.


i j k i j k 1 2 n

(a) (b) (c)

Figure 4: Simple trees used in our proofs: (a) Simple tree, (b) Simple polytree, (c) Simple sorted tree.

We now establish ergodicity of Algorithm PMMixed.


Theorem 1 The Markov chain generated by Algorithm PMMixed is aperiodic.
Proof. It is always possible to stay in the same state for procedures AR and AorR; therefore, all states have
a self-loop probability greater than zero. QED
Theorem 2 The transition matrix defined by the Algorithm PMMixed is doubly stochastic.
Proof. If we have symmetric transition probabilities between two neighbor states, its rows and columns
sum one, because the self-loop probabilities are complementary to all other probabilities. Procedure AorR
is clearly symmetric; procedure AR operation is also symmetric [7]. We just have to check that transitions
between polytrees and multi-connected graphs are symmetric; this is true because transitions from polytree
to multi-connected are accepted with probability p, and multi-connected to polytree transitions are also
accepted with the same probability. QED
We need the following lemma to prove Theorem 3.
Lemma 1 After removal of an arc from a multi-connected DAG, its induced width does not increase.
Proof. When we remove an arc, the moral graph stays the same or contains less arcs; by just keeping the
same ordering, the induced width cannot increase. QED
Theorem 3 The Markov chain generated by Algorithm PMMixed is irreducible.
Proof. Suppose that we have a multi-connected DAG with n nodes; if we prove that from this graph we can
reach a simple sorted tree (Figure 4 (c)), the opposite transformation is also true, because of the symmetry
of our transition matrix — and therefore we could reach any state from any other (during these transitions,
graphs must remain acyclic, connected and must satisfy imposed constraints). So, we start by finding a loop
cutset and removing enough arcs to obtain a polytree from the multi-connected DAG [15]. The induced
width does not increase during removal operations by Lemma 1. From a polytree we can move to a simple
polytree (Figure 4 (b)) in a recursive way. For all extreme sub-graphs of our polytree, for each pair of
extreme sub-graphs (call them branches), it is possible to “cut” a branch and add it in the other branch, by
the procedure AR, without ever increasing the induced width. Doing this we get a unique branch. If we
have more than two branches connected to a node, we repeat this process by pairs; we do this recursively
until get a simple polytree. Now that we have a simple polytree, we get a simple tree (Figure 4 (a)) just
inverting arcs to the same direction, without ever getting an induced width greater than two. The last step
is to get a simple sorted tree (Figure 4 (c)) from the simple tree. The idea here is illustrated in Figure 5.
We want to sort labelled nodes from 1 to n. Start removing arc (n, k) and adding arc (l, i) (step 1 to 2).
Remove arc (j, n) and add arc (n − 1, n) (step 2 and 3). Note that in this configuration, the induced width
is one. Now, remove arc (n − 1, o) and add arc (j, k) (step 3 and 4). Repeat steps 2 and 4 for all nodes.
So, from any multi-connected DAG it is possible to reach a simple sorted tree. The opposite path is clearly
analogous, so we can go from any DAG to any other DAG, and the chain is irreducible. Note that constraints
on node degree and maximum number of edges can be dealt with within the same processes. QED
By the previous theorems we obtain:
Theorem 4 The Markov chain generated by Algorithm PMMixed is ergodic and its unique stationary con-
verges to a uniform distribution.
The algorithm PMMixed can be implemented quite efficiently, except for the computation of in-
duced width — finding this value is a NP-hard problem with no easy solution. There are heuristics for
computing induced width; some of which have been found to be of high quality [9]. Consequently, we must
change our goal: instead of adopting constraints on exact induced width, we assume that the user specifies
step 1 i j n k l

step 2 k l i j n

step 3 k n-1 o j

step 4 o j k n-1 n

Figure 5: Basic moves to obtain a simple sorted tree.

Procedure J: Sequence of AorR


01. If the current graph is polytree, repeat
with probability (1 − p − q):
02. Generate uniformly a pair of distinct nodes i, j;
03. If arc (i, j) does not exist in current graph,
add the arc; otherwise, keep the same state.
04. If the current graph is multi-connected, repeat
with probability (1 − p − q):
05. Generate uniformly a pair of distinct nodes i, j.
06. If arc (i, j) exists in current graph, remove the arc;
otherwise, keep the same state.
07. If the new graph satisfies imposed constraints, accept the graph; otherwise, keep previous graph.

Figure 6: Procedure J.

a maximum width given a particular heuristic. We call this width the heuristic width. Our goal then is to
produce random DAGs on the space of DAGs that have constraints on heuristic width.
Apparently we could still use the PMMixed algorithm here, with the obvious change that lines
05 and 08 must check heuristic width instead of induced width. However such a simple modification is
not sufficient: because heuristic width is usually computed with local operations, we cannot predict the
effect of adding and removing edges on it. That is, we cannot adapt Lemma 1 to heuristic width in general,
and then we cannot predict whether a “path” between DAGs can in fact be followed by the chain without
violating heuristic width constraints. We must create a mechanism that would allow the chain to transit
between arbitrary DAGs regardless of the adopted heuristic. Our solution is to add a new type of operation,
specified by procedure J (Figure 6) — this procedure allows “jumps” from arbitrary multi-connected DAGs
to polytrees. We also assume that any adopted heuristic is such that, if the DAG is a polytree, then the
heuristic width is equal to the induced width. Even if a given heuristic does not satisfy this property, the
heuristic can be easily modified to do so: test whether the DAG is a polytree and, if so, return the induced
width of the polytree (the maximum number of parents amongst all nodes).
Procedure J must be called with probability (1 − p − q) both after line 04 and after line 07 in the
algorithm PMMixed. The complete algorithm can be understood as a sequence of probabilistic transitions
that follow the scheme in Figure 7. All previous theorems can be easily extended to this new situation; the
only one that must be substantially modified is Theorem 3. Transitions from polytree to multi-connected
DAGs are performed with probability (1 − q); transitions from multi-connected DAGs to polytrees are
q
performed with probability 1 − (p + q) × p+q = 1 − q. The value of p and q control the mixing rate of the
chain; we have observed remarkable insensitivity to these values.
p AorR multiconnected

polytree
q
AR polytree

1-p-q
Jump multiconnected
If

p/(p+q)
accept polytree

polytree
q/(p+q)
p+q reject
AorR If

multiconnected multiconnected multiconnected

1-p-q
Jump polytree

Figure 7: Structure of PMMixed with procedure J.

4. BNGENERATOR
The algorithm PMMixed (with the modifications indicated in Figure 7) can be efficiently implemented with
existing ordering heuristics, and the resulting DAGs are quite similar to existing Bayesian networks. We
have implemented the algorithm using a O(n log n) implementation of the minimum weight heuristic. The
result is the BNGenerator package, freely distributed under the GNU license (at http://www.pmr.poli.usp.br/
ltd/Software/BNGenerator).3 An example network generated by BNGenerator is depicted in Figure 8.
The BNGenerator accepts specification of number of nodes, maximum node degree, maximum
number of edges, and maximum heuristic width (for minimum weight heuristic, but other heuristics can be
added). The software also performs uniformity tests using a χ2 test. Such tests can be performed only for
relatively small number of nodes (as the number of possible DAGs grows extremely quickly [17]), but they
allowed us to test the algorithm and its procedures. We have observed the relatively fast mixing of the chain
with the transitions we have designed.

5. APPLICATIONS
To show how to use our previous results, we have selected three problems that have received some analysis
in the literature but have no conclusive solution yet. Due to the lack of space, we present a brief summary
of rather extensive tests.

5.1. D-CONNECTIVITY
Given a node Xq and a set of nodes XE , the number of nodes d-connected to Xq given XE is a fundamental
quantity in Bayesian networks. This quantity indicates how many variables are “requisite” for an inference;
it is also indicates how large a sensitivity analysis must be [2]; it essentially controls the complexity of
the EM algorithm when learning networks [23]; and it can be used to eliminate redundant computations in
learning [3]. Despite the importance of d-connectivity, there seems to be no result in the literature indicating
the relationship between heuristic width (the “complexity” of the network) and the expected number of d-
connected nodes. We illustrate how such a relationship could be investigated with random networks.
We took an ensemble of random networks with medium number of nodes and heuristic width, as
these networks are the ones most often used (currently at least) in exact inference and in learning. For each
3 The software uses the facilities in the JavaBayes system, including the efficient implementation of ordering heuristics

(http://www.cs.cmu.edu/˜javabayes).
Figure 8: Bayesian network generated with BNGenerator and opened with JavaBayes: 30 nodes, max-
imum degree 20, maximum induced width 2.

Table 1: Average number of d-connected nodes given number of nodes (N ) and heuristic width (HW ).
(N, HW ) ↓ 3% 14% 28% 85%
35,4 13 ± 6 19 ± 5 21 ± 7 13 ± 9
35,8 17 ± 7 23 ± 5 27 ± 4 20 ± 9
70,4 23 ± 9 36 ± 10 39 ± 16 13 ± 12
70,8 34 ± 11 49 ± 7 55 ± 7 31 ± 17

network, we computed the number of d-connected nodes for a random queried variable, varying the number
of observed nodes. We considered situations where 3%, 14%, 28% and 85% of the nodes are observed. The
first two situations correspond to typical queries in Bayesian networks, where the last situation corresponds
to a typical inference in an EM algorithm [3]. The situation of 28% observed nodes is an intermediate case.
Table 1 presents results for four different sets of networks, for different sizes and heuristic width (each entry
summarizes 5000 networks). Note the interesting fact that the number of d-connected nodes is larger for the
“intermediate” case, decreasing in other cases. Note also the relatively large number of d-connected nodes:
for an heuristic width of 4, about 35% of the nodes are d-connected even when a small number of variables
are observed.

5.2. QUASI-RANDOM INFERENCES


In this section we look into the behavior of Monte Carlo methods associated with quasi-random numbers;
that is, numbers that form low discrepancy sequences — numbers that progressively cover the space in
the “most uniform” manner [14]. There have been quite successful applications of quasi-Monte Carlo
methods for integration in low-dimensional problems; in high-dimensional problems, there has been con-
flicting evidence regarding the performance of quasi-Monte Carlo methods. As a positive example, Cheng
and Druzdzel obtained good results in Bayesian network inference with importance sampling using quasi-
random numbers [1].
There has been little work combining quasi-random numbers to Gibbs sampling; finding efficient
methods is still an open problem [6]. The difficulty seems to be the deleterious interaction between corre-
lations in quasi-random sequences and Markov chains. Liao applied quasi-random numbers for variance
reduction in a Gibbs sampler for classical statistical models [10], obtaining good but not conclusive results
through randomly permuted quasi-random sequences.
We have investigated the following question: How does quasi-random numbers affect standard
Table 2: Average Mean Square Error (multiplied by 103 ) for importance sampling (top) and Gibbs
sampling (bottom).
Imp. Samp. 0 10 20
Pseudo 1.2 ± 0.07 5.5 ± 2.8 19.0 ± 2.6
Quasi 1.9 ± 0.4 7.3 ± 3.3 17.3 ± 13.3

Gibbs Samp. 0 10 20
Pseudo 3.4 ± 0.5 3.4 ± 0.6 3.2 ± 0.8
Quasi 79.3 ± 8.2 81.6 ± 10 82.6 ± 13

importance sampling and Gibbs sampling algorithms in Bayesian networks? We have used the importance
sampling scheme derived by Dagum and Luby [4], and have investigated networks with Halton sequences
as quasi-random numbers.
The summary of our investigation is as follows. First, pseudo-random numbers are clearly better
than quasi-random numbers in medium-sized networks for Gibbs sampling. Second, pseudo-random num-
ber have a small edge over quasi-random numbers for importance sampling; however the differences are so
small that both can be used. In fact it is not hard to find networks that behave better under quasi-random
importance sampling than under pseudo-random importance sampling.4 It should also be noted that both
importance sampling and Gibbs sampling are affected in different ways by observations, as predicted by
the theory [4]. As we increase the number of observations, importance sampling eventually turns into a
much worse option than Gibbs sampling (corroborating results by Cheng and Druzdzel, as their importance
sampling algorithm displays a notable degradation in performance when observations are added [1, Figures
7 and 8]).
These conclusions are illustrated by Table 2, where we show the Mean Square Error (as defined in
[1]) for sets of networks under various algorithms. All networks had 100 nodes and heuristic width 5. Each
row shows the MSE for different numbers of observed nodes (0, 10 and 20 nodes).

5.3. LOOPY PROPAGATION


Loopy propagation algorithms are based on the iterative application of Pearl’s propagation algorithm in
multi-connected graphs [11]. These algorithms have been observed to often converge quickly to correct
answers. In fact, little is known about convergence, except for a few special cases [21, 20]. Existing
analysis of loopy propagation have identified a few situations where the algorithm does not converge, but
has not found a definite explanation for lack of convergence [13]. One of the essential aspects of the “non-
convergent” networks in [13] was the presence of logical nodes, with zero/one probability values. So, we
pose the question: how does loopy propagation behave when probabilities are not extreme, as we perform
inference in large and dense networks?
We have never observed, in a rather comprehensive exploration, situations where loopy propaga-
tion failed to converge when probabilities are larger than zero. We have varied the number of nodes (going
up to 500 nodes), maximum heuristic width (up to 28), number of values for variables, and even tried to in-
troduce somewhat extreme distributions into our networks; we never forced loopy propagation to oscillate.
Even for large networks where no known exact algorithm could succeed, loopy propagation converged. In
most cases convergence led to correct (or approximately correct) values; however, the quality of approxi-
mations is not always excellent — and it seems that the longer the algorithm takes to converge, the worst
the result.
To illustrate an interesting experiment with random networks, in Figure 9 we show the relationship
between heuristic width and number of iterations to convergence for a fixed family of networks. The number
of nodes is 100, and all variables are binary. We took rather complex networks, in the sense that heuristic
width was quite high (up to 28). The absolute error is just the absolute value of the difference between the
correct inference and the inference generated by loopy propagation. Note that the absolute error and the
number of interactions to convergence are almost independent of heuristic width. We would have expected
4 As a notable (not randomly generated) example of this phenomenon, the Alarm network does behave slightly better with quasi-

random importance sampling than with pseudo-random importance sampling (corroborating results by Cheng and Druzdzel [1]).
Loopy Propagation Errors

Absolute Error
0.04

0.02

0
10 12 14 16 18 20 22 24 26 28
Heuristic Width

Loopy Propagation Convergence


20
Interations

10

0
10 12 14 16 18 20 22 24 26 28
Heuristic Width
Figure 9: Absoluter error, number of iterations, and induced width (test with 200 random networks;
bars indicate variance within 10 random networks).

that the more dense the network, the more difficult it would be for loopy propagation to converge — but
loopy propagation seems to be incredibly robust in this respect. The absolute error is considerably large in
many of these tests.

6. CONCLUSION
In this paper we have presented what we believe is the best available solution for the generation of random
Bayesian networks. The key idea is to generate DAGs with constraints on induced width, and then gen-
erate distributions associated with the generated DAG. Given the NP-hardness of induced width, we have
resorted to “heuristic width” — the width produced by one of the many high-quality heuristics available.
We generate DAGs using Markov chains, and the need to guarantee heuristic width constraints leads to a
reasonably complex transition scheme encoded by algorithm PMMixed and procedure J. The algorithm can
be easily changed to accommodate a number of other constraints (say constraints on the maximum number
of parents).
We have observed that this strategy does produce “realistic” Bayesian networks. We have pre-
sented three applications of these random networks. First, we have investigated the relationship between
induced width and d-connectivity, verifying that the number of d-connected nodes tends to be rather large
in realistic networks. Such a quantity impacts inference algorithms, sensitivity analysis, and learning algo-
rithms such as EM. Second, we have confirmed comments in the literature that suggest that standard Gibbs
sampling cannot profit from quasi-random samples, while straightforward importance sampling presents es-
sentially the same behavior under pseudo- and quasi-random sampling for medium-sized networks. Third,
we have shown that loopy propagation converges with extraordinary regularity and robustness when non-
extreme distributions are used, and that the convergence of the algorithm does not seem to be affected by
the complexity of the underlying networks.

Acknowledgements
We thank Carlos Brito for suggesting us use the induced width, Robert Castelo for pointing us to Melançon
et al’s work, Guy Melançon for confirming some initial thoughts, Nir Friedman for indicating how to
generate distributions, and Haipeng Guo for testing the BNGenerator. We also thank Jaap Suermondt,
Alessandra Potrich and Tomas Kocka for providing important ideas, and Y. Xiang, P. Smets, D. Dash, M.
Horsh, E. Santos, and B. D’Ambrosio for suggesting valuable procedures.
The first author was supported by FAPESP grant 00/11067-9. The third author was supported
by HP Labs and was responsible for investigating loopy propagation; we thank Marsha Duro from HP
Labs for establishing the funding and Edson Nery from HP Brazil for managing it. The work received
substantial and generous support from HP Labs. The work was also partially supported by CNPq through
grant 300183/98-4.

References
[1] Jian Cheng and J. Druzdzel, Marek. Computational investigation of low-discrepancy sequences in simula-
tion algorithms for bayesian networks. In Craig Boutilier and Moises Goldszmidt, editors, Proceed-
ings of the 16th Conference on Uncertainty in Artificial Intelligence (UAI-00), pages 72–81, SF, CA,
June 30– July 3 2000. Morgan Kaufmann Publishers.
[2] Veerle Marleen H. Coupé, Linda L. C. van der Gaag, and J. D. F. Habbema. Sensitivity analysis: an aid for
belief-network quantification. Knowledge Engineering Review, 15:1–18, 2000.
[3] Fabio Gagliardi Cozman. Removing redundancy in the expectation-maximization algorithm for Bayesian
network learning. In Leliane Nunes de Barros, Roberto Marcondes, Fabio Gagliardi Cozman, and
Anna Helena Reali Costa, editors, Workshop on Probabilistic Reasoning in Artificial Intelligence,
pages 21–26, São Paulo, 2000. Editora Tec Art.
[4] Paul Dagum and Michael Luby. An optimal approximation algorithm for Bayesian inference. Artificial
Intelligence, 93(1–2):1–27, 1997.
[5] Rina Dechter. Bucket elimination: An unifying framework for probabilistic inference. In Eric Horvitz
and Finn Jensen, editors, Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence
(UAI-96), pages 211–219, San Francisco, August 1–4 1996. Morgan Kaufmann Publishers.
[6] K. T. Fang and Y. Wang. Number Theoretic Methods in Statistics. Chapman & Hall, New York, 1994.
[7] J. S. Ide and F. G. Cozman. Random generation of bayesian networks. In Proc. of XVI Brazilian Symposium
on Artificial Intelligence. Springer-Verlag, 2002.
[8] F. V. Jensen. An Introduction to Bayesian Networks. Springer-Verlag, New York, 1996.
[9] U. Kjaerulff. Triangulation of graphs — algorithms giving small total state space. Technical Report R-90-
09, Department of Mathematics and Computer Science, Aalborg University, Denmark, March 1990.
[10] J. G. Liao. Variance reduction in Gibbs sampler using quasi random numbers. Journal of Computational
and Graphical Statistics, 7(3):253–266, September 1998.
[11] R. J. McEliece, D. J. C. MacKay, and J. F. Cheng. Turbo decoding as an instance of Pearl’s ’belief propaga-
tion’ algorithm. IEEE Journal on Selected Areas in Communication, 16(2)(CSD-99-1046):140–152,
1998.
[12] G. Melanon and M. Bousque-Melou. Random generation of dags for graph drawing. Technical Report
technical report INS-R0005, Dutch Research Center for Mathematical and Computer Science-CWI,
2000.
[13] K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for approximate inference: An empirical
study. In In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages
467–475, 1999.
[14] Harald Niederreiter. Random Number Generation and Quasi-Monte Carlo Methods, volume 63 of CBMS-
NSF regional conference series in Appl. Math. SIAM, Philadelphia, 1992.
[15] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan-Kaufman, 1988.
[16] Sidney I. Resnick. Adventures in Stochastic Processes. Birkhäuser, Cambridge, MA, USA; Berlin, Ger-
many; Basel, Switzerland, 1992.
[17] R. W. Robinson. Counting labeled acyclic digraphs. In F. Harary, editor, New Directions in the Theory of
Graphs, pages 28–43, Michigan, 1973. Academic Press.
[18] S. M. Ross. Stochastic Processes. John Wiley and Sons; New York, NY, 1983.
[19] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search (second edition).
MIT Press, 2000.
[20] Sekhar C. Tatikonda and Michael I. Jordan. Loopy belief propagation and Gibbs measures. In Adnan
Darwiche and Nir Friedman, editors, Conference on Uncertainty in Artificial Intelligence, pages 493–
500, San Francisco, California, 2002. Morgan Kaufmann.
[21] Yair Weiss and William T. Freeman. Correctness of belief propagation in Gaussian graphical models of
arbitrary topology. Technical Report CSD-99-1046, CS Department, UC Berkeley, 1999.
[22] Y. Xiang and T. Miller. A well-behaved algorithms for simulating dependence structure of bayesian net-
works. In International Journal of Applied Mathematics, volume 1, pages 923–932, 1999.
[23] Nevin Lianwen Zhang. Irrelevance and parameter learning in Bayesian networks. Artificial Intelligence,
88:359–373, 1996.

You might also like