Professional Documents
Culture Documents
IdeCozman TR2003 quasiMC
IdeCozman TR2003 quasiMC
Abstract. We present algorithms for the generation of uniformly distributed Bayesian networks
with constraints on induced width. The algorithms use ergodic Markov chains to generate sam-
ples, building upon previous algorithms by the authors. The introduction of constraints on
induced width leads to more realistic results but requires new techniques. We discuss three ap-
plications of randomly generated networks: we study the average number of nodes d-connected
to a query, the effectiveness of quasi-random samples in approximate inference, and the conver-
gence of loopy propagation for non-extreme distributions.
1. INTRODUCTION
Certain questions involving Bayesian networks are quite difficult to answer exactly. For example, what is
the average number of variables that are d-connected to each other in a Bayesian network? Or, how do
quasi-random importance sampling and quasi-random Gibbs sampling compare? Significant insight into
such theoretical and empirical questions could be obtained analyzing large samples of Bayesian networks.
However it does not seem likely that hundreds of “real” Bayesian networks will be freely distributed in the
near future. We are left with an obvious solution: we must be able to randomly generate Bayesian networks
for our experiments and comparisons. Many researchers have used ad hoc random processes to generate
networks in the past (for example, [19, 22]); however we would like to produce networks that are uniformly
distributed in some suitable space. The most challenging aspect of Bayesian network generation is exactly
to define this space; that is, to define the properties that characterize “realistic” Bayesian networks. Our
first attempt in generating Bayesian networks was to control the maximum node degree and the maximum
number of edges, aiming to generate relatively sparse graphs [7]. We found that such a strategy is reasonable
but not perfect. Restrictions solely on node degree and number of edges lead to “overly random” edges
— real networks often have their variables distributed in groups, with few edges between groups.1 We
then considered a number of other distingushing properties of “real” networks. For example, one option,
suggested by T. Kocka, would be to produce graphs with a large number of equivalent graphs (a property
observed in real networks). However we wanted to use properties with clear intuitive meaning — so that
users of our algorithms would quickly grasp the properties of generated networks.
We eventually found that the most appropriate quantity to control is the induced width of net-
works.2 The induced width of a network can be easily explained and understood; it conveys the algorithmic
complexity of inferences and, indirectly, it captures how dense the network is. Our tests indicate (rather
subjectively) that a network with low induced width “looks like” real networks in the literature. Besides, it
makes sense to control induced width, as we are usually interested in comparing algorithms or parameteriz-
ing results with respect to the complexity of the underlying network, and induced width is the main indicator
of such complexity. Unfortunately, the generation of random graphs with constraints on induced width is
significantly more involved than the generation of graphs with constraints on node degree and number of
1 Tomas Kocka brought this fact to our attention.
2 Carlos Brito suggested the idea.
B B F L
F F
L H
D D
L L
D D
H H
H B
(a) (b)
B F
(c) (d)
Figure 1: a) Network, b) moral graph, c) induced graph for ordering F, L, D, H, B, and d) induced
graph for ordering L, H, D, B, F . Dashed lines represent induced connections.
edges. In this paper we report on new algorithms that accomplish generation of graphs with simultaneous
constraints on all these quantities: induced width, node degree, and number of edges.
Following our previous work [7], we divide the generation of random Bayesian networks into
two steps. First we generate a random directed acyclic graph that satisfies constraints on induced width,
node degree, and number of edges; then we generate probability distributions for the graph. To generate
the random graph, we construct ergodic Markov chains with appropriate stationary distributions, so that
successive sampling from the chains leads to the generation of properly distributed networks. The necessary
theory is presented in Section 2.. Our algorithms are described in Section 3.. A freely distributed program
for Bayesian network generation is presented in Section 4..
To illustrate how random Bayesian networks can be used, we describe three applications of obvious
practical significance in Section 5.. First, we study the the average number of d-connected nodes in Bayesian
networks. Second, we analyze the performance of quasi-random samples (as opposed to pseudo-random
samples) in two inference algorithms: importance sampling and Gibbs sampling. Third, we investigate the
convergence of loopy propagation on networks that have non-extreme distributions (that is, most probability
values are far from zero and one).
2. BASIC CONCEPTS
This section summarizes material from [7] and [5].
A directed graph is composed of a set of nodes and a set of edges. An edge (u, v) goes from a
node u (the parent) to a node v (the child). A path is a sequence of nodes such that each pair of consecutive
nodes is adjacent. A path is a cycle if it contains more than two nodes and the first and last nodes are the
same. A cycle is directed if we can reach the same nodes while following arcs that are in the same direction.
A directed graph is acyclic (a DAG) if it contains no directed cycles. A graph is connected if there exists a
path between every pair of nodes. A graph is singly-connected, also called a polytree, if there exists exactly
one path between every pair of nodes; otherwise, the graph is multiply-connected (or multi-connected for
short). An extreme sub-graph of a polytree is a sub-graph that is connected to the remainder of the polytree
by a single path. In an undirected graph, the direction of the edges is ignored. An ordered graph is a pair
containing an undirected graph and an ordering of nodes. The width of a node in an ordered graph is the
number of its neighbors that precede it in the ordering. The width of an ordering is the maximum width
over all nodes. The induced width of an ordered graph is the width of the ordered graph obtained as follows:
nodes are processed from last to first; when node X is processed, all its preceding nodes are connected (call
these connections induced connections and the resulting graph induced graph). An example is presented at
Figure 1. The induced width of a graph is the minimal induced width over any ordering; the computation
of induced width is an NP-hard problem [5], and computations are usually based on heuristics [9].
A Bayesian network represents a joint probability density over a set of variables X [8]. The density
is specified through a directed acyclic graph; every node in the graph is associated with a variable Xi in X,
and with a conditional probability density p(Xi |pa(Xi )), where pa(Xi ) denotes the parentsQ of Xi in the
graph. A Bayesian network represents a unique joint probability density [15]: p(X) = i p(Xi |pa(Xi ))
(consequence of a Markov condition). The moral graph of a Bayesian network is obtained by connecting
parents of any variable and ignoring direction of edges. The induced width of a Bayesian network is the
induced width of its moral graph. An inference is a computation of a posterior probability density for
a query variable given observed variables; the complexity of inferences is directly related to the induced
width of the underlying Bayesian network [5].
We use Markov chains to generate random graphs, following [12]. Consider a Markov chain
{Xt , t ≥ 0} over finite domains S and P = (pij )M ij=1 to be a M x M matrix representing transition
probabilities, where M is the number of states and pij = P r(Xt+1 = j|Xt = i), for all t [16, 18]. The
(s)
s-step transition probabilities is given by P s = pij = P r(Xt+s = j|Xt = i), independent of t. A
(s)
Markov chain is irreducible if for all i,j there exists s that satisfies pij > 0. A Markov chain is irreducible
if and only if all pair of states intercommunicate. A Markov chain is positive recurrent if every state
i ∈ S can be returned to in a finite number of steps; it follows a that finite irreducible chain is positive
(s)
recurrent. A Markov chain is aperiodic if the greatest common divisor of all those s for which pii > 0,
(s)
is equal to one (that is, G.C.D.(s|pii > 0) = 1). Aperiodicity is ensured if pii > 0 (pii is a self-loop
probability). A Markov chain is ergodic if there exists a vector π (the stationary distribution) satisfying
(s)
lims−→∞ pij = πj , for all i and j; a finite aperiodic, irreducible and positive recurrent chain is ergodic. A
PN
transition matrix is called doubly stochastic if the rows and columns sum to one (that is, if j=1 pij = 1
PN
and i=1 pij = 1). A Markov chain with such a transition matrix has a uniform stationary distribution
[16].
p AorR multiconnected
polytree
1-p
AR polytree
p accept polytree
If
polytree
1-p
reject
multiconnected AorR If
multiconnected multiconnected
Figure 4: Simple trees used in our proofs: (a) Simple tree, (b) Simple polytree, (c) Simple sorted tree.
step 2 k l i j n
step 3 k n-1 o j
step 4 o j k n-1 n
Figure 6: Procedure J.
a maximum width given a particular heuristic. We call this width the heuristic width. Our goal then is to
produce random DAGs on the space of DAGs that have constraints on heuristic width.
Apparently we could still use the PMMixed algorithm here, with the obvious change that lines
05 and 08 must check heuristic width instead of induced width. However such a simple modification is
not sufficient: because heuristic width is usually computed with local operations, we cannot predict the
effect of adding and removing edges on it. That is, we cannot adapt Lemma 1 to heuristic width in general,
and then we cannot predict whether a “path” between DAGs can in fact be followed by the chain without
violating heuristic width constraints. We must create a mechanism that would allow the chain to transit
between arbitrary DAGs regardless of the adopted heuristic. Our solution is to add a new type of operation,
specified by procedure J (Figure 6) — this procedure allows “jumps” from arbitrary multi-connected DAGs
to polytrees. We also assume that any adopted heuristic is such that, if the DAG is a polytree, then the
heuristic width is equal to the induced width. Even if a given heuristic does not satisfy this property, the
heuristic can be easily modified to do so: test whether the DAG is a polytree and, if so, return the induced
width of the polytree (the maximum number of parents amongst all nodes).
Procedure J must be called with probability (1 − p − q) both after line 04 and after line 07 in the
algorithm PMMixed. The complete algorithm can be understood as a sequence of probabilistic transitions
that follow the scheme in Figure 7. All previous theorems can be easily extended to this new situation; the
only one that must be substantially modified is Theorem 3. Transitions from polytree to multi-connected
DAGs are performed with probability (1 − q); transitions from multi-connected DAGs to polytrees are
q
performed with probability 1 − (p + q) × p+q = 1 − q. The value of p and q control the mixing rate of the
chain; we have observed remarkable insensitivity to these values.
p AorR multiconnected
polytree
q
AR polytree
1-p-q
Jump multiconnected
If
p/(p+q)
accept polytree
polytree
q/(p+q)
p+q reject
AorR If
1-p-q
Jump polytree
4. BNGENERATOR
The algorithm PMMixed (with the modifications indicated in Figure 7) can be efficiently implemented with
existing ordering heuristics, and the resulting DAGs are quite similar to existing Bayesian networks. We
have implemented the algorithm using a O(n log n) implementation of the minimum weight heuristic. The
result is the BNGenerator package, freely distributed under the GNU license (at http://www.pmr.poli.usp.br/
ltd/Software/BNGenerator).3 An example network generated by BNGenerator is depicted in Figure 8.
The BNGenerator accepts specification of number of nodes, maximum node degree, maximum
number of edges, and maximum heuristic width (for minimum weight heuristic, but other heuristics can be
added). The software also performs uniformity tests using a χ2 test. Such tests can be performed only for
relatively small number of nodes (as the number of possible DAGs grows extremely quickly [17]), but they
allowed us to test the algorithm and its procedures. We have observed the relatively fast mixing of the chain
with the transitions we have designed.
5. APPLICATIONS
To show how to use our previous results, we have selected three problems that have received some analysis
in the literature but have no conclusive solution yet. Due to the lack of space, we present a brief summary
of rather extensive tests.
5.1. D-CONNECTIVITY
Given a node Xq and a set of nodes XE , the number of nodes d-connected to Xq given XE is a fundamental
quantity in Bayesian networks. This quantity indicates how many variables are “requisite” for an inference;
it is also indicates how large a sensitivity analysis must be [2]; it essentially controls the complexity of
the EM algorithm when learning networks [23]; and it can be used to eliminate redundant computations in
learning [3]. Despite the importance of d-connectivity, there seems to be no result in the literature indicating
the relationship between heuristic width (the “complexity” of the network) and the expected number of d-
connected nodes. We illustrate how such a relationship could be investigated with random networks.
We took an ensemble of random networks with medium number of nodes and heuristic width, as
these networks are the ones most often used (currently at least) in exact inference and in learning. For each
3 The software uses the facilities in the JavaBayes system, including the efficient implementation of ordering heuristics
(http://www.cs.cmu.edu/˜javabayes).
Figure 8: Bayesian network generated with BNGenerator and opened with JavaBayes: 30 nodes, max-
imum degree 20, maximum induced width 2.
Table 1: Average number of d-connected nodes given number of nodes (N ) and heuristic width (HW ).
(N, HW ) ↓ 3% 14% 28% 85%
35,4 13 ± 6 19 ± 5 21 ± 7 13 ± 9
35,8 17 ± 7 23 ± 5 27 ± 4 20 ± 9
70,4 23 ± 9 36 ± 10 39 ± 16 13 ± 12
70,8 34 ± 11 49 ± 7 55 ± 7 31 ± 17
network, we computed the number of d-connected nodes for a random queried variable, varying the number
of observed nodes. We considered situations where 3%, 14%, 28% and 85% of the nodes are observed. The
first two situations correspond to typical queries in Bayesian networks, where the last situation corresponds
to a typical inference in an EM algorithm [3]. The situation of 28% observed nodes is an intermediate case.
Table 1 presents results for four different sets of networks, for different sizes and heuristic width (each entry
summarizes 5000 networks). Note the interesting fact that the number of d-connected nodes is larger for the
“intermediate” case, decreasing in other cases. Note also the relatively large number of d-connected nodes:
for an heuristic width of 4, about 35% of the nodes are d-connected even when a small number of variables
are observed.
Gibbs Samp. 0 10 20
Pseudo 3.4 ± 0.5 3.4 ± 0.6 3.2 ± 0.8
Quasi 79.3 ± 8.2 81.6 ± 10 82.6 ± 13
importance sampling and Gibbs sampling algorithms in Bayesian networks? We have used the importance
sampling scheme derived by Dagum and Luby [4], and have investigated networks with Halton sequences
as quasi-random numbers.
The summary of our investigation is as follows. First, pseudo-random numbers are clearly better
than quasi-random numbers in medium-sized networks for Gibbs sampling. Second, pseudo-random num-
ber have a small edge over quasi-random numbers for importance sampling; however the differences are so
small that both can be used. In fact it is not hard to find networks that behave better under quasi-random
importance sampling than under pseudo-random importance sampling.4 It should also be noted that both
importance sampling and Gibbs sampling are affected in different ways by observations, as predicted by
the theory [4]. As we increase the number of observations, importance sampling eventually turns into a
much worse option than Gibbs sampling (corroborating results by Cheng and Druzdzel, as their importance
sampling algorithm displays a notable degradation in performance when observations are added [1, Figures
7 and 8]).
These conclusions are illustrated by Table 2, where we show the Mean Square Error (as defined in
[1]) for sets of networks under various algorithms. All networks had 100 nodes and heuristic width 5. Each
row shows the MSE for different numbers of observed nodes (0, 10 and 20 nodes).
random importance sampling than with pseudo-random importance sampling (corroborating results by Cheng and Druzdzel [1]).
Loopy Propagation Errors
Absolute Error
0.04
0.02
0
10 12 14 16 18 20 22 24 26 28
Heuristic Width
10
0
10 12 14 16 18 20 22 24 26 28
Heuristic Width
Figure 9: Absoluter error, number of iterations, and induced width (test with 200 random networks;
bars indicate variance within 10 random networks).
that the more dense the network, the more difficult it would be for loopy propagation to converge — but
loopy propagation seems to be incredibly robust in this respect. The absolute error is considerably large in
many of these tests.
6. CONCLUSION
In this paper we have presented what we believe is the best available solution for the generation of random
Bayesian networks. The key idea is to generate DAGs with constraints on induced width, and then gen-
erate distributions associated with the generated DAG. Given the NP-hardness of induced width, we have
resorted to “heuristic width” — the width produced by one of the many high-quality heuristics available.
We generate DAGs using Markov chains, and the need to guarantee heuristic width constraints leads to a
reasonably complex transition scheme encoded by algorithm PMMixed and procedure J. The algorithm can
be easily changed to accommodate a number of other constraints (say constraints on the maximum number
of parents).
We have observed that this strategy does produce “realistic” Bayesian networks. We have pre-
sented three applications of these random networks. First, we have investigated the relationship between
induced width and d-connectivity, verifying that the number of d-connected nodes tends to be rather large
in realistic networks. Such a quantity impacts inference algorithms, sensitivity analysis, and learning algo-
rithms such as EM. Second, we have confirmed comments in the literature that suggest that standard Gibbs
sampling cannot profit from quasi-random samples, while straightforward importance sampling presents es-
sentially the same behavior under pseudo- and quasi-random sampling for medium-sized networks. Third,
we have shown that loopy propagation converges with extraordinary regularity and robustness when non-
extreme distributions are used, and that the convergence of the algorithm does not seem to be affected by
the complexity of the underlying networks.
Acknowledgements
We thank Carlos Brito for suggesting us use the induced width, Robert Castelo for pointing us to Melançon
et al’s work, Guy Melançon for confirming some initial thoughts, Nir Friedman for indicating how to
generate distributions, and Haipeng Guo for testing the BNGenerator. We also thank Jaap Suermondt,
Alessandra Potrich and Tomas Kocka for providing important ideas, and Y. Xiang, P. Smets, D. Dash, M.
Horsh, E. Santos, and B. D’Ambrosio for suggesting valuable procedures.
The first author was supported by FAPESP grant 00/11067-9. The third author was supported
by HP Labs and was responsible for investigating loopy propagation; we thank Marsha Duro from HP
Labs for establishing the funding and Edson Nery from HP Brazil for managing it. The work received
substantial and generous support from HP Labs. The work was also partially supported by CNPq through
grant 300183/98-4.
References
[1] Jian Cheng and J. Druzdzel, Marek. Computational investigation of low-discrepancy sequences in simula-
tion algorithms for bayesian networks. In Craig Boutilier and Moises Goldszmidt, editors, Proceed-
ings of the 16th Conference on Uncertainty in Artificial Intelligence (UAI-00), pages 72–81, SF, CA,
June 30– July 3 2000. Morgan Kaufmann Publishers.
[2] Veerle Marleen H. Coupé, Linda L. C. van der Gaag, and J. D. F. Habbema. Sensitivity analysis: an aid for
belief-network quantification. Knowledge Engineering Review, 15:1–18, 2000.
[3] Fabio Gagliardi Cozman. Removing redundancy in the expectation-maximization algorithm for Bayesian
network learning. In Leliane Nunes de Barros, Roberto Marcondes, Fabio Gagliardi Cozman, and
Anna Helena Reali Costa, editors, Workshop on Probabilistic Reasoning in Artificial Intelligence,
pages 21–26, São Paulo, 2000. Editora Tec Art.
[4] Paul Dagum and Michael Luby. An optimal approximation algorithm for Bayesian inference. Artificial
Intelligence, 93(1–2):1–27, 1997.
[5] Rina Dechter. Bucket elimination: An unifying framework for probabilistic inference. In Eric Horvitz
and Finn Jensen, editors, Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence
(UAI-96), pages 211–219, San Francisco, August 1–4 1996. Morgan Kaufmann Publishers.
[6] K. T. Fang and Y. Wang. Number Theoretic Methods in Statistics. Chapman & Hall, New York, 1994.
[7] J. S. Ide and F. G. Cozman. Random generation of bayesian networks. In Proc. of XVI Brazilian Symposium
on Artificial Intelligence. Springer-Verlag, 2002.
[8] F. V. Jensen. An Introduction to Bayesian Networks. Springer-Verlag, New York, 1996.
[9] U. Kjaerulff. Triangulation of graphs — algorithms giving small total state space. Technical Report R-90-
09, Department of Mathematics and Computer Science, Aalborg University, Denmark, March 1990.
[10] J. G. Liao. Variance reduction in Gibbs sampler using quasi random numbers. Journal of Computational
and Graphical Statistics, 7(3):253–266, September 1998.
[11] R. J. McEliece, D. J. C. MacKay, and J. F. Cheng. Turbo decoding as an instance of Pearl’s ’belief propaga-
tion’ algorithm. IEEE Journal on Selected Areas in Communication, 16(2)(CSD-99-1046):140–152,
1998.
[12] G. Melanon and M. Bousque-Melou. Random generation of dags for graph drawing. Technical Report
technical report INS-R0005, Dutch Research Center for Mathematical and Computer Science-CWI,
2000.
[13] K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for approximate inference: An empirical
study. In In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages
467–475, 1999.
[14] Harald Niederreiter. Random Number Generation and Quasi-Monte Carlo Methods, volume 63 of CBMS-
NSF regional conference series in Appl. Math. SIAM, Philadelphia, 1992.
[15] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan-Kaufman, 1988.
[16] Sidney I. Resnick. Adventures in Stochastic Processes. Birkhäuser, Cambridge, MA, USA; Berlin, Ger-
many; Basel, Switzerland, 1992.
[17] R. W. Robinson. Counting labeled acyclic digraphs. In F. Harary, editor, New Directions in the Theory of
Graphs, pages 28–43, Michigan, 1973. Academic Press.
[18] S. M. Ross. Stochastic Processes. John Wiley and Sons; New York, NY, 1983.
[19] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search (second edition).
MIT Press, 2000.
[20] Sekhar C. Tatikonda and Michael I. Jordan. Loopy belief propagation and Gibbs measures. In Adnan
Darwiche and Nir Friedman, editors, Conference on Uncertainty in Artificial Intelligence, pages 493–
500, San Francisco, California, 2002. Morgan Kaufmann.
[21] Yair Weiss and William T. Freeman. Correctness of belief propagation in Gaussian graphical models of
arbitrary topology. Technical Report CSD-99-1046, CS Department, UC Berkeley, 1999.
[22] Y. Xiang and T. Miller. A well-behaved algorithms for simulating dependence structure of bayesian net-
works. In International Journal of Applied Mathematics, volume 1, pages 923–932, 1999.
[23] Nevin Lianwen Zhang. Irrelevance and parameter learning in Bayesian networks. Artificial Intelligence,
88:359–373, 1996.