Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

NETEXTRACT - Extracting Belief Networks in Telecommunications Data

Mary Shapcott, Roy Sterritt, Kenneth Adamson and Edwin Curran University of Ulster Shore Road, Newtownabbey, Co. Antrim, N. Ireland Phone: +44-1232-365131, Fax: +44-366-068 email:{cm.shapcott, k.adamson, r.sterritt, ep.curran}@ulst.ac.uk

ABSTRACT: NETEXTRACT was a collaborative project carried out by the University of Ulster in conjunction with Northern Telecom. Nortel were experiencing problems in making sense of the behaviour of telecommunications networks. The response of networks to fault conditions was poorly understood. Nortel had developed a transputer-based simulation model of a multiplexer product and wished to be able to evaluate the model against live data. A generic architecture has been developed. The general structure is that of an integrated knowledge extraction/expert system. It includes components for extracting relation data from text-based files, for net extraction (machine learning) and for use of extracted nets in an expert system. Several algorithms for the extraction of Bayesian nets from data have been prototyped and typical results can be seen in this paper. We also give a brief overview of directed and undirected Bayesian nets.

Keywords: telecommunications, network, management, event, belief, probabilistic, cause, alarm, temporal.

1. INTRODUCTION
The NETEXTRACT project originated with a problem that Northern Telecom (Nortel) encountered when developing a telecommunications product (Moore et al. 1996). The product was a multiplexer designed to be used in high-speed optical fibre backbone networks using Synchronous Digital Hierarchy (SDH) equipment. In principle the multiplexers behaviour in the presence of fault conditions such as broken cables or faulty cards was deterministic. In practice it was not possible to predict the behaviour in the circumstances of faults from knowledge of the design of the multiplexer. This was because there were too many complex hardware and software interactions for a detailed model of the system to be constructed. Nortel had developed a transputer-based simulation model that was designed to emulate the behaviour of multiplexer networks. However Nortel had no easy way of evaluating the model against live data obtained from real SDH networks. The original aim of the NETEXTRACT project was to develop an architecture that would allow for the extraction of interesting patterns of relationship from a database, the patterns taking the form of cause and effect nets (called Bayesian nets below). These patterns could then be used to refine the simulation model iteratively. An important aspect of the project was to be the use of a parallel computing platform in developing high performance algorithms. It is of course very common to find real-world systems that are deterministic at a detailed level of understanding but non-deterministic in the behaviour that can be observed. During the course of the project it became clear that the general architecture fitted into the area of data mining, and the general approach taken was that of data mining. This paper contains a description of the telecommunications context, the NETEXTRACT architecture, a brief description of Bayesian networks and implementation details and results, concluding with a discussion of a new project and the continuing association with Nortel.

2. THE PROBLEM AND THE APPROACH TAKEN


Both the original transputer simulation model and real tests carried out by Nortel provided data for evaluation in the form of event logs. Event logs are very large text files recorded by the network management software, rather similar to transaction logs in database systems. Event logs store records of individual events such as the raising of alarm

conditions in network elements and they also monitor control activities carried out by the human operators in charge of the network. Each record contains information about an event: the location of the event (usually the component in the network in which it occurred), the time at which it occurred, the type of event (normally this was either an alarm or a human activity on the network controller), and other details. Because a telecommunications network consists of a set of (relatively) autonomous interacting components, such as chips, cards, shelves, multiplexers, gateways and managers the observed behaviour reflects both the underlying fault conditions that are being tested and the response by the error handling hardware and software in the network. Quite often there are long delays of many seconds between the time when an error occurs and the time it takes to propagate around the network and to work its way up and down the protocol stack. Information about the abnormal condition is passed between hardware layers and software layers as it travels as a message to the network manager. There was a need to characterise the observed behaviour so that the behaviour of the simulation model could be compared with the results of real tests. It was decided to investigate the use of graphical statistical models in this study. As will be discussed later such models provide high level views of relationships between observational and explanatory variables which can mix both domain knowledge and real world data in the elucidation of the model structure. The team evaluated several such algorithms to be used in inducing Bayesian nets. It was decided to concentrate on one or two reasoning algorithms, and the team did this. Two algorithms were OMI (optimisation of mutual information), and a genetic algorithm. A research tool for extracting chain graphs from data, BIFROST was also used. All three gave similar results when used with the test data. The problem of model extraction for Bayesian nets is discussed in some detail later in this report. It was intended that output from the transputer model should be compared with live data. Unfortunately it was not possible to carry out this phase to any serious extent. This was because the transputer-based simulation model became out of date very early on in the project, and Nortel decided that the effort of changing it was not justified. However, they remain convinced of the value of using the output as summary data for fault behaviour and this is how the information is being used at present (Sterritt et al. (1997a, 1997b and 1997c).

3. THE NETEXTRACT ARCHITECTURE


The actual realisation of the NETEXTRACT architecture for Nortels situation is shown in Figure 1. It can be seen that the input data is available in the form of log data. The team identified a need for the efficient preparatory processing of data that appeared in the event log format. This led to the design and preliminary implementation of a data cleaner and a data pre-processor. The data cleaner allows a user to specify the format of a text document in a generic way using a template, and to specify filtering conditions on the output. In the telecommunications application the text document is an event log, and the template file can be altered if the structure of the event records changes. The data cleaner parses a log file and passes the resulting events to the pre-processor which time-slices the information and creates an intermediate file for use in the induction module. The Bayesian net created as a result of induction is potentially useful in a fault situation where the faults most likely to be responsible for observed alarms can be computed from the net and relayed to a human operator. For this reason there is a deduction module in the architecture whereby observed conditions in the telecommunications network can be fed into the Bayesian net and changes in the probabilities of underlying fault conditions can be computed (Shapcott and McClean 1997). However, the components are able to operate in isolation, provided that they are provided with files in the correct input format. In particular the induction component and the deduction component use data in Microsoft's Bayesian Interchange format (Microsoft, 1996).

NORTEL TEST EQUIPMENT

DATA CLEANER

PREPROCESSOR

INDUCTION

GRAPHICAL NETWORK NORTEL LIVE EQUIPMENT


Figure 1: NETEXTRACT Architecture

DEDUCTION

4. PROBABILISTIC GRAPHICAL MODELS


Graphs in which relationships between variables can be represented by the existence of links between them have an intuitive appeal. They are easy to read if represented on paper and can summarise fairly complex relationships succinctly. Probabilistic nets are a special case of such graphs (others include decision trees and neural networks). A probabilistic network defines graphical relationships that express various independence properties possessed by the variables. As a very simple example, assume that we have three variables containing information about the state of our home. t, the outside temperature which is either warm or cold, r: the occurrence of rain (yes or no), c: the state of the central heating system thermostat - running or switched off. If we know that it is cold (t is true) then it changes our expectations of rain (r), and having the central heating (c) switched on, but we dont expect that the central heating (c) will depend on the heavy rain (r) directly. It is intuitive to express this relationship as a directed graph with t as the parent of r and c. The joint probability distribution for the three variables, t, r and c, can be represented by the product of conditional probability tables, one table for each variable. The table specifies the conditional probability that the variable takes each of its possible values given the values taken by its parents. In the simple case in point the joint distribution takes the form:

Pr(t, r , c ) = Pr(r | t ) Pr(c | t ) Pr(t )

The graph appears as shown below. It is a directed graph: the arrowheads pointing from parent to child. t: Temperature drop

r:rain

c: central heating

Figure 2: A Very Simple Recursive Graphical Model

A recursive graphical model is represented by a directed acyclic graph and its joint probability distribution (which describes the behaviour of all the variables) can be expressed as a product of conditional distributions. There is one term for each node in the distribution, namely the conditional probability of a particular value of the node variable, given the values of its parent variables. Hence the joint distribution has the form: Pr( X V ) =

Pr{ X v | X a where a is a parent of v} v

There are also graphical models in which the edges of the graph are undirected. In such cases the absence of an edge, rather than its presence is of most interest. The emphasis is on the independence of sets of variables. First of all consider the graph below, and the subset A = {a, b, c} and the subset B ={d, e}. It is clear that every path between a node in A and a node in B must go through the subset C = {f, g}. In such a case we say that C separates A and B. Now we can define the important structure: the I-map. b a

C
g f k d e h

Figure 3: An Undirected Graph Showing Separation

If the undirected graph G is an I-map then it has the following property. In a situation where A, B and C are subsets of variables then if C separates A and B, A and B are conditionally independent given C. If we consider the system in terms of information we are saying that if we already know the values of the variables in C, then knowing A gives us no more information about B. And, of course, vice versa. This property is known as the global Markov property. The global Markov Property is a strong statement about a graph and it turns out that often a weaker property can be used. This is the pairwise Markov property which simply requires that Xa and Xb are conditionally independent given

all the other random variables if there is no edge from node a to node b. If the distribution is strictly positive and pairwise Markov then it can be shown that it is global Markov (Clifford, 1990). How are directed and undirected graphical models related? It turns out that it is possible to convert a directed (recursive) graph into an equivalent undirected graph by dropping all the directions on the edges and connecting all the parents of each child with edges. The graph thus formed is called the moral graph of the DAG. It can be shown by using properties of the moral graph that all recursive models on DAGs are global Markov. There is a very large literature on the subject of graphical models: it having been approached from both the point of view of statistics and from work on artificial intelligence. Russell and Norvig (1995) provide a good introduction from the point of view of artificial intelligence, and Ripley (1996) gives a more mathematical treatment. For a comprehensive treatment from the statistical point of view the reader is invited to refer to Whittaker (1990) in which both directed and undirected graphs are treated. It is probably true to say that on the whole statisticians prefer to deal with undirected graphs as capturing the independence properties of the system under consideration and to avoid notions of causality as being rather abstract and not to be inferred directly from a graphical structure. However, it there is a good understanding of the underlying real-world situation, such as in the simple example shown above, then it may be appropriate to put directions on the edges in the graph which indicate cause-effect relationships. For example Druzdzel and Simon (1993) have argued that if the variables can be regarded as structural parameters in some sort of mechanism then a probabilistic net in which variables are connected in a directed acyclic graph can be created which represents the behaviour of the mechanism. This example can be readily generalised to more complex nets involving more variables and many other examples in which causes and effects can be linked in this way have been used to good effect in the statistical and artificial intelligence literature. In their classic paper Lauritzen and Spiegelhalter (1988) illustrate how such nets can be used as expert systems, and Spiegelhalter et al. (1993) also apply the theory to medical diagnosis. The prime advantage of having such a representation is that it reduces the computational effort in using Bayes law for diagnosis. The brute force effort involved in computing the conditional probability of unobserved variables, given the known values of observed variables is cut down considerably by judicious use of the known properties of the structure. See our later discussion on deduction.

5 GENERATING PROBABILISTIC NETS FROM DATABASES


In many cases the structure of the graphical model is not known in advance, but there is a database of information concerning the frequencies of occurrence of combinations of different variable values. In such a case the problem is that of induction to induce the structure from the data. Heckerman (1996) has a good description of the problem. There has been a lot of work in the literature in the area, including that of Cooper and Herskovits (1992). Unfortunately the general problem is NP-hard (Chickering and Heckerman, 1994). For a given number of variables there is a very large number of potential graphical structures which can be induced. To determine the best structure then in theory one should fit the data to each possible graphical structure, score the structure, and then select the structure with the best score. Consequently algorithms for learning networks from data are usually heuristic, once the number of variables gets to be of reasonable size. There are 2k(k-1)/2 distinct possible independence graphs for a k-dimensional random vector: this translates to 64 probabilistic models for k= 4, and 32, 768 models for k = 6. In the NETEXTRACT several different algorithms were tested using data supplied by Nortel. 5.1 THE CHOW AND LIU ALGORITHM In this algorithm, due to Chow and Liu (1968) the mutual information between pairs of variables is calculated and those variables with the highest value are connected. The algorithm continues with successive elimination of variables. It has the advantage of simplicity but generates only tree structures. The mutual information between two variables, a and b is defined as:

ab = Pr( a = i, b = j) log
i, j

Pr( a = i, b = j ) Pr( a = i) Pr( b = j )

It can be shown that if the independence graph is a tree (i.e. if there are no cycles in the I-map) the best fit to the data is obtained by the Chow and Liu algorithm. The main advantage of using tree-structured graphs is that they are simple to compute. The probability tables can be calculated directly from the marginal counts of the variables - maximum likelihood estimators are obtained by simply computing the pairwise marginal counts of the variables.

5.2 BIFROST BIFROST (Hojsgaard and Thiesson, 1995) generates chain graphs, a particular form of probabilistic net, allowing for undirected links between variables at the same level in the chain. The user has various choices for deciding on goodness of fit, and can apply penalty functions for over-complex graphs. BIFROST was found to give good results. In the absence of explicit knowledge of the causal relationships about the variables it was assumed that all variables were at the at the same level in the chain. The main advantage of using BIFROST (or CoCo, its inference engine, see Badsberg, (1991)) is that it is able to fit graphical models which are not fully triangulated. This means that there is a cycle of length at least four in the graph which lacks a chord. The maximum likelihood estimators for the probability tables in such graphs cannot be computed directly but must be estimated using some iterative formula such as the iterative proportional fitting algorithm. 5.3 CAEGA This algorithm is a genetic algorithm in which good networks are bred in order to produce new, potentially better nets. It was found to give good solutions and was selected as an innovative algorithm for use in NETEXTRACT. The algorithm takes as input a modified form of the Bayesian Belief Network Interchange Format suggested by Heckerman. In this format each variable is specified with its name, a description and the number of values that it can have, and a descriptor for each possible value. In the data table each tuple is replaced with a vector of integer values, one integer for each value, and the number of occurrences of each distinct vector also appears. This format considerably compresses the storage requirement for the input file. The genetic algorithm also accepts a user-specified list of net structures (defined by child-parent lists) which may have been generated by other algorithms such as OMI or one of the BIFROST procedures. The algorithm works in the following way. Internally, breeders are specified by a square matrix of binary values in which a unit value indicates a parent-child relationship between two variables and a zero indicates that there is no direct relationship. An initial breeding stock is created from the breeders specified in the input data file and by random creation of graphs, which represent the net structure. Because the graph must be acyclic not all binary matrices define valid graphs, and any invalid matrices created by breeding or mutation are pruned until they are valid. At each generation pairs of breeders are selected randomly to exchange genetic material and thereby created new breeders. Others are selected for random mutations. Scoring was done using the posterior probability function in Cooper and Herskovits. Because there is a problem of overfitting with these models the saturated model in which one node is picked arbitrarily and taken as a child of all the others is a perfect match to the data some penalty must be applied to the score. In our implementation the penalty effect is obtained by limiting the number of parents each child may possess. The algorithms all have as output files containing belief net specifications in the standard format.

6. RESULTS OF INDUCTION
Typical results of the application of the algorithms described are shown below. The data which is shown results from an overnight run of automated testing, but does not show dependencies on underlying faults. About 12,000 individual events were recorded and the higher graph in figure 4 shows a directed graph generated by an adaptation of Chow and Liu's algorithm in which the width of the edge between two nodes (alarms) is proportional to the mutual information of the two variables. It can be seen that PPI-AIS and LP-EXC have the strongest relationship, followed by the relationship between PPI-Unexp_Signal and LP-PLM. Note that the directions of the arrows are not important as causal indicators but variables sharing the same parents do form a group. The second graph in Figure 4 (created using BIFROST) shows the edge strengths as strongest if the edge remains in models which become progressively less complex - where there is a penalty for complexity. It can be seen that the broad patterns are the same in the two graphs but that the less strong edges are different in the two graphs. In the second graph the node NE-Unexpected_Card shows links to three other nodes, whereas it has no direct links in the first graph.

Figure 4: Two Graphs Output from Alarm Test Data by NETEXTRACT

7. DEDUCTION
Bayesian nets have the important property that they simplify the important operations of belief updating (this corresponds to conditioning in the probabilists world). The deduction component in NETEXTRACT allows users to explore the properties of the nets they have extracted and if desired to use them as expert systems. The deduction module reads in a network specification file, and then carries out various internal operations. It can cope with multiply connected networks by identifying cliques of the networks moral graph and creating local probability tables on these cliques, and also on the connectors between the cliques. Whenever the value of a node is changed, the effect is propagated as messages around the junction tree of cliques. The user is able to alter the value of a node and to thereby view the set of attributes (nodes) with updated probabilities for their attribute values. The user interface for NETEXTRACT has been developed in Java. Users can alter values of nodes and see how the effects of the changes that propagate around the network. They can view the marginal values of probabilities of states of nodes that have not been directly observed but which are influenced by other variables.

8. CONCLUSIONS
The association between the University of Ulster and Nortel in the area of fault analysis has continued with the Garnet project. Garnet is using the techniques developed in NETEXTRACT to develop useful tools in the area of live testing. The basic idea is to regard the Bayesian nets as abstract views of the test networks response to stimuli. In a further development, the Jigsaw project entails the construction of a data warehouse for the storage of the test data, with a direct link to data mining algorithms. We propose to incorporate the NETEXTRACT algorithms into a data mining framework as described elsewhere by Anand et al. (1996) and (1997). Although this report has described the work completed with reference to Nortels telecommunications network, the NETEXTRACT architecture is generic in that it can extract cause-and-effect nets from large, noisy databases, the data arising in many areas of science and technology, industry and business and in social and medical domains. The corresponding hypothesis to the aim of this project could be proposed that cause and effect graphs can be derived to simulate domain experts knowledge and even extend it. The authors would like to thanks Nortel Networks and the EPSRC for their support during the project.

REFERENCES
Anand S.S., Bell D.A., Hughes J.G., 1996. EDM: A General Framework for Data Mining based on Evidence Theory, Data & Knowledge Engineering Journal 18 pp189-223. Badsberg, J.H., 1991. A Guide to CoCo Technical Report R 91-43. Institute for Electronic Systems, Aalborg University. Anand S.S., Scotney, S.W., Tan M.G., McClean S.I., Bell D.A., Hughes J.G., 1997. Designing a Kernel for Data Mining, IEEE Expert. Chickering D.M. and D. Heckerman, 1994. Learning Bayesian networks is NP-hard. Technical Report MSR-TR-9417, Microsoft Research, Microsoft Corporation, 1994. Chow, C.J.K and Liu, C.N., 1968. Approximating Discrete Probability Distribution with Dependence Trees, IEEE Transactions on Information Theory, Vol IT-14:3, pp 462-467. Clifford, P. 1990, Markov Random Fields in Statistics. In Disorder in Physical Systems. A Volume in Honour of John M. Hammersley. Eds.: Grimmett, G., R. and D. J. A. Welsh". Clarendon Press. Oxford. Cooper, G.F. and Herskovits, E., 1992. A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning, 9, pp 309-347. Druzdzel, M. J. and H. A. Simon", 1993. "Causality in Bayesian Belief Networks". In "Proceedings of the Ninth Annual Conference on Uncertainty in Artificial Intelligence". pp 3-11. Heckerman D, 1996. Bayesian Networks for Knowledge Discovery In Fayyad UM, Piatetsky-Shapiro G, Smyth P and Uthurusamy R (Eds.), {Advances in Knowledge Discovery and Data Mining} AAAI Press / The MIT Press, 273-305. Hojsgaard, S. and B. Thiesson. 1995. BIFROST: Block Recursive Models Induced from Relevant Knowledge, Observations and Statistical Techniques. Computational Statistics and Data Analysis. Vol 19. pp. 155--175. Lauritzen, S.L. and Spiegelhalter. D.J. 1988. Local Computations with Probabilities on Graphical Structures and their Application to Expert Systems. J. Roy. Statist. Soc. B, 50, 157-224. Moore P., Shao J., Adamson K., Hull M.E.C., Bell D.A., Shapcott M., 1996. An Architecture for Modelling NonDeterministic Systems using Bayesian Analysis, Proceedings of the Fourteenth IASTED International conference on Applied Informatics. pp. 254-257. Ripley, B. D., 1996. Pattern Recognition and Neural Networks. Cambridge University Press. England, Cambridge.

Russell, S., and P Norvig, 1995. Artificial Intelligence: A Modern Approach. Prentice-Hall International, London. Sterritt R, M. Daly, K. Adamson, M. Shapcott, D.A. Bell, F. McErlean, 1997a. NETEXTRACT: An Architecture for the Extraction of Cause and Effect Networks from Complex Systems. Proceedings of the 15th IASTED International Conference on Applied Informatics, pp 55-57. Sterritt R., Adamson K., Shapcott C.M., Bell D.A., McErlean F., 1997b. Using A.I. For The Analysis Of Complex Systems, Proceedings of the IASTED International Conference on Artificial Intelligence and Soft Computing (in co-operation with AAAI), pp 113-116. Sterritt R., Adamson K., Shapcott M., Wells N., Bell D.A., Liu W., 1997c. P-CAEGA: A Parallel Genetic Algorithm For Cause And Effect Networks, Proceedings of the IASTED International Conference on Artificial Intelligence and Soft Computing (in co-operation with AAAI). pp 105-108. McClean, S.I. and Shapcott, C.M., 1997, Developing Bayesian Belief Networks for Telecommunications Alarms, Proceedings of Conference on Causal Models and Statistical Learning, pp 123-128, UNICOM, London. Microsoft Decision Theory and Adaptive Systems Group, 1996. Proposal for a Bayesian Network Interchange Format, Internal Microsoft Document. Whittaker, J., 1990. Graphical Models in Applied Multivariate Statistics. Wiley. England, Chichester.

You might also like