Ciriello - A Review On Models and Algorithms For Motif Discovery in Protein-Protein Interaction Networks

B RIEFINGS IN FUNC TIONAL GENOMICS AND P ROTEOMICS . VOL 7 . NO 2.
147^156
doi:10.1093/bfgp/eln015
A review on models and algorithms for motif discovery in protein^protein interaction networks
Giovanni Ciriello and Concettina Guerra
Advance Access publication date 28 April 2008
Abstract
Several algorithms have been recently designed to identify motifs in biological networks, particularly in protein ^ protein interaction networks. Motifs correspond to repeated modules in the network that may be of biological interest. The approaches proposed in the literature often differ in the definition of a motif, the way the occurrences of a motif are counted and the way their statistical significance is assessed. This has strong implications on the computational complexity of the discovery process and on the type of results that can be expected. This review presents in a systematic way the different computational settings outlining their main features and limitations. Keywords: network motifs; PPI networks; computational biology; graph mining
Downloaded from http://bfg.oxfordjournals.org/ by guest on May 28, 2013
INTRODUCTION
Much data has become recently available about proteinprotein interactions (PPI) in selected organisms [13]. The interactions have been determined by a number of experimental [46] and computational [79] techniques designed to systematically identify both the candidate and actual protein interactions. A PPI network is modelled by a graph whose nodes correspond to proteins and edges to interactions between two proteins. Much work has been devoted to the characterization and modelling of biological networks in terms of global properties of the underlying graphs, such as diameter, degree distribution and clustering coefficient. Recently, there has been a shift of attention of researchers from global properties to local properties that provide a concise description of a network in terms of appearances of a given number of small subgraphs. It has been shown that PPI networks exhibit interesting features in terms of repeated occurrences of certain modules and that these modules have biological meaning, since they may represent evolutionary conserved topological units of cellular networks. Thus, much attention has been paid to the
identification of small subgraphs, particularly those occurring significantly often within the biological networks. These subgraphs, referred to as network motifs or simply motifs, can be considered as basic building blocks of complex networks [10]. The PPI data available are certainly incomplete, thus many false negatives affect the studies on these networks. Nevertheless, the current research effort on the characterization of biological networks may ultimately aid in understanding the cell organization. This review focuses on the computational aspects of the motif discovery process in biological networks. Although we will consider PPI networks in this review, most of the algorithms that will be reviewed have been applied also to other types of biological networks, such as transcriptionregulation [1114], and to networks of different nature, as social [15, 16] and WWW [17]. A discovery algorithm typically consists of three steps: (i) find the occurrences of subgraphs of a given size in a graph; (ii) determine which subgraphs are isomorphic and group them into classes, sometimes referred to as graphlets; and (iii) determine the statistical significance of the graphlets by comparing their frequencies to
Corresponding author. Giovanni Ciriello, Department of Information Engineering, University of Padova Via Gradenigo 6a, 35131 Padova, Italy. Tel: 39 049 827 7928; Fax: 39 049 827 7799; E-mail: ciriello@dei.unipd.it Giovanni Ciriello is a PhD student in Computer Science at the University of Padova, Italy. Concettina Guerra is a Professor at the University of Padova, Italy, and at Georgia Institute of Technology, Georgia, USA.
The Author 2008. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oxfordjournals.org
148
Ciriello and Guerra

efficient approaches exist for small subgraphs. Canonical labelling of a graph has been successfully used in heuristic approaches to graph isomorphism to reduce the computation time. As already said, graph isomorphism is a required step in most motif-discovery algorithms for counting the occurrences of subgraphs and distinguishing non-isomorphic subgraphs. The canonical label is a unique label associated to a graph that is independent on the permutation of the nodes, and therefore invariant to isomorphism. A graph with n nodes has up to n! different adjacency matrices, obtained from the set of permutations of the n nodes. A code is associated to each adjacency matrix, M, defined as the string of values obtained by the concatenation of the rows of the lower triangular sub-matrix of M, including the diagonal, i.e. code(M) m11m12m22. . .m1n. . .mnn. Note that M is symmetric because G is assumed to be an indirect graph. The canonical labelling of the graph G denoted by cl(G) is the labelling that generates the code that is maximal over all adjacency matrices of G according to a given order among strings, e.g. lexicographic order. A property of this labelling is that, given two subgraphs G1 and G2, cl(G1) cl(G2) iff G1 is isomorphic to G2. Thus, determining whether two graphs have the same canonical label is as difficult as the graph isomorphism problem is. For specific instances of graphs, efficient heuristics can be set up to reduce the search space of canonical labelling. One such heuristics consists of grouping the nodes of each graph based on vertex invariant properties such as degree and/or labels [19] and then restricting the permutations to nodes within the same group. A generalization of standard-canonical labelling applied to PPI networks is defined in [20] where labels are attached to all nodes of a graph. The label of node vi depends on its degree mi and the distances dij to all connected vertices j. Specifically:
labelvi X
j
those of an ensemble of random networks. In Graphs: definitions section, we give basic definitions of graphs and graph isomorphism. In Network motifs: definitions and properties section, we present various definitions of motifs, concentrating on different ways of counting subgraphs, and show how they affect the computational complexity of the discovery process. Models of biological networks are reviewed in Model of random graphs section, since they provide a framework for the evaluation of the statistical significance of the obtained subgraphs. Finally, in Motif discovery algorithms section, we give a short outline of some of the algorithms proposed in the literature to address this problem. We conclude with a short survey of the literature on the controversial biological interpretation of motifs.
GRAPHS: DEFINITIONS
In this section we will briefly review basic definitions of graphs. A graph is a pair G (V, E ) where V {v} is the set of nodes or vertices and E {(u, v)} is the set of edges each connecting a pair of nodes. A subgraph GS (VS, ES) of G (V, E ) consists of a subset of nodes VS " V and of a subset of edges ES " E connecting the nodes of VS in the original graph. The graph induced by VS is the subgraph GS that includes all the edges of G which connect the vertices of VS. In most of the algorithms we will review, the graphs are represented by adjacency matrices. Assume that the nodes of a graph are numbered in some arbitrary order from 1 to n, that is V {v1, . . . ,vn}. The adjacency matrix M {mij}, i, j 1, . . . ,n, is then defined as a two-dimensional square matrix whose element mij is 1 if there is an edge connecting nodes vi and vj and 0 otherwise. Two graphs G1 and G2 are isomorphic if there is a one-to-one mapping f between vertices of G1 and those of G2 with the property that two vertices u and v of G1 are adjacent if and only if f(u) and f(v) are adjacent in G2 [18]. In other words, two graphs are isomorphic if it is possible to permute the vertices so that G1 coincides with G2. Determining graph isomorphism is not known to be either in P or in NP-complete. On the other hand, given two graphs G and H determining whether G contains a subgraph isomorphic to H is an NP-complete problem, thus no polynomial time algorithm is known to solve it. In practice, several
2ddij mi :
where d is the graph diameter, i.e. the max length of a graph path. The product of all these labels provides a code for the entire graph. Although such a code does not uniquely characterize a graph, i.e. nonisomorphic graphs may have the same code, in practice it allows to distinguish between graphs of size up to 8 quite well.
Motif discovery in PPI networks
149
NETWORK MOTIFS: DEFINITIONS AND PROPERTIES

Two distinct definitions of a motif have been used in the literature based on frequency and statistical significance. Definition 1. A motif is a subgraph that appears more than a threshold number of times. Definition 2. A motif is a subgraph that appears more often than expected by chance. A motif of size k, i.e. containing k nodes, is called k-motif. A motif according to definition 2 is often said to be over-represented. It is possible that an overrepresented motif does not occur more than the threshold number of times required by definition 1 and vice versa. In the following sections, we will specify how to count the frequency of a graphlet, i.e. the number of its occurrences, and how to evaluate its significance, i.e. the degree of surprise of such frequency. We stress that another important distinction among the different definitions of a motif depends on whether or not the subgraph has to be induced, i.e. it must contain all edges that connect the nodes of the subgraph in the original graph. The number of possible k-motifs grows very fast with k. This is one of the reasons why only small size k-motifs have been studied in PPI networks. To overcome in part this problem two different types of motifs are defined in [21]: (i) subgraphs that are generated by a random walk of a given length (8 in [21]) and 2. subgraphs with a given number of edges. These choices allow to extend the exact search to larger subgraphs (k 14). An alternative definition of motif is also provided in [22]: here a subgraph alignment is performed across a set of small graphs in order to extract an average structure called consensus pattern. This definition seems to be less affected by the incompleteness of the data, on the other hand an alignment among many different subgraphs may lose interesting informations.
Figure 1: Given the graph G let a and b be two subgraphs, respectively, of size 3 and 4. Their frequencies vary according to whether or not overlap is allowed in counting.
Frequency
The frequency of a subgraph is strictly related to the kind of node and edge overlap that is allowed in counting the occurrences of the subgraph [23, 24], and this affects in a significant way both the results and the complexity of the counting. Indeed, a distinction can be made between
non-identical subgraphs, i.e. subgraphs that differ in at least one node or one edge and edge-disjoint subgraphs, i.e. subgraphs that do not share any edges. This distinction is crucial in the counting. For example, for the network G of Figure 1, the counting of all edge-disjoint subgraphs leads to only one occurrence of both subgraphs a and b. However, if we count all non-identical subgraphs, subgraph b turns out to appear six times while subgraph a still only once. This na ve example well explains the most important difference between the two ways of counting: the edge-disjoint counting leads to a downward closed frequency, i.e. the frequency of a subgraph monotonically decreases with its size. The downward closure property does not hold for the case of non-identical counting. In other words, if arbitrary overlaps are allowed, a subgraph of size k can be, and in general is, more frequent than the subgraphs of size <k it contains. The downward closure is a highly desirable property from a computational point of view: first, it reduces considerably the number of subgraphs to process; second, as we will see later, it allows to use the A-priori algorithm to further prune the number of subgraphs to be examined. A further distinction is made in [24] where the case of only overlaps of nodes is considered. In [24] indeed three frequency concepts, denoted by F 1, F 2 and F 3, are identified: F 1: arbitrary overlaps of nodes and edges (nonidentical case), F 2: only overlaps of nodes (edge-disjoint case), F 3: no overlaps (edge and vertex-disjoint case). It has to be noticed that the downward closure property of the frequency holds for both F 2 and F 3, while it does not for F 1. Hence the definition of the frequency of a subgraph is a key point when designing algorithms for
150
Ciriello and Guerra
motif discovery. Several approaches proposed by the computational biology community have considered arbitrary overlaps of motifs [10, 22, 2532]. In PPI networks proteins often participate into different modules where they perform different functions; thus it is conceivable that subgraphs with overlapping vertices or edges are simultaneously active at any time. In contrast, in the data mining literature, most algorithms consider appearances of subgraphs that do not share edges nor vertices [23, 24, 3335]. In such context, the search for recurring subgraphs is performed not in a single graph, as we have discussed so far, but in a dataset consisting of many graphs, also called transactions. The frequency of a subgraph is the number of graphs of the dataset that contains it. If this number is above a given threshold, the subgraph is denoted as a motif. Thus one can distinguish between a single-graph setting, as we have considered so far for biological networks, and a graph-transaction setting [23]. An attempt at combining these two settings is given by the algorithm NeMoFINDER for motif discovery in PPI networks [25], where a different definition of motif is used. A motif is a subgraph (not necessarily an induced subgraph) that is frequent and unique. A subgraph is frequent if it occurs more than tf times in the real PPI network. A frequent subgraph is unique if its frequency in the real network is greater than that of at least tu random networks among an ensemble of n random networks, where tf , tu and n are given parameters. The requirement that the frequency in the real network has to be greater than tf excludes the possibility that a relatively unfrequent motif may turn out to be overrepresented. An important characterization of a motif is that of maximality [28, 36]. A motif GS is a maximal motif if it is not contained in any subgraph of G that is a motif, i.e. none of the super graphs of GS are frequent. Detecting only maximal motifs has the advantage of a reduced computation time since a smaller number of subgraphs are produced and stored. The concentration is another interesting measure associated to a subgraph that has been used mostly in conjunction with sampling methods for motifs detection. For a given subgraph type Gk of size t the P concentration is given by C(Gk) N(Gk)/ iN(Gi), where PN(Gk) is the number of occurrences of Gk and iN(Gi) is the total number of occurrences of all the subgraphs Gi of the same size t as Gk.
Statistical significance
We now turn to the definition of statistical significance of the frequency of a subgraph. According to definition 2, motifs have to be overrepresented, i.e. appear significantly more often than what would be expected by chance. The overrepresentation of a subgraph is established on the basis of its frequency compared to the average frequency of the same subgraph in a set of random networks. The obtained values of the frequencies for the observed and random networks are compared through either the Z-score or the abundance () as defined in [10, 27]:
freal Gk f rand Gk stdfrand Gk freal Gk f rand Gk Gk freal Gk f rand Gk ZGk
where freal(Gk) is the frequency of the subgraph Gk into the real network, while f rand Gk and std(frand(Gk)) are the average and the standard deviation of the frequency of Gk in the set of random networks, respectively. The value is a small constant that prevents from growing too much when the number of occurrences of the subgraph is very small in both the real and random networks. The Z-score is high if the subgraph is overrepresented, negative if it is under-represented and close to zero otherwise. The abundance varies between 1 (under-represented) and 1 (overrepresented). Even though rarely used in the biological literature, the measure of abundance could be more practical than the Z-score when evaluating and comparing the appearances of small subgraphs in different networks, such as biological, social and internet networks, since it is less affected by the different network sizes. An interesting characterization of networks was introduced in [27] based on network motifs: the significance profile (SP). Basically the SP of a network is a vector of normalized Z-scores that gives a histogram of the subgraph over-representation values over a network; the Z-scores are normalized to the unit length such that negative values, especially those close to 1, are associated to under-represented subgraphs, while positive ones, especially those close to 1, allow to recognize the motifs. The use of Z-score is criticized when indiscriminately applied to scientific or real world problems irrespective of whether or not they have a Gaussian behaviour. A popular and entertaining book on

randomness [37] lists several real life examples with non-Gaussian properties that are mistakenly interpreted in terms of Z-score. In the biological literature on networks a similar concern has been raised. In [38] it is observed that many network properties are not Gaussian and their distribution should be instead estimated from data. Thus a method from machine learning, namely kernel density estimation and cross validation [39], is proposed to evaluate the statistical significance of motifs. Once the distribution is learnt from sample data, the deviation is measured as the likelihood that an observation was drawn from random networks.
151
MODELS OF RANDOM GRAPHS

An important issue at this point is how to model random networks to be used for comparison and evaluation of motif occurrences. There is a large body of literature on network models [15, 16, 4049] in different fields from physics to sociology to internet and biology. An extensive mathematical-oriented survey of network models can be found in [50] where the Poisson random graphs, scale-free, exponential and Markov graphs are reviewed and their relations to various global and local network properties are discussed in depth. A much cited work by Barabasi and Oltvai [51] provides a short introduction to network models and their properties with focus on the principles that govern the growth and evolution of biological networks, especially scale-free networks. We will briefly review the random graph models that are relevant to motif discovery. In the pioneering work of [10, 27, 52] random networks preserve the same degree distribution of biological networks. Degree distribution is perhaps the most important global property of a biological network. The degree distribution P(k) gives the probability that a node has degree k, i.e. has k adjacent nodes. Much work have been devoted to the characterization of biological networks in terms of degree distribution. In [51] biological networks and PPI networks are described as having a power-law degree distribution, i.e. P(k) $ k , with 2 < < 3. These networks are known as scale-free networks. The power-law distribution differs from the Poisson degree distribution typical of ErdosRenyi (ER) random networks [43, 44] where the probability P of two nodes being connected is uniform across the network. In ER random networks, most nodes have a degree close to the average degree of the network and the
probability P(k) decreases exponentially with k. In contrast, in a network with powerlaw degree distribution there is a long tail in the distribution implying a higher probability of nodes with high degrees with respect to ER models. In [29], an extension of the idea of a random graph model preserving degree sequence is proposed for the search of n-node motifs. The random graph preserves the distribution of the (n 1)-node subgraphs of the original graph. In their implementation n 4. The same random graph model is suggested in a recent paper [53] to evaluate motifs of size up to 8. However, the generated random graphs preserve the distribution of the 3-node subgraphs only, since it is found computationally infeasible to preserve the subgraph distribution for larger values of n. Recently, a new model of PPI networks has been proposed [54, 55] based on the notion of geometric random networks [56] and Poisson distribution of the degrees. A random geometric graph G(n, r) with radius r has n nodes that are independently and uniformly randomly distributed in a metric space. An edge connects two nodes u, v of G if their distance is less than the radius r. The results in [54] show that, despite the fact that the degree distribution of currently available PPI networks tends to be powerlaw, as in the scale-free model, other global properties such as diameter and clustering coefficient seem to better fit the geometric model. The prediction in [54] is that as the number of new experimentally and computationally determined protein interactions will increase, the degree distribution of PPI networks will be better approximated by a Poisson distribution. A promising new research direction tries to improve the fit by incorporating node clustering into the model to account for different patterns of connectivity in different network modules, as in assortative mixing models [50]. One such scheme, called ERMG for Erdos-Renyi Mixtures for Graphs, is proposed in [57, 58] based on mixture distributions. As observed in [21], for most models a perfect fit with a real network can be obtained in terms of a single selected property, either local or global, by accurately tuning certain model parameters. However, none of the proposed models can reproduce all the statistical properties of a naturally occurring network. Before concluding this section, it is important to stress that a major problem in modelling arises from the incompleteness and noise
152
Ciriello and Guerra

starting with individual nodes and adding one node at a time until the required size k is reached. During construction, for each partial subgraph the algorithm keeps an auxiliary list of nodes that are candidates for future additions. Such list is dynamically updated: for each new node w added to the subgraph, a node is added to the candidate list if it has a label higher than the first node selected for the subgraph construction and if it is adjacent to w but not to any of the nodes already in the subgraph. As we will see later, ESU, when combined to sampling, allows the detection of subgraphs of size up to 14. Compact topological motifs The motif discovery problem is solved exactly in [28] where a motif is considered to be a subgraph that occurs at least t times. The algorithm extends the notion of maximality typical of string analysis [62] to graphs. Edge-maximality (vertex-maximality) ensures that no more edges (vertices) can be added to a motif without altering the occurrences of the motif in the graph. The combinatorial explosion of the motifs, due to arbitrary overlaps, is handled by the introduction of a compact graph representation, obtained by grouping together maximal sets of nodes that are indistinguishable. The notion of maximal indistinguishability is here presented by means of an example. Formal definitions are in the article. In Figure 2 the nodes of the set U1 are indistinguishable with respect to the set U2. The nodes within each set have the same label or colour; furthermore, there is an edge connecting each pair of nodes of U1 U2. The nodes in U1 may produce three distinct occurrences of a motif that includes also U2. The enumeration of all such occurrences can be done more efficiently if a compact representation is
of the PPI data. The impact of false positives and negatives has been recently studied in [59, 60] where models of random link removal that reproduce incomplete networks have been analysed. Their findings indicate no significant dependence of degree distribution on the parameters of the models. On the other hand, motif frequencies seem to be affected by link removal strength. Further investigation is certainly needed in this area.
MOTIF DISCOVERY ALGORITHMS

In this section, we will review some of the existing algorithms for motif discovery. This is a complex task from a computational point of view, since it involves graph and subgraph isomorphism. We first review exact approaches to identify motifs that are typically limited to 3- and 4-motifs. Then we survey approximate methods that extend the analysis to motifs of size up to 1214; they are either based on sampling or resort to heuristics that may help reduce the amount of computation. We also briefly review graph data mining approaches, based on the A-priori algorithm. Even though they have been introduced for a different problem, namely the search for a subgraph in a dataset of graphs, they can provide good insights into the motif discovery problem. Note that with the exception of the A-priori, the algorithms presented here adopt the F 1 frequency, consistently with the biological meaning discussed in Frequency section.
Exact algorithms
Exhaustive search Pioneering work on motif discovery was presented in [10, 29] where an exhaustive search for motifs is performed across networks of different nature: transcriptionregulation, neuron-synaptic connection, food webs, WWW. The algorithm has been applied also to compute the SP of the Saccharomyces cerevisiae network [31] and of various other different networks, including WWW and social networks [27]. This exhaustive recursive search ERS finds all connected induced subgraphs of 3 and 4 nodes. Edge and node overlaps of subgraphs are allowed. The input network is represented by an adjacency matrix M. The n-graphlets are extracted by enumerating all the n n submatrices induced by n connected nodes. ERS does not scale to motifs of size bigger than 4. A faster exact search algorithm is ESU [61] that incrementally builds subgraphs of a certain size k
Figure 2: The graph on the left shows the sets U1 and U2 mutually maximally indistinguishable, on the right the graph after the substitution of U1 and U2 as compact nodes and U1U2 as compact edge.

used for the nodes of U1. The resulting algorithm achieves a drastic reduction of the size of the output motifs.
153
Approximate algorithms
Search algorithms based on sampling A non-exact search was proposed by [26], where an algorithm based on sampling, called MFINDER, was developed. A sample subgraph of size k is extracted by picking at random edges of the input graph until a set of k nodes is obtained. The sampling procedure iteratively expands a subgraph by one by randomly picking a new edge in the set of edges adjacent to nodes already in the sample. When the sample reaches the required size, all the edges that connect its nodes in the original graph are added to it, thus obtaining an induced subgraph. Since some subgraphs have higher probability of being picked, an important part of the process is to assign weights to the samples to correct the non-uniform sampling. Then MFINDER computes the score of each type of subgraph of size k as the sum of the weights of all samples of the same type. Finally, the value of concentration defined in the previous section is used to estimate the significance of a motif. Interesting features of this work are the ability to extract rare motifs with high probability and the low number of samples needed to obtain satisfactory results (even with 5000 samples MFINDER obtains concentrations very similar to ERSs). This algorithm scales well with large networks since the runtime is independent of the network size. However, as noticed in [25], it does not scale well with large motifs (it takes about 2 h to extract the 8-motifs in the transcriptional network of Escherichia coli, with W V W 423 and W E W 519). One drawback of this approach is the extra time needed to compute the weights of all samples. This issue is addressed in [61, 63] where a new sampling method, Rand-ESU, is proposed to overcome this problem. Rand-ESU is based on the exact search algorithm ESU reviewed in the previous section. The recursive procedure ESU builds a tree whose leaves correspond to subgraphs of size k while internal nodes correspond to subgraphs of size 1 up to k 1, depending on the tree level. Rather than traversing the entire tree, the sampling procedure visits parts of it and guarantees that all leaves are visited with uniform probability. This is achieved by assigning to each level in the tree a probability that the nodes are further explored. Although this
approach is certainly faster and more intuitive, there is still the problem of how to empirically choose such probabilities. A simpler strategy that produces uniform sampling consists of randomly picking subsets of k-nodes [20] instead of edges. Obviously, there is a possibility that the picked nodes do not belong to a connected component and do not identify a subgraph, therefore. NeMoFINDER A more recent paper [25] proposes a combination of approaches developed within the data mining and computational biology communities. The algorithm, NeMoFINDER, applies to unlabelled undirected graphs. A subgraph needs not be an induced subgraph. Furthermore, a motif is frequent and unique, as described in Frequency section. NeMoFINDER adopts the idea in SPIN [36] to search for repeated trees and then extend them to subgraphs. In short, the procedure first builds the set Tk of frequent trees of size up to k that appear in G more than a given threshold number; then, based on such trees, the graph G is divided into a set GD of transactions or graphs, such that each tree in Tk can be embedded in at least one of these transactions. The frequency of a motif g is computed as the number of transactions of GD into which g can be embedded. To compute such frequency, NeMoFINDER first determines the trees of Tk that are embedded in at least tf transactions of GD, where tf is the given threshold. Then each such tree is expanded by one edge and the frequency of the obtained graphlet is computed. The definition of motifs adopted in the article leads to a reduction of the number of graphlets to be examined and, consequently, of the computation time thus enabling the discovery of larger motifs, but at the cost of missing some potentially interesting subgraphs. An extension of this work is in [64]. Subgraph counting by scalar computation An alternative interesting approach to characterize a biological network is by a set of measures (or an embedding space for a graph) based on scalars and functionals of the adjacency matrix A associated to the network [38]. The functionals are matrix multiplications of A and its transpose AT, corresponding to moving forward or backward in the graph, respectively. The scalars are the results of an operation, such as a sum, on either the diagonal terms or the off-diagonal terms of the matrix
154
Ciriello and Guerra

Perhaps the most important question is: Are motifs the result of evolutionary selection for function? This question relates to the other important question: Which mechanisms regulate PPI networks growth and evolution? We cannot treat these controversial issues in depth here and only mention a few interesting points. The selection of network motifs for their function is a theory proposed originally in [52] where the over abundance of certain motifs is interpreted as manifestation of evolutionary design principles. A criticism to this interpretation has been raised in [65] where it is asserted that the choice of the random network model of [52] is not adequate, and the null hypothesis is not well posed. The obtained motifs may be a consequence of some topological characteristics (such as spatial aggregation) present only in real networks and not in random networks, rather than a result of evolutionary design principles. While this comment poses an interesting issue related to the randomization process, it has been shown [29, 66] that random networks built under this additional topological constraint reveal a profile that is different from that of a real network. Another study in support of the theory of evolutionary conservation of motifs is in [67]. It shows how ortholog proteins from different species are often found in the same network motifs, indicating that motifs are evolutionary conserved topological units. This study has been conducted for proteins in S. cerevisiae and their orthologs in networks of five higher eukaryotes (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus and Homo sapiens). The over-representation of certain motifs in PPI networks, namely triangles and squares, has been also explained by the duplication-mutation model of network growth [6870]. In this probabilistic model, the network evolves by the addition of a new node that duplicates a randomly picked node of the original network and inherits all its links. Then, during evolution some of the new links may disappear and a new link between the copied and the original node may be added. Seven different evolutionary mechanisms are investigated in [21] to determine which model best reproduces the structure of the PPI network of D. melanogaster in terms of motifs. The result of the study suggests that the duplicationmutation-complementation mechanism shaped Drosophila melanogaster. Another interesting outcome of this work is that the ER random model
obtained by the application of the functionals. As a simple example, the sum of the diagonal terms of the product AT A2 gives the number of subgraphs of 3 nodes i, j and k with edges (i, j ), ( j, k) and (i, k). However, in most cases there is not an obvious meaningful correspondence of a scalar measure with a subgraph. In fact, the mapping between scalars and subgraphs is many-to-many. One of the major advantages of this approach, besides its mathematical elegance, is its computational efficiency when compared with other approaches, especially on dense scale-free networks. However, in the original paper [38] the time performance is only compared to that of the ERS and not to the latest versions of the exhaustive search, which are significantly more efficient. A-priori-based motif detection When no overlap of subgraphs in the original network is considered, the complexity of the discovery process can be handled more efficiently by a levelwise search, as in the A-priori algorithm for graphtransactions. In fact, in that case the downward closure property holds for the frequency. In other words, if a subgraph is frequent so are all its subgraphs. A typical levelwise search builds candidate motifs of size k by joining motifs of size k 1 and then evaluating their frequency. Two subgraphs of size k 1 are joined if they share a core of size k 2. The core can consist of a set of vertices [34] or a set of edges [23]. Then the process of growing subgraphs may proceed by either adding a node at a time or an edge at a time. The issues of how to join separate motifs to obtain larger candidate motifs and how to count their occurrence are far from trivial, since they involve subgraph isomorphism. However, this setting allows the development of efficient pruning schemes, based on the canonical labelling of the subgraphs [20, 35] reviewed in Graphs: definitions section. An overall significant speed-up over previous methods is achieved by this approach because fewer subgraphs need to be explored and stored.
CONCLUSIONS
This survey has focused on the algorithmic aspects of motif discovery and on the graph models that best describe biological networks. We have not considered so far some very important biological questions that motivate the search for motifs and explain the attention paid to this problem.

shows a better fit score than other popular growing graph mechanisms, such as preferential attachment [50, 51], when the analysis is restricted to the so-called core network, containing edges that have a high probability of occurrence in vivo [4]. In conclusion, caution should be taken when it comes to a biological interpretation because of the observational incompleteness and noise. Hopefully, future research on the discovery of larger composite motif combined with the availability of more reliable PPI data will provide answers to the very important biological questions.
155
Key Points
A comprehensive review of the Network Motif concept is provided on the basis of the different definitions of frequency and significance in computational biology and data mining. The importance of network motifs in biological networks, such as protein-protein interaction (PPI) networks, is related to their ability of being a manifestation of evolutionary design principles. The network motif discovery and classification problem is difficult since it involves the subgraph isomorphism problem. We analyze and compare several exact and approximate algorithmic approaches.
References
Han K, Park B, Kim H, et al. HPID: The Human Protein Interaction Database. Bioinformatics 2004;20(15):246670. 2. Mewes HW, Frishman D, Gruber C, et al. MIPS: a database for genomes and protein sequences. Nucleic Acids Res 2000; 28:3740. 3. Xenarios I, Rice DW, Salwinski L, et al. DIP: the database of interacting proteins. Nucleic Acid Res 2000;28(1):28991. 4. Giot L, Bader JS, Brouwer C, et al. A Protein interaction map of Drosophila melanogaster. Science 2004;302:172736. 5. Ito T, Chiba T, Ozawa R, et al. A comprehensive twohybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci 2001;98:456974. 6. Uetz P, Giot L, Cagney G, et al. A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature 2002;403:6237. 7. Aloy P, Russell RB. Interrogating protein interaction networks through structural biology. Proc Natl Acad Sci USA 2002;99:5896901. 8. Jansen R, Yu H, Greenbaum D, et al. A Bayesian networks approach for predicting proteinprotein interactions from genomic data. Science 2003;302(5644):44953. 9. Marcotte EM, Pellegrini M, Ng HL, et al. Detecting protein function and proteinprotein interactions from genome sequences. Science 1999;285:7513. 10. Milo R, Shen-Orr S, Itzkovitz S, et al. Network motifs: simple building blocks of complex networks. Science 2002; 298:8247. 11. Alon U. Network motifs: theory and experimental approaches. Nature 2007;8:45061. 1.
12. Dobrin R, Beg QK, Barabasi A-L, et al. Aggregation of topological motifs in the Escherichia coli transcriptional regulatory network. BMC Bioinformatics 2004;5:10. 13. Lee TI, Rinaldi NJ, Robert F, et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 2002;298:799804. 14. Mazurie A, Bottani S, Vergassola M. An evolutionary and functional assessment of regulatory network motifs. Genome Biol 2005;6(4):R35. 15. Holme P, Edling CR, Liljeros F. Structure and timeevolution of n dating Internet community. Soc Networks 2004;26:15574. 16. Scott J. Social Network Analysis: A Handbook, 2nd edn. Sage Publications, London: 2000. 17. Kautz H, Selman B, Shah M. Referral web: combining social networks and collaborative filtering. Commun ACM 1997;40:635. 18. Ullman JR. An algorithm for subgraph isomorphism. JAssoc Comp Mach 1976;23(1):3142. 19. Read RC, Corneil DG. The graph isomorph desease. J Graph Theory 1977;1:33963. 20. Baskerville K, Paczuski M. Subgraph ensembles and motif discovery using a new heuristic for graph isomorphism. Phys Rev 2006;E 74:051903. 21. Middendorf M, Ziv E, Wiggins CH. Inferring network mechanisms: the Drosophila melanogaster protein interaction network. PNAS 2005;102(9):31927. 22. Berg J, Lassig M. Local graph alignment and motif search in biological networks. PNAS 2004;101(41):1468994. 23. Kuramochi M, Karypis G. Finding frequent patterns in a large sparse graph. In: SIAM International Conference on Data Mining (SDM04) 2004. SIAM, Lake Buena Vista, Florida, USA. pp. 24371. 24. Schreiber F, Schwo bbermeyer H. Frequency concepts and pattern detection for the analysis of motifs in networks. T rans Comput Syst Biol III 2005;LNBI 3737:89104. 25. Chen J, Hsu W, Lee ML, et al. NeMoFinder: dissecting genome-wide proteinprotein interactions with meso-scale network motifs. In: KDD 2006. ACM, Philadelphia, USA. pp. 106115. 26. Kashtan N, Itzkovitz S, Milo R, et al. Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 2004; 20(11):174658. 27. Milo R, Itzkovitz S, Kashtan N, et al. Superfamilies of evolved and designed networks. Science 2004;303:153842. 28. Parida L. Discovering Topological motifs using a compact notation. J Comput Biol 2007;4(3):30023. 29. Shen-Orr SS, Milo R, Mangan S, et al. Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 2002;31:648. 30. Schreiber F, Schwobbermeyer H. MAVisto: a tool for the exploration of network motifs. Bioinform Appl Note 2005; 21(17):35724. 31. Yeger-Lotem E, Sattath S, Kashtan N, etal. Network motifs in integrated cellular networks of transcription-regulation and proteinprotein interaction. Proc Natl Acad Sci 2004;101: 59349. 32. Zhang LV, King OD, Wong SL, et al. Motifs, themes and thematic maps of an integrated Saccharomyces cerevisiae interaction network. J Biol 2005;4:6.
156
Ciriello and Guerra

53. Grochow JA, Kellis M. Network motif discovery using subgraph enumeration and symmetry-breaking. In: RECOMB 2007, Lecture Notes in Computer Science 4453. Oakland, CA: Springer. pp. 92106. 54. Przulj N, Corneil DG, Jurisica I. Modeling interactome: scale-free or geometric? Bioinformatics 2004;20(18):350815. 55. Przulj N, Corneil DG, Jurisica I. Efficient estimation of graphlet frequency distributions in proteinprotein interaction networks. Bioinformatics 2006;22(8):97480. 56. Penrose M. Geometric Random Graphs. Oxford: Oxford Univeristy Press, 2003. 57. Lacroix V, Fernandes CG, Sagot MF. Motif search in graphs: application to metabolic networks. IEEE/ACMT rans Comput Biol Bioinform (TCBB) 2006;3(4):3608. 58. Daudin JJ, Lacroix V, Picard F, et al. Uncovering structure in biological networks. In: JOBIM 2006. Bordeaux, France. 59. Kuhnt M, Glauchec I, Greinerb M. Impact of observational incompleteness on the structural properties of protein interaction networks. Physica A 2007;373:75969. 60. Scholz J, Dejori M, Stetter M, Greiner M. Noisy scale-free networks. Physica 2005;A 350:62242. 61. Wernicke S. Efficient detection of network motifs. IEEE/ACM T rans Comput Biol Bioinform 2006;3(4):34759. 62. Apostolico A, Comin M, Parida L. Bridging lossy and lossless compression by motif pattern discovery. general theory of information transfer and combinatorics. Lect. Notes Comput Sci 2006;4123:793813. 63. Wernicke S, Rasche F. FANMOD: a tool for fast network motif detection. Bioinformatics 2006;22:11523. 64. Chen J, Hsu W, Lee ML, et al. Labeling network motifs in protein interactomes for protein function prediction. In: IEEE 23rd International Conference on Data Engineering (ICDE07) 2007. IEEE Computer Society, Istanbul, Turkey. pp. 54655. 65. Atzy-Randrup Y, Fleishman SJ, Ben-Tal N, et al. Comment on Network Motifs: Simple Building Blocks of Complex Networks and Superfamilies of Evolved and Designed Networks. Science 2004;305:1107c. 66. Milo R, Itzkovitz S, Kashtan N, etal. Response to comment on Network Motifs: Simple Building Blocks of Complex Networks and Superfamilies of Evolved and Designed Networks. Science 2004;305:1107d. 67. Wuchty S, Oltvai ZN, Barabasi A-L. Evolutionary conservation of motif constituents in the yeast protein interaction network. Nat Genet 2003;35:1769. 68. Ispolatov I, Krapivsky PL, Yuryev A. Duplicationdivergence model of protein interaction network. Phys Rev 2005;E 71:061911. 69. Ispolatov I, Krapivsky PL, Mazo I, et al. Cliques and duplication-divergence network growth. NewJ Phys 2005;7: 14559. 70. Rice JJ, Kershenbaum A, Stolovitzky G. Lasting impressions: motifs in proteinprotein maps may provide footprints of evolutionary events. Proc Natl Acad Sci 2005; 102(9):31734.
33. Huan J, Wang W, Prins J, et al. Efficient mining of frequent subgraphs in the presence of isomorphism. In: Third IEEE International Conference on Data Mining (ICDM03) 2003. IEEE Computer Society, Melbourne, Florida, USA. pp. 54960. 34. Inokuchi A, Washio T, Motoda H. An apriori-based algorithm for mining frequent substructures from graph data. In: PKDD 2000. Springer, Lyon, France LNAI 1910. pp. 1323. 35. Kuramochi M, Karypis G. Frequent subgraph discovery. In: IEEE International Conference on Data Mining (ICDM) 2001. IEEE Computer Society, San Jose, California, USA. pp. 31320. 36. Huan J, Wang W, Prins J, et al. Spin: mining maximal frequent subgraphs from graph databases. In: KDD 2004. ACM, Seattle, Washington, USA. pp. 5816. 37. Taleb NN. The Black Swan: The Impact of the Highly Improbable. New York: Random House Publishing Group, 2007. 38. Ziv E, Koytcheff E, Middendorf M, et al. Systematic identification of statistically significant network. Physc Rev 2005;E 71:016110. 39. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York: Springer, 2001. 40. Alm E, Arkin AP. Biological networks. Curr Opin Struct Biol 2002;13:193202. 41. Barabasi A.L. Linked: The New Science of Networks. Cambridge: Perseus, 2002. 42. Dorogovtsev SN, Mendes JFF. Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford: Oxford University Press, 2003. 43. Erdos P, Renyi A. On random graphs. Publicationes Mathematicae 1959;6:2907. 44. Erdos P, Renyi A. On the evolution of random graphs. Publ Math Inst Hungarian Acad Sci 1960;5:1761. 45. Newman MEJ, Barabasi A-L, Watts DJ. The Structure and Dynamics of Networks. Princeton: Princeton University Press, 2003. 46. Strogatz SH. Exploring complex networks. Nature 2003; 410:26876. 47. Vazquez A, Flammini A, Maritan A, et al. Modeling of protein interaction networks. Complexus 2003;1:3844. 48. Vazquez A. Growing networks with local rules: preferential attachment, clustering hierarchy and degree correlations. Phys Rev 2003;E 67(5):056104. 49. Vespignani A. Evolution thinks modular. Nat Genet 2003; 35:11819. 50. Newman MEJ. The structure and function of complex networks. SIAM Rev 2003;45(2):167256. 51. Barabasi A-L, Oltvai ZN. Network biology: understanding the cells functional organization. Nat Rev 2004;5: 10113. 52. Milo R, Kashtan N, Itzkovitz S, et al. On the uniform generation of random graphs with prescribed degree sequences. http://arxiv.org/abs/cond-mat/0312028, 2003.

Ciriello - A Review On Models and Algorithms For Motif Discovery in Protein-Protein Interaction Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ciriello - A Review On Models and Algorithms For Motif Discovery in Protein-Protein Interaction Networks

Uploaded by

Copyright:

Available Formats

B RIEFINGS IN FUNC TIONAL GENOMICS AND P ROTEOMICS . VOL 7 . NO 2.

Downloaded from http://bfg.oxfordjournals.org/ by guest on May 28, 2013

Ciriello and Guerra

Downloaded from http://bfg.oxfordjournals.org/ by guest on May 28, 2013

Motif discovery in PPI networks

NETWORK MOTIFS: DEFINITIONS AND PROPERTIES

Downloaded from http://bfg.oxfordjournals.org/ by guest on May 28, 2013

Ciriello and Guerra

Downloaded from http://bfg.oxfordjournals.org/ by guest on May 28, 2013

Motif discovery in PPI networks

MODELS OF RANDOM GRAPHS

Downloaded from http://bfg.oxfordjournals.org/ by guest on May 28, 2013

Ciriello and Guerra

MOTIF DISCOVERY ALGORITHMS

Downloaded from http://bfg.oxfordjournals.org/ by guest on May 28, 2013

Motif discovery in PPI networks

Downloaded from http://bfg.oxfordjournals.org/ by guest on May 28, 2013

Ciriello and Guerra

Downloaded from http://bfg.oxfordjournals.org/ by guest on May 28, 2013

Motif discovery in PPI networks

Downloaded from http://bfg.oxfordjournals.org/ by guest on May 28, 2013

Ciriello and Guerra

Downloaded from http://bfg.oxfordjournals.org/ by guest on May 28, 2013

You might also like