Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 35

abstract

Availability of large-scale experimental data for cell biology is enabling computational methods to
systematically model the behaviour of cellular networks. This review surveys the recent advances in the
field of graph-driven methods for analysing complex cellular networks. The methods are outlined on three
levels of increasing complexity, ranging from methods that can characterize global or local structural
properties of networks to methods that can detect groups of interconnected nodes, called motifs or clusters,
potentially involved in common elementary biological functions. We also briefly summarize recent
approaches to data integration and network inference through graph-based formalisms. Finally, we highlight
some challenges in the field and offer our personal view of the key future trends and developments in
graph-based analysis of large-scale datasets.

Keywords: graph algorithms, data integration, cellular networks, protein–protein interactions,


transcriptional regulatory networks, network modularity

INTRODUCTION

Recent advances in large-scale experimental technologies have resulted in an accumulation of experimental


data that reflect the interplay between biomolecules on a global scale. Due to the complexity of the control
mechanisms involved, and the large number of possible interactions, there is a great need for computer-
assisted tools to manage, query and interpret the experimental observations with formal network models. In
their most basic abstraction level, cellular networks can be represented as mathematical graphs, using nodes
to represent cellular components, and edges to represent their various types of interactions [1]. For
instance, protein–protein interaction (PPI) networks are conveniently modelled by undirected graphs,
where the nodes are proteins and two nodes are connected by an undirected edge if the corresponding
proteins physically bind. In contrast, transcriptional regulatory networks can be modelled as directed
weighted graphs, where the weights of directed edges capture the degree of the regulatory effect of the
transcription factors (source nodes) to their regulated genes (sink nodes). Metabolic networks generally
require more complex representations, such as hypergraphs, as reactions in metabolic networks generally
convert multiple reaction inputs into multiple outputs with the help of other components. An alternative,
reduced representation for a metabolic network, is a weighted bipartite graph, where two types of nodes are
used to represent reactions and compounds, respectively, and the edges connect nodes of different types,
representing either substrate or product relationships.

The representation of complex cellular networks as graphs has made it possible to systematically
investigate the topology and function of these networks using well-understood graph-theoretical concepts
that can be used to predict the structural and dynamical properties of the underlying network. Such
predictions can suggest new biological hypotheses regarding, for instance, unexplored new interactions of
the global network or the function of individual cellular components that are testable with subsequent
experimentation. Even a simplistic dynamical system originating from small Boolean network models,
where nodes represent discrete biological entities (i.e. mRNA or protein), that can be thought to be either on
or off and edges their Boolean relationships (‘genotypes’), can give rise to a multitude of designable
dynamical outputs (‘phenotypes’) [2]. Mathematical modelling also enables an iterative process of network
reconstruction, where model simulations and predictions are closely coupled with new experiments chosen
systematically to maximize their information content for subsequent model adjustments, providing
increasingly more accurate descriptions of the network properties [3]. The topological relations underlying
graph-based methods can also convey structure to putative pathways. This helps avoiding approaches that
test many known sets of molecules without causal interactions [4]. Furthermore, graph formalisms may
provide powerful tools for ‘omics’ data integration to address fundamental biological questions at the
systems level [5*].
This review describes network analysis approaches in which the concept of a graph is a key component,
together with a large collection of recently introduced methods and available tools. This excludes some
related problems, such as hierarchical clustering or phylogenetic footprinting, where typically only the
result of the computational analysis is represented in the form of a specific graph, such as a dendrogram [6]
or a phylogenetic tree [7]. As substantial efforts have recently been devoted to develop graph-based
methods for a wide range of computational and biological tasks, only representative examples of different
approaches can be surveyed here, with an emphasis on methods related to concrete biological questions,
rather than computational issues. The selected methods are presented in the broader context of network
analysis, summarizing some of the basic concepts and themes such as scale-free networks, pathways and
modules (Table 1). The order of sections roughly reflects the increasing demands placed for the type and
amount of data the methods require and their applicability to address more targeted problems in cell
biology. Accordingly, the methods reviewed range from very elementary measures that characterize the
global topological structure and require only general assumptions about the underlying network model to
recent software systems available for integrating multiple types of cellular data within a graph-based
framework that can be applied to solve concrete biological problems.

CHARACTERIZATION OF NETWORK TOPOLOGY

Perhaps the most general level of network analysis comes from global network measures that allow us to
characterize and compare the given network topologies (i.e. the configuration of the nodes and their
connecting edges). Global measures such as the degree distribution (the degree of a node is the number of
edges it participates in) and the clustering coefficient (the number of edges connecting the neighbours of the
node divided by the maximum number of such edges) have recently been thoroughly reviewed in the
context of cellular networks [8**] and in proteomics [9]. It has been proposed that these quantitative graph
concepts can efficiently capture the cellular network organization, providing insights into their evolution,
function, stability and dynamic responses [10**]. For instance, several types of surveyed biological
networks, such as PPI, gene regulation and metabolic networks, are thought to display scale-free topologies
(i.e. most nodes have only a few connections whereas some nodes are highly connected), characterized by a
power-law degree distribution that decays slower than exponential. This particular type of network
topology is also frequently observed in numerous non-biological networks and it can be generated by
simple and elegant evolutionary models, where new nodes attach preferentially to sites that are already
highly connected. Numerous improvements to this generic model include, for instance, iterative network
duplication and integration to its original core, leading to hierarchical network topologies, which are
characterized by non-constant clustering coefficient distribution [8, 10].

It should be noticed, however, that, in practice, the architecture of large-scale biological networks is
determined with sampling methods, resulting in subnets of the true network, and only these partial networks
can be applied to characterize the topology of the underlying, hidden network [11]. It has recently been
recognized that it is possible to extrapolate from subnets to the properties of the whole network only if the
degree distributions of the whole network and randomly sampled subnets share the same family of
probability distributions [12]. While this is the case in specific classes of network graph models, including
classical Erdös–Rényi and exponential random graphs, the condition is not satisfied for scale-free degree
distributions. Accordingly, recent studies in interactome networks have revealed that the commonly
accepted scale-free model for PPI networks may fail to fit the data [13]. Moreover, limited sampling alone
may as well give rise to apparent scale-free topologies, irrespective of the original network topology [14].
These results suggest that interpretation of the global properties of the complete network structure based on
the current—still limited—accuracy and coverage of the observed networks should be made with caution.
Moreover, while the scale-free and hierarchical graph properties can efficiently characterize some large-
scale attributes of networks, the local modularity and network clustering is likely to be the key concept in
understanding most cellular mechanisms and functions.
GRAPH ANALYSIS OF INTERACTION PATTERNS

As an alternative to the study of global graph characteristics, elementary graph algorithms have been used
to characterize local interconnectivity and more detailed relationships between nodes. Such graph methods
can facilitate addressing fundamental biological concepts, such as essentiality and pathways, especially
when additional biological information is incorporated into the analysis in addition to the primary data. For
instance, while gene expression clustering traditionally makes the assumption that genes with similar
expression profiles have similar functions in cells, a more targeted approach could aim at identifying
the genes participating in a particular cellular pathway where not every components has a similar
transcriptional profile [15*]. Once the network of interest has been represented as a graph, the conventional
graph-driven analysis work-flow involves the following two steps: (i) applying suitable graph algorithms to
compute the local graph properties, such as the number and complexity of given subgraphs, the shortest
path length of indirectly connected nodes or the presence of central nodes of the network and (ii) evaluating
the sensitivity and specificity of the model predictions using curated databases of known positive examples
or random models of synthetic negative examples, respectively. We start by surveying the basic graph
concepts used in network analysis together with corresponding recent work.

Subgraphs and centrality statistics


A subgraph represents a subset of nodes with a specific set of edges connecting them. As the number of
distinct subgraphs grows exponentially with the number of nodes, efficient and scalable heuristics have
been developed and applied for detecting the given subgraphs and their frequencies in large networks. In
contrast to network motif searches, Przulj et al. [13] argued that it is equally important to understand the
organization of infrequently observed subgraphs as the frequently observed ones. Graphlets are defined as
small induced subgraphs, consisting of all edges of the original graph that connect a given group of nodes,
regardless of whether or not they appear at significantly higher frequencies than expected in randomized
networks. Since exhaustive searches become computationally infeasible even when applied to rather small
networks, Przulj et al. [16] designed sampling heuristics for finding graphlets in high-confidence PPI
networks that concentrate on specific parts of the graph, depending on the particular model (either
geometric random graph model or more general sampling strategy). Under the random graph model, it is
also possible to calculate analytically the estimated distribution of different subgraphs with given number of
nodes, edges and their specific global properties like degree distribution and clustering coefficient [17].
Such approximate analytical expressions will save substantial amounts of computing time when analysing
e.g. lists of proteins with large undirected graphs representing their known functional relationships [18].

Centrality is a local quantitative measure of the position of a node relative to the other nodes, and can be
used to estimate its relative importance or role in global network organization. Different flavors of centrality
are based on the node's connectivity (degree centrality), its shortest paths to other nodes (closeness
centrality) or the number of shortest paths going through the node (betweenness centrality). Estrada [19]
recently showed that centrality measures based on graph spectral properties can distinguish essential
proteins in PPI network of yeast Saccharomyces cerevisiae (essential genes are those upon which the cell
depends for viability). In particular, the best performance in identifying essential proteins was obtained with
a novel measure introduced to account for the participation of a given node in all subgraphs of the network
(subgraph centrality), which gives more weight to smaller subgraphs. It was proposed that ranking proteins
according to their centrality measures could offer a means to selecting possible targets for drug discovery
[19]. A similar approach to characterize the importance of individual nodes, based on trees of shortest paths
and concepts of ‘bottleneck’ nodes, demonstrated that 70% of the top 10 most frequent ‘bottleneck’ proteins
were inviable and structural proteins that do not participate in cellular signaling [20*]. With degree
centrality analyses in the metabolic networks of Escherichia coli, S. cerevisiae and Staphylococcus aureus,
it was also demonstrated that most reactions identified as essential turned out to be those involving the
production or consumption of low-degree metabolites [21].

Paths and pathways


In the theory of directed graphs, a path is a chain of distinct nodes, connected by directed edges, without
branches or cycles. Such pathways in cellular network graphs can represent, for instance, a transformation
path from a nutrient to an end product in a metabolic network, or a chain of post-translational modifications
from the sensing of a signal to its intended target in a signal transduction network [10**]. Pathway
redundancy (the presence of multiple paths between the same pair of nodes) is an important local property
that is thought to be one of the reasons for the robustness of many cellular networks. Betweenness centrality
can be used to measure the effect of node perturbations on pathway redundancy, whereas path lengths
characterize the response times under perturbations. With shortest paths and centrality-based predictions in
the S. cerevisiae PPI and metabolic networks, respectively, the existence of alternate paths that bypass
viable proteins can be demonstrated, whereas lethality corresponds to the lack of alternative pathways in the
perturbed network [20*, 22]. Besides the various commercial software packages for pathway analysis there
exist also freely available tools for some specific graph queries, such as finding shortest paths between two
specified seed nodes on degree-weighted metabolic networks [23] or searching for linear paths that are
similar to query pathways in terms of their composition and interaction patterns on a given PPI network
[24].

The relatively high degree of noise inherent in the interactions data in current PPI databases can make
pathway modelling very challenging. Integration of prior biological knowledge, such as Gene Ontology
(GO), can be used to make the process of inferring models more robust by providing complementary
information on protein function. GO terms and their relationships are encoded in the form of directed
acyclic graph (DAG). Guo et al. [25] recently assessed the capability of both GO graph structure-based and
information content-based similarity measures on DAG to evaluate the PPIs involved in human regulatory
pathways. They also showed how the functional similarity of proteins within known pathways decays
rapidly as their path length increases. While most of the analysis methods designed for PPI networks
consider unweighted graphs, where each pairwise interaction is considered equally important, Scott et al.
[26] recently presented linear-time algorithms for finding paths and more general graph structures such as
trees that can also consider different reliability scores for PPIs. By exploiting a powerful randomized graph
algorithm, called color coding, they efficiently recovered several known S. cerevisiae signaling pathways
such as MAPK, and showed that in general the pathways they detected score higher than those found in
randomized networks. In addition to known pathways, they also predicted (by unsupervised learning) novel
putative pathways in the PPI network that are functionally enriched (i.e. share significant number of
common GO annotations) [26].

NETWORK DECOMPOSITION INTO FUNCTIONAL MODULES

The decomposition of large networks into distinct components, or modules, has come to be regarded as a
major approach to deal with the complexity of large cellular networks [27–29]. This topic has witnessed
great progress lately, and only representative examples of different approaches are presented here. In
cellular networks, a module refers to a group of physically or functionally connected biomolecules
(nodes in graphs) that work together to achieve the desired cellular function [8**]. To investigate the
modularity of interaction networks, tools and measures have been developed that can not only identify
whether a given network is modular or not, but also detect the modules and their relationships in the
network. By subsequently contrasting the found interaction patterns with other large-scale functional
genomics data, it is possible to generate concrete hypotheses for the underlying mechanisms governing e.g.
the signaling and regulatory pathways in a systematic and integrative fashion. For instance, interaction data
together with mRNA expression data can be used to identify active subgraphs, that is, connected regions of
the network that show significant changes in expression over particular subsets of experimental conditions
[30].

Motifs
Motifs are subgraphs of complex networks that occur significantly more frequently in the given network
than expected by chance alone [29]. Consequently, the basic steps of motif analyses are (i) estimating the
frequencies of each subgraph in the observed network, (ii) grouping them into subgraph classes consisting
of isomorphic subgraphs (topologically equivalent motifs) and (iii) determining which subgraph classes are
displayed at much higher frequencies than in their random counterparts (under a specified random graph
model). While analytical calculations from random models can assist in the last step, exhaustive
enumeration of all subgraphs with a given number of nodes in the observed network is impossible in
practice. Kashtan et al. [31] therefore developed a probabilistic algorithm that allows estimation of
subgraph densities, and thereby detection of network motifs, at a time complexity that is asymptotically
independent of the network size. The algorithm is based on a subgraph importance sampling strategy,
instead of standard Monte Carlo sampling. They noticed that, network motifs could be detected already with
a small number of samples in a wide variety of biological networks, such as the transcriptional regulatory
network of E. coli [31]. Recently, efficient alternatives together with graphical user interfaces have also
been implemented to facilitate fast network motif detection and visualization in large network graphs [32,
33].

Many of the methodologies recently introduced in network analysis are inspired by established approaches
from sequence analysis. The concepts utilized in both fields include approximate similarity, motifs and
alignments. As network motifs represent a higher-order biological structure than protein sequences, graph-
based methods can be used to improve the homology detection of standard sequence-based algorithms, such
as PSI-BLAST, by exploiting relationships between proteins and their sequence motif-based features in a
bipartite graph representing protein-motif network [34]. The definition of network motifs can be enriched
by concepts from probability theory. The motivation is that if the network evolution involves elements of
randomness and the currently available interaction data is imperfect, then functionally related subgraphs do
not need to be exactly identical. Accordingly, Berg and Lässig [35] devised a local graph alignment
algorithm, which is conceptually similar to sequence alignment methodologies. The algorithm is based on a
scoring function measuring the statistical significance for families of mutually similar, but not necessarily
identical, subgraphs. They applied the algorithm to the gene regulatory network of E. coli [35].

Motifs have increasingly been found in a number of complex biological and non-biological networks, and
the observed over-representation have been interpreted as manifestations of functional constraints and
design principles that have shaped network architecture at the local level. Significance of motifs is typically
assessed statistically by comparing the distribution of subgraphs in an observed network with that found in a
particular computer-generated sample of randomized networks that destroy the structure of the network
while preserving the number of nodes, edges and their degree distribution. It can be argued what kind of
random model provides the most appropriate randomization, and especially whether it is realistic to assume
that the edges in the randomized network are connected between the nodes globally at random and without
any preference [36]. However, the principal application of network motif discovery should not originate
from a rigorous statistical testing of a suitable null hypothesis, but from the possibility to reduce the
complexity of large networks to smaller number of more homogeneous components. Analogously with gene
expression cluster analysis, where statistical testing is also difficult because of the lack of an established
null model, network decomposition may be used as a tool to identify biologically significant modules,
irrespective of their statistical significance.

Clusters
An alternative approach to the identification of functional modules in complex networks is discovering
similarly or densely connected subgraphs of nodes (clusters), which are potentially involved in common
cellular functions or protein complexes [37]. As in expression clustering, the application of graph clustering
is based on the assumption that a group of functionally related nodes are likely to highly interact with each
other while being more separate from the rest of the network. The challenges of clustering network graphs
are similar to those in the cluster analysis of gene expression data [6]. In particular, the results of most
methods are highly sensitive to their parameters and to data quality, and the predicted clusters can vary
from one method to another, especially when the boundaries and connections between the modules are not
clear-cut. This seems to be the case at least in the PPI network of S. cerevisiae [38]. Moreover, it should be
noted that modules are generally not isolated components of the networks, but they share nodes, links and
even functions with other modules as well [8**]. Such hierarchical organization of modules into smaller,
perhaps overlapping and functionally more coherent modules should be considered when designing
network clustering algorithms. The functional homogeneity of the nodes in a cluster with known
annotations can be assessed against the cumulative hypergeometric distribution that represents the null
model of random function label assignments [20*].

Highly connected clusters


Most algorithms for determining highly connected clusters in PPI networks yield disjoint modules [39]. For
instance, King et al. [40] partitioned the nodes of a given graph into distinct clusters, depending on their
neighbouring interactions, with a cost-based local search algorithm that resembles the tabu-search heuristic
(i.e. it updates a list of already explored clusters that are forbidden in later iteration steps). Clusters with
either low functional homogeneity, cluster size or edge density were filtered out. After optimizing the
filtering cut-off values according to the cluster properties of known S. cerevisiae protein complexes from
MIPS database, their methods could accurately detect the known and predict new protein complexes [40].
Other local properties such as centrality measures can be used for clustering purposes as well. A recent
algorithm by Dunn et al. [41], for example, divides the network into clusters by removing the edges with
the highest betweenness centralities, then recalculating the betweenness and repeating until a fixed number
of edges have been removed. They applied the clustering method to a set of human and S. cerevisiae PPIs,
and found out that the protein clusters with significant enrichment for GO functional annotations included
groups of proteins known to cooperate in cell metabolism [41].

Overlapping clusters
Corresponding to the fact that proteins frequently have multiple functions, some clustering approaches,
such as the local search strategy by Farutin et al. [42], also allow overlapping clusters. Like in motif
analysis, the score for an individual cluster in the PPI network graph is assessed against a null model of
random graph that preserves the expected node degrees. They also derived analytical expressions that allow
for efficient statistical testing [18]. It was observed that many of the clusters on human PPI network are
enriched for groups of proteins without clear orthologues in lower organisms, suggesting functionally
coherent modules [42]. Pereira-Leal et al. [43] used the line graph of the network graph (where nodes
represent an interaction between two proteins and edges represent shared interactors between interactions)
to produce an overlapping graph partitioning of the original PPI network of S. cerevisiae. Recently,
Adamcsek et al. [44] provided a program for locating and visualizing overlapping, densely interconnected
groups of nodes in a given undirected graph. The program interprets as motifs all the k-clique percolation
clusters in the network (all nodes that can be reached via chains of adjacent k-cliques). Larger values of k
provide smaller groups resulting in higher edge densities. Edge weights can additionally be used to filter out
low-confidence connections in the graphs [44].

Distance-based clusters
Another approach to decompose biological networks into modules applies standard clustering algorithms on
vectors of nodes’ attributes, such as their shortest path distances to other nodes [45]. As the output then
typically consists of groups of similarly linked nodes, the approach can be seen as complementary to the
above clustering strategies that aim at detecting highly connected subgraphs. To discover hierarchical
relationships between modules of different sizes in PPI graphs, Arnau et al. [46] explored the use of
hierarchical clustering of proteins in conjunction with the pairwise path distances between the nodes. They
considered the problem of lacking resolution caused by the ‘small world’ property (relatively short—and
frequently identical—path length between any two nodes) by defining a new similarity measure on the basis
of the stability of node pair assignments among alternative clustering solutions from resampled node sets.
As ties in such bootstrapped distances are rare, standard hierarchical clustering algorithms yield clusters
with a higher resolution. The clusters obtained in S. cerevisiae PPI data were validated using GO
annotations and compared with those refined from gene expression microarray data [46]. A similar
approach was also applied to decompose metabolic network of E. coli into functional modules, based on the
global connectivity structure of the corresponding reaction graph [47].

Supervised clustering
Provided that the eventual aim of module analysis is function prediction, it can be argued that supervised
clustering (or classification), rather than unsupervised clustering methods, should be employed. In the
context of cellular networks, classification aims at constructing a discriminant rule (classifier) that can
accurately predict the functional class of an unknown node based on the annotation of neighbouring nodes
and connections between them. To this end, Tsuda and Noble [48] considered a binary classification
problem, and calculated pairwise distances on undirected graphs with a locally constrained diffusion kernel.
They demonstrated a good protein function prediction with a support vector machine (SVM) classifier from
S. cerevisiae PPI and metabolic networks. Supervised clustering methods in function prediction are
challenged by their notorious dependence on the quality of the training examples [49]. As fully curated
databases are rarely available, especially for less-studied organisms, the applicability of such methods is
still limited. Therefore, an intermediate method between the two extremes of supervised and unsupervised
clustering may be preferable. The protein function prediction algorithm by Nabieva et al. [50] suggests
such an approach that exploits both global and local properties of the network graphs. They demonstrated
better predictions than previous methods in cross-validation testing on the unweighted S. cerevisiae PPI
graph. More importantly, they showed that the performance could be substantially improved further by
weighting the edges of the interaction network according to information from multiple data sources and
types [50].

CURRENT CHALLENGES AND FUTURE TRENDS

Functional modules across multiple data sources


As the high-throughput assays are inherently noisy and biased in their nature, and each single data source or
type can describe only a limited scope of a system, it is evident that integrative analysis of data from such
measurements will be essential in order to fully understand the system's behaviour on a global scale [51*].
In many biological applications, it is beneficial to perform the network analysis in a truly integrated
manner, simultaneously rather than sequentially, like when validating the results against external data
sources. Graph-based frameworks can also be used in such an integrative analysis of data from different
sources. The composition of data sources required depends naturally on the specific biological goals of
the study. In the analysis of transcriptional regulatory networks, for instance, clustering becomes a
problem of dissecting genes into regulatory modules (sets of coexpressed genes regulated by common
transcription factors). It has been shown that the identification of regulatory modules can be improved by
combining gene expression data (inferred e.g. from microarrays) with the knowledge of transcription factor
binding to the DNA motifs (extracted e.g. from chromatin immunoprecipitation ChIPs) [52–54]. Recently,
Tanay et al. [55] identified modules across diverse genome-wide data sources and types, including gene
expression, protein interactions, growth phenotype data and transcription factor binding. By modelling
genomic information as properties of a weighted bipartite graph in the yeast system, they defined clusters of
genes with a common behaviour across a set of the experiments. This provides a general data processing
and integration framework for revealing a detailed view of both the global and local organization of a
molecular network even in higher organisms [55].

In interaction networks, the detection of modules, motifs or clusters has also been performed on multiple
graphs simultaneously using efficient algorithms for exact or approximate pattern mining across a set of
graphs constructed from same data type [56, 57]. Towards integrated graph analysis of heterogeneous
genome-wide data sources, Yeger-Lotem et al. [58] developed algorithms for detecting composite network
motifs with two or more types of interactions and applied them to a combined data set of PPIs and
transcription-regulation interactions in S. cerevisiae. Similarly, Moon et al. [59] built a unified network
model for both protein–protein and domain–domain interactions to detect network motifs between proteins
and their domains by applying a colored vertex graph model. The module searches can also be extended to
incorporate more than one species in order to elucidate the evolution of cellular machinery or to predict
more reliably the protein functions. Sharan and Ideker [60**] recently reviewed the computational
approaches available to comparative biological network analysis, that is, contrasting two or more interaction
networks representing different species, conditions, interaction types or time points. In particular, network
integration can assist in predicting protein interactions or uncovering protein modules that are supported by
interactions of different types. Besides interaction data sets, Chen and Xu [61] encoded into their functional
linkage graph also other genome-scale data types, including microarray gene expression profiles, and used
them simultaneously when annotating S. cerevisiae proteins into multiple GO categories. Future studies
involving a blend of multiple experimental and computational approaches will hopefully provide added
insights into the biological roles of network motifs and clusters [62–64].

Software tools for graph-based network analysis


The availability of genome-scale data sets has increased the need for software tools that can integrate,
construct, analyze and visualize the high-dimensional data effectively. Several such software packages
developed for these challenging tasks along with their specific functionalities were recently listed by Joyce
and Palsson [5*]. Publicly available software systems that use graph-based data integrating visual
frameworks for networks include e.g. Cytoscape together with its recent plug-ins [65–67], Osprey [68],
GiGA [69], megNet [70], VisANT [71], BioPIXIE [72], Pointillist [73, 74], PIANA [75] and PathSys [76].
An important component of such systems is the possibility to visualize the graphs under analysis. This can
be regarded as a fundamental tool in explorative network analysis; even if one wants to address only a very
specific question within the given network graph, it may be helpful to visualize the result to discern possible
flaws or follow-up questions. Recently introduced graph drawing tools include e.g. WebInterViewer [77],
CADLIVE [78] and PATIKAweb [79]. By meeting the challenges of automated construction and
simultaneous visualization of multiple pathways, such software tools can be of great help in relating the
selected node sets and their interconnections to the underlying biological significance.

Bioconductor project incorporates also open source tools to support computational analysis of graphical
data structures (http://www.bioconductor.org/). The available packages implement not only algorithms for
efficient graph visualization (AT&T Graphviz), but also the C++ Boost Graph Library for basic graph
algorithms (RBGL package). At present, procedures that can be interfaced in the R environment include
minimum spanning tree construction, shortest path finding, depth-first search, topological sorting, edge-
connectivity measurement and connected component decomposition [80]. The number of packages
available within Bioconductor grows rapidly as many authors make their R source codes freely available for
academic use [81–83]. These advanced graph algorithms and post-processing tools can be used also in
conjunction with more specific, freely available software packages such as GenePattern
(http://www.broad.mit.edu/genepattern). One major challenge of computational network analysis deals with
the selection of model types appropriate for analysing data from different experimental approaches [84**].
In particular, accurate modelling and integration of protein interactions measured with yeast two-hybrid and
affinity purification/mass spectrometry can be very critical e.g. for understanding the physical properties
and functional operation of local protein complexes [82].

Network graph reconstruction by reverse engineering


A number of computational approaches have been tried to reconstruct the underlying global network
structure or even the causal regulatory relationships between the nodes from the experimental data sets. This
challenging problem is often referred to as network inference or reverse engineering [85, 86]. For instance,
several works have dealt with gene regulatory networks inference from gene expression microarray data
alone. In such a hypothetical network, the nodes conventionally correspond to both the particular gene and
the protein it encodes, and the edges to the statistical relations between the genes. Bayesian network offers a
convenient probabilistic model, where nodes represent gene expression levels as random variables, edges
represent their conditional dependence relations and the corresponding DAG the joint probability
distributions of the observed expression patterns. However, it has been recognized that these data are
sufficient for reconstruction of only relatively small networks and that even in idealized situations the
estimated models contain many false edges because the expression data alone cannot unambiguously
distinguish the underlying target network [87]. Further challenges are faced when applying the inference
algorithms to limited quantities of experimentally collected noisy data from real biological systems [88].
Suggested solutions to tackle these problems include the usage of gene network motifs [87], network
pruning methods [88] or reduced network models [89]. One way to further refine these hypothetical models
is to conduct an automated design of new experiments, enabling both iterative model building and
candidate model discrimination [90, 91]. Such reverse-engineered gene networks could be of great medical
significance, for instance, in identification of drug targets [92].

Computational methods have also been used in assisting the completion of the existing PPI networks by
prioritizing the interactions, either observed or missing, that warrant further experimental confirmation. For
instance, Yu et al. [93] first searched for defective cliques in the incomplete network graphs (nearly
complete groups of pairwise interacting nodes), and then they predicted new interaction that can complete
these cliques. Albert and Albert [94] showed that machine-learning algorithms can achieve success rates
between 20 and 40% for predicting the correct interaction partner of a protein based solely on the presence
of conserved interaction motifs within the given network. Ultimately, however, integrated usage of multiple
large-scale data types together with local and global topological properties will likely be essential for
effective prediction of networks and their functions [51*, 84**]. Towards such integrative approaches,
several groups have recently combined multiple heterogeneous data sources to construct global models of
gene regulatory networks or PPIs [95–97]. Qi et al. [98*] showed that in supervised protein interaction
prediction, some of the most important features are actually derived from indirect information sources, such
as gene expression measurements. Both indirect statistical relations and direct physical interactions can also
be used when predicting or interpreting genetic interactions, observed by comparing phenotypic variations,
which are involved in many complex human diseases [99, 100]. However, while most studies have
concentrated on snapshots of interactions under particular conditions, it is likely that only by coupling
interactions from several functional and temporal states we can reveal truly significant dynamic
reconstructions of cellular networks in the future.

CONCLUSION

The large-scale data on biomolecular interactions that is becoming available at an increasing rate enables a
glimpse into complex cellular networks. Mathematical graphs are a straightforward way to represent this
information, and graph-based models can exploit global and local characteristics of these networks relevant
to cell biology. Most current research activities concern the dissection of networks into functional modules,
a principal approach attempting to bridge the gap between our very detailed understanding of network
components in isolation and the ‘emergent’ behaviour of the network as a whole, which is frequently
the phenotype of interest on a cellular level. Approaches developed for DNA and protein sequence
analysis, such as multiple alignment and statistical over-representation of parts, are being carried over
to address these problems. Network graphs have the advantage that they are very simple to reason about,
and correspond by and large to the information that is globally available today on the network level.
However, while binary relation information does represent a critical aspect of interaction networks, many
biological processes appear to require more detailed models. Therefore, we expect that one of the main
directions in the development of graph-based methods will be their extension to other types of large-scale
data from existing and new experimental technologies. This may eventually prove mathematical models of
large-scale data sets valuable in medical problems, such as identifying the key players and their
relationships responsible for multifactorial behaviour in human disease networks.
Gene network analysis in plant development by genomic technologies
ABSTRACT The analysis of the gene regulatory networks underlying development is of central importance for a
better understanding of the mechanisms that control the formation of the different cell-types, tissues or organs
of an organism. The recent invention of genomic technologies has opened the possibility of studying these
networks at a global level. In this paper, we summarize some of the recent advances that have been made in the
understanding of plant development by the application of genomic technologies. We focus on a few specific
processes, namely flower and root development and the control of the cell cycle, but we also highlight landmark
studies in other areas that opened new avenues of experimentation or analysis. We describe the methods and
the strategies that are currently used for the analysis of plant development by genomic technologies, as well as
some of the problems and limitations that hamper their application. Since many genomic technologies and
concepts were first developed and tested in organisms other than plants, we make reference to work in non-plant
species and compare the current state of network analysis in plants to that in other multicellular organisms.

Network Role Analysis in the Study of Food Webs:


An Application of Regular Role Coloration[1]

ABSTRACT: In the last fifteen years, ecosystem ecologists have


developed a theoretical approach and a set of computational methods
called “ecological network analysis” (Ulanowicz, 1986; Kay et al. 1996).
Ecological network analysis is based on input/output models of energy
or material flows (e.g., carbon compound flows) through a trophic
network (e.g., a food web describing which species eats which other
species). Mathematically and conceptually, this ecological network
analysis approach is strikingly similar to work in the field of social
network analysis, particularly the influence models of Hubbell (1965),
Katz (1963), and Friedkin and Johnsen (1990). In food web research,
Yodzis and Winemiller (1999), have recently proposed a new way to
operationalize the concept of a "trophospecies", which is a set of
species with similar foods or predators. Their definition turns out to be
identical to the notion of structural equivalence (Lorrain and White,
1971) in social network analysis, particularly as conceived by Burt
(1976) and Burt and Talmud (1993). The striking convergence to date of
the fields of ecology and sociology via independent invention of network
concepts suggests that there may be considerable value in cross-
fertilization of the two fields. With this paper we hope to begin a
dialogue between the two fields, by applying advanced social role
theory and methods to the study of food webs. In social network
analysis, the introduction of the notion of structural equivalence thirty
years ago was followed by the development of regular coloration (White
& Reitz, 1983; Everett & Borgatti, 1991), an important advance over
structural equivalence for modeling social roles. The objective of our
paper is to answer a call in the ecological literature for greater clarity in
thinking about the role of species in ecosystems (Simberloff and Dayan,
1991), by applying the notion of regular coloration to food webs.
Introduction
Ecologists have long been interested in interactions among species,
particularly in the sub-fields of community and ecosystem ecology. In
the last fifteen years, ecosystem ecologists have developed a
theoretical approach and a set of computational methods called
“ecological network analysis” (Ulanowicz, 1986; Kay et al. 1996).
Ecological network analysis is based on input/output models of energy
or material flows (e.g., carbon compound flows) through a trophic
network (e.g., a food web describing which species eats which other
species). In addition, several other network approaches to studying food
webs are beginning to be developed in ecology, especially in terrestrial
ecosystems (see Polis and Winemiller 1995 for a review of the various
approaches).

Research efforts using network analyses in ecology have produced


methodological, theoretical and empirical advances. Two main software
packages have been developed to perform ecological network analysis:
NETWRK4 (Ulanowicz 1987; software downloads available at
http://www.cbl.cees.edu/~ulan/ntwk/network.html) and ECOPATH
(Christensen and Pauly 1992). These network analysis programs have
been used to model trophic relationships among primary producers,
herbivores, intermediate consumers, top predators, and dead material
(detritus and carrion) in food webs, providing opportunities for
comparative analysis of whole ecosystems measured at different places
and times. The food web is modeled as a series of n compartments into
which energy or materials flow and leave, with input from and export to
other ecosystems, and respiration to the atmosphere for each
compartment incorporated into the network. Each species in a food web
can be assigned its own compartment, or more typically, species are
grouped into compartments based on similarity of diet. An n x n matrix
of compartments with material or energy flows moving from producers
to consumer compartments is then set up. This matrix can be analyzed
for total system throughput, average path length, the presence of
cycles, the sources or sinks of material, and the trophic level of any
given compartment.

Comparative ecosystem analysis using ecological network models


derived from empirical data has been used to study whole ecosystem
productivity, growth and development of ecosystems, and to provide a
measure of ecosystem stress. For example, Baird and Ulanowicz (1989)
used the network analysis approach to compare the seasonal changes
in the Chesapeake Bay, clarifying the importance of the planktonic and
benthic subsystems in the network to overall productivity at different
times of the year. Baird and Ulanowicz (1993) compared four estuary
ecosystems in Europe and South Africa, including one polluted
ecosystem, using network analysis to analyze the trophic structure of
each ecosystem and assess the relative amount of stress or ecosystem
maturity. Monaco and Ulanowicz (1997) employed ecological network
analysis to compare trophic structure and total system productivity of
three estuaries in the mid-Atlantic region of the USA: Narragansett Bay,
Delaware Bay and Chesapeake Bay, listed here in decreasing order of
overall production and increasing order of stress. Baird et al. (1998)
described the trophic relationships and cycles of carbon fluxes in a
seagrass meadow using network analysis, comparing variations in the
carbon flow dynamics across four sampling sites within a single estuary.
Christian and Luczkovich (1999) used ecological network analysis to
produce a ranking of each species in a food web by their effective
trophic levels (Levine 1980), which is a fractional measure of the
number of trophic links carbon compounds had passed through after
being fixed by primary producers in a seagrass ecosystem. A
compilation of many such network analysis models and their application
to aquatic ecosystems, including studies of fisheries and aquaculture
systems, was recently produced (Christensen and Pauly 1993). Thus,
ecological network analysis has been widely adopted in aquatic
ecosystem analysis as an empirical tool for following carbon and other
nutrient flows, describing carbon sources and sinks, mapping trophic
levels and energy flow among these levels, and examining the
dependency of various species on particular sources of energy as it
changes over space and time.

Network analysis has been used to develop ecosystem theory, notably


in the area of ecosystem maturity. The theory of ecosystem
“ascendancy” has been articulated by Ulanowicz (1997), which states
that as an ecosystem network develops through time in a stable
environment, it becomes more hierarchical and has fewer redundant
links. A state of high ecosystem ascendancy -- a network-based
measure that has been proposed by Ulanowicz (1986) -- has been used
to describe this trend. In an unstable environment or in an early stage of
network development, there are many redundant links, a lack of
hierarchical structure, and thus low ecosystem ascendancy. The
network measure of ecosystem stress is based on the frequency of
redundant connections; whereas a mature or non-stressed network
have few redundant connections, a polluted, stressed, or frequently
disturbed network will have many redundant connections. Thus, a new
ecological theory has been proposed based on network studies that
suggest mature ecosystem networks have higher ascendancy than
immature ecosystems -- a measure that has been derived from network
analyses.

Mathematically and conceptually, ecological network analysis is


strikingly similar to work in the field of social network analysis,
particularly the influence models of Hubbell (1965), Katz (1963), and
Friedkin and Johnsen (1990). In food web research, ecologists Yodzis
and Winemiller (1999) have recently proposed a new definition and
operationalization of the concept of a "trophospecies," which is a
collection of species in a food web that have similar foods and
predators. Their definition turns out to be identical to the notion of
structural equivalence (Lorrain and White, 1971) in social network
analysis, particularly as conceived by Burt (1976) and Burt and Talmud
(1993).

The striking convergence to date of the fields of ecology and sociology


via independent invention of network concepts suggests that there may
be considerable value in cross-fertilization of the two fields. In particular,
sociology in general and social network analysis in particular has
focused heavily on structural analysis (Mayhew, 1980), dedicating
whole conferences and journals (such as the Journal of Social
Structure) to the topic. As a result, it has developed a rich set of
structural concepts and computational procedures. Structural concepts
are equally important in ecology, but have received somewhat less
methodological attention, particularly concerning some of the role
approaches discussed in this article. Consequently, ecology may be in a
position to benefit from the extraordinary development of structural
methodologies in the social sciences whereas the social sciences may
be in a position to benefit from some of the network theoretical
advancements in ecology.

With this paper we hope to begin a dialogue between the two fields, by
applying advanced social role theory and methods to the study of food
webs. In social network analysis, the introduction of the notion of
structural equivalence thirty years ago was followed by the development
of regular coloration (White & Reitz, 1983; Everett & Borgatti, 1991), an
important advance over structural equivalence for modeling social roles.
The objective of our paper is to answer a call in the ecological literature
for greater clarity in thinking about the role of species in ecosystems
(Simberloff and Dayan, 1991), by applying the notion of regular
coloration to food webs.

In Search of a Formal Definition of Role


In recent years, the social sciences have paid much attention to the
importance of structure in understanding human behavior at both the
macro and micro levels of analysis (Mayhew, 1980). One area of
particular interest has been the study of social roles. Social scientists
have devoted a considerable amount of effort to understand the
relationship between the position of an actor in set of structured
relations and the role (or roles) of that actor within the system. Such
work has been influenced by a general concern in the social sciences
for producing theoretical definitions of a social role. Earlier social
theorists such as Homans (1967), Goodenough (1969), Nadel (1957),
and Merton (1957) produced conceptual definitions of social roles that
went beyond simple inventories of rights and obligations, and instead
defined roles relationally in terms of their characteristic pattern of ties
with other roles. For these theorists, what defines a person as playing
the role of doctor is the characteristic set of relations they maintain with
persons who are playing related roles such as patients, nurses, medical
record keepers, pharmaceutical sales people, receptionists, etc. The
relations associated with these roles are structured and patterned in
predictable ways and may vary only slightly from one medical practice
to another.

This relational conception of the notion of social role lends itself well to
formal modeling. Early network researchers such as White (1963) and
Boyd (1969) initiated what might be termed formal network role analysis
in their attempts to model kinship systems using algebraic approaches.
This was followed by Lorraine and White’s (1971) seminal work on
structural equivalence that sought to not only formally define roles in
terms of the structure of human relations, but to reduce the complexity
of a network of relations in terms of sets of simplified role structures
(i.e., images). This work has been followed by the formulation of a
variety of equivalence approaches that define in various ways the
relational properties that sets of actors must have in order for them to be
equivalent in terms of the roles and positions they occupy (Wasserman
and Faust 1995). We will return to this in more detail in a later section of
the paper.

The Idea of a Species Role in Ecology


An important issue for ecologists has been to conceptualize the roles of
species in an ecological community. The niche concept, for example,
was defined early on by Elton (1927) as the "fundamental role" of an
organism in a community; its relationship to predators and prey. In fact,
Elton's original discussion of an organism's role was based explicitly on
social roles in human communities (ibid. 63-64):

“It is therefore convenient to have some term to describe the status of


an animal in its community, to indicate what it is doing and not merely
what it looks like, and the term used is "niche." Animals have all manner
of external factors acting upon them -- chemical, physical, and biotic --
and the "niche" of an animal means its place in the biotic environment,
its relations to food and enemies. The ecologist should cultivate the
habit of looking at animals from this point of view as well as from the
ordinary standpoints of appearance, names, affinities, and past history.
When an ecologist says "there goes a badger" he should include in his
thoughts some definite idea of the animal's place in the community to
which it belongs, just as if he had said "there goes the vicar."”
(Emphasis is Elton's.)

Role-like concepts have become a central concern for researchers


interested in trophic (i.e., eating-based) views of ecosystems (Yodzis
and Winemiller, 1999). Dissatisfied with early approaches to the trophic
level concept, which are now seen as too coarse, broad, or even
arbitrary to be useful, some ecologists have sought to develop new
trophic concepts that better capture the inherent complexity of systems
of trophic interactions (Cousins 1987; Polis 1991). These newer
approaches often refer to the functional role (Cummins, 1974) of a
species (or sets of related species) in a community, and also tend to
see these roles as “basic building blocks” (Hawkins and MacMahon,
1989) of communities and therefore aids in reducing the complexity of
an ecological system.
One of the most widely used trophic role concepts is the guild. Guilds
are subdivisions of units of analysis that share similar prey. They are
groups of species exploiting a common resource base in a similar
fashion (Root, 1967). What is important to note here is that two species
are members of the same guild to the extent that they have exactly the
same prey. Although this captures an aspect of the role of a species or
class of species in a food web, it only considers a species role with
regard to downward trophic interactions; predators are not considered.
Yodzis (1988:508) discussed a related concept called trophospecies,
which he defined as “a lumping together of biospecies with similar
[emphasis ours] feeding habits.” As we shall see, this downward
orientation of the trophospecies concept was later extended to include
trophic interactions in both directions.

Several ecologists have attempted to expand the idea of a species


trophic role to include simultaneously interactions involving predators as
well as prey. For example, Pimm et al. (1991) advanced the
trophospecies concept as the aggregation of species that have identical
predators and prey. Goldwasser and Roughgarden (1993) have
suggested the concept of trophically equivalent species or species that
share the same sets of predators and prey. An important part of the
logic of Goldwasser and Roughgarden is the intuitive notion that
similarities in diets may or may not imply similarities in predators.
Similarly, Persson et al. (1996) sought to expand the trophic role
concept to include both upward and downward interactions. One of the
concepts they suggested was that of the trophic group in which species,
or clusters of species, have similar dynamics because they share the
same predators and the same prey. Although all of these role concepts
can be considered conceptual advancements, the authors provide little
in the way of how these concepts can be formally operationalized.
Further, clarification of these concepts is hampered by the confusion
surrounding what is meant by the terms similar and same. As Yodzis
(1988:508) notes in a discussion of some of the problems with methods
for species aggregations in food webs[2] “of course a shortcoming of
existing food web data is the lack of a consistent usage for 'similar' in
this context; for the foreseeable future we are just going to have to live
with this.”

In response to these basic problems, Yodzis and Winemiller (1999)


have recently provided a formalization of the notion of trophospecies --
species aggregated on the basis of having similar roles in a food web --
that takes into account linkages to both predators and prey and more
clearly delineates what is meant by similar. They recommend using the
Jaccard (1900) similarity index for binary data (as well as an analogous
index for continuous data) in which the similarity between species i and
j is

Where a refers to the total of both predator and prey that species i and j
share, b represents the number of species of both predators and prey
that i has but not j, and c gives the number of species of both predators
and prey that j has but not i. They then employ cluster analysis of the
matrix of Jaccard coefficients in order to produce aggregations of
species based on trophic similarity.

Role Coloration and Trophic Roles


To any social network researcher, it is obvious that Yodzis and
Winemiller’s approach to trophospecies in a food web is essentially the
same as the notion of structural equivalence in a social network
introduced in the sociological literature thirty years ago by Lorrain and
White (1971), and is virtually identical to the version popularized by Burt
(1976).[3] In fact, the Jaccard coefficient proposed by Yodzis and
Winemiller is an option in the structural equivalence procedure found in
the UCINET 5 social network analysis software package (Borgatti,
Everett and Freeman, 1999).

However, in the social networks literature, the introduction of structural


equivalence eventually gave rise to a more general notion -- regular
coloration -- that better captured the theoretical notion of social role
(White and Reitz, 1983; Everett and Borgatti, 1991). Although structural
equivalence captures perfectly the notion of competition, it has
limitations when used to define social role. In structural equivalence, two
actors are equivalent if they have the same relations with the same
others. That means that two mothers, each with three children, and
each with two parents, are not seen as playing the same role (i.e., are
not structurally equivalent), since (presumably) they are not mothers to
the same children and not born of the same parents. Similarly, if we
divide up a network of people into structurally equivalent classes, then if
a person Bill in one structural class has a certain kind of tie to a person
Mary in another class, then everyone in Bill’s class must have that same
kind of tie with Mary, and Bill must have that tie with everyone in Mary’s
class. But this strange restriction is not part of the theoretical concept of
role, since we recognize that two corporate Presidents play the same
role, even though the boards they report to are different and they have
different subordinates. Consequently, a new approach was needed to
model the notion of social role, and this approach was regular
coloration.

In the regular coloration model, we assign colors to nodes such that if


two persons are assigned the same color (i.e., are playing the same
role), then they are connected to equivalent (but not necessarily the
same) others. To define the concept more formally, we introduce some
notation. Let G(V,E) denote a directed graph with set of nodes V and set
of ties E. We write (u,v)∈E if there is a tie from u to v. Let the out-
neighborhood of a node v be denoted No(v) and defined as the set of
nodes that have a tie from v to them. That is, No(v) = {x | (v,x) ∈ E}.
Similarly, the in-neighborhood of v is denoted Ni(v) and defined as Ni(v)
= {x | (x,v) ∈ E}. The notions of in- and out-neighborhoods are paralleled
in ecology by the concepts of “input environs” and “output environs”
proposed by Patten (1978). A coloration C is defined as an assignment
of colors (classes) to nodes. The color of a node v is denoted by C(v). A
coloration is termed regular if it satisfies the following condition:

C(u) = C(v) -> C(No(u)) = C(No(v)) and C(Ni(u)) = C(Ni(v))

Thus, if two nodes are members of the same class, then they have ties
to members of the same classes, and receive ties from the same
classes, but not necessarily the same individual nodes (see Figure 1).
Applied to a food web, a regular coloration places species in color
classes such that if two species (A and B) are members of the same
class, then the set of species that A preys upon belong to the same set
of classes that the prey of B belong to, and the predators of A belong to
the same classes as do the predators of B. Yet A and B need not share
any of the same individual species of prey or predators.
It should be noted that regular coloration actually defines a family of
colorations[4], of which structural equivalence (also known as structural
coloration) is a member. In this paper, we focus on the most inclusive
member of this family, technically known as the maximal regular
coloration although, for simplicity of exposition, we shall usually omit the
"maximal."

Figure 1. Regular and non-regular role colorations.

Regular coloration can be seen as providing a principled homomorphic


reduction or simplified model of the system of dyadic ties in a network.
For example, the regular coloration in Figure 1 indicates that although
there are eight nodes, there are only five kinds of nodes, and the way
they are related is given in Figure 2.

Figure 2. Image graph for regular coloration in Figure 1.


In other words, what is characteristic of red nodes is that they receive
ties from green and blue nodes, and give ties to white nodes. Looking at
Figure 1, it can be seen that in fact every red node receives a tie from at
least one blue node and one green node, and gives a tie to at least one
white node. Similarly what characterizes blue nodes is that they receive
from yellow nodes (and no others) and give to red nodes (and no
others). We refer to the reduced model shown in Figure 2 as an image
graph. In a food web, the image graph of a regular coloration simplifies
system complexity by providing a reduced model of a food web that
loses very little information, and only in a strictly defined way. These
aggregations of “trophically analogous” or, what we term isotrophic
species are useful because they reduce complexity without combining
species that are playing different relational roles. Thus, regular
coloration captures some of the functional ideas discussed by ecologists
as it relates to the general trophic roles of various species.

The regular coloration model may be seen as providing an alternative to


the aggregation model proposed by Hirata and Ulanowicz (1985). In
their model, the information in a network is defined as the average
amount of uncertainty resolved by the knowledge of the network
structure. They then use a stepwise heuristic algorithm -- much like a
hierarchical clustering algorithm -- to agglomerate species into an
arbitrary number of desired classes. The result is a partition that
minimizes information loss. In practice, it is possible that in some cases
this model will yield answers that resemble regular colorations.
However, the answers can also be very different. A key difference
between the Hirata-Ulanowicz model and the regular coloration model is
that the Hirata-Ulanowicz model is not constrained to collapse only
those species that have ties to equivalent others. It is this recursive
feature that makes regular coloration distinctive and which makes it so
suitable for capturing the notion of role, which carries with it the idea
that occupants of the same role have characteristic relations with the
occupants of certain other roles.

Regular coloration is also related to the ecological notion of trophic level


or trophic position. Trophic level normally refers to the number of links
that a species is from the bottom (primary producers) of a food web. It
can be proved that in a maximal regular coloration of digraphs without
cycles or self-loops, nodes colored the same are equally positioned
along corresponding paths passing through them from sources to sinks.
[5]
In such graphs, then, regularly equivalent nodes will be similarly
positioned with respect to distances to (corresponding) producers and
also from (corresponding) top predators. For example, in Figure 3,
nodes d, e, and g are all regularly equivalent, and they are the same
distances from the top (a) and from the bottom (nodes i, j, k) nodes.
Similarly, f and h are colored the same, and are the same distances
from the top and bottom of the web.

Figure 3. Regular coloration showing source to sink chains.

The relationship between regular coloration and trophic level, while


complex mathematically, is easy to understand at a theoretical level.
Fundamentally, regular coloration is about structural (and presumably)
functional redundancy. As noted by Hirata and Ulanowicz (1985),
functional redundancy is implicit in the notion of trophic level. In
digraphs without cycles or loops, regularly equivalent nodes occupy
identical positions in parallel pathways from sources to sinks; what a
regular homomorphic reduction accomplishes is the elimination of these
parallel pathways to yield a simple model of the original network. [6]
Application of Regular Coloration to Food Web
Examples
In this section we examine the application of the regular coloration
model to food webs found in the ecological literature. To be useful, the
model needs to be able to handle certain features of empirical food
webs. For example, the relations among the nodes (called
“compartments” by ecologists) of a food web can be either binary (either
a species eats something or it doesn’t) or real valued (as in the amount
of carbon (mgC/m2) exchanged between compartments). In addition,
cannibalism (self-loops) exists, as do cycles (species A eats species B
which eats Species A). Fortunately, none of these features is a problem
for the regular coloration model.

Another issue pertaining to real-life food webs is the fact that the
theoretical models are unlikely to fit perfectly. That is, no species will be
perfectly equivalent to another, if only because of measurement error.
Consequently, the algorithms used to detect regular coloration must be
capable of finding classes of “nearly equivalent” species. There are
several approaches available. The one we use in this paper is the
profile similarity method embodied in the well-known REGE algorithm
(White and Reitz, 1985; Borgatti, Everett and Freeman, 1999). In this
approach, we calculate for each pair of nodes the degree of regular
equivalence. Details of how this is done are beyond the scope of this
paper, but are discussed in Borgatti and Everett (1993). The matrix of
regular similarity coefficients is then multidimensionally scaled to
provide a visual, spatial representation of the web structure, and cluster-
analyzed to produce classes of nearly equivalent nodes. An alternative
approach is to use a combinatorial optimization procedure to directly
classify nodes into groups so as to minimize a penalty measure based
on the regular coloration concept (Batagalj, Doreian and Ferligoj, 1992).
The advantage of this approach is that it provides a readily interpretable
measure of how well the data actually fit the model. The disadvantage
of this approach, and the reason we have not used it here, is that it
provides no information (i.e., pairwise similarities or distances) for a
spatial representation.

The food webs and isotrophic classes were visualized using MAGE, a
three dimensional interactive program originally developed for use in
protein chemistry (Richardson and Richardson, 1992,
http://kinemage.biochem.duke.edu/).
For each of the MAGE graphs, referred to
as kinemages, three-dimensional coordinates were obtained through a
multidimensional scaling of the regular coefficients similarity matrix.
Also, in one of the examples, the Pajek network analysis program
(Batagelj and Mrvar, n.d., http://vlado.fmf.uni-lj.si/pub/networks/pajek/) is used to
help illustrate the importance of the direction of trophic linkages.

The Mukkaw Bay Intertidal Subweb

Our first simple example is taken from a classic paper on food web
complexity and species diversity by Robert Paine (1966). His work was
instrumental in illustrating the relationship between the presence of
predators and species diversity in an ecosystem. Claiming that subwebs
or "groups of organisms capped by terminal carnivores and trophically
interrelated in such a way that at higher levels there is little transfer of
energy to co-occurring subwebs," are the basic units within biological
communities he conducted experiments in the Mukkaw Bay intertidal
zone of Washington state (ibid: 66).

Figure 4 provides a representation of Paine's Figure 1 of the feeding


relationships of the Pisaster-dominated subweb at Mukkaw Bay (Paine
1966:67).
Figure 4. Paine's (1966) conceptualization of the feeding relationships of a subweb in Mukkaw
Bay.

Note that this subweb has three levels: herbivorous invertebrates


(chitons [2 species], limpets [2 species], bivalves [1 species], acorn
barnacles [3 species], and goose-necked barnacles [Mitella polymerus])
at the basal level, and two levels of carnivorous predators (Thais
emarginata, a snail, as an intermediate predator, and Pisaster
ochraceus, a sea star, as top predator). It is important to recognize that
Paine did not include primary producers in his depiction of the subweb.

A matrix of trophic relations for this subweb was created and subjected
to REGE and graphed using Mage as described above. Figure 5 is an
interactive kinemage of the subweb and is strikingly similar to Paine's
intuitive graph, with one major exception.

Figure 5. Kinemage of the regular role coloration of the Mukkaw Bay subweb.
To animate Mage, click in the figure (may take a few seconds to
load).

Using an average linkage clustering of the regular equivalence


coefficients (see Figure 6) the basal species are aggregated into two
isotrophic classes based on their trophic roles in the subweb (excluding
primary producers). Thus, based on the structure of these trophic
relations there are 4 trophic roles clearly evident. The first of these
includes barnacles and bivalves that are preyed upon by both Thais and
Pisaster, while the second group (chitons, limpets and goose-necked
barnacles) are preyed on exclusively by Pisaster. The remaining two
groups include both the intermediate (Thais) and top predators
(Pisaster).

Figure 6. Average linkage clustering of REGE coefficients for


Mukkaw Bay subweb with isotrophic classes identified.

The El Verde Rain Forest


The second example comes from Reagan and Waide (1996) in their
examination of the El Verde rain forest food web in Puerto Rico. The
matrix contains 156 compartments each containing various levels of
species aggregations. For example, some compartments (nodes)
contain a single species while others may contain as many as 429
different species. Compartment ties were determined through direct
observations while others were inferred from the literature. Species
aggregation and lack of direct observation may be problematic in terms
the validity of comparisons across systems as well as causing problems
with developing and testing theories concerning food web structure.
There is an on going debate in ecology concerning these issues, but
this is beyond the scope of this paper [see Yodzis and Winemiller
(1999) and Polis (1995) for a more detailed discussion].

Based on a hierarchical clustering of REGE coefficients, we have


chosen to partition the 156 compartments into six isotrophic classes. Of
course, the identification of isotrophic groupings can be at various levels
of similarity based on the cluster analysis. In this case, we have chosen
few classes in order to explore the structure of the graph at a very
general level. The 6-class image graph is shown in Figure 7. This image
graph helps in showing both the direction of the trophic relations and
which isotrophic classes contain cycles and loops.
Figure 7. An image graph of the El Verde rain forest food web.

We now turn to a visual representation of the 156-node food web as a


whole. The kinemage in Figure 8 shows four main isotrophic groups and
two isolated classes (class anomalies).
Figure 8. Kinemage of the regular role coloration of the El Verde rain forest.

To animate Mage, click in the figure (may take a few seconds to


load).

Top predators (class 4, red), those consumed by no other


compartments, are at one end of the graph. Basal (consumers of
producers) and intermediate consumers (consumers of basal and
producers -- classes 3 [blue] and 2 [black, when a white background is
selected, as shown here; this class will be either black or white
depending on background]) are in the middle of the graph, while the
primary producers (class 1, green), which do not consume members of
any other compartments, are at the other extreme. If we move through
the graph using the animation utility in Mage, the red class (class=4)
contains top predators such as Puerto Rican screech owls, Puerto
Rican lizard cuckoos, hawks (red-tailed, broad-winged, and sharp-
shinned), Puerto Rican boa snakes, colubrid snakes, ringed lizards,
various predatory snails, parasitic insects, and house cats. The
basal/intermediate consumers are divided into two primary classes. The
first of these, the black class (class=2) contains specialist herbivores
and detritivores such as decomposers (fungi and bacteria), various
insects [ants and bees (Hymenoptera), bark and book lice (Psocoptera),
beetles (Coleoptera), bugs (Hemiptera), butterflies and moths
(Lepidoptera), cockroaches, mantids, and walking sticks (Orthoptera),
earwigs (Dermaptera), flies (Diptera), Homopterans (cicadas),
springtails (Collembola), termites (Isoptera), thrips (Thysanptera), and
webspinners (Embioptera)], pseudoscorpions, whip-scorpions,
sowbugs, pillbugs, bristletails, daddy longlegs, various snails, snakes
and toads, various bats, various birds (Antillian euphonia, Puerto Rican
parrots, ruddy quail-doves, striped-headed tanagers, bananaquits,
Puerto Rican bullfinches, ovenbirds, pigeons, thrushes, warblers, and
woodpeckers), and some mammals (black rats, Indian mongoose). The
second isotrophic class colored blue (class=3) contains primarily
generalist omnivores such as various insects (ants, flies, and
orthopterans), spiders, mites and ticks (Arachnids), scorpions,
centipedes, nematodes, various tree frogs, anole lizards (Anolis sp.),
and birds (prairie and black-throated blue warblers, black-cowled
orioles, Puerto Rican emeralds, green mangoes, Puerto Rican todies). It
is important to note the differences in the density (or linkage density in
ecology) of within class connections between the two isotrophic classes.
The blue class, in which members are omnivorous, have higher within-
class density than the black/white class and also exhibits loops and
cycles. The green class (class=1) contains the primary producers and
detritus such as plants, algae, live wood, sap, nectar, pollen, dead
wood, leaves, and detritus (with some exceptions discussed below).

Of interest are the two remaining isolated classes and some other
notable anomalies. The maroon class (class=5) contains ostrocods and
slime molds. Although they appear to play roles as top predators in the
food web, by all accounts slime mold, for example, has rarely been
conceived of as a top predator in ecology. In addition, the members of
the orange class (class=6), which are both warblers, appear to function
more as primary producers rather than in their true role as intermediate
consumers. Finally, there is a small cluster of insects found among the
plants and detritus in the green class (e.g., beetles, dragonflies,
caddisflies, mayflies, vespids). Is the model not performing well, or is
there some other explanation for these anomalous classifications? In
reality the fault lies not with the model, but with the input data itself. The
original food web matrix of trophic interactions contains incomplete data
(e.g., we know warblers and mayflies must eat something -- they're not
photosynthetic). Thus, the model places species into classes according
to the nature of their trophic linkages as reported in the input data
matrix.

One interesting outcome with respect to the discovery of these


anomalies is the usefulness of an exploratory visualization tool such as
Mage. Exploratory visualization programs such as Mage allow for an in-
depth exploratory examination of the modeled data that facilitates both
an interpretation of the model results and a cursory assessment of how
well the model actually performed. Freeman (2000) discusses some of
these issues in more depth in the first issue of the Journal of Social
Structure.

The Coachella Valley Desert Food Web

We now turn to a more complex example of a food web. Although


simpler with regards to the total number of compartments in the
network, it is more complex in that the food web contains a high degree
of cycles (A eats B, B eats C and C eats A), loops (A eats B and B eats
A) and self loops (cannibalism). Polis (1995) provides a set of highly
aggregated data for the Coachella Valley desert in California as an
example of the importance and frequency of cycles and loops in
understanding true food web dynamics, something he feels has been
overlooked in most food web research. The matrix contains 30
compartments in which thousands of species were lumped together to
produce compartments representing “kinds of organisms” and it is not
clear how these aggregations were done. As with rain forest example,
the problems with aggregation are possibly even more acute here.

These problems notwithstanding, we analyzed the Coachella Valley


dataset using the methods presented earlier. Figure 9 shows the web
with isotrophic classes indicated by color.
Figure 9. Pajek diagram of the regular role coloration of the Coachella Valley desert food web.

Figure 10 gives an interactive kinemage representation of the same


web.
Figure 10. Kinemage of the regular role coloration of the Coachella Valley desert food web.

To animate Mage, click in the figure (may take a few seconds to


load).

Once again we can orient the kinemage in trophic terms with primary
producers (green class) at the bottom and animate through the various
classes. However, unlike the other two examples, there are no clear top
predators due to the high degree of cycles, loops, and self loops found
in the food web. As Polis notes, "…the Coachella Valley ecosystem
represents a community and food web with potentially no top predators"
(535). This is precisely what the regular role coloration found. In fact, 66
percent of the compartments involves cannibalism and, as we shall see,
one of the isotrophic groups contains trophic generalists with a high
degree of cycles and loops. The primary producers (green class=1)
contain the highly aggregated compartments plants, carrion, and
detritus. The basal and intermediate consumers are found in four
distinct isotrophic classes (classes 2, 6, 7, 8). The first of these, and the
largest, contains primarily herbivorous specialists such as herbivorous
mammals and reptiles, surface arthropod herbivores and detritivores,
soil macroarthropods, and soil macroarthropod predators (yellow, class
2). The remaining three classes are all found in the soil and include soil
microbes (black/white, class 6), soil microarthropods and nematodes
(orange, class 7) and soil micro predators (purple, class 8).

Finally, there are three isotrophic classes making up the highest level in
the food web. These include large, primarily predaceous mammals and
primarily herbivores mammals and birds (red, class 3), hyperparasitoids
and spider parasitoids (magenta, class 5), and a larger class containing
golden eagles, large, primarily predaceous birds, various carnivorous
snakes, birds and mammals, and various species of parasitoids (blue,
class 4). Isotrophic class 4 constitutes what Polis (1995) refers to as
carnivorous generalists as well as trophic generalists and also displays
a high degree of looping and cycling. Hence, there are no clear top
predators found in this community food web, something Polis (1995)
notes may be more prevalent than catalogued food webs would have us
believe.

Figure 11 summarizes the discussion above with an image graph,


clarifying the overall structure.
Figure 11. An image graph of the Coachella Valley food web showing cycles and loops.

Summary
This paper responds to the call by Yodzis and Winemiller (1999:339) “to
further evaluate objective criteria for aggregation into trophospecies.”
We have proposed that regular coloration models provide objective and
formalized means for aggregating species into “isotrophic” classes that
play similar structural roles in a network of trophic interactions. These
isotrophic classes are a close match to the ecological notion of
trophospecies.

Regular coloration models provide a rigorous way to describe the


structure of food webs and formalize familiar notions such as trophic
group, trophic equivalence, trophic position, complexity, niche, and
niche overlap without necessarily losing too much information. Such
models identify isotrophic species that face analogous environments,
have similar functional roles, and are equally positioned in terms of
trophic position, yet the regular coloration model does not presuppose a
single linear trophic dimension nor does it forbid omnivory, cycles,
loops, reflexive ties, cannibalism and other features of real food webs.

Basic regular coloration can be extended and adapted to examine


special purpose problems in ecology. For example, to cluster species
that exploit equivalent resources but do not necessarily have equivalent
predators (such as a guild), it is possible to adjust the definitions to
consider only outgoing (or incoming) ties (Pattison, 1988). Further,
algorithms exist that relax the idealized mathematical definitions for
application to noisy real world data of all kinds, including binary relations
and quantitative energy flows. These models have the ability to handle
multiple relations simultaneously and may facilitate the modeling of
complex trophic and non-trophic interactions such as habitat facilitation
and different types of competition.

You might also like