Professional Documents
Culture Documents
BBT 034
BBT 034
1093/bib/bbt034
Advance Access published on 21 May 2013
Abstract
Corresponding author. Mark Ragan. The University of Queensland, Institute for Molecular Bioscience and ARC Centre of
Excellence in Bioinformatics, Brisbane, QLD 4072, Australia, Tel.: 61 7 3346 2616; Fax: 61 7 3346 2101; E-mail: m.ragan@uq.edu.au
Stefan Maetschke is a computer scientist, and his current research focus is on machine learning techniques and their application to the
inference of gene regulatory and protein–protein interaction networks.
Piyush Madhamshettiwar is currently completing a PhD in bioinformatics. His work is focused on the inference and analysis of gene
regulatory networks, and the development of integrative analytical workflows for cancer transcriptomics.
Melissa Davis is a computational cell biologist with research interests in genome-scale biological network and pathway analysis, and
the integration of experimental data with knowledge-based modeling.
Mark Ragan has research interests in comparative mammalian and bacterial genomics, genome-scale bioinformatics and computa-
tional systems biology.
The most recent and largest comparison so far has unsupervised methods, distinguishes between experi-
been performed by Madhamshettiwar et al. [3]. They mental types and performs repeats, resulting in a
compared the prediction accuracy of eight unsuper- more complete picture. A related evaluation by
vised and one supervised method on 38 simulated Schaffter et al. [12] compared six unsupervised meth-
data sets. The methods showed large differences in ods on larger networks with 100, 200 and 500 nodes
prediction accuracy but the supervised method was and simulated expression data. Again the z-score
found to perform best, despite the parameters of the method was found to be one of the top performers
unsupervised methods having been optimized. Here in knockout experiments.
we extend this study to 17 unsupervised methods Several smaller evaluations have been performed
and include a direct comparison with supervised but are largely restricted to four unsupervised meth-
and semi-supervised methods on a wide range of ods (ARACNE, CLR, MRNET and RN) in
networks and experimental data types (knockout, comparisons with a novel approach on small data
knockdown and multifactorial). sets. The ARACNE method was introduced by
ordinary differential equations but on random of the true network. In the Supplementary Material,
networks and simulated expression data. we nonetheless report results based on F1 score and
In the following sections, we first describe the Matthews correlation.
different inference methods in detail, before evaluat- To avoid discrepancies between the gene expres-
ing their prediction accuracies on simulated and ex- sion values generated by true regulatory networks
perimental gene expression data and regulatory and the actually known, partial networks, we per-
networks of varying size. We continue with a dis- formed evaluations on simulated, steady-state expres-
cussion of the prediction results and conclude with sion data, generated from subnetworks extracted
guidelines for the use of the evaluated methods. from E. coli and Saccharomyces cerevisiae networks.
This allowed us to assess the accuracy of an algorithm
against a perfectly known true network [21].
METHODS We used GeneNetWeaver [12, 24] and SynTReN [25]
We compared the prediction accuracy of unsuper- to extract subnetworks and to simulate gene expres-
training data (known interactions) and render those being the most common. Spearman’s method is
methods supervised. For the supervised and semi- simply Pearson’s correlation coefficient for the
supervised methods, 5-fold cross-validation was ranked expression values, and Kendall’s t coefficient
applied and parameters were optimized on the train- is computed as
ing data only.
conðXir , Xjr Þ disðXir , Xjr Þ
In addition to simulated data, we also evaluated all tðXi , Xj Þ ¼ 1 ð3Þ
methods on two experimental data sets originating 2 nðn 1Þ
from the fifth DREAM systems biology challenge where Xir and Xjr are the ranked expression profiles
[8]. Specifically, we downloaded an E. coli network of genes i and j. conð, Þ denotes the number of con-
with 296 regulators, 4297 genes and the corresponding cordant and disð, Þ the number of disconcordant
expression data with 487 samples, and an S. cerevisiae value pairs in Xir and Xjr , with both profiles being
network with 183 regulators, 5667 genes and expres- of length n.
sion data with 321 samples. Both data sets are described
to infer interactions. The MI I between discrete vari- filter indirect interactions, but uses partial correlation
ables Xi and Xj is defined as coefficients instead of MI as interaction weights. The
X X partial
partial correlation coefficient corrij between two
pðxi , xj Þ
IðXi , Xj Þ ¼ pðxi , xj Þ log ð7Þ genes i and j within an interaction triangle ði, j, kÞ is
xi 2Xi xj 2Xj
pðxi Þpðxj Þ
defined as
where pðxi , xj Þ is the joint probability distribution corrðXi , Xj Þ corrðXi , Xk ÞcorrðXj , Xk Þ
partial
of Xi and Xj, and pðxi Þ and pðxj Þ are the marginal corrij ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð11Þ
probabilities. Xi and Xj are required to be discrete ð1 corrðXi , Xk ÞÞ2 ð1 corrðXj , Xk ÞÞ2
variables. We used equal-width binning for discret-
where corrð, Þ is Person’s correlation coefficient.
ization and empirical entropy estimation as described
The partial correlation coefficient aims to eliminate
by Meyer et al. [26].
the effect of the third gene k on the correlation of
genes i and j.
The background correction step aims to reduce the iMRMR ¼ argmaxðsi Þ ð12Þ
i2VnS
prediction of false interactions based on spurious cor-
relations and indirect interactions. with si ¼ ui ri . The relevance term ui ¼ IðXi , Xj Þ
is thereby the MI between gene i and target j, and
Algorithm for the reconstruction of accurate the redundancy term ri is defined as
cellular networks 1 X
ARACNE [13] is another modification of the RN ri ¼ IðXi , Xk Þ ð13Þ
jSj k2S
that applies the Data Processing Inequality (DPI) to
filter out indirect interactions. The DPI states that, if Interaction weights wij are finally computed as
gene i interacts with gene j via gene k, then the wij ¼ maxðsi , sj Þ.
following inequality holds:
IðXi , Xj Þ minðIðXi , Xj Þ, IðXj , Xk ÞÞ ð10Þ MRNET-B
MRNET-B is a modification of MRNET that
ARACNE considers all possible triplets of genes replaces the forward selection strategy to identify
(interaction triangles) and computes the MI values the best subset of regulator genes by a backward
for each gene pair within the triplet. Interactions selection strategy followed by a sequential replace-
within an interaction triangle are assumed to be in- ment [32].
direct and are therefore pruned if they violate the
DPI beyond a specified tolerance threshold eps. We Gene network inference with ensemble of trees
used a threshold of eps ¼ 0:2 for our evaluations. GENIE is similar to MRNET in that it also lets each
gene take on the role of a target regulated by the
Partial correlation and information theory remaining genes and then uses a feature selection
PCIT [29] is similar to ARACNE. PCIT extracts all procedure to identify the best subset of regulator
possible interaction triangles and applies the DPI to genes. In contrast to MRNET, Random Forests
200 Maetschke et al.
and Extra-Trees are used for regression and feature Mutual rank
selection [27] rather than MI and MRMR. MR by Obayashi and Kinoshita [34] uses ranked
Pearson’s correlation as a measure to describe gene
SIGMOID co-expression. For a gene i, first Pearson’s correlation
SIGMOID models the regulation of a gene by a with all other genes k is computed and ranked. Then
linear combination with soft thresholding. The pre- the rank achieved for gene j is taken as score to de-
dicted expression value Xik0 of gene i at time point k scribe the similarity of the gene expression profiles Xi
is described by the sum over the weighted expression and Xj:
values Xjk of the remaining genes, constrained by a rankij ¼ rankðcorrðXi , Xk Þ, 8k 6¼ iÞ ð20Þ
sigmoid function sðÞ. j
X
n
Xik0 ¼ sð Xjk wij þ bi Þ ð14Þ with corrð, Þ being Pearson’s correlation coefficient.
j6¼i The final interaction weight wij is calculated as the
Mass^distance EUCLID
MD by Yona et al. [33] is a similarity measure for EUCLID is a simple method that uses the
expression profiles. It estimates the probability to ob- euclidean distance between the normalized expres-
serve a profile inside the volume delimited by the sion profiles Xi0 and of two genes as interaction
profiles. The smaller the volume, the more similar weights
are the two profiles. Given two expression profiles Xi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X
and Xj, the total probability mass of samples whose wij ¼ ðXik0 Xjk0 Þ2 ð23Þ
k-th feature is bounded between the expression k
values Xik and Xjk is calculated as
X where profiles are normalized by computing the
MASSk ðXi , Xj Þ ¼ freqðxÞ ð18Þ absolute difference of expression values Xik to the
minðXik , Xjk ÞxmaxðXik , Xjk Þ
median expression in profile Xi
where freq(x) is the empirical frequency. The mass Xik0 ¼ jXik medianðXi Þj ð24Þ
distance MDij is defined as the total volume of pro-
files bounded between the two expression profiles Xi
and Xj and is estimated by the product over all Z-SCORE
coordinates k. Z-SCORE is a network inference strategy by Prill
Y
n et al. [7] that takes advantage of knockout data.
MDij ¼ MASSk ðXi , Xj Þ ð19Þ It assumes that a knockout affects directly interacting
k
genes more strongly than others. The z-score zij
where n is the length of the expression profiles. describes the effect of a knockout of gene i in
Because the MDij is symmetric and positive, we the k-th experiment on gene j as the normalized
interpret it directly as an interaction weight wij. deviation of the expression level Xjk of gene j for
Inference of gene regulatory networks 201
experiment k from the average expression mðXj Þ of confident the prediction, and similar to a correlation
gene j: value we interpret the distance as an interaction
Xjk mðXj Þ weight.
zij ¼ j j ð25Þ
sðXj Þ In contrast to unsupervised methods, e.g. correl-
ation methods, the supervised approach does not
The original Z-SCORE method requires knowledge directly operate on pairs of expression profiles but
of the knockout experiment k and is therefore not on feature vectors that can be constructed in various
directly applicable to data from multifactorial experi- ways. We computed the outer product of two gene
ments. The method, however, can easily be general- expression profiles Xi and Xj to construct feature
ized by assuming that the minimum expression value
vectors:
within a profile indicates the knockout experiment
[minðXj Þ ¼ Xjk ]. Equation (25) then becomes x ¼ Xi XjT ð30Þ
minðXj Þ mðXj Þ
The distance dðxÞ can be interpreted as a confidence Figure 1: Extraction of samples for the training and
value. The larger the absolute distance, the more test set from a gene interaction network.
202 Maetschke et al.
parts is used as a test set and the remaining four parts only a portion of the training samples is labeled.
serve as a training set. The total prediction accuracy is To enable the SVM training, which requires all sam-
averaged over the prediction accuracies achieved ples to be labeled, all unlabeled samples within the
during the five iterations (5-fold cross-validation). semi-supervised training data are relabeled as nega-
tives [10]. This approach enables a direct comparison
Semi-supervised of the same prediction algorithm trained with fully
Data describing regulatory networks are sparse and or partially labeled data.
typically only a small fraction of the true interactions We assigned different percentages (10%, . . . ,100%)
is known. The situation is even worse for negative of true positive and negative or positive-only labels
data (non-interactions) because experimental valid- to the training set. The prediction performance of the
ation largely aims to detect but not exclude inter- different approaches was then evaluated by 5-fold
actions. The case that all samples within a training cross-validation, with equal training/test set sizes for
data set can be labeled as positive or negative is there- the supervised, semi-supervised, positives-only and
corrected Spearman method (SPEARMAN-C) different network sizes and sample numbers.
achieving the highest mean AUC. However, when SynTReN simulates expression data for multifactorial
focusing on networks of specific size, the best per- experiments only, and networks were extracted from
formance is achieved by the EUCLID method for E. coli. All experiments were repeated 10 times. The
small networks with 10 nodes. Other methods also results show the expected trend of improving accur-
show different behaviors with respect to network acy with increasing number of samples and decreasing
size. Correlation methods clearly achieve higher size of network.
AUCs for large networks. Similar trends can be However, the absolute improvements in predic-
observed for MR, MINE, GENIE, MRNET, tion accuracy are rather small with additional data,
MRNET-B and CLR. In contrast, SIGMOID, most likely because unsupervised methods can infer
PCIT and MD decrease in prediction accuracy for only simple network topologies reliably and small
growing network sizes, while the performance of sample sets are sufficient for this purpose. For in-
RN and ARACNE is seemingly unaffected by net- stance, networks with 50 nodes are predicted with
Figure 4: Prediction accuracy (AUC) of unsupervised Figure 5: Prediction accuracy (AUC) averaged over all
methods on multifactorial data for different network unsupervised methods on multifactorial for different
sizes (nodes). Data generated by GeneNetWeaver and network sizes (nodes) and sample numbers. Data gener-
extracted from E. coli and S. cerevisiae. ated by SynTReN and extracted from E. coli.Ten repeats.
Inference of gene regulatory networks 205
unsupervised methods (Z-SCORE, SPEARMAN) data. Apparently, supervised methods can be trained
in our evaluation of supervised methods. effectively even when only a portion of network
Supervised and semi-supervised methods are labeled interactions (positives) is known.
‘SVM’ followed by the percentage of labeled data Even with as little as 10% of known interactions,
(10, 30, 50, 70, 90, 100%). The suffix ‘þ’ indicates semi-supervised methods still outperform unsuper-
that only positive data were used and ‘’ indicates vised methods for multifactorial data. The
that positive and negative data were used. For in- Z-SCORE method is still the top-performing
stance, ‘SVM-70’ describes an SVM trained on method on knockout data, but supervised methods
70% of labeled data (positive and negative). All are not far behind and considerably outperform
evaluations are 5-fold cross-validated and the com- Spearman correlation. For knockdown data, the
plexity parameter C of the SVM was optimized via Z-SCORE method loses its top rank, and semi-
grid search (0:1 . . . 100) for each training fold. supervised methods perform better when at least
The results show good prediction accuracies for 70% of the data are labeled.
Figure 6: Prediction accuracy (AUC) of supervised methods on multifactorial, knockout, knockdown and aver-
aged (all) data generated by GeneNetWeaver. Results are for 5-fold cross-validation and 10 repeats over networks
with 30 nodes, extracted from E. coli. Error bars show standard deviation.
206 Maetschke et al.
Figure 7: Prediction accuracy (AUC) of supervised and unsupervised methods for networks with 30 nodes
extracted from E. coli and S. cerevisiae. The first 30 samples of the corresponding experimental expression data are
used. AUCs are averaged over 10 repeats and error bars show standard deviation. Results for supervised methods
are 5-fold cross-validated.
Inference of gene regulatory networks 207
accuracy of the best-performing unsupervised meth- benchmarks, but methods that fail for simulated
ods (see Supplementary Material). data are unlikely to succeed in the inference of real
biological networks [21].
Marbach et al. [8] furthermore suggest that incor- These examples illustrate how GRN inference can
porating additional information such as transcription- productively be combined with downstream analysis:
factor binding and chromatin modification data or a GRN is initially inferred from experimental data,
promoter sequences might improve the accuracy of features of interest are identified based on network
prediction. We have performed evaluations with topology and interesting features are then ranked by
added transcription-factor binding data but could applying additional criteria derived from experimen-
not improve prediction accuracies (data not shown). tal evidence, functional annotation or literature.
Both He et al. [39] and Della Gatta et al. [41] used
GRNs to identify highly connected nodes as features
Applications of inference methods of interest, and then either rank these hubs, or refine
The inference of regulatory interactions from expres- the networks, by use of additional evidence.
sion data has the potential to capture important Interesting hubs prioritized in this way then form
associations between key genes regulating transcrip- the basis of hypotheses that can be subjected to ex-
trained on positively labeled data only, compared Unsupervised inference methods are accurate only for networks
with simple topologies.
with training on positive and negative samples. An exception is the unsupervised Z-SCORE method that shows
Apparently semi-supervised methods can effectively the highest accuracy on simulated knockout data.
be trained on partial interaction data, and non- Knockout data are more informative for network inference than
knockdown or multifactorial data.
interaction data are not essential.
These results have important implications for the
application of network inference methods in systems
biology. Even the best methods are accurate only for FUNDING
small networks of relatively simple topology, which Australian Research Council (DP110103384 and
means that large-scale or genome-scale regulatory CE0348221).
network inference from expression data alone is
currently not feasible. If inference methods are
to be applied to data of the scale generated by
14. Meyer P, Kontos K, Lafitte F, Bontempi G. Information- 28. Reshef DN. Detecting novel associations in large data sets.
theoretic inference of large transcriptional regulatory net- Science 2011;334:1518–24.
works. EURASIP J Bioinform Syst Biol 2007;2007:79879. 29. Reverter A, Chan EK. Combining partial correlation and an
15. Altay G, Emmert-Streib F. Revealing differences in gene information theory approach to the reversed engineering of
network inference algorithms on the network level by gene co-expression networks. Bioinformatics 2008;24:2491–7.
ensemble methods. Bioinformatics 2010;26(14):1738–44. 30. Langfelder P, Horvath S. WGCNA: an R package for
16. Lopes FM, Martins DC, Cesar RM. Comparative study weighted correlation network analysis. BMC Bioinformatics
of GRNS inference methods based on feature selection 2008;9:559.
by mutual information. In: IEEE International Workshop on 31. Butte AJ, Kohane IS. Mutual information relevance net-
Genomic Signal Processing and Statistics. Guadalajara, JA, works: functional genomic clustering using pairwise entropy
Mexico, 2009. measurements. Pac Symp Biocomput 2000;1:418–29.
17. Haynes B, Brent M. Benchmarking regulatory network 32. Patrick M, Daniel M, Sushmita R, et al. Information-
reconstruction with GRENDEL. Bioinformatics 2009;25(6): theoretic inference of gene networks using backward
801–7. elimination. In: The 2010 International Conference on
18. Werhli A, Grzegorczyk M, Husmeier D. Comparative Bioinformatics and Computational Biology 2010.