Professional Documents
Culture Documents
NetCoMi Network Construction and Comparison For
NetCoMi Network Construction and Comparison For
NetCoMi Network Construction and Comparison For
https://doi.org/10.1093/bib/bbaa290
Method Review
Abstract 从高通量测序数据中估计微生物联合网络是一种常见的探索性数据分析方法,旨在了解微生物群落在其自然栖息地中的复杂相互作用
Motivation: Estimating microbial association networks from high-throughput sequencing data is a common exploratory
data analysis approach aiming at understanding the complex interplay of microbial communities in their natural habitat.
Statistical network estimation workf lows comprise several analysis steps, including methods for zero handling, data 由于微生物之间的相互作
normalization and computing microbial associations. Since microbial interactions are likely to change between conditions,用可能会在不同的条件下
发生变化,例如在健康人
e.g. between healthy individuals and patients, identifying network differences between groups is often an integral 和患者之间,识别群体之
secondary analysis step. Thus far, however, no unifying computational tool is available that facilitates the whole analysis 间的网络差异通常是一个
不可或缺的二次分析步骤
workf low of constructing, analysing and comparing microbial association networks from high-throughput sequencing data.
Results: Here, we introduce NetCoMi (Network Construction and comparison for Microbiome data), an R package that
integrates existing methods for each analysis step in a single reproducible computational workf low. The package offers
functionality for constructing and analysing single microbial association networks as well as quantifying network
该包提供了构建和分
析单个微生物协会网 differences. This enables insights into whether single taxa, groups of taxa or the overall network structure change between
络以及量化网络差异 groups. NetCoMi also contains functionality for constructing differential networks, thus allowing to assess whether single
的功能
pairs of taxa are differentially associated between two groups. Furthermore, NetCoMi facilitates the construction and
analysis of dissimilarity networks of microbiome samples, enabling a high-level graphical summary of the heterogeneity of
an entire microbiome sample collection. We illustrate NetCoMi’s wide applicability using data sets from the GABRIELA study
to compare microbial associations in settled dust from children’s rooms between samples from two study centers (Ulm and
Munich). Availability: R scripts used for producing the examples shown in this manuscript are provided as supplementary
Stefanie Peschel is a PhD student at the Institute of Asthma and Allergy Prevention, Helmholtz Zentrum München (Neuherberg, Germany) with background
in statistics. Her current research focuses on understanding the role of the human microbiome for the development of asthma and allergies using network
analysis methods.
Christian L. Müller is group leader at the Institute of Computational Biology, Helmholtz Zentrum München (Munich, Germany), professor at the Department
of Statistics, LMU Munich (Germany), and researcher at the Center for Computational Mathematics, Flatiron Institute, New York. His research interests
include computational statistics, optimization, and microbiome data analysis.
Erika von Mutius, MD MSc, Professor of Pediatric Allergology, is head of the Asthma and Allergy Department at the Dr von Hauner Children’s Hospital
in Munich and leading the Institute of Asthma and Allergy Prevention at Helmholtz Zentrum München (Neuherberg, Germany). Her research group is
focusing on the determinants of childhood asthma and allergies.
Anne-Laure Boulesteix is an associate professor at the Institute for Medical Information Processing, Biometry and Epidemiology, LMU Munich (Germany).
Her activities mainly focus on computational statistics, biostatistics and research on research methodology.
Martin Depner is a postdoctoral research fellow at the Institute of Asthma and Allergy Prevention, Helmholtz Zentrum München (Neuherberg, Germany)
with background in medical biometry. His current research focuses on studies of the microbiome (16S rRNA) and the protective effects of growing up on a
farm on asthma and allergies.
Submitted: 17 July 2020; Received (in revised form): 24 September 2020
© The Author(s) 2020. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/),
which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
1
2 S. Peschel et al.
Key words: compositional data; microbial association estimation; sample similarity network; differential association;
network analysis; network comparison. 成分数据;微生物关联度估计;样本相似网络;差异关联;网络分析;网络比较数据
Introduction
Current approaches for comparing networks between two 比较两种条件下的
The rapid development of high-throughput sequencing tech- conditions can be divided into two types: (i) differential association 网络的方法可分为
测序技术的发展(3种) 两种类型
niques [1–4] offers new possibilities for investigating the micro- analysis focusing on differences in the strength of single associ-
数据处理(1,2,3步) 网络构建(估计关联、建立邻接矩阵)
差异网络使用的是差异关联
网络分析 网络可视化
Figure 1. The proposed workflow for constructing, analysing and comparing microbial association networks, implemented in the R package NetCoMi. The main
framework (displayed as continuous lines) requires a n × p read count matrix as input. The data preparation step includes sample and taxa filtering, zero replacement
and normalization (step 1). Associations are calculated and stored in an adjacency matrix (step 2). Alternatively, an association matrix is accepted as input, from which
the adjacency matrix is determined. A more detailed chart describing step 2 is given in Figure 2. In step 3, network metrics are calculated, which can be visualized in the
network plot (step 4). If two networks are constructed (by passing a binary group vector, two count matrices or two user-defined association matrices to the function),
their properties can be compared (step 5). Besides the main workflow, a differential network can be constructed from the association matrix.
数据中过多的零点是 The excess number of zeros in the data is a major challenge (VSTs) [49] as well as the clr-transformation produce very similar
分析微生物组数据的
主要挑战,因为参数 for analysing microbiome data because parametric as well as Pearson correlation estimates, which are more consistent across
模型和非参数模型对 non-parametric models may become invalid for data with a large different sample sizes than TSS, CSS and COM methods.
于含有大量零点的数
据可能变得无效 amount of zeros [39]. Moreover, many compositionally aware
measures are based on so-called log-ratios. Log-ratios have been
proposed by Aitchison [45] as the basis for statistical analyses Measuring associations between taxa
of compositional data as they are independent of the total sum Association estimation is the next step in our workflow (step 2a
of counts m. More precisely, for two variables i and j the log- in Figure 1) to obtain statistical relations between the taxa. Com-
常见的关联测量包括
ratio of relative abundances log( xxi ) = log( ωωi /m
/m
) is equal to the log- mon association measures include correlation, proportionality 相关性、相称性和条
j j
ratio of the absolute abundances log( ωωi ) [28]. However, log-ratios 件依赖
j
and conditional dependence. For all three types of association,
cannot be computed if the count matrix contains any zeros, compositionally aware approaches have been proposed, which
making zero handling necessary (step 1b in Figure 1). Several are summarized in Table 3. Further information on these mea-
zero replacement strategies have been proposed [40–42, 44, 46]. sures is available in Supplementary material 1. To ensure wide
Table 1 gives an overview of the different types of zeros that applicability of NetCoMi, the package comprises also traditional
have been suggested as well as existing approaches for their association measures, which are not suitable for application on
treatment. read count data in their original form.为了确保NetCoMi的广泛适用性,该包还包括传统的关联
措施,这些措施不适合应用于原始形式的读取计数数据
需要使用归一化技术 Normalization techniques are required to make read counts
来使读取计数在不同 comparable across different samples [47, 48] (step 1c in Figure 1).
样本之间具有可比性 相关性
Correlations
The normalization approaches included in NetCoMi are summa-
rized in Table 2. A description of these methods is available in Compositionality implies that if the absolute abundance of a
[47, 48]. Note that forcing the read counts of each sample to a single taxon in the sample increases, the perceived relative
unique sum (as done with total sum scaling) does not change the abundance of all other taxa decreases. Applying traditional cor-
传统相关测量方法的
compositional structure. Rather, proportions are always com- relation measures such as Pearson’s correlation coefficient to 不足:成分影响
positional, even if the original data are not [11]. Aitchison [45] compositions can lead to spurious negative correlations, which
suggested using the centered log-ratio (clr) transformation to do not reflect underlying biological relationships. It has been
move compositional data from the simplex to real space. Badri shown that the resulting bias is stronger, the lower the diversity
et al. [47] have shown that variance-stabilizing transformations of the data [27].
4 S. Peschel et al.
为测序数据定义了不同的零类型,并采用了适当的处理方法:3种零值及其对应的方法
Table 1. Different zero types defined for sequencing data with appropriate approaches for their treatment (suggested by Martín-Fernández et
al. [36]), which are available in NetCoMi. The three approaches are implemented in R via the zCompositions package [37]. Shown are for each
treatment method a few key facts as well as the argument for applying it in NetCoMi.
Description “... component which is truly Assumed to be present in the sample but Sequencing data are seen as
zero, not something recorded as their proportion is too low to be detected [36]. categorical. For a zero count it is
zero simply because the assumed that, in this category, no
experimental design or the event occurred in this experimental
measuring instrument has not run but could occur in another run
been sufficiently sensitive to (e.g. with a different sequencing
detect a trace of the part.” [38] depth) [36].
Treatment • No general approach available Multiplicative imputation [40]: Bayesian-multiplicative treatment
methods so far [39] • Non-parametric [44]:
Table 2. Data normalization techniques implemented in NetCoMi. More detailed descriptions of these methods can be found in Badri et al. [47]
and McMurdie and Holmes [48]. The corresponding NetCoMi argument is normMethod with the available options: “none”, “TSS”, “CSS”, “COM”,
“rarefy”, “clr” and “VST”. NetCoMi中实施的数据标准化技术:选择下面其一
Total sum scaling (TSS) Traditional approach for building fractions. Counts are Strongly inf luenced by highly abundant taxa [47].
divided by the total sum of counts in the corresponding
sample.
Cumulative sum scaling Within each sample, the counts are summed up to a Aims at avoiding the inf luence of highly
(CSS) [50] predefined quantile. The counts are then divided by this abundant taxa [47].
sum.
Common sum scaling Counts are scaled according to the minimum sum of Originally suggested as alternative to rarefying as
(COM) [48] counts over all samples so that all samples have equal common normalization method [48].
library size (that of the sample with minimum overall
sum).
Rarefying [51] Subsampling from the data to obtain samples with equal
library size (also called rarefaction level) [48].
Centered log-ratio (clr) For a composition x = (x1 , ..., xp ) the clr transformation is Aims to avoid compositional effects.
transformation [45] 1
x1 xp p p
defined as clr(x) = log( g(x) , ..., g(x) ), where g(x) = x
k=1 k
is the geometric mean.
Variance stabilizing The dispersion-mean relation is fitted based on the count Aims at eliminating the dependence of the
transformation (VST) matrix. The data are then transformed to receive a matrix variance on the mean. VST and the clr
[49] that is approximately homoscedastic [49]. transformation produce very similar Pearson
correlations [47].
减少成分影响的一种 A possible approach to reduce compositional effects is to Popular compositionally aware correlation estimators 常用的成分感知相关
性估计器
可能的方法:以成分 normalize or transform the data in a compositionally aware include SparCC (Sparse Correlations for Compositional data)
感知的方式对数据进
行标准化或变换,并 manner and apply standard correlation coefficients to the [27], CCLasso (Correlation inference for Compositional data
将标准相关系数应用 transformed data. A common method is the clr transformation through Lasso) [53] and CCREPE (Compositionality Corrected
于变换后的数据
(Table 2), which is a variation of the aforementioned log-ratio by REnormalization and PErmutation), also called ReBoot
approach. Traditional correlation measures provided in NetCoMi method [18]. SparCC is one of the first approaches developed
NetCoMi提供的传统 are the Pearson correlation, Spearman’s rank correlation and the for inferring correlations for compositional data and has
相关性度量 biweight midcorrelation (bicor). Bicor as part of the R package become a widely used method [53]. However, SparCC has 3种方法的介绍
WGCNA [52] is more robust to outliers than Pearson’s correlation some limitations, namely that the estimated correlation matrix
because it is based on the median instead of the mean of is not necessarily positive definite and may have values
observations. outside [-1,1] [53]. Furthermore, the basic algorithm is repeated
NetCoMi 5
NetCoMi中可用于网络构建的成分感知关联措施概述
Table 3. Overview of compositionally aware association measures that are available for network construction in NetCoMi. The corresponding这些衡量标准按关联的
argument is named measure. The measures are grouped by the three types of association: correlation, proportionality and conditional三种类型进行分组:相
关性、相称性和条件依
dependence. For each measure, the assumptions stated in the corresponding publication are listed, together with a short summary of the赖性
approach.
Measure
Data transformation
R implementation Assumptions Approach
Correlation 相关性
SparCC [27] • Large number of taxa • Idea: estimating Pearson correlations from log-ratio
• Log-ratios • Sparse underlying network variances via an approximation based on the assumption
• R code based on package r-sparcc [60] (taxa sparsely correlated) that taxa are uncorrelated on average
• Zero handling: Bayesian approach (observations
Proportionality 对称性
var(log(x/y))
ρ [54] – ρ(log(x), log(y)) = 1 − var(log(x))
+ var(log(y))
• clr transformation
• R package propr [30]
SPIEC-EASI [63] • Large number of taxa • Idea: covariance matrix for clr-transformed data is
• Log-ratios • Sparse underlying network approximately equal to the “true” covariance matrix for
• R package SpiecEasi [63] p0
• Zero entries in the inverse covariance matrix −1
correspond to conditional independent taxa
• Approach: a graphical model is used to infer the
conditional dependence structure between each two taxa
• Two methods for graphical model inference: sparse
inverse covariance selection and neighborhood selection
gCoda [56] • True absolute abundances • log(y) ∼ N(μ, ) with = −1 representing conditional
• Relative abundances y follow a multivariate dependence relationships between taxa
• R code on GitHub [64] normal distribution • is determined by minimizing the negative
• Sparse underlying network log-Likelihood for (μ, ) plus l1 penalty (to satisfy sparsity
assumption)
• Optimization problem solved via Majorization-
Minimization algorithm
SPRING [55] • Sparse underlying network • is estimated using a semi-parametric rank based
• mclr transformation1 approach relying on a truncated Gaussian copula model
• R package SPRING [65] [66], which can deal with zeros in the data
• Conditional independence relationships are inferred
from using neighborhood selection [58]
• Graphical model selection via stability-based approach
1 clr transformation where only non-zero elements are included in the geometric mean [55].
iteratively to reinforce the sparsity assumption and to account variable model to infer a positive definite correlation matrix
for uncertainties due to random sampling, leading to high directly [53]. The CCREPE approach operates directly on relative
computational complexity. CCLasso is also based on log-ratios read counts, which are permuted and renormalized in order to
but aims to avoid the disadvantages of SparCC by using a latent detect correlations induced by compositionality alone.
6 S. Peschel et al.
对称度量 Proportionality (or vertices) represent the taxa. The different options available in
Lovell et al. [54] argue that correlations cannot be inferred from NetCoMi are illustrated in Figure 2 (Supplementary Figure S1 for
the relative abundances in a compositionally aware manner a more detailed version of this chart).
without any assumptions and propose proportionality as an Since the estimated associations are generally different from
alternative association measure for compositional data. If two zero, using them directly as adjacency matrix results in a dense
关联矩阵通常被稀
relative abundances are proportional, then their correspond- network, where all nodes are connected to each other and con- 疏以选择感兴趣的
ω
ing absolute abundances are proportional as well: ωmi ∝ mj ⇒ sequently only weighted network measures are meaningful. 边
ωi ∝ ωj . Thus, proportionality is identical for the observed Instead, the association matrix is usually sparsified to select
(relative) read counts and the true unobserved counts. Lovell et edges of interest.
al. [54] suggest proportionality measures based on the log-ratio One possible sparsification strategy consists of defining a
variance var(log xxi ), which is zero when ωi and ωj are perfectly cut-off value (or threshold) so that only taxa with an absolute 一种可能的稀疏化策
j 略包括定义一个分割
proportional. This variance, however, lacks a scale that would association value above this threshold are connected [27, 67]. 值(或阈值),以便只
This filtering method is available for all types of association. 有绝对关联值高于该
建议的比例度量φ和 make the strength of association comparable. For this reason, 阈值的分类群被连接
稀疏化方法(使网络关注
我们感兴趣的连接) 差异测量 网络
not negative associations should be included in the unweighted degree, betweenness, closeness and eigenvector centrality
network (Figure S1). (Table 4). Using these measures, so-called hubs (or keystone
我们还包括通过 Furthermore, we included adjacency matrix constructions taxa) can be determined. This analysis step is of high interest
WGCNA包[46,52]中 via the soft-thresholding approach from the WGCNA package to researchers since hub nodes may correspond to taxa with a 什么被认为是“重要
的软阈值方法构建邻
接矩阵,该方法仅适 [46, 52], which is only available for correlations (see blue particularly important role in the microbial community. What 的”,取决于科学背
景
用于相关性(参见图2 path in Figure 2). Rather than using hard thresholding (e.g. is deemed “important”, depends on the scientific context. A
中的蓝色路径)
via cutoff value or statistical tests), Zhang and Horvath [46] selection of definitions for hub nodes is given in Table 4. 表4:集线器节点的定义选择。
suggested raising the estimated correlations to the power of Clustering methods are appropriate to identify functional
a predefined value greater than 1 to determine the adjacencies groups within a microbial community. A cluster (or module)
(Supplementary material 2.2 for more details). Small correlation is a group of nodes that are highly connected to one another
values are thus pushed toward zero becoming less important but have a small number of connections to nodes outside their
in the network. The resulting similarities are used as edge module [67]. The user of NetCoMi can choose between one of NetCoMi的用户可以在
提供的群集方法之一
weights for network plotting and network metrics based the clustering methods provided by the igraph package [24] 和分级群集之间进行
选择
on connection strength. Note that this approach leads to and hierarchical clustering (R package hclust() from stats
a fully connected network. Following Zhang and Horvath package). Both similarities and dissimilarities can be used for
[46], only TOM dissimilarities are available in NetCoMi for clustering (Table 4).
computing shortest paths and clusters when soft-thresholding Global network properties are defined for the whole
is used. network and offer an insight into the overall network structure. 全局网络属性
Typical measures are the average path length, edge and
Network analysis vertex connectivity, modularity, and the clustering coefficient
NetCoMi提供了分析 We analyse a constructed network by calculating network (Table 4).
和可视化单个网络的
可能性:网络汇总度 summary metrics (step 3 in Figure 1), which are amenable to
量 group comparisons. Alternatively, NetCoMi offers the possibility
Sample similarity networks
to analyse and visualize single networks.
Several network statistics require shortest path calculations. Using dissimilarity between samples rather than association
一些网络统计数据需
要计算最短路径 A path between two vertices v0 and vk is a sequence of edges measures among taxa leads to networks where nodes rep-
connecting these vertices such that v0 and vk are each at one end resent subjects or samples rather than taxa. Analogous to 定义解释
of the sequence and no vertices are repeated [78]. The length of a microbiome ordination plots, these sample networks express how
path is the sum of edge weights, where the weight of an edge similar microbial compositions between subjects are and thus
is a real non-negative number associated with this edge [78]. provide insights into the global heterogeneity of the microbiome
The shortest path between two nodes is the path with minimum sample collection.
边权重以两种方式定义length. In NetCoMi, edge weights are defined in two ways: (i) The following dissimilarity measures are available in NetCoMi中提供了以下
for properties based on shortest paths, dissimilarity values are NetCoMi: Euclidean distance, Bray-Curtis dissimilarity [90], 不同度量
used implying that the path length between two taxa is shorter Kullback-Leibler divergence (KLD) [91], Jeffrey’s divergence [92],
the higher the association between the taxa is. (ii) For properties Jensen-Shannon divergence [93], compositional KLD [94, 95], and
based on connection strength, the corresponding similarities are Aitchison distance [96]. Details are available in Table S3. Only
used as edge weights (Section 2.3 and Figure S1 for details on Aitchison’s distance and the compositional KLD are suitable 只有Aitchison距离和
distance and similarity calculation). for application on MGS data whereas the others may induce 成分KLD适合应用于
MGS数据,而其他方法
Network centrality measures offer insights into the role of compositional effects when applied to raw count data without 在应用于未经适当变
individual taxa within the microbial community. We consider appropriate transformation (Table 2). 换的原始计数数据时
可能会引起成分效应
网络中心性测量提供了对单个分类群在微生物群落中的作用的洞察
8 S. Peschel et al.
在NetCoMi中实施的本地和全局网络属性
Table 4. Local and global network properties implemented in NetCoMi. The table shows for each measure a short explanation and whether
it is based on connection strength (similarity) or dissimilarity/distance (e.g. two taxa with a high correlation have a high connection strength,
but their distance is low). 测量的简短解释,以及它是基于连接强度(相似性)还是基于不同/距离
Shortest path Sequence of edges connecting two nodes with minimum sum of edge weights [78]. Dissimilarity
Degree centrality 中心程度 Number of adjacent nodes [79]. Options: weighted/unweighted. 周围邻接节点数(越多说明越“中心”) Similarity
Betweenness centrality Fraction of times a node lies on the shortest path between all other nodes [79]. ⇒ A central node has the Dissimilarity
ability to connect sub-networks [67]. 一个节点位于所有其他节点之间的最短路径上的次数
Closeness centrality Reciprocal of the sum of shortest paths between this node and all other nodes [79]. ⇒ The node with Dissimilarity
highest closeness centrality has the minimum shortest path to all other nodes. 此节点与所有其他节点之间的最短路径之和的倒数
Eigenvector centrality Calculated via eigenvalue decomposition: Ac = λc, where λ denotes the eigenvalues and c the eigenvectors Similarity
特征向量
of the adjacency matrix A. Eigenvector centrality is then defined as the i-th entry of the eigenvector
Average path length Arithmetic mean of all shortest paths between vertices in a network [83].网络中顶点之间所有最短路径的算术平均值 Dissimilarity
Global clustering Proportion of triangles with respect to the total number of connected triples2 [83]. ⇒ Expresses how likely Similarity
coefficient 全局聚类系数 the nodes are to form clusters [83]. For weighted networks, the definition according to Barrat et al. [85] is
used in NetCoMi. 表示节点形成集群的可能性
Modularity Expresses how well the network is divided into communities (many edges within the identified clusters Depends on the
and only a few between them) [86]. 表示网络划分为社区的程度(识别的聚类内有多条边,而它们之间只有几条边) clustering algorithm.
Edge / vertex connectivity Minimum number of edges, or vertices (nodes) that need to be removed to disconnect the network, Presence/ absence of
respectively [87]. Not meaningful for a fully connected network. 分别需要删除以断开网络连接的最小边数或顶点(节点) an edge
Density Ratio of the actual number of edges in the network and the possible number of edges [87]. Not meaningful Presence/ absence of
for a fully connected network. 网络中的实际边数与可能的边数之比 an edge
Clustering
Hierarchical clustering Hierarchical clustering using the R function hclust() from stats package [69]. Different methods are Dissimilarity
层次聚类
provided (e.g. single, complete, and average linkage, Ward’s method). The cutree() function (stats) is
used for cutting the resulting tree. 进行分层聚类
Modularity clustering [88] The modularity measure is maximized over all possible clusterings.在所有可能的集群上最大化模块化度量 Similarity
模块化聚类
Fast greedy modularity Modularity clustering via a fast greedy algorithm, which is suitable for very large networks. Similarity
通过快速贪婪算法实现模块化分簇,适用于超大型网络
optimization [86]快速贪婪模块化优化
Clustering based on edge Idea: Edges with high edge betweenness (number of shortest paths leading through an edge) tend to divide Dissimilarity
betweenness [89] the network into clusters. Approach: Hierarchical clustering, where edges with high edge betweenness are
基于边缘介数的聚类方法 removed gradually from the network. 想法:具有较高边介间度的边(通过边的最短路径的数量)往往会将网络划分为多
个簇。方法:分层聚类,将具有较高边缘介数的边逐渐从网络中移除
1 Not to be confused with “centralization” [79], which is a global network measure expressing the degree to which the centrality of the most central node exceeds the
centrality of all other nodes in the network. 2 A connected triple is a group of three nodes with two or three edges. A triangle is a triple where all three nodes are
connected.
in (Section 2) are performed for both subsets separately. The Table 5. Test procedures for identifying differential associations
estimated associations, as well as network characteristics, are available in NetCoMi. Fisher’s z-test and the Discordant method
are based on correlations that are Fisher-transformed into z-values.
then compared using the methods described below.
These two methods are thus not suitable for the other association
measures included in NetCoMi.
Differential network analysis Test method Originally designed for... In NetCoMi used for...
We next detail NetCoMi’s differential network analysis capa- Fisher’s z-test Correlation Correlation
bilities. All approaches are applicable for microbial association Non-parametric “connectivity • Correlation
networks and sample similarity networks, respectively. test [105] scores”[105]: • Proportionality (due to
所有方法分别适用于微生物联合网络和样本相似性网络
• Correlation its similarity to
Permutation tests 排列测试 • Partial correlation correlations)
• Partial least squares • Conditional
使用排列测试为每个 NetCoMi uses permutation tests to assess for each centrality based scores dependence / partial
中心性度量和分类单 measure and taxon whether the calculated centrality value is
usable) NetCoMi functions together with their main arguments a method for zero handling, normalization and sparsification
is given in Figure 3. In the following, we use the SPRING could be chosen (Figure S1 and Table S1). Since these steps are
method as a measure of partial correlation for constructing already included in SPRING, they are skipped in our example.
exemplary microbial association networks and Aitchison’s If associations have already been estimated in advance, the
distance as a dissimilarity measure for constructing sample external association matrix can be passed to netConstruct()
similarity networks. The construction of a differential network is instead of a count matrix. In this mode, the user only
described in Supplementary material 5.5. We also demonstrate needs to define a sparsification method and a dissimilarity
the integration of external networks into NetCoMi’s workflow function.
using BAnOCC-derived microbial association networks [109] Besides the available options for transforming the associa-
(Supplementary material 6). In Supplementary material 7, tions into dissimilarities (Supplementary Figure S1), the function
we investigate the variability of microbial networks among also accepts a user-defined dissimilarity function. The choice of
different network construction methods. The results reveal dissimilarity function also influences the handling of negative
strong differences between the networks based on different associations. In Figure 4, we use the “signed” distance metric,
association measures as well as different normalization meth- where strong negative correlations lead to a high dissimilar-
ods whereas the zero replacement methods available in NetCoMi ity (and thus to a low edge weight in the adjacency matrix).
lead to quite similar networks, if the remaining arguments Figure S4 shows a network plot where the “unsigned” metric
are fixed. is used.
netConstruct() returns an object of class microNet, which
can be directly passed to netAnalyze() to compute network
Constructing a single microbial network
properties (Table S5). Applying the plot function to the output
In principle, the netConstruct() function allows the specifica- of netAnalyze() leads to a network visualization, as shown in
tion of any combination of methods for association estimation, Figure 4A. In this plot, network characteristics are emphasized
zero treatment and normalization. However, since NetCoMi is in different ways: hubs are highlighted, clusters are marked by
mainly designed to handle compositional data, a warning is different node colors, and node sizes are scaled according to
returned if a chosen combination is not compositionally aware. eigenvector centrality. Alternatively, node colors and shapes can
To generate the network shown in Figure 4A, we pass represent features such as taxonomic rank. Node sizes can be
过滤器参数的设置方 the combined count matrix containing dust samples from scaled according to other centrality measures or absolute/rel-
式是,在分析中只包 Ulm and Munich to netConstruct(). Filter parameters are
括最频繁的100个分 ative abundances of the corresponding taxa. Node positions
类群,从而为我们的 set in such a way that only the 100 most frequent taxa are are defined using the Fruchterman-Reingold algorithm [110].
数据生成1022×100
个读取计数矩阵 included in the analyses leading to a 1022 × 100 read count This algorithm provides a force-directed layout aimed at high
matrix for our data. Depending on the association measure, readability of the network by placing pairs of nodes with a high
NetCoMi 11
absolute edge weight close together and those with low edge group difference, and also includes betweenness and closeness
weight further apart. centrality. For none of the four centrality measures, any sig-
The plot() method implemented in NetCoMi includes sev- nificant differences are observed, confirming the descriptive
eral options for selecting nodes or edges of interest to facil- analyses. Pseudomonas, Pedobacter and Rikenellaceae RC9 gut group
itate the readability of the network plot without influencing as hub taxa in only one of the networks show high differences in
the calculated network measures. In Figure 4B, for instance, we eigenvector centrality. However, the differences are not deemed
display the 50 bacteria with highest degree, facilitating network significant (Table 6). Global network properties (upper part of
interpretability. 在图4B中,我们显示了度数最高的50个细菌,便于网络解释。 Table 6) are also not significantly different for α = 0.05.
NetCoMi identified four clusters and the following five hub Table 7 summarizes Jaccard indices expressing the similarity
nodes: Aerococcus, Atopostipes, Brachybacterium, Rikenellaceae RC9 of sets of most central nodes and the hub nodes among the
gut group and [Eubacterium] coprostanoligenes group. The strongest two centers. They do not imply any group differences, which
positive association is between Ezakiella and Peptoniphilus with a would be indicated by a small probability P(J ≤ j). Similarly,
partial correlation of 0.55 (in the green cluster in Figure 4A). The the adjusted Rand index (0.752, P-value = 0) indicates a high
strongest negative correlation is -0.09 between Alistipes (in the similarity of the two clusterings, which is also highlighted
blue cluster) and Lactobacillus (in the red cluster). in Figure 5.
Table 6. Results from testing global network metrics and centrality measures of the networks in Figure 5 for group differences (via permutation
tests using 1000 permutations). Shown are respectively the computed measure for Munich and Ulm, the absolute difference, and the P-value for
testing the null hypothesis H0 : |diff| = 0. For the centrality measures, P-values are adjusted for multiple testing using the adaptive Benjamini-
Hochberg method [73], where the proportion of true H0 is determined according to Langaas et al. [74]. For degree and eigenvector centrality,
the five genera with the highest absolute group difference are shown. The centralities are normalized to [0,1] as described in Table 4. Highly
different eigenvector centralities (even if not significant) describe bacteria with highly different node sizes in the network plots in Figure 5
such as Pseudomonas, which is a hub in Munich and much less important in Ulm.
Table 7. Jaccard index values corresponding to the networks shown in Figure 5. Index values j express the similarity of the sets of most
central nodes and also of the sets of hub taxa between the two networks. “Most central” nodes are those with a centrality value above the
empirical 75% quantile. Jaccard’s index is 0 if the sets are completely different and 1 for exactly equal sets. P(J ≤ j) is the probability that
Jaccard’s index takes a value less than or equal to the calculated index j for the present total number of taxa in both sets (P(J ≥ j) is defined
analogously).
j P(J ≤ j) P(J ≥ j)
and dissimilarity transformations based on topological overlap. penalized marginal correlations (such as SparCC) and partial
The package includes a dedicated list of methods for handling correlations (such as SPRING) are statistically difficult to com-
excess zeros in the count matrix and for data normalization in pare, thus requiring special modifications which, in turn, limits
order to account for the special characteristics of the underlying cross-study comparisons (e.g. Yoon et al. [55] where SparCC
MGS data prior to association estimation. correlations are transformed into SparCC partial correlations).
A unique feature of the NetCoMi framework is its ability to Moreover, the sole focus on edge recovery may obfuscate other
perform differential network and differential association anal- aspects of correct model recovery, including the shape of degree
NetCoMi框架的一个独 ysis in a statistically principled fashion. Differential network distributions [28] or detecting hub nodes. For instance, methods
特功能是它能够以统 analysis allows not only to uncover the global role of a taxon with a similar edge recovery precision may greatly vary regarding
计原理的方式执行差
分网络和差分关联分 in the overall network structure but also its changing influence their ability to determine hub nodes [117]. Thus, if the correct
析
under varying conditions. Differential association analysis [31], detection of hub nodes is of major interest to the user, present
on the other hand, can directly assess which associations signif- comparative microbial network studies will give little guidance.
icantly change across conditions, providing concrete hypotheses Finally, we posit that simulation studies accompanying arti- 最后,我们假设,伴
随着介绍新方法的文
the sparsified structure of a network will likely depend on the Zurich), Dick Heederik (Utrecht University) and Jon Genuneit
number of available samples. A comprehensive understand- (Ulm University and Leipzig University) who were responsi-
ing between number of samples, sparsification and network ble for design and implementation of the GABRIELA study as
因此,整合要求较低
的置换测试替代方案 structure is currently elusive. NetCoMi relies on permutation well as Rob Knight and Georg Loss (both UC San Diego) who
将是NetCoMi.ons的一 tests for several statistical tests. The lower limit of P-values
个受欢迎的功能 created the 16S rRNA data used in our examples. We would
arising from permutation tests is directly related to the num- like to thank Alethea Charlton (LMU Munich) for language
ber of available permutations (Section 4). This already required editing and her helpful comments. We also thank Anastasiia
proper adjustment of the calculation for small numbers [126, Holovchak (LMU Munich) for testing the R package and
127]. For extended simulation studies, NetCoMi’s dependence on
editing its documentation.
permutation tests may prove to be computer intensive. Thus,
integration of less demanding alternatives to permutation tests
would represent a welcome feature in NetCoMi.
最后,尽管在我们的R Finally, despite incorporating a comprehensive list of meth- References
包中包含了一个完整
18. Faust K, Sathirapongsasuti JF, Izard J, et al. Microbial co- Girona, Spain: University of Girona, 2003;http://eprints.
occurrence relationships in the human microbiome. PLoS gla.ac.uk/159351/.
Comput Biol 2012; 8:e1002606. 39. Xia Y, Sun J, Chen DG. Statistical Analysis of Microbiome Data
19. Layeghifard M, Li H, Wang PW, et al. Microbiome networks with R. Springer, 2018.
and change-point analysis reveal key community changes 40. Martín-Fernández JA, Barceló-Vidal C, Pawlowsky-Glahn V.
associated with cystic fibrosis pulmonary exacerbations. Dealing with zeros and missing values in compositional
NPJ Biofilms and Microbiomes 2019; 5. data sets using nonparametric imputation. Mathematical
20. Liu Z, Ma A, Mathé E, et al. Network analyses in microbiome Geology 2003; 35:253–78.
based on high-throughput multi-omics data. Brief Bioinform 41. Palarea-Albaladejo J, Martín-Fernández JA, Gómez-García
2020; 00:1–17. doi: 10.1093/bib/bbaa005. J. A parametric approach for dealing with composi-
21. Ma B, Wang Y, Ye S, et al. Earth microbial co-occurrence net- tional rounded zeros. Mathematical Geology 2007; 39:
work reveals interconnection pattern across microbiomes. 625–45.
Microbiome 2020; 8:1–12. 42. Palarea-Albaladejo J, Martín-Fernández J. A modified EM
59. Friedman J, Hastie T, Tibshirani R. Sparse inverse covari- 81. Ruhnau B. Eigenvector-centrality – a node-centrality? Social
ance estimation with the graphical lasso. Biostatistics 2008; Networks 2000; 22:357–65.
9:432–41. 82. Bolland JM. Sorting out centrality: an analysis of the per-
60. Filosi M. R package computes correlation for relative abundances. formance of four centrality models in real and simulated
https://github.com/MPBA/r-sparcc, 2017. networks. Social Networks 1988; 10:233–53.
61. Fang H. CCLasso: Correlation Inference for Compositional 83. Junker BH, Schreiber F. Analysis of biological networks. New
Data through Lasso. https://github.com/huayingfang/CCLa Jersey: John Wiley & Sons, 2008.
sso, 2016. 84. Agler MT, Ruhe J, Kroll S, et al. Microbial hub taxa link host
62. Schwager E, Bielski C, George W. ccrepe: ccrepe_and_nc.score, and abiotic factors to plant microbiome variation. PLoS Biol
2019, R package version 1.18.1. 2016; 14:e1002352.
63. Kurtz ZD, Müller CL, Miraldi E, et al. SpiecEasi: Sparse Inverse 85. Barrat A, Barthélemy M, Pastor-Satorras R, et al. The archi-
Covariance for Ecological Statistical Inference, 2019, R package tecture of complex weighted networks. Proc Natl Acad Sci
version 1.0.6. 2004; 101:3747–52.
104. Yu D, Zhang Z, Glass K, et al. New statistical methods 116. Hirano H, Takemoto K. Difficulty in inferring microbial
for constructing robust differential correlation networks community structure based on co-occurrence network
to characterize the interactions among microRNAs. Sci Rep approaches. BMC Bioinformatics 2019; 20:329.
2019; 9:1–12. 117. Röttjers L, Faust K. From hairballs to hypotheses–biological
105. Gill R, Datta S, Datta S. A statistical framework for differ- insights from microbial networks. FEMS Microbiol Rev 2018;
ential network analysis from microarray data. BMC Bioin- 42:761–80.
formatics 2010; 11:95. 118. Boulesteix AL, Lauer S, Eugster MJ. A plea for neutral com-
106. Siska C, Bowler R, Kechris K. The discordant method: a parison studies in computational sciences. PloS One 2013;
novel approach for differential correlation. Bioinformatics 8:e61562.
2016; 32:690–6. 119. Boulesteix AL, Binder H, Abrahamowicz M, et al. On the
107. Genuneit J, Büchele G, Waser M, et al. The GABRIEL necessity and design of studies comparing statistical meth-
advanced surveys: study design, participation and evalu- ods. Biom J 2018; 60:216–8.
ation of bias. Paediatr Perinat Epidemiol 2011; 25:436–47. 120. Rowan-Nash AD, Korry BJ, Mylonakis E, et al. Cross-domain