Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

1

2 Mitochondria branch within Alphaproteobacteria


3
4 Lu Fan1,2,¶, Dingfeng Wu3,¶, Vadim Goremykin4,¶, Jing Xiao3, Yanbing Xu3, Sriram Garg5, Chuanlun
5 Zhang2,6, William F. Martin5,*, Ruixin Zhu2,3,*
1
6 Academy for Advanced Interdisciplinary Studies, Southern University of Science and Technology
7 (SUSTech), Shenzhen 518055, China
2
8 Shenzhen Key Laboratory of Marine Archaea Geo-Omics, Department of Ocean Science and
9 Engineering, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China
3
10 Department of Bioinformatics, Tongji University, Shanghai 200092, China
4
11 Research and Innovation Centre, Fondazione E. Mach, 38010 San Michele all’Adige (TN), Italy
5
12 Institute of Molecular Evolution, Heinrich-Heine-University, Universitätsstr. 1, 40225 Düsseldorf,
13 Germany
6
14 Laboratory for Marine Geology, Qingdao Pilot National Laboratory for Marine Science and Technology,
15 Qingdao, 266061, China
16

17 These authors contribute equally to this work.
*
18 Corresponding authors:
19 William F. Martin (bill@hhu.de)
20 Ruixin Zhu (rxzhu@tongji.edu.cn)
21
22 It is well accepted that mitochondria originated from an alphaproteobacterial-like ancestor.
23 However, the phylogenetic relationship of the mitochondrial endosymbiont to extant
24 alphaproteobacteria remains a subject of discussion. The focus of much debate is whether the
25 affiliation between mitochondria and fast-evolving alphaproteobacterial lineages reflects true
26 homology or artifacts. Approaches such as protein-recoding and site-exclusion have been claimed to
27 mitigate compositional heterogeneity between taxa but this comes at the cost of information loss and
28 the reliability of such methods is so far unjustified. Here we demonstrate that site-exclusion methods
29 produce erratic phylogenetic estimates of mitochondrial origin. Thus, previous phylogenetic
30 hypotheses on the origin of mitochondria based on pretreated datasets should be re-evaluated. We
31 applied alternative strategies to reduce phylogenetic noise by taxon replacement and selective
32 exclusion while keeping site substitution information intact. Cross-validation based on a series of
33 trees placed mitochondria robustly within Alphaproteobacteria by sharing an ancient common
34 ancestor with Rickettsiales and currently unclassified marine lineages.
35
36 The origin of mitochondria is one of the defining events in the history of life. Gene-network analyses1-4
37 and marker gene-based phylogenomic inference have generally reached a consensus that mitochondria
38 have an alphaproteobacterial common ancestor5. However, the exact relationship of mitochondria to
39 specific alphaproteobacterial groups remains debated. Phylogenetic placement of mitochondria in the tree
40 of Alphaproteobacteria has been extremely difficult for several reasons including strong phylogenetic
41 artifacts associating mitochondria with some fast-evolving alphaproteobacterial lineages such as
42 Rickettsiales and Pelagibacterales resulting in erroneous clade formations (Supplementary Discussion).

1
43 To minimize the possible influence of long-branch attraction coupled with convergent compositional
44 signals, various strategies have been applied such as the use of nucleus-encoded mitochondrial genes4,6,7,
45 site or gene exclusion8-10, protein recoding 10 and the use of heterogeneity-tolerant models6,11. These
46 attempts have proposed contradictory hypotheses (Supplementary Discussion). Recently, Martijn et al.
47 revisited the topic and reported that when compositional heterogeneity of the protein sequence alignments
48 was reduced by excluding sites from the amino acid alignment, the entire alphaproteobacterial class
49 formed a sister group to mitochondria12. Their conclusion is at odds with the long-agreed phylogenetic
50 consensus that mitochondria originated from within the Alphaproteobacteria13. However, while excluding
51 possible noise in compositionally heterogenous sites might mitigate systematic errors, it will also
52 necessarily lead to some loss of phylogenetic information. A priori, one cannot rule out the possibility that
53 excluded sites contain signals of true evolutionary connection between mitochondria and
54 Alphaproteobacteria. A similar concern was voiced by Gawryluk14. Here, we examined the phylogenetic
55 affiliations of mitochondria by using several site-exclusion methods and demonstrated that these results
56 should be interpreted with utmost caution. To avoid arbitrary effects and model overfitting by site
57 exclusion, we then applied a straightforward approach to significantly reduce compositional signals in the
58 dataset by using GC-rich mitochondrial sequences that better fit the model in their natural state and to keep
59 the native alignment intact.
60
61 To cross-validate the effects of site-exclusion approaches on mitochondrial and alphaproteobacterial
62 phylogeny, we implemented five metrics with different principles in this study namely Stuart’s test,
63 Bowker’s test, 2-score, ɀ-score and Fast-evolving (Supplementary Table 1). Site-excluded subsets of the
64 ‘24-alphamitoCOGs’ dataset in Martijn et al. (2018) were generated by using the five methods with a
65 series of cutoff values (Supplementary Table 2). Trees of the subsets were compared to the tree of the
66 untreated dataset by calculating topological dissimilarity. Site exclusion approaches led to substantial tree
67 topological changes (Fig. 1). In general, the increase in number of sites removed precipitated increases in
68 changes of tree topology. Among the methods, ɀ-score generally caused the least changes in nearly all the
69 subsets of alignment. These patterns were consistent when either simple or mixed phylogenetic models
70 were applied. Of these trees, nearly half supported mitochondria in a sisterhood with the entire
71 Alphaproteobacteria (‘mito-out’) and the other half supported that mitochondria branch within
72 Alphaproteobacteria (‘mito-in’) (Fig. 1). Site-exclusion method, the number of sites excluded and tree
73 model applied had a mixed function to the phylogenetic relationship of mitochondria to
74 Alphaproteobacteria. One explanation for this observation is that sites strongly supporting either the ‘mito-
75 in’ or the ‘mito-out’ topology were randomly excluded by these metrics. The absence of certain topology-
76 determining and ‘mito-in’-supporting sites could have caused tree shift from one topology to the other,
77 while the further loss of ‘mito-out’-supporting sites might shift the tree topology back. Notably, while we
78 reproduced the results observed in Martijn et al. (2018) that tree topology shifted from ‘mito-in’ to ‘mito-
79 out’ when 5% to 40% of sites were removed by using the 2-score metric, further exclusion of more sites
80 (in total 60% here) changed the tree topology back to ‘mito-in’ predicted by the simple model (Fig. 1a)
81 strongly suggesting their conclusion was based on arbitrary parameter setup.
82

2
83
84 Fig. 1 | Tree dissimilarity based on the Alignment metric between the untreated tree and trees
85 generated after applying site-exclusion approaches. All trees are rooted. Empty dots show trees
86 supporting the Alphaproteobacteria-sister topology and filled dots show trees supporting the within-
87 Alphaproteobacteria topology of mitochondria. a, ML trees under simple models. b, ML trees under the
88 mixed model (C60).
89
90
91 To counter compositional heterogeneity but without arbitrarily compromising phylogenetic signals, we
92 here replaced the mitochondrial and Rickettsiales sequences with GC-rich alternatives in the ‘24-
93 alphamitoCOGs’ dataset and remarkably reduced the heterogeneity in FYMINK/GARP ratio between
94 mitochondria and slowly-evolving alphaproteobacteria (Supplementary Fig. 1). A new dataset namely
95 ‘18-alphamitoCOGs’ was generated including 61 nonredundant taxa and 18 marker proteins for
96 downstream phylogenetic inference (Supplementary Table 3, 4, and Supplementary Discussion).
97 The tree topology of Alphaproteobacteria themselves is an issue in its own right15. When fast-evolving taxa
98 were excluded, the alphaproteobacteria could be classified into four major clades: Alpha I, Alpha II, Alpha
99 III and GT, respectively (Fig. 2ab, Supplementary Fig. 2, 3, Supplementary Table 3, Supplementary
100 Discussion). We assign these alphaproteobacteria as the ‘backbone’ taxa. Despite the addition and removal
101 of fast-evolving taxa, a topology was preserved that all the four backbone clades maintain their monophyly
102 (Fig. 2, tree files are deposited in Supplementary Data Files).
103

3
104
105 Fig. 2 | Schematic phylogenetic trees of subgroups of Alphaproteobacteria in the ‘18-
106 alphamitoCOGs’ dataset. Alphaproteobacterial lineages are named according to Supplementary Table
107 3. Taxa and taxonomic groups in black present the backbone taxa. Filled dots show node support values
108 greater than 80% while empty dots show values greater than 50% but less than 80%. Node values show
109 posterior probability support values for Bayesian trees and bootstrapping support values based on 1000
110 iterations for ML trees. Trees are rooted. Outgroup taxa and Magnetococcus marinus MC-1 are not shown.
111 The Maxdiff values of Bayesian trees are shown beside the trees. a-l, Schematic trees of Supplementary
112 Fig. 2-13, respectively.
113
114
115 Each of the six fast-evolving groups was then added and a series of phylogenetic trees were built,
116 respectively. Our approach based on this new dataset not only provided results congruent with the most
117 recent study15 but also novel findings for the most difficult lineages (Supplementary Discussion).
118 Specifically, Holosporales were placed either in Alpha III forming a sister relationship with Azospirllaceae
119 and Acetobacteraceae (Alpha IIIb) or in sisterhood with the entire Alpha III (Fig. 2cd, Supplementary
120 Fig. 4, 5). Pelagibacteriales branched after Alpha Ia and before Alpha Ib (Fig. 2ef, Supplementary Fig. 6,
121 7). Alphaproteobacterium HIMB59 was placed in Alpha IIb forming a sisterhood with MarineAlpha12
122 Bin1 (Fig. 2gh, Supplementary Fig. 8, 9). Rickettsiales branched as sister to the clade of Alpha II and
123 Alpha III in the ML tree with a weak basal node support but within Alpha II, as the sister of MarineAlpha9
124 Bin5 in the Bayesian tree, suggesting possible connection between Rickettsiales and this newly discovered,
125 non-fast-evolving marine alphaproteobacterium (Fig. 2ij, Supplementary Fig. 10, 11). Moreover, fast-
126 evolving MAGs belonging to FEMAG I and FEMAG II were robustly placed within Alpha IIb (Fig. 2kl,
127 Supplementary Fig. 12, 13). Specifically, FEMAG I showed a strong connection to MarineAlpha9 Bin5,
128 while FEMAG II was linked to MarineAlpha12 Bin1 in the Bayesian tree.
129 We then added mitochondria to the trees of backbone taxa solely or in combinations with other fast-
130 evolving clades (tree files are deposited in Supplementary Data Files). Mitochondria by themselves were
131 placed within Alphaproteobacteria as the sister of Alpha II and Alpha III in the ML tree with a weak node
132 support (Fig. 3a, Supplementary Fig. 14). However, the counterpart Bayesian tree could not resolve the

4
133 relationship of mitochondria to taxa of the four alphaproteobacterial backbone clades (Fig. 3b,
134 Supplementary Fig. 15). Similar results were observed in trees including mitochondria in combination
135 with Holosporales, Pelagibacterales and alphaproteobacterium HIMB59, respectively (Fig. 3c-h,
136 Supplementary Fig. 16-21), suggesting all these lineages have little evolutionary affinity to mitochondria.
137 In contrast, apparent phylogenetic connection of mitochondria to Rickettsiales and FEMAG II was
138 observed in both ML and Bayesian trees (Fig. 3i-l, Supplementary Fig. 22-25). Specifically, mitochondria
139 and Rickettsiales group together independently of the four backbone clades, while mitochondria and
140 FEMAG II form sisters inside the Alpha IIb clade.
141

142
143 Fig. 3 | Schematic phylogenetic trees of mitochondria and subgroups of Alphaproteobacteria in the
144 ‘18-alphamitoCOGs’ dataset. Alphaproteobacterial lineages and Mitochondria are named according to
145 Supplementary Table 3. Taxa and taxonomic groups in black present the backbone taxa. Filled dots show
146 node support values greater than 80% while empty dots show values greater than 50% but less than 80%.
147 Node values show posterior probability support values for Bayesian trees and bootstrapping support values
148 based on 1000 iterations for ML trees. Trees are rooted. Outgroup taxa and Magnetococcus marinus MC-1
149 are not shown. The Maxdiff values of Bayesian trees are shown beside the trees. a-l, Schematic trees of
150 Supplementary Fig. 14-25, respectively.
151
152
153 Since Rickettsiales, alphaproteobacterium HIMB59, FEMAG I and FEMAG II individually showed
154 phylogenetic connections to backbone taxa of Alpha IIb in Bayesian trees, evolutionary relationships
155 between these lineages were then investigated specifically by setting Alpha IIa as the outgroup.
156 MarineAlpha11 Bin1 and MarineAlpha12 Bin2 formed a monophylic clade in both the ML and the
157 Bayesian trees (Fig. 4ab). MarineAlpha9 Bin5 either branched below all the fast-evolving taxa studied
158 here (Fig. 4a) or formed monophyly with FEMAG I (Fig. 4b). The nodes before MarineAlpha9 Bin5 and

5
159 FEMAG I, respectively, had low support suggesting the clade comprising these two lineages in the ML tree
160 was unstable. Both trees reached an agreement that alphaproteobacterium HIMB59 branched within
161 FEMAG II and Rickettsiales was in sisterhood with FEMAG II.
162

163
164 Fig. 4 | Phylogenetic relationships of fast-evolving taxa and mitochondria to alphaproteobacteria of
165 Alpha IIb. Node values show posterior probability support values for Bayesian trees and bootstrapping
166 support values based on 1000 iterations for ML trees. mt, mitochondria. All trees are rooted and the
167 outgroup is not shown. a and b, ML and Bayesian trees, respectively, of fast-evolving alphaproteobacteria
168 and taxa of Alpha IIb. c and d, ML and Bayesian trees, respectively, of fast-evolving alphaproteobacteria,
169 mitochondria and taxa of Alpha IIb.
170
171
172 When mitochondria were present, the topology of all other taxa was preserved in both the ML tree and the
173 Bayesian tree, respectively (Fig. 4cd). Mitochondria were placed below the clade consist of FEMAG II,
174 alphaproteobacterium HIMB59, Rickettsiales and FEMAG I in the ML tree with node support of 71%. In
175 comparison, the phylogenetic relationship of mitochondria, the clade of FEMAG I and MarineAlpha9 Bin5
176 and the clade of FEMAG II, alphaproteobacterium HIMB59 and Rickettsiales was unresolved by Bayesian
177 inference. Despite that, the placement of mitochondria within Alpha IIb was robust. Our result suggests
178 that mitochondria may have originated from the common ancestor of Rickettsiales and certain extant
179 marine planktonic alphaproteobacteria. This tree topology is robust to various parameters and unlikely a
180 result of phylogenetic artifact, as indicated by several lines of evidence (Supplementary Discussion).
181 In summary, we have demonstrated that site-exclusion methods can cause diverse topological shifts via
182 arbitrary cutoff selection. Specifically, the Alphaproteobacteria-sister topology reported by Martijn et al.
183 was the result of a very particular experimental setup and a set of parameters used12. In even most site
184 excluded datasets, mitochondria branch from within Alphaproteobacteria. Therefore, by lacking objective
185 criteria for parameter setup, site exclusion methods are still out for judgement. We then employed
186 alternative approaches to mitigate biases in datasets and found that mitochondria have strong phylogenetic
187 connection to the common ancestor of Rickettsiales and several fast-evolving alphaproteobacteria derived
188 from marine surface metagenomes. While this result again supports a robust evolutionary association

6
189 between mitochondria and Alphaproteobacteria, it also provides important ecological insights into the
190 origin of both mitochondria and Rickettsiales. Based on our result, the common ancestor of mitochondria
191 and Rickettsiales was a free-living alphaproteobacterium. This is consistent with a recent report favoring
192 independent branching of Rickettsiales and mitochondria16 but again in agreement with numerous previous
193 studies which suggested phylogenetic connection between mitochondria and Rickettsiales5. Physiological
194 and geological modellings have suggested that mitochondrial acquisition possibly occurred in shallow
195 marine environments17 or in anaerobic syntrophy18. Proteome study of Rickettsiales and MarineAlpha bins
196 of Alpha II may provide hints about the metabolic nature of the common ancestor of mitochondria18,19.
197
198 1. Ku, C., Nelson-Sathi, S., Roettger, M., Sousa, F. L., et al. Endosymbiotic origin and differential loss of
199 eukaryotic genes. Nature 524, 427-432 (2015).
200 2. Thiergart, T., Landan, G., Schenk, M., Dagan, T. & Martin, W. F. An evolutionary network of genes
201 present in the eukaryote common ancestor polls genomes on eukaryotic and mitochondrial origin.
202 Genome Biol Evol 4, 466-485 (2012).
203 3. Abhishek, A., Bavishi, A., Bavishi, A. & Choudhary, M. Bacterial genome chimaerism and the origin
204 of mitochondria. Can J Microbiol 57, 49-61 (2011).
205 4. Atteia, A., Adrait, A., Brugière, S., Tardif, M., et al. A proteomic survey of Chlamydomonas reinhardtii
206 mitochondria sheds new light on the metabolic plasticity of the organelle and on the nature of the
207 alpha-proteobacterial mitochondrial ancestor. Mol Biol Evol 26, 1533-1548 (2009).
208 5. Roger, A. J., Muñoz-Gómez, S. A. & Kamikawa, R. The origin and diversification of mitochondria.
209 Curr Biol 27, R1177-R1192 (2017).
210 6. Derelle, R. & Lang, B. F. Rooting the eukaryotic tree with mitochondrial and bacterial proteins. Mol
211 Biol Evol 29, 1277-1289 (2012).
212 7. Wang, Z. & Wu, M. An integrated phylogenomic approach toward pinpointing the origin of
213 mitochondria. Sci Rep 5, 7949 (2015).
214 8. Viklund, J., Ettema, T. J. & Andersson, S. G. Independent genome reduction and phylogenetic
215 reclassification of the oceanic SAR11 clade. Mol Biol Evol 29, 599-615 (2012).
216 9. Esser, C., Ahmadinejad, N., Wiegand, C., Rotte, C., et al. A genome phylogeny for mitochondria
217 among alpha-proteobacteria and a predominantly eubacterial ancestry of yeast nuclear genes. Mol
218 Biol Evol 21, 1643-1660 (2004).
219 10. Fitzpatrick, D. A., Creevey, C. J. & McInerney, J. O. Genome phylogenies indicate a meaningful
220 alpha-proteobacterial phylogeny and support a grouping of the mitochondria with the Rickettsiales.
221 Mol Biol Evol 23, 74-85 (2006).
222 11. Rodríguez-Ezpeleta, N. & Embley, T. M. The SAR11 group of alpha-proteobacteria is not related to
223 the origin of mitochondria. PLoS One 7, e30520 (2012).
224 12. Martijn, J., Vosseberg, J., Guy, L., Offre, P. & Ettema, T. J. G. Deep mitochondrial origin outside the
225 sampled alphaproteobacteria. Nature 557, 101-105 (2018).
226 13. Gray, M. W., Burger, G. & Lang, B. F. Mitochondrial evolution. Science 283, 1476-1481 (1999).
227 14. Gawryluk, R. M. R. Evolutionary biology: A new home for the powerhouse? Curr Biol 28, R798-
228 R800 (2018).
229 15. Muñoz-Gómez, S. A., Hess, S., Burger, G., Lang, B. F., et al. An updated phylogeny of the
230 Alphaproteobacteria reveals that the parasitic Rickettsiales and Holosporales have independent
231 origins. Elife 8, (2019).

7
232 16. Castelli, M., Sabaneyeva, E., Lanzoni, O., Lebedeva, N., et al. Deianiraea, an extracellular bacterium
233 associated with the ciliate Paramecium, suggests an alternative scenario for the evolution of
234 Rickettsiales. ISME J (2019).
235 17. Waldbauer, J. R., Newman, D. K. & Summons, R. E. Microaerobic steroid biosynthesis and the
236 molecular fossil record of Archean life. Proc Natl Acad Sci U S A 108, 13409-13414 (2011).
237 18. Gould, S. B., Garg, S. G. & Martin, W. F. Bacterial vesicle secretion and the evolutionary origin of the
238 eukaryotic endomembrane system. Trends Microbiol 24, 525-534 (2016).
239 19. Martin, W. F., Tielens, A. G. M., Mentel, M., Garg, S. G. & Gould, S. B. The physiology of
240 phagocytosis in the context of mitochondrial origin. Microbiol Mol Biol Rev 81, (2017).
241
242

8
243 Methods
244 No statistical methods were used to predetermine sample size. The experiments were not randomized and
245 the investigators were not blinded to allocation during experiments and outcome assessment.
246 Implementation of site-exclusion metrics. To obtain the 24-alphamitoCOGs dataset in Martijn et al.
247 (2018), file ‘alphaproteobacteria_mitochondria_untreated.aln’ was downloaded from
248 https://datadryad.org//resource/doi:10.5061/dryad.068d0d0. As the names of some MarineAlpha bins in
249 this file are not consistent with the phylogenetic trees in the original paper, we obtained the name mapping
250 file from Dr. Joran Martijn on 4 July 2018. On this dataset, 2-score based site exclusion was achieved by
251 applying the equation introduced by Viklund et al.8. The 2-score metric was designed to test site
252 contribution to dataset compositional heterogeneity8 and was applied by Martijn et al. for mitochondrial
253 phylogeny study12. ɀ-score is a metric specifically designed to cope with strong GC content-related amino
254 acid compositional heterogeneity in datasets of alphaproteobacterial phylogeny15. ɀ-scores of sites were
255 calculated according to the method introduced by Muñoz-Gómez et al.15. A method implemented in
256 IQTREE for fast-evolving site selection was also included for comparison since long-branch attraction
257 caused by fast-evolving species in Alphaproteobacteria and mitochondria is a potential issue20. Fast-
258 evolving site exclusion was based on conditional mean site rates estimated under the LG+C60+F+R6
259 model in IQTREE (v1.5.5) using the ‘-wsr’ flag20. Based on these three metrics, 5%, 10%, 20%, 40% and
260 60% of sites with the highest scores were excluded for downstream phylogenetic analyses. Stuart’s test and
261 Bowker’s test are two typical evaluation metrics of symmetry violation21. Site exclusion based on Stuart’s
262 test was conducted by using the stationary-trimming function in BMGE (v1.12)22.
263 Compared to Stuart’s test, Bowker’s test of symmetry was reported to more comprehensive and sufficient
264 to assess the compliance of symmetry, reversibility and homogeneity in time-reversible model
265 assumptions21,23. We used Bowker’s test of symmetry to produce subsets of the ‘alphamitoCOGs-24’
266 dataset by meeting increasingly stringent p-value-based thresholds
267 (>0.005, >0.01, >0.05, >0.1, >0.2, >0.3, >0.4 and >0.5, respectively). The Bowker’s test has long been
268 used as an overall test for symmetry24. The test assesses symmetry in an r × r contingency table with the ij-
269 th cell containing the observed frequency nij. The null hypothesis for symmetry is H0 = nij = nji, i ≠ j, i,j =
270 1,…,r, and the test value is computed as:
271

272 (1)
2
273 The test statistics follows distribution with the number of degrees of freedom equal to the number of
274 comparisons (nij vs nji) made.
275 The scoring function (SF) utilized for symmetry-based alignment trimming employed here is a sum of
276 absolute values of natural logarithms of Bowker's test's p-values, each raised to a certain power (15 as the
277 default value). SF can be computed as a mean over the values in an upper or lower triangular part of a
278 square matrix which rows and columns represent taxa, populated with |ln p|x values for Bowker’s tests
279 among these taxa, e.g:
280

281 (2)

9
282 wherein h is the number of taxa in the msa, and pab is a p value for the sequences a and b.

283 The script which performs symmetry-based trimming (Bowker_test_symmetry.pl, available as


284 Supplementary Code Files) deletes a site in an alignment, computes a SF value and restores the original
285 alignment. The operation is performed for every alignment site. Then, the site which removal results in
286 lowest SF value is deleted irreversibly. The procedure is repeated for each shortened alignment subset until
287 the lowest p-value for a pair-wise Bowker’s test in the trimmed dataset exceeds certain p-value-based
288 threshold(s).
289 Exponentiation in formula 2 leads to a sooner recovery of trimmed subsets. The exponentiation
290 disproportionally increases the addend values in formula 2 (|ln pab|x) for smaller p values. For instance, the
291 default addend in the formula 2 for p-value 0.5 is 0.004 and the addend for p-value 0.005 is 72789633288.
292 Thus, when there is a disparity in individual p-values in the data, which is the case when the method is
293 needed, the exponentiation increases the relative contribution of the lowest p-values onto the SF value size.
294 At each trimming step the heuristic algorithm identifies a site which removal is likely to improve the worst
295 (lowest) p-values. The script outputs a trimmed subset when the lowest p-value exceeds the threshold
296 value. The suggested exponentiation, causing preferential improvement of the worst p-values at each site
297 stripping step, is able to deliver a result when less positions are removed. The default exponent value (x =
298 15) has been determined experimentally.
299 Phylogenetic inference and tree topology comparison. ML trees in this study were reconstructed by
300 using IQTREE under either auto-selected simple model (ModelFinder) or mixed model (LG+C60+F) as
301 specified in text. Bayesian trees were produced by using PhyloBayes MPI (v1.8)24, four chains were run
302 until a Maxdiff < 0.3 were reached.
303 For comparison of topology, ML trees of site-excluded datasets were first rooted to Beta-,
304 Gammaproteobacteria, Magnetococcales, MarineProteo1 Bin1 and Bin 2. The dissimilarity value between
305 each tree and the untreated tree was then calculated by using the Alignment metric developed by Nye et al.
306 Briefly the Alignment metric considers all the ways that the branches of one tree map onto the other25. This
307 method was found to superior among other tree comparison metrics26. The code was adapted from Kuhner
308 et al. (2015) and implemented in Python. Trees of the subsets were compared to the tree of the untreated
309 dataset, respectively. Both simple model and mixed model (C60) were used in Maximum-likelihood (ML)
310 tree reconstruction for comparison (tree files are deposited in Supplementary Data Files).
311 Genome and marker protein selection of the GC-bias-reduced dataset. The ‘18-alphamitoCOGs’
312 dataset of this study was based on the ‘24-alphamitoCOGs’ dataset in Martijn et al. (2018) after several
313 modifications. Specifically, MAGs derived from composite bins, which contain sequences from multiple
314 naturally existing genomes were excluded to minimize possible assembly-induced artifacts. Five less AT-
315 rich mitochondria (GC content 45.1%-52.2%) and five less AT-rich Rickettsiales (GC content 38.2%-
316 49.8%) were selected to replace the mitochondrial and rickettsiales groups in the original dataset
317 (Supplementary Table 3). The GC-poor vs. GC-rich amino acid (FYMINK/GARP) ratio of marker
318 proteins of the reselected mitochondria and Rickettsiales ranged from 0.955 to 1.329 and from 1.013 to
319 2.330, respectively (Supplementary Fig. 1). All relevant genomes were downloaded from the RefSeq
320 database of NCBI on 21 July 2018.
321 For quality control of the 24 marker proteins of the original dataset, sequences of these proteins were
322 downloaded from the MitoCOGs27 database and then aligned by using MAFFT-L-INS-I (v 7.055b)28,
323 respectively. Alignment of each protein was trimmed by using trimAl (v.1.4)29. Protein-specific e-values
324 were determined with distributions of positive and negative sequences. For each gene, sequences classified
325 into the proteins in MitoCOGs database were used as positive dataset and sequences classified into other
326 proteins were used as negative one. E-value distribution of positive and negative sequences was calculated
327 by using Hmmer (v3.2.1)30. Protein-specific e-values were the minimum of 95% quantile e-values of
328 positive sequences, and the minimum of negative sequences. We searched these 24 proteins individually in

10
329 the genomes by using Hmmer based on protein-specific e-values of the HMM models. The obtained
330 proteins were processed for ML tree reconstruction by using IQTREE under the model ‘LG+C60+F’.
331 Copies identified as paralogs, possible contaminants or events of lateral gene transfer in each gene tree
332 were removed. Candidatus Paracaedibacter symbiosus was excluded as multiple contaminant proteins
333 were detected in its genome and we think its genome likely suffers from heavy contamination.
334 MitoCOG0003 and MitoCOG0133 were excluded as they were detected in few genomes. MitoCOG00052,
335 MitoCOG00060, MitoCOG00066 and MitoCOG00071 were excluded as they were absent in reselected
336 mitochondrial genomes. Consequenly, 18 marker proteins were selected. Except for outgroup species
337 (including Beta-, Gammaproteobacteria and Magnetococcales), genomes contained 16 or more than 16 of
338 the 18 marker proteins were kept. Furthermore, we removed redundant MarineAlpha bins of the original
339 dataset based on pairwise similarity of marker proteins by using BLASTP (v2.6.0+, identity ⩾ 0.99 and
340 coverage ⩾ 0.95) to reduce computational time. As a result, 61 genomes were kept for downstream
341 analysis (Individual and concatenated alignments of ‘18-alphamitoCOGs’ are deposited in Supplementary
342 Data Files).
343 Before phylogenetic inference, selected proteins were aligned respectively by using MAFFT-L-INS-i.
344 Low-quality columns were removed by BMGE (-m BLOSUM30) and the multiple sequence alignments
345 after quality control were concatenated.
346
347 Data availability. The authors declare that data supporting the findings of this study are available in
348 Supplementary Data Files.
349
350 Code availability. Scripts of site-exclusion methods are available in Supplementary Code Files.
351
352 References
353
354 20. Nguyen, L. T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective
355 stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 32, 268-274
356 (2015).
357 21. Jermiin, L. S., Jayaswal, V., Ababneh, F. M. & Robinson, J. Identifying optimal models of evolution.
358 Methods Mol Biol 1525, 379-420 (2017).
359 22. Criscuolo, A. & Gribaldo, S. BMGE (Block Mapping and Gathering with Entropy): a new software
360 for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol Biol
361 10, 210 (2010).
362 23. Bowker, A. H. A test for symmetry in contingency tables. J Am Stat Assoc 43, 572-574 (1948).
363 24. Lartillot, N., Rodrigue, N., Stubbs, D. & Richer, J. PhyloBayes MPI: phylogenetic reconstruction with
364 infinite mixtures of profiles in a parallel environment. Syst Biol 62, 611-615 (2013).
365 25. Nye, T. M., Liò, P. & Gilks, W. R. A novel algorithm and web-based tool for comparing two
366 alternative phylogenetic trees. Bioinformatics 22, 117-119 (2006).
367 26. Kuhner, M. K. & Yamato, J. Practical performance of tree comparison metrics. Syst Biol 64, 205-214
368 (2015).
369 27. Kannan, S., Rogozin, I. B. & Koonin, E. V. MitoCOGs: clusters of orthologous genes from
370 mitochondria and implications for the evolution of eukaryotes. BMC Evol Biol 14, 237 (2014).

11
371 28. Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements
372 in performance and usability. Mol Biol Evol 30, 772-780 (2013).
373 29. Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment
374 trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972-1973 (2009).
375 30. Eddy, S. R. Accelerated profile HMM searches. PLoS Comput Biol 7, e1002195 (2011).
376
377
378 Acknowledgements This work was financially supported by the National Natural Science Foundation of
379 China (91851210, 41530105 and 81774152), the European Research Council (ERC 666053), the Shenzhen
380 Key Laboratory of Marine Archaea Geo-Omics, Southern University of Science and Technology,
381 (ZDSYS201802081843490), the Shenzhen Science and Technology Innovation Commission
382 (JCYJ20180305123458107), the VW foundation (93 046), and the Laboratory for Marine Geology,
383 Qingdao National Laboratory for Marine Science and Technology (MGQNLM-TD201810).
384
385 Author Contributions L.F., W.F.M. and R.Z. conceived this study. L.F., D.W., V.G., J.X., Y.X. and S.G.
386 were involved in data analysis. L.F., D.W., V.G., C.Z, W.F.M. and R.Z. interpreted the results and drafted
387 the manuscript. All authors participated in the critical revision of the manuscript.
388
389 Competing interests The authors declare no competing interests.
390

12

You might also like