Professional Documents
Culture Documents
Karen E. Nelson (Eds.) - Encyclopedia of Metagenomics - Genes, Genomes and Metagenomes - Basics, Methods, Databases and Tools-Springer US (2015)
Karen E. Nelson (Eds.) - Encyclopedia of Metagenomics - Genes, Genomes and Metagenomes - Basics, Methods, Databases and Tools-Springer US (2015)
Nelson
Editor
Encyclopedia of
Metagenomics
Genes, Genomes and Metagenomes:
Basics, Methods, Databases and Tools
1 3Reference
Encyclopedia of Metagenomics
Karen E. Nelson
Editor
Encyclopedia of
Metagenomics
Genes, Genomes and Metagenomes:
Basics, Methods, Databases and
Tools
v
About the Editor
vii
Contributors
ix
x Contributors
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
A 2 A 123 of Metagenomics
and prove the general metagenomic approach and (Mackelprang et al. 2011; Barberan et al. 2012;
were often limited by the high cost of sequencing. Bergmann et al. 2011; Nemergut et al. 2011;
Hence, desirable scientific methodology, includ- Bates et al. 2011). Considerable work still is
ing biological replication, could not be adopted, needed in order to determine spatial heterogene-
a situation that precluded appropriate statistical ity, for example, how representative a 0.1 mg
analyses and comparison (Prosser 2010). sample of soil is with respect to the larger envi-
The significant reduction, and indeed continu- ronment from which it was taken.
ing fall, in sequencing costs (see below) now The design of a sampling strategy is implicit in
means that the central tenants of scientific inves- the scientific questions asked and the hypotheses
tigation can be adhered to. Rigorous experimen- tested, and standard rules outside of replication
tal design will help researchers explore the and frequency of observation are hard to define.
complexity of microbial interactions and will However, the question of “depth of observation”
lead to improved catalogs of proteins and genetic is prudent to address because researchers now can
elements. Individual ecosystems can now be sequence microbiomes of individual environments
studied with appropriate cross-sectional and with exceptional depth or breadth. By enabling
temporal approaches designed to identify the either deep characterization of the taxonomic,
frequency and distribution of variance in commu- phylogenetic, and functional potential of a given
nity interaction and development (Knight ecosystem or a shallow investigation of these
et al. 2012). Such studies should also pay close elements across hundreds or thousands of samples,
attention to the collection of comprehensive current sequencing technology (see below) is
physical, chemical, and biological data (see changing the way microbial surveys are being
below). This will enable scientists to elucidate performed (Knight et al. 2012).
the emergent properties of even the most com- DNA handling and processing play a major
plex biological system. This capability will pro- role in exploring microbial communities through
vide the potential to identify drivers at multiple metagenomics (see also DNA extraction methods
spatial, temporal, taxonomic, phylogenetic, func- for human studies, “Extraction Methods, DNA”
tional, and evolutionary levels and to define the and “Extraction Methods, Variability Encoun-
feedback mechanisms that mediate equilibrium. tered in”). Specifically, it is well known that the
The frequency and distribution of variance type of DNA extraction used for a sample will
within a microbial ecosystem are basic factors affect the community profile obtained (e.g.,
that must be ascertained by rigorous experimental Delmont et al. 2012). Therefore, with projects
design and analysis. For example, to analyze the like the Earth Microbiome Project that aim to
microbial community structure from 1 l of sea- compare a large number of samples, efforts
water in a coastal pelagic ecosystem, one must have been made to standardize DNA extraction
also ideally define how representative this will protocols for every physical sample. Clearly, no
be for the ecosystem as a whole and what the single protocol will be suitable for every sample
bounds of that ecosystem are. Numerous studies type (Gilbert 2011, 2010b). For example,
of marine systems have shown how community a particular extraction protocol might yield only
structure can vary between water masses and over very low DNA concentrations for a particular
time (e.g., Gilbert et al. 2012; Fuhrman 2009; sample type, making it necessary to explore
Fuhrman et al. 2006, 2008; Martiny et al. 2006), other protocols in order to improve efficiency.
and metagenomics currently helps further However, differences among DNA extraction
define how community structure varies in these protocols may limit comparability of data.
environments (Ottesen et al. 2011; DeLong Therefore, researchers need to further define in
et al. 2006; Rusch et al. 2007; Gilbert et al. qualitative and quantitative terms how different
2010a). In contrast, in soil systems variance in DNA extraction methodologies affect microbial
space appears to be far larger than in time community structure.
A 123 of Metagenomics 3 A
Sequencing Technology and Quality nucleoside triphosphate matches the next posi-
Control tion after the primer, then its incorporation results
in the release of diphosphate (pyrophosphate, or A
The rapid development of sequencing technolo- PPi). PPi production is coupled by an enzymatic
gies over the past few years has arguably been reaction involving an ATP sulfurylase and
one of the driving forces in the field of a luciferase to the production of a light signal
metagenomics. While shotgun metagenomic that is detected through a charge-coupled device.
studies initially relied on hardware-intensive The Ion Torrent sequencing platform uses
and costly Sanger sequencing technology a related approach; however, here, protons that
(Tyson et al. 2004; Venter et al. 2004) available are released during nucleoside incorporation are
only to large research institutes, the advent and detected through semiconductor technology. In
continuous release of several next-generation both cases, the production of light or charge sig-
sequencing (NGS) platforms has democratized nals relates to the incorporation of the sequen-
the sequencing market and has given individual tially offered nucleoside, which can be used to
laboratories or research teams access to afford- deduce the sequence downstream of the primer.
able sequencing data. Among the available NGS Homopolymer sequences create signals propor-
options, the Roche (Margulies et al. 2005), tional to the number of positions; however,
Illumina (Bentley et al. 2008), Ion Torrent the linearity of this relationship is limited by
(Rothberg et al. 2011), and SOLiD (Life Tech- enzymatic and engineering factors leading to
nologies) platforms have been applied to well-investigated insertion and deletion (Indel)
metagenomic samples, with the former two sequencing errors (Prabakaran et al. 2011;
being more intensively used than the latter. The McElroy et al. 2012).
features of these sequencing technologies have Illumina sequencing is based on the incorpo-
been extensively reviewed – see, for example, ration and detection of fluorescently labeled
Metzker (2010) and Quail et al. (2012) – and are nucleoside triphosphates to extend a primer
therefore only briefly summarized here (Table 1). bound to a template. The key feature of the nucle-
Roche’s platform utilizes pyrosequencing oside triphosphates is a chemically modified 30
(also often referred to as 454 sequencing because position that does not allow for further chain
of the name of the company that initially devel- extension (“terminator”). Thus, the primer gets
oped the platform) as its underlying molecular extended by only one position, whose identity is
principle. Pyrosequencing involves the binding detected by different fluorescent colors for each
of a primer to a template and the sequential addi- of the four nucleosides. Through a chemical reac-
tion of all four nucleoside triphosphates in the tion, the fluorescent label is then removed, and
presence of a DNA polymerase. If the offered the 30 position is converted into a hydroxyl group
A 123 of Metagenomics, Table 1 Next-generation sequencing technologies and their throughput, errors, and
application to metagenomics
Throughput (per Error/metagenomic example
Machine (manufacturer) machine run) Reported errors references
GLX Titanium ~1 M reads @ 0.56 % indels; up to 0.12 % (McElroy et al. 2012; Fan et al. 2012)
(454/Roche) ~500 nt substitution
HiSeq 2000 (Illumina) ~3 G reads @ 100 nt ~0.001 % indels; up to (McElroy et al. 2012; Quail et al. 2012;
0.34 % substitution Hess et al. 2011)
Ion Torrent PGM (Life ~0.1–5 M reads @ 1.5 % indels (Loman et al. 2012; Whiteley
Technologies) ~200 nt et al. 2012)
SOLiD (Life ~120 M reads @ Up to 3 % (Salmela 2010; Zhou et al. 2011;
Technologies) ~50 nt Iverson et al. 2012)
A 4 A 123 of Metagenomics
allowing for another round of nucleoside incor- One important practical aspect to consider
poration. The use of a reversible terminator thus when analyzing raw sequencing data is the qual-
allows for a stepwise and detectable extension of ity value assigned to reads. For a long time, the
the primer that results in the determination of the quality assessment provided by the technology
template sequence. In theory, this process could vendor was the only available option for
be repeated to generate very long sequences; in data consumers. Recently, however, a vendor-
practice, however, misincorporation of nucleo- independent error detection and characterization
sides in the many clonal template strands results has been described that relies on error estimate-
in the fluorescent signal getting out of phase, and based reads that are accidentally duplicated
thus reliable sequencing information is only during the PCR stages (a fact described for
obtained for about 200 positions (Quail Ion Torrent, 454, and Illumina sequencing
et al. 2012). technologies) (Trimble et al. 2012). Moreover, a
SOLiD sequencing utilizes ligation, rather significant number of publicly available
than polymerase-mediated chain extension, to metagenomic datasets contain sequence adaptors
determine the sequence of a template. Primers (apparently because quality control is often
are extended through the ligation with fluores- performed on the level of assembled sequences,
cently labeled oligonucleotides. The high specific- not raw reads). Simple statistical analyses with
ity of the ligase ensures that only oligonucleotides tools such as FastQC (http://www.bioinformat-
matching the downstream sequence will be incor- ics.babraham.ac.uk/projects/fastqc/) will rapidly
porated; and by encoding different oligonucleo- detect most of these adapter contaminations. An
tides with different fluorophores, the sequence important aspect of quality control is therefore
can be determined. that each individual dataset requires error profil-
It is important to understand the features of the ing and that relying on general properties of the
sequencing technology in terms of throughput, platform used is not sufficient.
read length, and errors (see Table 1), because
these will have a significant impact on down-
stream processing. For example, the relative Assembly
high frequency of homopolymer errors for the
pyrosequencing technology can impact ORF iden- Assembly of shotgun sequencing data can in gen-
tification (Rho et al. 2010) but might still allow for eral follow two strategies: the overlap-layout-
reliable gene annotation, because of its compara- consensus (OLC) and the de Bruijn graph
tively long read length (Wommack et al. 2008). approach (see also “▶ A De Novo Metagenomic
Conversely, the short read length of Illumina Assembly Program for Shotgun DNA Reads”).
sequencing might reduce the rate of annotation of These two strategies are employed by a number
unassembled data, but the substantial throughput of different genome assemblers, and this topic
and data volume generated can facilitate assembly has been reviewed recently (Miller et al. 2010).
of entire draft genomes from metagenomic data Basically, the OLC assembly involves the
(Hess et al. 2011). These considerations are also pairwise comparison of sequence reads and the
particularly relevant with new sequencing technol- ordering of matching pairs into an overlap graph.
ogies coming online. These include single- These overlapping sequences are then merged
molecule sequencing using zero-mode waveguide into a consensus sequence. Assembly with the
nanostructure arrays (Eid et al. 2009), which de Bruijn strategy involves representing each
promises read lengths beyond 1,000 bp and has sequence’s reads in a graph of all possible
been shown to improve the hybrid assemblies of k-mers. Two k-mers are connected when the
genomes (Koren et al. 2012), as well as nanopore sequence reads have them in sequential,
sequencing (Schneider and Dekker 2012), which overlapping positions. Thus, all reads of
also promises long read lengths. a dataset are represented by the connection within
A 123 of Metagenomics 5 A
the de Bruijn graph, and assembled contigs are correspond to individual genomes or the abun-
generated by traversing these connections to dance information of k-mers to find an optimal
yield a sequence of k-mers. solution path through the graph. A
The OLC assembly has the advantage that These subdividing approaches are analogous to
pairwise comparison can be performed to allow binning metagenomic reads or contigs, in order to
for a defined degree of dissimilarity between identify groups of sequences that define a specific
reads. This can compensate for sequencing genome. These bins or even individual sequence
errors and allows for the assembly of reads from reads can also be taxonomically classified by
heterogeneous populations (Tyson et al. 2004). comparison with known reference sequences.
However, memory requirement for pairwise Binning and classifying of sequences can be
comparisons increases exponentially with the based on phylogeny, similarity, or composition
numbers of reads in the dataset; hence, the (or combinations thereof), and a large number of
OLC assembler often cannot deal with large algorithms and software is available. For recent
datasets (e.g., Illumina data). Nevertheless, sev- comparisons and benchmarking of binning and
eral OLCs, including the Celera Assembler classification software, please see Bazinet and
(Miller et al. 2008), Phrap (de la Bastide and Cummings (2012) and Droge and McHardy
McCombie 2007), and Newbler (Roche), have (2012). Obviously, care has to be taken with any
been used to assemble partial or complete draft automated process, since nonrelated sequences
genomes from metagenomic data; see, for exam- can be combined to produce genomic chimera
ple, Tyson et al. (2004), Liu et al. (2011), and bins or classes. It is thus advisable that any binning
Brown et al. (2012). or classification strategy is thoroughly tested
In contrast, memory requirements of de Bruijn through appropriate in vitro and in silico simula-
assemblers are largely determined by the k-mer tions (Mavromatis et al. 2007; Morgan et al. 2010;
size chosen to define the graph. Thus, these McElroy et al. 2012). Also, manual curation of
assemblers have been used successfully with contigs and iterative assembly and mapping can
large numbers of short reads. Initially, de produce improved genomes from metagenomic
Bruijn assemblers designed for clonal genomes, data (Dutilh et al. 2009). Through such carefully
such as Velvet (Zerbino and Birney 2008), designed strategies and refined processes, nearly
SOAP (Li et al. 2008), and ABySS (Simpson complete genomes can be assembled, even for
et al. 2009), were used to assemble metagenomic low-abundance organisms from large numbers of
data. Because of the heterogeneous nature of short reads (Iverson et al. 2012).
microbial populations, however, assemblies
often ended up fragmented. One reason is that
every positional difference between two reads Annotation
from the same region of two closely related
genomes will create a “bubble” in the graph. Initially, techniques developed for annotating
Another reason is that sequence errors in low- clonal genomes were applied to metagenomic
abundance reads cause terminating branches. data, and several tools for metagenomic analysis,
Traversing such a highly branched graph leads such as MG-RAST (Meyer et al. 2008) and
to large number of contigs. These problems have IMG/M (Markowitz et al. 2008), were derived
been partially overcome by modification of from existing software suites. For metagenomic
existing de Bruijn assemblers such as MetaVelvet projects, the principal challenges lie in the size of
(Namiki et al. 2012) or by newly designed de the dataset, the heterogeneity of the data, and the
Bruijn-based algorithms such as Meta-IDBA fact that sequences are frequently short, even if
(Peng et al. 2011; see also “Meta-IDBA, assembled prior to analysis.
overview”). Conceptually, these solutions often The first step of the analysis (after extensive
include the identification of subgraphs that quality control; see above) involves identification
A 6 A 123 of Metagenomics
of genes from a DNA sequence. Fundamentally, overviews and comparison between samples
two approaches exist: the extrinsic approach, after statistical normalization.
which relies on similarity comparison of an The time and resources required to perform
unknown sequence to existing databases, and functional annotations are substantial, but
the intrinsic (or de novo) approach, which approaches that project multiple results derived
applies statistical analysis of sequence proper- from a single sequence analysis into multiple
ties, such as frequently used codon usage, to namespaces can minimize these computational
define likely open reading frames (ORFs). For costs (Wilke et al. 2012). Numerous tools are
metagenomic data, the extrinsic approach (e.g., also available to predict, for example, short
running a similarity search with BLASTX) RNAs and/or other genomic features, but these
comes at a significant computational cost tools are frequently less useful for large
(Wilkening et al. 2009), rendering it less attrac- metagenomic datasets that exhibit both low
tive. De novo approaches based on codon or sequence quality and short reads.
nucleotide k-mer usage are thus more promising Several integrations package annotation func-
for large datasets. De novo gene-calling software tionality into a single website. The CAMERA
for microbial genomes are trained on long (Seshadri et al. 2007) website, for example,
contigs and assume clonal genomes. For provides users with the ability to run a number
metagenomic datasets this approach is often of pipelines on metagenomic data. The Joint
however unsuitable, because training data is Genome Institute’s IMG/M web service also pro-
lacking and multiple different codon usage vides an analysis for assembled metagenomic
(or k-mer) profiles are present due to the multi- data, which has been used so far for over
ple, different genomes present. 300 metagenomic datasets. The European Bioin-
However, several software packages have formatics Institute provides a service aimed at
been designed to predict genes for short frag- smaller, typically 454/pyrosequencing-derived
ments or even reads (see Trimble et al. 2012 metagenomes. The most popular service is the
for a review). The most important finding of MG-RAST system (Meyer et al. 2008), used for
that review is the effect of errors on gene predic- over 50,000 metagenomes with over 140 billion
tion performance, reducing the reading frame base pairs of data. The system offers comprehen-
accuracy of most tools to well below 20 % at sive quality control, tools for comparison of
3 % sequencing error. Only the software datasets, and data import and export tools to, for
FragGeneScan (Rho et al. 2010; see also example, QIIME (Caporaso et al. 2010) using
FragGeneScan, overview) accounted for the pos- standard formats such as BIOM (McDonald
sibility that metagenomic sequences may contain et al. 2012).
errors, thus allowing it to clearly outperform its
competitors.
Once identified, protein-coding genes require Metadata, Standards, Sharing, and
functional assignment. Here again, numerous Storage
tools and databases exist. Many researchers
have found that performing BLAST analysis With over 50,000 metagenomes available, the
against the NCBI nonredundant database scientific community has realized that standard-
adds little value to their metagenomic datasets. ized metadata (“data about data”) and higher-
Preferable are databases that contain high- level classification (e.g., a controlled vocabulary)
level groupings of functions, for example, into will increase the usefulness of datasets for novel
metabolic pathways as in KEGG (Kanehisa discoveries (see also ▶ Metagenomics, Metadata,
2002) or into subsystems as in SEED and Meta-analysis). Through the efforts of the
(Overbeek et al. 2005). Using such higher-level Genomic Standards Consortium (GSC) (Field
groupings allows for the generation of et al. 2011), a set of minimal questionnaires has
A 123 of Metagenomics 7 A
been developed and accepted by the community Conclusion
(Yilmaz et al. 2010) that allows effective
communication of metadata for metagenomic Metagenomics has truly proven a valuable tool for A
samples of diverse types. While the “required” analyzing microbial communities. Technological
GSC metadata is purposefully minimal and advances will continue to drive down the sequenc-
thus provides only a rough description, several ing cost for metagenomic projects and, in fact, the
domain-specific environmental packages exist flood of current datasets indicates that funding to
that contain more detailed information. obtain sequences is not a major limitation. Major
As the standards evolve to match the needs of bottlenecks are encountered, however, in terms of
the scientific community, the groups developing storage and computational processing of sequenc-
software and analysis services have begun to ing data. With community-wide efforts and stan-
rely on the presence of GSC-compliant meta- dardized tools, the impact of these current
data, effectively turning them into essential limitations might be managed in the short term.
data for any metagenome project. Furthermore, In the long term, however, large standardized data-
comparative analysis of metagenomic datasets is bases will be required (e.g., a MetaGeneBank) to
becoming a routine practice, and acquiring give information access to the entire scientific
metadata for these comparisons has become community. Every metagenomic dataset contains
a requirement for publication in several scien- many new and unexpected discoveries, and the
tific journals. Since reanalysis of raw sequence efforts of microbiologists worldwide will be
reads is often computationally too costly, needed to ensure that nothing is being missed. As
the sharing of analysis results is also advisable. for the data, whether raw or processed, it is just
Currently only the IMG/M and MG-RAST plat- data. Only its biological and ecological interpreta-
forms are designed to provide cross-sample tion will further our understanding of the complex
comparisons without the need to recompute and wonderful diversity of the microbial world
analysis results. In the MG-RAST system, around us.
moreover, users can share data (after providing
metadata) with other users or make data publicly Government License
available. The submitted manuscript has been created by
Metagenomic datasets continue to grow in UChicago Argonne, LLC, Operator of Argonne
size. Indeed the first multi-hundred gigabase National Laboratory (“Argonne”). Argonne,
pair of metagenomes already exists. Therefore, a US Department of Energy Office of Science
storage and curation of metagenomic data Laboratory, is operated under Contract
have become a central theme. The on-disk No. DE-AC02-06CH11357. The US Government
representation of raw data and analyses has retains for itself, and others acting on its behalf,
led to massive storage issues for groups a paid-up nonexclusive, irrevocable worldwide
attempting meta-analyses. Currently there is license in said article to reproduce, prepare deriv-
no solution for accessing relevant subsets of ative works, distribute copies to the public, and
data (e.g., only reads and analyses pertaining perform publicly and display publicly, by or on
to a specific phylum or a specific species) behalf of the Government.
without downloading the entire dataset. Cloud
technologies may in the future provide attrac-
tive computational solutions for storage and References
computing problems. However, specific and
metadata-enabled solutions are required for Barberan A, Bates ST, et al. Using network analysis to
explore co-occurrence patterns in soil microbial com-
cloud systems to power the community-wide
munities. ISME J. 2012;6(2):343–51.
(re-)analysis efforts of the first 50,000 Barns SM, Fundyga RE, et al. Remarkable archaeal
metagenomes. diversity detected in a Yellowstone National Park hot
A 8 A 123 of Metagenomics
spring environment. Proc Natl Acad Sci U S A. Gilbert JA, Field D, et al. The taxonomic and functional
1994;91(5):1609–13. diversity of microbes at a temperate coastal site:
Bates ST, Berg-Lyons D, et al. Examining the global a ‘multi-omic’ study of seasonal and diel temporal
distribution of dominant archaeal populations in soil. variation. PLoS One. 2010a;5(11):e15545.
ISME J. 2011;5(5):908–17. Gilbert JA, Meyer F, et al. The earth microbiome project:
Bazinet AL, Cummings MP. A comparative evaluation of meeting report of the “1 EMP meeting on sample
sequence classification programs. BMC Bioinforma. selection and acquisition at Argonne National Labora-
2012;13(1):92. tory October 6 2010”. Stand Genomic Sci.
Bentley DR, Balasubramanian S, et al. Accurate whole 2010b;3(3):249–53.
human genome sequencing using reversible terminator Gilbert JA, Bailey M, et al. The earth microbiome project:
chemistry. Nature. 2008;456(7218):53–9. the Meeting Report for the 1st International Earth
Bergmann GT, Bates ST, et al. The under-recognized Microbiome Project Conference, Shenzhen, China,
dominance of Verrucomicrobia in soil bacterial com- June 13th-15th 2010. Stand Genomic Sci.
munities. Soil Biol Biochem. 2011;43(7):1450–5. 2011;5(2):243–7.
Brown MV, Lauro FM, et al. Global biogeography of Gilbert JA, Steele JA, et al. Defining seasonal marine
SAR11 marine bacteria. Mol Syst Biol. 2012;8:595. microbial community dynamics. ISME J. 2012;6:
Caporaso JG, Kuczynski J, et al. QIIME allows analysis 298–308.
of high-throughput community sequencing data. Nat Gill SR, Pop M, et al. Metagenomic analysis of the human
Methods. 2010;7(5):335–6. distal gut microbiome. Science. 2006;312(5778):
de la Bastide M, McCombie WR. Assembling genomic 1355–9.
DNA sequences with PHRAP. Curr Protoc Hess M, Sczyrba A, et al. Metagenomic discovery of
Bioinforma. 2007. Chapter 11: Unit11 14. biomass-degrading genes and genomes from cow
Delmont TO, Malandain C, et al. Metagenomic mining for rumen. Science. 2011;331(6016):463–7.
microbiologists. ISME J. 2011;5(12):1837–43. Iverson V, Morris RM, et al. Untangling genomes from
Delmont TO, Prestat E, et al. Structure, fluctuation and metagenomes: revealing an uncultured class of marine
magnitude of a natural grassland soil metagenome. Euryarchaeota. Science. 2012;335(6068):587–90.
ISME J. 2012;6(9):1677–87. Kanehisa M. The KEGG database. Novartis Found Symp.
DeLong EF, Preston CM, et al. Community genomics 2002;247:91–101. discussion 101–103, 119–128,
among stratified microbial assemblages in the ocean’s 244–152.
interior. Science. 2006;311(5760):496–503. Knight R, Jansson J, et al. Designing better metagenomic
Dinsdale EA, Edwards RA, et al. Functional metagenomic surveys: the role of experimental design and metadata
profiling of nine biomes. Nature. 2008;452(7187): capture in making useful metagenomic datasets for
629–32. ecology and biotechnology. Nat Biotechnol.
Droge J, McHardy AC. Taxonomic binning of metagenome 2012;30(6):513–2.
samples generated by next-generation sequencing tech- Koren S, Schatz MC, et al. Hybrid error correction and de
nologies. Brief Bioinform. 2012;13(6):646–55. novo assembly of single-molecule sequencing reads.
Dutilh BE, Huynen MA, et al. Increasing the coverage of Nat Biotechnol. 2012;30(7):693–700.
a metapopulation consensus genome by iterative read Li R, Li Y, et al. SOAP: short oligonucleotide alignment
mapping and assembly. Bioinformatics. 2009;25(21): program. Bioinformatics. 2008;24(5):713–4.
2878–81. Liu MY, Kjelleberg S, et al. Functional genomic analysis
Eid J, Fehr A, et al. Real-time DNA sequencing of an uncultured delta-proteobacterium in the sponge
from single polymerase molecules. Science. 2009; Cymbastela concentrica. ISME J. 2011;5(3):427–35.
323(5910):133–8. Loman NJ, Misra RV, et al. Performance comparison of
Fan L, Reynolds D, et al. Functional equivalence and benchtop high-throughput sequencing platforms. Nat
evolutionary convergence in complex communities Biotechnol. 2012;30(5):434–9.
of microbial sponge symbionts. Proc Natl Acad Sci Mackelprang R, Waldrop MP, et al. Metagenomic analy-
U S A. 2012;109(27):E1878–87. sis of a permafrost microbial community reveals
Field D, Amaral-Zettler L, et al. The genomic standards a rapid response to thaw. Nature. 2011;480(7377):
consortium. PLoS Bio. 2011;9(6):e1001088. 368–71.
Fuhrman JA. Microbial community structure and its func- Margulies M, Egholm M, et al. Genome sequencing in
tional implications. Nature. 2009;459(7244):193–9. microfabricated high-density picolitre reactors.
Fuhrman JA, Hewson I, et al. Annually reoccurring Nature. 2005;437(7057):376–80.
bacterial communities are predictable from ocean Markowitz VM, Ivanova NN, et al. IMG/M: a data man-
conditions. Proc Natl Acad Sci U S A. 2006; agement and analysis system for metagenomes.
A103(35):13104–9. Nucleic Acids Res. 2008;36(Database issue):D534–8.
Fuhrman JA, Steele JA, et al. A latitudinal diversity gra- Martiny JB, Bohannan BJ, et al. Microbial biogeography:
dient in planktonic marine bacteria. Proc Natl Acad Sci putting microorganisms on the map. Nat Rev
U S A. 2008;A105(22):7774–8. Microbiol. 2006;4(2):102–12.
A 123 of Metagenomics 9 A
Mavromatis K, Ivanova N, et al. Use of simulated Rothberg JM, Hinz W, et al. An integrated semiconductor
data sets to evaluate the fidelity of metagenomic device enabling non-optical genome sequencing.
processing methods. Nat Methods. 2007;4(6): Nature. 2011;475(7356):348–52.
495–500. Rusch DB, Halpern AL, et al. The Sorcerer II global A
McDonald D, Clemente JC, et al. The Biological Obser- ocean sampling expedition: Northwest Atlantic
vation Matrix (BIOM) format or: how I learned to stop through Eastern Tropical Pacific. PLoS Biol.
worrying and love the ome-ome. Gigascience. 2007;5(3):e77.
2012;1(1):7. Schneider GF, Dekker C. DNA sequencing with
McElroy KE, Luciani F, et al. GemSIM: general, error- nanopores. Nat Biotechnol. 2012;30(4):326–8. doi:
model based simulator of next-generation sequencing 10.1038/nbt.2181.
data. BMC Genomics. 2012;13:74. Salmela L. Correction of sequencing errors in a mixed set
Metzker ML. Sequencing technologies – the next genera- of reads. Bioinformatics. 2010;26(10):1284–90.
tion. Nat Rev Genet. 2010;11(1):31–46. Seshadri R, Kravitz SA, et al. CAMERA: a
Meyer F, Paarmann D, et al. The metagenomics RAST community resource for metagenomics. PLoS Biol.
server – a public resource for the automatic phyloge- 2007;5(3):e75.
netic and functional analysis of metagenomes. BMC Simpson JT, Wong K, et al. ABySS: a parallel assembler
Bioinforma. 2008;9:386. for short read sequence data. Genome Res.
Miller JR, Delcher AL, et al. Aggressive assembly of 2009;19(6):1117–23.
pyrosequencing reads with mates. Bioinformatics. Trimble WL, Keegan KP, et al. Short-read reading-frame
2008;24(24):2818–24. predictors are not created equal: sequence error causes
Miller JR, Koren S, et al. Assembly algorithms for loss of signal. BMC Bioinforma. 2012;13(1):183.
next-generation sequencing data. Genomics. Tringe SG, von Mering C, et al. Comparative
2010;95(6):315–27. metagenomics of microbial communities. Science.
Morgan JL, Darling AE, et al. Metagenomic sequencing of 2005;308(5721):554–7.
an in vitro-simulated microbial community. PLoS Tyson GW, Chapman J, et al. Community structure and
One. 2010;5(4):e10209. metabolism through reconstruction of microbial
Namiki T, Hachiya T, et al. MetaVelvet: an extension of genomes from the environment. Nature.
Velvet assembler to de novo metagenome assembly 2004;428(6978):37–43.
from short sequence reads. Nucleic Acids Res. Venter JC, Remington K, et al. Environmental genome
2012;40(20):e155. shotgun sequencing of the Sargasso Sea. Science.
Nemergut DR, Costello EK, et al. Global patterns in the 2004;304(5667):66–74.
biogeography of bacterial taxa. Environ Microbiol. Warnecke F, Luginbuhl P, et al. Metagenomic and func-
2011;13(1):135–44. tional analysis of hindgut microbiota of a wood-
Ottesen EA, Marin R, et al. Metatranscriptomic analysis feeding higher termite. Nature. 2007;450(7169):
of autonomously collected and preserved marine 560–5.
bacterioplankton. ISME J. 2011;5(12):1881–95. Whiteley AS, Jenkins S, et al. Microbial 16S rRNA Ion
Overbeek R, Begley T, et al. The subsystems approach to Tag and community metagenome sequencing using
genome annotation and its use in the project to anno- the Ion Torrent (PGM) platform. J Microbiol Methods.
tate 1000 genomes. Nucleic Acids Res. 2012;91(1):80–8.
2005;33(17):5691–702. Wilke A, Harrison T, et al. The M5nr: a novel
Peng Y, Leung HC, et al. Meta-IDBA: a de Novo non-redundant database containing protein sequences
assembler for metagenomic data. Bioinformatics. and annotations from multiple sources and associated
2011;27(13):i94–101. tools. BMC Bioinforma. 2012;13:141.
Prabakaran P, Streaker E, et al. 454 antibody sequencing – Wilkening J, Wilke A, et al. Using clouds for
error characterization and correction. BMC Res Notes. metagenomics: a case study. IEEE Cluster 2009. 2009
2011;4:404. Wommack KE, Bhavsar J, et al. Metagenomics: read
Prosser JI. Replicate or lie. Environ Microbiol. length matters. Appl Environ Microbiol.
2010;12(7):1806–10. 2008;74(5):1453–63.
Quail M, Smith ME, et al. A tale of three next generation Yilmaz P, Kottmann R, et al. The “Minimum Information
sequencing platforms: comparison of Ion Torrent, about an ENvironmental Sequence” (MIENS) specifi-
Pacific Biosciences and Illumina MiSeq sequencers. cation. Nat Biotechnol. 2010. in print.
BMC Genomics. 2012;13(1):341. Zerbino DR, Birney E. Velvet: algorithms for de novo
Rho M, Tang H, et al. FragGeneScan: predicting genes in short read assembly using de Bruijn graphs. Genome
short and error-prone reads. Nucleic Acids Res. Res. 2008;18(5):821–9.
2010;38(20):e191. Zhou R, Ling S, et al. Population genetics in nonmodel
Riesenfeld CS, Schloss PD, et al. Metagenomics: genomic organisms: II. Natural selection in marginal habitats
analysis of microbial communities. Annu Rev Genet. revealed by deep sequencing on dual platforms. Mol
2004;38:525–52. Biol Evol. 2011;28(10):2833–42.
A 10 A De Novo Metagenomic Assembly Program for Shotgun DNA Reads
A De Novo Metagenomic Assembly Program for Shotgun DNA Reads, Fig. 1 The flowchart of MAP algorithm
and 454 sequencing (usually 200–500 bp), are stage, a consistency-based consensus algorithm
still the overwhelming recommendation and is used (Rausch et al. 2009), which is based on
thus remain the major source of metagenomic a multi-read alignment algorithm aligning the
sequence data. Therefore, it is never trivial to reads with a consistency-enhanced alignment
continue to emphasize the importance of longer graph of shared sequence segments identified in
reads to metagenomic analyses, clearly including advance. The most important innovation of MAP
the reads assembly tool designed specifically. is the layout stage which applies mate-paired
information to deal with repeat problem, which
is described below.
Algorithm of MAP In the OLC approach of MAP, the overlap
graph is used to facilitate the assembly process.
MAP designs an improved approach of the clas- Conceptually, reads and overlaps are represented
sical overlap/layout/consensus (OLC) strategy, in in the graph G by nodes and bidirected edges,
which several special algorithms are incorporated respectively. The arrows of both ends of the edge
into its stages, to calculate correct contigs by are determined by the way how two reads over-
connecting the fragments linked by mate pairs lap. Herein, a dovetail path is defined as an acy-
to prevent the false merge of unrelated reads. clic path with each node has only one arrow
For the improved OLC strategy, MAP deploys outward it and one arrow inward it. Thus,
a series of algorithms in three stages as shown in a dovetail path can determine a certain contig
Fig. 1. In the overlap stage, the filter algorithm by means of threading the reads corresponding
based on q-gram (Mullikin et al. 2003) is used to to the nodes in this path. Thus, the goal of the
obtain the read pairs that are supposed to have the layout stage is to separate the graph into discon-
overlaps, and the seed and extend alignment nected dovetail paths. However, since there may
approach, similar to that used by BLAST be quite many misleading edges in the graph that
(Altschul et al. 1990), is employed in the pairwise represent the false overlaps mainly originated
alignment calculation. While in the consensus from two repetitive DNA regions or similar
A 12 A De Novo Metagenomic Assembly Program for Shotgun DNA Reads
fragments of different genomes, this goal seems (Margulies et al. 2005), and Genovo (Laserson
to be a formidable task. To this end, MAP et al. 2011), for typical shorter reads by
is designed to determine the optimal dovetail 454 sequencing (Lai et al. 2012).
paths with the aids of the clues given by mate
pairs (Lai et al. 2012).
Compared with other assemblers, several dis- Availability
tinct features of MAP algorithm should be pointed
out. First, MAP does not refer to any other infor- MAP is written in C++ and the source code is
mation such as genome length or sequencing cov- freely available under GNU GPL license. The
erage that is often used in the assemblers targeting MAP is freely available at http://bioinfo.ctb.
the isolated genomes, because such information is pku.edu.cn/MAP/.
clearly not applicable to the situation of
metagenomic assembly. What is more important
References
is that MAP employs mate-paired information dif-
ferent from other assemblers do. For example, the Altschul SF, Gish W, et al. Basic local alignment search
Celera Assembler (Myers et al. 2000) used mate- tool. J Mol Biol. 1990;215(3):403–10.
paired information in the scaffold construction. Huang X, Wang J, et al. PCAP: a whole-genome assembly
program. Genome Res. 2003;13:2164–70.
The Celera Assembler later developed a new pipe-
Kunin V, Copeland A, et al. A bioinformatician’s guide to
line CABOG, which finds the best overlap graph metagenomics. Microbiol Mol Biol Rev.
in the unitigger module (Miller et al. 2008). In this 2008;72:557–178.
algorithm, mate pairs are used to correct the Lai B, Ding R, et al. A de novo metagenomic assembly
program for shotgun DNA reads. Bioinformatics.
misassemblies by breaking the unitigs which are
2012;28(11):1455–62.
found violated with the mate-pair constrains. Laserson J, Jojic V, et al. Genovo: de novo assembly for
PCAP (Huang et al. 2003) used mate-paired infor- metagenomes. J Comput Biol. 2011;18:429–43.
mation to correct contigs and to link contigs into Li R, Zhu H, et al. De novo assembly of human genomes
with massively parallel short read sequencing.
scaffolds. Different from these assemblers, MAP
Genome Res. 2010;20:265–72.
uses mate pairs as a core measure to construct Margulies M, Egholm M, et al. Genome sequencing in
contigs when repeats hamper the assembly. microfabricated high-density picolitre reactors.
Based on mate-paired information, MAP designs Nature. 2005;437:376–80.
Mavromatis K, Ivanova N, et al. Use of simulated data sets
a series of procedures to implement the layout
to evaluate the fidelity of metagenomic processing
stage. methods. Nat Methods. 2007;4:495–500.
Miller JR, Delcher AL, et al. Aggressive assembly of
pyrosequencing reads with mates. Bioinformaticts.
2008;24:2818–24.
Performance of MAP Mullikin JC, Ning Z, et al. The phusion assembler.
Genome Res. 2003;13:81–90.
MAP is designed for metagenomic assembly on Myers EW, Sutton GG, et al. A whole-genome assembly
long reads data with mate pairs, such as Sanger of Drosophila. Science. 2000;287:2896–204.
Rausch T, Koren S, et al. A consistency-based consensus
reads (700–1,000 bp) and 454 reads
algorithm for de novo and reference-guided sequence
(200–500 bp). MAP method was assessed on assembly of short reads. Bioinformatics.
simulated data compared with widely used 2009;25:1118–24.
assemblers on long reads data. Specifically, the Tyson GW, Chapman J, et al. Genomic structure and
metabolism through reconstruction of microbial
assessment test results on simulated dataset with
genomes from the environment. Nature. 2004;428:
800 bp reads demonstrate that the total assembly 37–43.
performance of MAP can be superior to both Venter JC, Remington K, et al. Environmental genome
Celera and Phrap for typical longer reads by shotgun sequencing of Sargasso sea. Science.
2004;304:66–74.
Sanger sequencing, and the results on simulated
Zerbinor DR, Birney E. Velvet: algorithms for de novo
dataset with 200 bp reads show that MAP has short read assembly using de Bruijn graphs. Genome
evident advantage over Celera, Newbler Res. 2008;18:821–9.
Ab Initio Gene Identification in Metagenomic Sequences 13 A
annotation of the first completely sequenced
Ab Initio Gene Identification in archaeal genome, Methanococcus jannaschii
Metagenomic Sequences (Bult et al. 1996). All the M. jannaschii genes A
were predicted by the ab initio statistical method
Shiyuyun Tang1 and Mark Borodovsky2 (Borodovsky and McIninch 1993) while function
1
School of Biology, Biodiversity Research of 2/3 of them was a mystery since the translated
Center, Georgia Institute of Technology, Atlanta, protein sequences did not show sequence similar-
GA, USA ity to proteins in databases.
2
Joint Georgia Tech and Emory Wallace The history repeats itself in metagenomes,
H Coulter Department of Biomedical since majority of protein-coding regions in a new
Engineering, Center for Bioinformatics and metagenome may code for proteins that do not
Computational Genomics, Atlanta, GA, USA show similarity to already known proteins.
“Evidence-based” or “similarity-based” methods
of gene finding (Kunin et al. 2008) provide gene
Synonyms prediction along with valuable information about
function of encoded proteins. Similarity-based
Statistical or intrinsic methods of gene prediction gene finders possess high specificity, close to
100 % (Altschul et al. 1997; Badger and Olsen
1999; Frishman et al. 1998; Gish and States 1993).
Definition Still, the drawback of similarity-based methods is
low sensitivity; they cannot predict novel genes.
Computational inference of how a metagenomic The similarity-based methods are less useful
sequence is divided into protein-coding and non- for gene prediction in metagenomes that carry
coding regions based on presence or absence of many novel genes, while the ab initio gene
characteristic oligonucleotide frequency patterns. prediction methods, not depending on presence
of homologs in protein databases, are both effec-
tive and efficient for annotating genes in
Introduction metagenomic sequences (Kunin et al. 2008).
Ab Initio Gene Identification in Metagenomic Sequences, Table 1 Gene prediction accuracy for five ab initio
gene finders. Sn stands for sensitivity and Sp stands for specificity
Sequence (Sn + Sp)/2
Programs Test set length (bp) Sn (%) Sp (%) (%) Publication
Orphelia Fragments from 12 test 300 82.1 91.7 86.9 Hoff et al. (2009)
species
FragGeneScan Simulated short reads of 400 91.3 86.1 88.7 Rho et al. (2010)
9 genomes
MetaGeneMark Fragments from 400 97.0 94.6 95.8 Zhu et al. (2010)
50 microbial chromosomes
Glimmer-MG Simulated 454 sequences 535 98.4 71.8 85.1 Kelley
et al. (2012)
MetaGeneAnnotator Subsequences of 700 95.1 91.0 93.1 Noguchi
13 genomes et al. (2008)
FragGeneScan Simulated reads with 1 % 400 85.4 79.5 82.5 Rho et al. (2010)
sequencing error rate
Glimmer-MG Simulated 454 reads with 535 83.6 62.5 73.1 Kelley
1 % sequencing error rate et al. (2012)
frequency matrix can be derived by algorithms (Noguchi et al. 2006), and many of them make
such as MCMC (Markov chain Monte Carlo)- co-transcribed “chains” or operons. Genes in an
based Gibbs sampler (Lawrence et al. 1993) or operon are located on a close distance or even
EM (Expectation Maximization)-based MEME overlap. Four base-pair overlap ATGA is very
(Bailey and Elkan 1994); detection of the RBS common in adjacent genes as an overlap of stop
motif is done by finding the most conserved set and start codons ATG and TGA. Average dis-
of ungapped sequence fragments within the tance between adjacent genes having the same
multiple alignment window. The structure of orientation is shorter than that between neighbor
two-component RBS model is convenient for genes residing in complementary strands, espe-
incorporation into HMM-based framework of cially in gene start-to-gene start configuration
several algorithms such as MetaGeneMark and where additional space has to be available for
FragGeneScan promoters.
Another feature, the prokaryotic gene length All these features are incorporated in
distribution, is approximated for complete or metagenomic gene finders, e.g., MetaGeneMark.
draft genomes by the gamma distribution with Tests of ab initio gene finders on simulated
mean value about 900 nt; yet another one, the metagenomic sequences have shown that these
distribution of length of noncoding region is algorithms are quite accurate, with average
approximated by exponential distribution. These values of sensitivity and specificity above 90 %;
two distributions, as well as the RBS spacer see Table 1. However, the sensitivity drops if the
length distribution, are used as in the HSMM- sequence length goes below 200 nt (Yok and
based algorithms (Besemer et al. 2001). Since Rosen 2011; Zhu et al. 2010).
short metagenomic sequences are more likely to
contain partial genes than complete genes, length
distributions of partial genes are used in HSMM- An Initio Gene Finding in Metagenomic
based metagenomic gene finders (Rho et al. 2010; Sequences with Errors
Zhu et al. 2010).
About 70 % of neighboring genes in prokary- Real metagenomic sequences contain errors: sub-
otic genomes have the same orientation stitutions, insertion, and deletions (indels), as well
Ab Initio Gene Identification in Metagenomic Sequences 17 A
Ab Initio Gene Identification in Metagenomic Sequences, Table 2 Frameshift prediction accuracy
Sequence
Programs length (bp) Sn (%) Sp (%) (Sn + Sp)/2 Test set Publication
A
FragGeneScan 400 81.0 43.2 62.1 Fragments from Tang
600 81.9 35.1 58.5 18 prokaryotic et al. 2013
800 82.8 29.4 56.1 genomes with
20 % containing
MetaGeneTrack 400 75.8 70.2 73.0
frameshifts
600 80.1 61.7 70.9
800 81.5 51.9 66.7
as chimerisms, when two reads from different (Table 2) in reads with error rate typical for
species are joined due to assembly error. Indels metagenomic projects (Tang et al. 2013).
can cause frameshifts in coding regions; thus gene Yet another approach was used in Glimmer-
prediction accuracy is affected by sequencing MG, which, to trace possible indel errors, splits
errors. The overall effect on accuracy depends on an ORF into three branches (frames), starting
error rates specific to sequencing and finishing from the position of a nucleotide called with
technologies; for example, the error rates reported low confidence (Kelley et al. 2012). This
for Sanger sequencing may be as low as 0.001 % approach was reported to have higher gene pre-
while sequencing errors in NGS technologies can diction accuracy on error-contained reads than
go above 1 %. In both simulated Sanger reads and FragGeneScan. Methods that account for
simulated 454 reads significant decrease of gene sequencing errors generally perform better in
prediction sensitivity is observed when error rate real error-prone metagenomic sequences than
exceeds 1 % (Hoff 2009). Still, in assembled “idealistic” approaches. The accuracy of
sequences, the per-nucleotide error rate of 0.5 % sequencing error detection, however, depends
in raw reads can be reduced to as low as 0.005 %. on how accurate is the modeling of sequencing
This error rate is still large enough to affect errors is.
3–4.5 % of genes in assembled sequences (Luo
et al. 2012).
To identify frameshift errors in metagenomic Summary
sequences, gene-finding algorithms have to
model frame transitions that occur due to Accurate ab initio gene prediction in
sequencing errors. In HSMM-based gene finders, metagenomic sequences is necessary for reliable
e.g., FragGeneScan, new hidden states designat- functional annotation. Ab initio algorithms iden-
ing transitions between coding frames in the same tify genes in metagenomic sequences by
strand were incorporated into the HSMM archi- detecting intrinsic statistical patterns of coding
tecture. Another recent tool able to detect frame- and noncoding regions. Being independent of
shift in metagenomic coding regions is data stored in databases, these methods are espe-
MetaGeneTack (Tang et al. 2013). It combines cially useful for discovering novel genes. Special
the original HSMM-based MetaGeneMark with techniques have been developed for derivation of
an ab initio frameshift finding program GeneTack parameters of the ab initio algorithms working
(Antonov and Borodovsky 2010). Several filters with short anonymous metagenomic sequences.
of false-positive predictions were employed in We have reviewed several ab initio gene finders
MetaGeneTack to achieve higher accuracy. developed for metagenomic sequences including
MetaGeneTack is reported to have higher frame- the latest tools that take into account possible
shift prediction specificity than FragGeneScan sequencing errors (frameshifts).
A 18 Ab Initio Gene Identification in Metagenomic Sequences
results by tuning the taxonomic classifier to each further splitting bins. The recursive procedure
matching length, reference gene, and taxonomic continues if (1) the predicted abundance values
level. Note that some tools in this category can of two bins differ significantly; (2) the predicted
only classify a subset of the metagenomic genome sizes are larger than a certain threshold;
sequences instead of all. MLTreeMap (Stark and (3) the number of reads associated with each
et al. 2010) uses phylogenetic analysis of bin is larger than a certain threshold proportion of
31 marker genes for taxonomic distribution esti- the total number of reads classified in the
mation. CARMA (Krause et al. 2008) searches parent bin.
for conserved Pfam domains and protein families AbundanceBin achieves accurate classifica-
in raw metagenomic sequences and classifies tion of even very short sequences sampled from
them into a higher-order taxonomy. RDP classi- species with different abundance levels, as tested
fier is designed for classification of 16S rRNA on simulated and real metagenomic datasets. The
genes, and later extended to classification of 18S software is available for download at http://
rRNA genes using a naı̈ve Bayes classifier (Cole omics.informatics.indiana.edu/AbundanceBin.
et al. 2009).
These ambiguities are directly associated with GRAMMy is a statistical framework developed
the read length reduction in NGS technologies. to accurately and efficiently estimate the relative
Second, communities usually consist of many abundance of microbial organisms within the
microbes with similar genomes, different only community (Xia et al. 2011).
in some parts, making it indeed impossible to
determine the origin of a particular short read
based solely on its sequence. Description
Despite these difficulties, NGS read sets have
brought in richer abundance information of micro- The GRAMMy Framework
bial communities than traditional datasets because The GRAMMy framework is based on a mixture
of the significant increase in the number of reads. model for the short metagenomic sequencing and
Along with the increase of read set size, efforts to an expectation-maximization (EM) algorithm, as
assemble more reference genomes are ongoing. In outlined in the model schema and the analysis
addition, new experimental techniques, such as flowchart in Figs. 1 and 2. GRAMMy accepts
single-cell sequencing approaches, are being a set of shotgun reads as well as external refer-
developed to sequence reference genomes directly ences (e.g., genomes, scaffolds, or contigs) as
from environmental samples. In face of the chal- inputs and subsequently performs the
lenges from short reads and the opportunities from maximum-likelihood estimation (MLE) of the
fast-expanding reference genome databases, genome relative abundance (GRA) levels.
Accurate Genome Relative Abundance Estimation mixture model underlies the GRAMMy framework for
Based on Shotgun Metagenomic Reads, Fig. 1 The shotgun metagenomics. In the figure, “iid” stands for
GRAMMy model. A schematic diagram of the finite “independent identically distributed”
Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads 23 A
estimates. If the taxonomy information for the
input reference genomes is available, strain
(genome) level GRA estimates can be combined A
to calculate high taxonomic level abundance, such
as species- and genus-level estimates.
#j-th genome
aj ¼
#known genomes
pj
aj ¼
X
m1
pk
lj
l
k¼1 k
Accurate Genome Relative Abundance Estimation 99 species occurring in at least 50 % of the 33 human gut
Based on Shotgun Metagenomic Reads, samples with a minimum relative abundance of 0.05 %
Fig. 3 Frequent species of human gut microbiome. The were selected
Gill SR, Pop M, Deboy RT, et al. Metagenomic analysis of The All-Species Living Tree Project (LTP) is an
the human distal gut microbiome. Science. A
international initiative for the creation and main-
2006;312(5778):1355–9.
He PA, Xia L. Oligonucleotide profiling for discriminat- tenance of highly curated 16SrRNA and
ing bacteria in bacterial communities. Comb Chem 23SrRNA gene sequence databases, alignments,
High Throughput Screen. 2007;10(4):247–55. and phylogenetic trees for all the type strains of
Kurokawa K, Itoh T, Kuwahara T, et al. Comparative
Bacteria and Archaea.
metagenomics revealed commonly enriched gene sets in
human gut microbiomes. DNA Res. 2007;14(4):169–81.
Turnbaugh PJ, Hamady M, Yatsunenko T, et al. A core gut
microbiome in obese and lean twins. Nature. Introduction
2009;457(7228):480–4.
Xia LC, Cram JA, Chen T, et al. Accurate genome relative
abundance estimation based on shotgun metagenomic Classification and identification of Bacteria and
reads. PLoS One. 2011;6(12):e27992. Archaea came across to a turning point around
35 years ago. It was the time when Carl Woese
and co-workers demonstrated that ribosomal
All-Species Living Tree Project markers were appropriate to infer genealogical
relationships by means of phylogenetic reconstruc-
Pablo Yarza1, Raul Munoz2, Jean Euzéby3, tions (Fox et al. 1977). Rapidly, comparative anal-
Wolfgang Ludwig4, Karl-Heinz Schleifer4, ysis of rRNA gene sequences became a standard
Rudolf Amann5, Frank Oliver Glöckner6,7 and procedure with mature implications in microbial
Ramon Rosselló-Móra2 ecology and taxonomy: culture-independent
1
Ribocon GmbH., Bremen, Germany exploration of ecosystems’ diversity (Amann
2
Marine Microbiology Group, Department of et al. 1995) and settlement of the phylogenetic
Ecology and Marine Resources, Institut backbone (i.e., our current accepted classification
Mediterrani d’Estudis Avançats (CSIC-UIB), of Bacteria and Archaea; Garrity 2001). As
Illes Balears, Spain a result, the total amount of ribosomal RNA entries
3
Society of Systematic Bacteriology and in the public DNA databases has grown exponen-
Veterinary (SBSV) & National Veterinary tially since early 1990s, currently comprising at
School de Toulouse (ENVT), Toulouse, France least 3,500,000 small (SSU) and 300,000 large
4
Lehrstuhl F€ur Mikrobiologie, Technische (LSU) ribosomal subunit gene sequence entries.
Universit€at M€unchen, Freising, Germany On the other hand, the number of bacterial and
5
Molecular Ecology Group, Max Planck Institute archaeal species with validly published names
for Marine Microbiology, Bremen, Germany has followed arithmetic trends with a ratio of
6
Microbial Genomics and Bioinformatics Group, around 500–700 annual descriptions during the
Max Planck Institute for Marine Microbiology, last 7 years (Fig. 1), currently (December 2012)
Bremen, Germany exceeding the total number of 10,300 species and
7
Jacobs University Bremen gGmbH, Bremen, subspecies. A comparative overview of these
Germany trends until December 2011 is shown in Fig. 1.
As from early 1990s, the 16S rRNA has been,
Synonyms by orders of magnitude, the most often sequenced
gene, there is no alternative phylogenetic marker
16SrRNA(SSU) and 23SrRNA( LSU) gene with such a high coverage in public repositories.
sequence databases; Alignments; LTP project; However, abundance is not the single requisite
Manual curation; “Orphan” species; Taxa bound- for a proper phylogenetic inference and other
aries; Taxonomy/classification/phylogeny of single molecules (e.g., 23S rRNA) or combina-
Bacteria and Archaea; Type strains tions of them might perform better at reflecting
A 26 All-Species Living Tree Project
All-Species Living Tree Project, Fig. 1 Annual growth and yellow bars (LSU, 1B). The cumulative growth of
of ribosomal 16S rRNA (a) and 23S rRNA (b) gene published species and subspecies names (according to
sequence databases and species and subspecies names LPSN; http://www.bacterio.cict.fr/number.html) since
with standing in nomenclature until December 2011. 1980 until December 2011 is plotted in blue. Note that the
SILVA SSU-Parc111 and LSU-Parc111 databases total number of names is around 2,000 above the total
(http://www.arb-silva.de/documentation/release-111/) were number of distinct type strains due to homotypic synonyms,
filtered by submission date until December 2011 and its new combinations, nomina nova, later heterotypic syno-
cumulative annual growth was plotted in red (SSU, 1A) nyms, or illegitimate names
genealogies of certain groups given the higher complete the genome sequence of all type strains
information content (Ludwig and Klenk 2001). (GEBA initiative). Undoubtedly, comparative
Although far from reaching 16S rRNA levels, genomics will involve a new breakthrough for
submission of alternative markers is growing microbial taxonomy and the current phylogenetic
fast, mostly because (i) the number of meta- backbone based on ribosomal sequences will be
genomes and complete genomes is growing carefully reviewed (Coenye et al. 2005). Never-
exponentially due to the reduction on sequencing theless, at this point, the number of sequenced
and analysis costs and (ii) the recent initiative to genomes of type strains is still low and therefore
All-Species Living Tree Project 27 A
the current possibilities for an in-depth taxo- names from LPSN. When a species is divided
nomic study are sparse. into subspecies, we substituted the original
The responsible teams of the ARB, SILVA, species name by that of the subspecies (e.g., A
and LPSN projects (www.arb-home.de, www. Staphylococcus sciuri subsp. sciuri instead of
arb-silva.de, and www.bacterio.net) together Staphylococcus sciuri). We avoided the
with the journal Systematic and Applied Micro- “Candidatus” names (e.g., “Candidatus
biology (SAM) started the “All-Species Living Aciduliprofundum boonei”), Cyanobacteria
Tree Project” (LTP; http://www.arb-silva.de/pro- not validly published under the Bacteriologi-
jects/living-tree), a project conceived to provide cal Code (e.g., Anabaena oscillatorioides),
a tool especially designed for the microbial tax- and later heterotypic synonyms (e.g., Pseudo-
onomist scientific community (Yarza et al. 2008). monas chloritidismutans).
The main objectives considered so far are (1) pro- 3. Manual cross-check. Then, each entry from
vide a curated 16S and 23S rRNA gene database our initial list of sequences was assigned to
for the type strains of all species with validly a species name by manually examining the
published names; (2) set up an optimized and companion contextual metadata. This process
universally usable alignment; (3) reconstruct reli- had to be done manually given the often out-
able phylogenetic trees with all the type strains; dated, mistaken, or absent taxonomic informa-
(4) maintain the database, alignments, and trees tion such as the organism name or the strain
through regular updates including the new validly numbers.
published taxa and their respective 16S and 23S 4. Quest for missing type strains. We realized
rRNA gene sequences; and (5) investigate, with that not all species names were represented
the use of the database, fundamental aspects in the list of sequences. Then, we inverted
about taxonomy of Bacteria and Archaea such the process by searching in resources like
as phylogenetic thresholds in new taxa circum- EMBL, Bergey’s Outlines, issues of the Inter-
scriptions, coherence of current taxonomy by national Journal of Systematic and Evolution-
means of phylogenetic schemes, and relevance ary Microbiology (IJSEM), etc. with the aim
of the ribosomal RNA genes in taxonomic to find a good-quality sequence entry for each
studies. missing type strain.
5. “Orphan” species recognition. Finally, we got
a group of type strains whose 16S/23S rRNA
Creation and Maintenance of LTP genes had never been sequenced or that the
Releases existing sequences were of too low quality to
be considered for the project (i.e., in terms of
LTP Datasets sequence length, number of ambiguities, etc.).
First LTP datasets (release LTPs93 for SSU We called them “orphan” species. The LTP
(Yarza et al. 2008), release LTPs102 for LSU project together with eleven international cul-
(Yarza et al. 2010)) were prepared following six ture collections has driven the sequencing of
main steps: these “orphan” species through the SOS ini-
1. Set up a list of candidate sequences. An initial tiative (Yarza et al. 2013).
sequence dataset consisted on a subsample of 6. Keep one sequence per species. On the other
the SILVA database, filtering by “type” (T) or hand, the list of type-strain sequences was
“cultured” (C) strains; this information mainly redundant in the sense that one single type
came from StrainInfo. strain could be represented by multiple
2. Set up a list of species names. In parallel we sequence entries. This is the case of multiple
built a comprehensive, updated, and independent sequencings and submissions, or
nonredundant (i.e., free of synonyms and the existence of several sequences due to mul-
according to latest valid nomenclature) list of tiple copies of the ribosomal operon. The aim
validly published species and subspecies of the LTP is, whenever possible, to keep one
A 28 All-Species Living Tree Project
All-Species Living Tree Project, Table 1 Summary of LTP releases. “Sync” fields correspond to IJSEM and EMBL
release dates. “Net increase” of a release is the number of new entries minus the number of deleted entries. “% incorrect”
refers to the percentage of new entries whose INSDC records carried incorrect information in the organism name field.
Averages include standard deviation
IJSEM EMBL Total New Deleted Net % Average
Release Type
sync sync entries entries entries increase incorrecta Average lengtha ambig.b
LTPs93 SSU
Dec. 2007 Dec. 2007 6,728 6,728 0 6,728 22 1,465.0 51.2 0.10 0.26
LTPs95 SSU
Jun. 2008 Jun. 2008 7,006 299 21 278 45 1,446.0 46.3 0.04 0.11
LTPs100 SSU
Aug. Jun. 2009 7,710 750 46 704 50 1,448.0 54.2 0.03 0.11
2009
LTPs102 SSU Feb. 2010 Nov. 8,029 363 44 319 58 1,453.6 52 0.33 0.12
2009
LTPs102 LSU Feb. 2010 Nov. 792 792 0 792 6 2,866.1 177.6 0.02 0.11
2009
LTPs104 SSU Dec. 2010 May 2010 8,545 545 29 516 74 1,444.6 62 0.27 0.11
LTPs106 SSU May 2011 Dec. 2010 8,815 279 9 270 77 1,445.9 51.1 0.03 0.12
LTPs108 SSU Dec. 2011 Jun. 2011 9,279 490 26 464 60 1,455.4 51.9 0.02 0.09
a
Average length for the “new entries”
b
Average percentage of ambiguities for the “total entries”
sequence per type strain in order to maintain if their corresponding species names are seen
simplicity, avoid confusion, and improve tree to be later heterotypic synonyms, if they
navigation and database usability. In general, become rejected, or as a matter of taxonomic
the best quality available (including manual opinions. Sequence entries existing in an LTP
inspection of the alignment) was selected for database can also change by means of their
the project and, in case of doubt, the earliest metadata. Thus, for example, new combina-
submission to an INSDC partner (www.insdc. tions (i.e., a type strain which changes its
org). From release LTPs102 (Yarza name due to reclassification) or subdivision of
et al. 2010), when multiple paralogues exist a species into subspecies leads to an entry
due to rRNA operon copy number, several modification at its taxonomic information
copies are kept if they show less than 98 % fields.
sequence identity (see below for further
details). Inaccurate or Mistaken Metadata
LTP is maintained by a scrutiny of the new Inaccurate sequence-associated metadata tend to
described species, nomenclatural changes, taxo- happen in more than 50 % of the new added 16S
nomic notes, and opinions that are monthly rRNA entries (Table 1). Often, these “mistakes”
published in the IJSEM journal. Their respective consist on a lack of entries’ updating tasks at the
16S and 23S rRNA gene sequence entries are time of their first appearance in a scientific pub-
acquired from the latest SILVA release and lication. It mainly occurs in taxonomy-associated
appended to the existing LTP database. There- information fields. To prove the uniqueness of a
fore, SILVA’s Reference (Ref) ARB databases new species and to name it take time and, in the
(http://www.springerreference.com/docs/html/ meanwhile, sequences are quickly produced
chapterdbid/304116.html) serve as template and easily submitted to nucleotide databases.
for the new LTP-ARB databases. Until now Most often, these submissions only show
(December 2012) one LSU-based and seven genus specifications, for example, sequence
SSU-based LTP releases have been produced entry GU808562 appears as “Hymenobacter sp.
(Table 1). New species are incorporated into HMD1010” but its real name is Hymenobacter
the database only if they account a good- yonginensis. Indeed, a Bacteriological Code-
quality sequence existing in the respective compliant (Lapage et al. 1992) nomenclature
SILVA release. Certain entries can be deleted may be somewhat tricky and is frequent to
All-Species Living Tree Project 29 A
consider several Latin terms and derivations until DCB-2T, accession number CP001336, with
one species name is finally accepted by authors 4.34 % of maximum inter-operonic divergence.
and reviewers. Unavoidably, this bad-quality A
information is propagated from INSDC databases Sequence Quality in LTP Datasets
(primary sources) to other technological services It has been suggested that sequences produced for
like dedicated ribosomal databases (e.g., taxonomic purposes should be equal or larger
SILVA). Although extensive data curation is not than 1,450 bases with less than 0.5 % ambiguities
a task of primary sources of information, it would (Stackebrandt et al. 2002). Reason is that infor-
be very beneficial that authors enhance their com- mative content of a molecular clock is linked to
mitment with the correctness of the metadata the total number of its variable positions (Ludwig
provided (e.g., like the species name) or that and Klenk 2001). Statistics derived from LTP
authors are forced to update their INSDC entries datasets indicate that in general, sequence quality
prior to manuscript acceptation (recommended is acceptable for in-depth phylogenetic studies
action for scientific journals). Successively, this (~1,455 bases and 0.02 % ambiguities for
rough data arrives finally to resources like LTP, LTPs108; Table 1). Figure 2 shows annual vari-
which have no choice but checking it carefully to ation of gene sequence length and percentage of
provide new informational fields with corrected ambiguities. Quality increase is mainly observed
information; curated information can return back in terms of ambiguities reduction, probably
to other resources of information. related to amelioration of sequencing techniques.
In any case, the completion of more full genome
Multiple Copies of the Ribosomal Operon sequences of type strains will substantially
In 2010, a comprehensive study was conducted to increase the sequence quality (indicated by
evaluate the intra-genomic variability of the 16S these two parameters) in the LTP database.
rRNA gene on complete type-strain genomes Researchers should be encouraged to complete
(Yarza et al. 2010). We observed that in very 50 ends of 16S rRNA gene sequences, as first
unusual exceptions, the intra-genus (94.5 %; 250 bases contain hypervariable regions V1 and
Yarza et al. 2008) or intraspecies (98.7 %; V2 which play an important role in comparisons
Stackebrandt and Ebers 2006) boundaries could between highly related organisms (Chakravorty
be exceeded within a single genome. In such et al. 2007).
cases, the selection of one or another sequence
might seriously affect the interpretation of Curated Metadata Introduced by the LTP
a phylogenetic inference. However, despite the In addition to regular fields provided by the
fact that the vast majority of strains contain mul- ARB-SILVA databases, sequence entries include
tiple copies of the rrn operon, only 2 % of them now the following LTP-specific metadata fields:
reveal divergences beyond 2 % (30 nucleotides) 1. fullname_ltp: corrected species name
sequence identity. Thus, most likely, the selec- according to LPSN (http://www.bacterio.net).
tion of one or another copy should not affect the 2. rel_ltp: name of the LTP release where
phylogenetic reconstructions. Consequently, a sequence entry appeared for the first time.
starting from release s104 (Munoz et al. 2011), 3. hi_tax_ltp: name of the family where the taxon
the LTP database includes all paralogues with is classified. For unclassified genera, the name
higher divergences than 2 %. By now, it is the of the next available higher taxon above genus
case of three species: Haloarcula marismortui (e.g., “Acidobacteria” for Bryobacter
ATCC 43049T, accession number AY596297, aggregatus).
with 5.7 % of maximum inter-operonic diver- 4. type_ltp: type species receive the label “type
gence; Thermoanaerobacter pseudethanolicus sp.” in this field.
ATCC 33223T, accession number CP000924, 5. riskgroup_ltp: risk-group classification of
with 3.66 % of maximum inter-operonic diver- microorganisms risk-group classification of
gence; and Desulfitobacterium hafniense microorganisms obtained from the DSMZ
A 30 All-Species Living Tree Project
All-Species Living Tree Project, Fig. 2 Annual distri- is given by the SILVA parameter “nuc_gene_slv” which
bution of the 16S rRNA gene sequence length and % of cuts off the bases at the extremes when beyond the
ambiguities in the 9,279 type-strain sequences E.coli’s16S rRNA gene limits. Percentage of ambiguities
corresponding to LTP release s108. Gene sequence length is given by the SILVA descriptor “ambig_slv”
(Deutsche Sammlung von Mikroorganismen variable stretches, with low sequence similarities,
und Zellkulturen), according to the Federal could be optimally positioned by recognizing
Institute for Occupational Safety and Health functional homology (due to evolutionary pres-
(BAuA) in Germany. sure) and functional stability of helices (due to
6. tax_ltp: taxonomic classification into higher chemical stability of base pairs’ bounds). A core
taxonomic ranks according to LPSN (http:// dataset of sequences with highly curated align-
www.bacterio.cict.fr/classifphyla.html). ments was incorporated into the SILVA system
7. url_lpsn_ltp: it contains the variable part of so new added sequences can be automatically
the URL leading to the LPSN’s species file aligned using this “seed alignment” as a reference
(e.g., http://www.bacterio.net/bryobacter.html). (Ludwig et al. 2004; Pruesse et al. 2007). Period-
ically more and more manually curated
Alignments and Phylogenetic Trees sequences are added into the seed which
Setting up universal alignments is a key step in improves its quality over time.
order to achieve optimal and comparable phylo- Although all new sequences incorporated into
genetic reconstructions. It has been one of the the LTP come from an ARB-SILVA database,
constant motivations of Wolfgang Ludwig and they are again manually revised to further correct
co-workers who dealt with the huge task of pre- misplaced bases and to check highly variable
paring common and reliable alignment of ribo- regions. Before tree calculation, the complete
somal SSU and LSU sequences of Bacteria, alignment is shifted using maximum frequency
Archaea, and Eukarya (Ludwig and Schleifer filters (Table 2) that remove dubious orthologous
1994). They found out that secondary structure positions caused by sequencing errors and
formations such as loops and helices occurred at hypervariability. Typically, LTP phylogenetic
the same relative positions along the molecule. trees are calculated using the 40 % maximum
This helped to refine the alignments because frequency filter.
All-Species Living Tree Project 31 A
All-Species Living Tree Project, Table 2 Maximum frequency filters implemented into the LTPs 108ARB database
Filter name Start position Stop position %mina %maxa No. of positionsb
LTPs108_ssu_10 0 50,000 10 100 1,433 A
LTPs108_ssu_20 0 50,000 20 100 1,433
LTPs108_ssu_30 0 50,000 30 100 1,432
LTPs108_ssu_40 0 50,000 40 100 1,390
LTPs108_ssu_50 0 50,000 50 100 1,288
a
Minimum and maximum sequence identity. For tree reconstructions, only columns are taken into account if they have a
positional conservation above the respective minimum values
b
Number of homologous positions (columns) taken into account for tree reconstructions
The first 16S rRNA-based phylogenetic tree The missing partial or lower-quality type-
was calculated for the release LTPs93 (Yarza strain sequences were added to the tree using
et al. 2008). The sequence dataset consisted of the ARB parsimony tool with the option for
6,728 type-strain sequences plus 3,247 keeping the initial topology while inserting
supporting sequences belonging to non-type additional data.
strains used to reinforce underrepresented groups The groups shown in the trees are defined by
and to stabilize the topology. The multiple align- recognizing the type members and according to
ment of 9,975 16S rRNA gene sequences was the taxonomic classification. The trees are care-
submitted to different treeing methodologies fully compared against previously reported topol-
including neighbor-joining, maximum likeli- ogies and current taxonomic classifications
hood, and maximum parsimony, all tested with (Yarza et al. 2010). All the additional supporting
several filters (30 %, 40 %, and 50 % maximum sequences used to reconstruct the phylogeny are
frequency filters) and all implemented in the removed from the final tree by keeping its topol-
ARB software package (Ludwig et al. 2004). ogy intact. Within the ARB database, the type
A high degree of congruence was observed species are labeled with a distinct color for easy
among them. The tree considered as optimal recognition and tree handling.
was a 40 %-filtered maximum likelihood recon-
struction calculated using the RAxML algorithm
(Stamatakis 2006), with the GTRGAMMA cor- Files Provided by the LTP
rection, with 100 bootstrap replicates, in a 5-node
and 20-processor parallel environment. The last As a taxonomic tool, the LTP must be understood
de novo phylogenetic reconstruction appears in as a collection of reference materials, all publicly
the release LTPs108 and was similarly calcu- available at the project’s Web page (http://www.
lated; tree calculation was run with a dataset of arb-silva.de/projects/living-tree), including:
12,166 16S rRNA gene sequences. 1. Release documentation: (I) readme file with
The phylogenetic tree calculated using the 23S a release description and (II) PDF document
rRNA gene was particularly challenging due to describing the metadata fields introduced by
data shortage in many groups. In order to set up the LTP
a reliable phylogeny based on 23S rRNA data, we 2. Tables: (I) new entries with outdated submis-
defined a core dataset made of high-quality sion names and (II) list of changes in the
sequences (type and non-type strains). The strin- dataset: added/deleted/modified entries
gent quality filtering approach ended with around 3. Export filter: ARB-export filter (.eft format) to
2,000 high-quality and nonredundant LSU extract data from LTP-ARB databases
sequences. This dataset was submitted to 4. Databases: (I) complete ARB databases
a maximum likelihood reconstruction in combi- including sequences, alignments, metadata,
nation with a 50 % maximum frequency filter filters, and trees and (II) datasets in CSV for-
allowing 2,463 positions of the entire alignment. mat including LTP metadata
A 32 All-Species Living Tree Project
5. Alignments: (I) gapped exports in multi- 94.9 % 0.4, 87.5 % 1.3, and 78.4 % 2.0
FASTA format and (II) compressed exports lead to the circumscription of a new genus, fam-
in multi-FASTA format ily, and phylum, respectively. For 23S rRNA
6. Phylogenetic trees: (I) collapsed overviews in genes, these values are slightly different:
PDF format showing the distinct phyla, 93.2 % 1.3 (genus), 87.7 % 2.5 (family),
(II) full SSU (more than 80 pages long) and and 75.3 % (phylum). As shown by the low
LSU trees in PDF format, and (III) full trees in errors, historically used criteria for genera, fam-
NEWICK format, including group names and ilies, and phyla are quite homogeneous and do not
branch lengths lead to unambiguous circumscriptions. These
cutoffs should be used with caution and always
as a complementary approach. They are espe-
Side Research cially recommended for prospective studies in
clone libraries and as additional support for the
Sequencing the Orphan Species circumscription of new taxa or emendation of
Initiative (SOS) existing ones.
The understanding that around 6 % of all classi-
fied species were missing from the ribosomal
SSU sequence catalogues motivated us to start Summary
the “Sequencing the Orphan Species” (SOS) ini-
tiative (Yarza et al. 2013). During 3 years of SSU and LSU databases made by the All-Species
work, the LTP team coordinated a network of Living Tree Project (LTP; http://www.arb-silva.
12 partner researchers and culture collections de/projects/living-tree) provide high-quality
(ATCC, BZF, CECT, CIP, CCUG, DSMZ, nearly full-length sequences of the type strains
JCM, ICMP, BCCM/LMG, MMG, NBRC, of all Archaea and Bacteria with validly
NCCB) in order to improve this situation by published names. Setting up a type-strain
(re)sequencing the 16S rRNA gene of the sequences database included the sieving of the
“orphan” species. As a result, 351 type strains public DNA databases whose sequence entries
appear represented now by a good-quality SSU often appeared outdated or mistaken at their tax-
gene sequence in the databases. They comprise onomic metadata. It involved the initial manual
representatives of 14 bacterial and archaeal cross-check of nearly 14,000 SSU and 6,000 LSU
phyla, 76 type species, and 78 pathogenic spe- sequence entries against the catalogue of distinct
cies. However, 201 type strains could not be species with validly published names retrieved
accessed as cultivable strains were not available from LPSN. Databases are complemented with
at recognized culture collections. They represent manually curated metadata, manually curated
10 phyla and 17 type species. alignments, and state-of-the-art phylogenetic
reconstructions (in contrast to other similar
Taxonomic Boundaries resources like the EzTaxon (Santamaria
In order to understand how the higher taxonomic et al. 2012)). The LTP team wants to remark
categories could be circumscribed by means of that the aim of the project is not to reconstruct
a sequence identity threshold, we performed the currently described species genealogy with
a statistical procedure to get the lowest similarity total fidelity but to provide a curated taxonomic
found within the members of a certain taxon tool for the scientific community. Our small but
(Yarza et al. 2008, 2010). By taking into account very representative SSU and LSU datasets may
all the taxa at a particular taxonomic rank, we be used as a reference for identification and clas-
obtained general lower cutoff values of sequence sification purposes in several fields of applica-
identity for genus, family, and phylum based on tion, for example, facilitating the collection of
16S rRNA and 23S rRNA. In general, minimum sequences for the reconstruction of taxa genealo-
16S rRNA gene sequence identities of gies (Cousin et al. 2012), enabling fast and
antiSMASH 33 A
reliable taxonomic affiliations in rRNA surveys Munoz R, Yarza P, Ludwig W, et al. Release LTPs104 of
(Santamaria et al. 2012), or serving as reference the all-species living tree. Syst Appl Microbiol.
2011;34:169–70.
datasets for testing bioinformatic procedures Pruesse E, Quast C, Knittel K, et al. SILVA: A
(Mizrahi-Man et al. 2013). a comprehensive online resource for quality checked
and aligned ribosomal RNA sequence data compatible
with ARB. Nucleic Acids Res. 2007;35:7188–96.
Santamaria M, Fosso B, Consiglio A, et al. Reference
Cross-References databases for taxonomic assignment in metagenomics.
Brief Bioinform. 2012;13:682–95.
▶ Culture Collections in the Study of Microbial Stackebrandt E, Ebers J. Taxonomic parameters
Diversity, Importance revisited: tarnished gold standards. Microbiol Today.
2006;33:152–5.
▶ Phylogenetics, Overview Stackebrandt E, Frederiksen W, Garrity GM, et al. Report
▶ SILVA Databases of the ad hoc committee for the re-evaluation of the
species definition in bacteriology. Int J Syst
Evol Microbiol. 2002;52:1043–7.
Stamatakis A. RAxML-VI-HPC: maximum likelihood-
References based phylogenetic analyses with thousands of taxa
and mixed models. Bioinformatics. 2006;22:2688–90.
Amann R, Ludwig W, Schleifer KH. Phylogenetic identi- Yarza P, Richter M, Peplies J, et al. The all-species living
fication and in situ detection of individual microbial tree project: a 16S rRNA-based phylogenetic tree
cells without cultivation. Microbiol Rev. of all sequenced type strains. Syst Appl Microbiol.
1995;59:143–69. 2008;31:241–50.
Chakravorty S, Helb D, Burday M, et al. A detailed anal- Yarza P, Ludwig W, Euzéby J, et al. Update of the
ysis of 16S ribosomal RNA gene segments for the all-species living tree project based on 16S and 23S
diagnosis of pathogenic bacteria. J Microbiol rRNA sequence analyses. Syst Appl Microbiol.
Methods. 2007;69:330–9. 2010;33:291–9.
Coenye T, Gevers D, Van de Peer Y, et al. Towards Yarza P, Spröer C, Swiderski J, et al. Sequencing Orphan
a prokaryotic genomic taxonomy. FEMS Microbiol Species initiative (SOS): filling the gaps in the 16S
Rev. 2005;29:147–67. rRNA gene sequence database for all species with
Cousin S, Gulat-Okalla ML, Motreff L, et al. Lactobacil- validly published names. Syst Appl Microbiol.
lus gigeriorum sp. nov., isolated from chicken crop. Int 2013;36:69–73.
J Syst Evol Microbiol. 2012;62:330–4.
Fox GE, Pechman KR, Woese CR. Comparative catalogu-
ing of 16S ribosomal ribonucleic acid: molecular
approach to prokaryotic systematics. Int J Bacteriol.
1977;27:44–57.
Garrity GM. Bergey’s manual of systematic bacteriology. antiSMASH
2nd ed. New York: Springer; 2001.
Lapage SP, Sneath PHA, Lessel EF, et al. International Eriko Takano1, Rainer Breitling1 and
code of nomenclature of bacteria (1990 revision).
Washington, DC: American Society for Microbiology;
Marnix H. Medema2
1
1992. p. 295. Manchester Institute of Biotechnology,
Ludwig W, Klenk HP. Overview: a phylogenetic University of Manchester, Manchester, UK
backbone and taxonomic framework for prokaryotic 2
Microbial Genomics and Bioinformatics
systematics. In: Boone DR, Castenholz RW,
Research Group, Max Planck Institute for Marine
Garrity GM, editors. Bergey’s manual of systematic
bacteriology. 2nd ed. New York: Springer; 2001. Microbiology, Bremen, Germany
p. 49–65.
Ludwig W, Schleifer KH. Bacterial phylogeny based on
16S and 23S rRNA sequence analysis. FEMS
Microbiol Rev. 1994;15:155–73.
Definition
Ludwig W, Strunk O, Westram R, et al. ARB: a software
environment for sequence data. Nucleic Acids Res. antiSMASH (Medema et al. 2011) is a web server
2004;32:1363–71. and a stand-alone software to identify, annotate,
Mizrahi-Man O, Davenport ER, Gilad Y. Taxonomic clas-
sification of bacterial 16S rRNA genes using short
and compare gene clusters that encode the bio-
sequencing reads: evaluation of effective study synthesis of secondary metabolites in bacterial
designs. PLoS One. 2013;8:e53608. and fungal genomes. antiSMASH offers a wide
A 34 antiSMASH
range of options to identify and analyze biosyn- Protein Domain Analysis of Polyketide
thetic gene clusters, including protein domain Synthases and Nonribosomal Peptide
analysis of the large multi-domain enzymatic Synthetases
assembly lines involved, prediction of core PKs and NRPs are synthesized by large
chemical structures of their end compounds, and megasynthase enzymes containing a multitude
multiple cluster alignments to a database of all of protein domains, such as condensation
currently sequenced gene clusters. (C) and adenylation (A) and PCP-binding
The antiSMASH web server can be found at domains in nonribosomal peptide synthetases
http://antismash.secondarymetabolites.org. (NRPSs), ketosynthase (KS), and acyltransferase
(AT) and ACP-binding domains in polyketide
synthases (PKSs) (Fischbach and Walsh 2006).
Introduction antiSMASH contains a library of pHMMs that
can recognize all these protein domains as well
Microbial secondary metabolites are of great as distinguish between various subtypes of these
interest to society because of their diverse bio- domains. In the antiSMASH output, the domain
logical activities that are interesting starting structures of any NRPSs or PKSs encoded in
points for drug development. Many of them are a gene cluster are visualized, and several down-
already used as antibiotics, antitumor agents, or stream analysis options are provided for each
cholesterol-lowering drugs (Hutchinson and domain (Fig. 1).
McDaniel 2001; Fischbach and Walsh 2009).
Automated computational identification of gene Core Chemical Structure Prediction
clusters in newly sequenced genomes is becom- When a secondary metabolite biosynthesis
ing a cornerstone of genome-based drug discov- gene cluster is detected, one of the key questions
ery, due to the affordability of sequencing large of course is what kind of chemical structure it
numbers of genomes from microorganisms that produces. For NRPs and PKs, antiSMASH
potentially produce novel secondary metabolites is able to already give a first approximation of
(Walsh and Fischbach 2010). the core chemical structure of the end compound
(Fig. 2). To do so, it uses several substrate
specificity prediction methods (Yadav et al.
Functionalities 2003; Minowa et al. 2007; Röttig et al. 2011)
that are based on the amino acid sequences of the
Gene Cluster Detection A domains of NRPSs and the AT domains of
antiSMASH detects a wide range of different PKSs. To infer the sequential arrangement of
types of biosynthetic gene clusters, including the predicted substrates of the A/AT domains
those encoding the pathways toward polyketides in the resulting polyketide or peptide, the
(PKs), nonribosomal peptides (NRPs), terpenoids, order of the PKS enzymes in a multimodular
ribosomal peptides, aminoglycosides, and assembly line is predicted using their estimated
non-NRP siderophores. The detection is docking domain binding affinities (Yadav
performed by screening the gene sequences from et al. 2009) or, alternatively, colinearity of the
the input against a library of profile Hidden Mar- PKS or NRPS genes with their enzymes is
kov Models (pHMMs) (Eddy 2011), each of which assumed.
is specific for genes characteristic for a certain
gene cluster type, and passing the results through Comparative Analysis of Gene Clusters
a hierarchical logic filter. A second detection algo- In order to understand the architecture and func-
rithm is also run, which detects genomic regions tion of a secondary metabolite biosynthesis
that are enriched in Pfam domains (Finn gene cluster, much is gained by examining it
et al. 2010) linked to secondary metabolism. within its evolutionary context through the
antiSMASH 35 A
SCO6273 (type I modular pks)
KS AT DH KR TD
antiSMASH, Fig. 1 Domain structure of multi-domain the mouse is positioned over a domain: one can, for
enzymes such as PKSs and NRPSs as visualized by example, run a BlastP search specifically with the
antiSMASH, offering several options for analysis when sequence of this domain
O
O
O O
N C(H1)
C(H1) N N
C(H1) O
C(H1) N N
O O
N O C(H1)
O O
N N N
C O
C(H1) O C(H1)
O O O
C(H1) O
N
O
N O N O N
C(H1)
O
N
antiSMASH, Fig. 2 Prediction of the core chemical for the substrate specificities of the NRPS adenylation
structure of an NRP by antiSMASH. The residues are domains in the gene cluster
based on a consensus between three prediction methods
comparison with related gene clusters from spe- detecting the borders of the gene cluster and
cies across the tree of life. To facilitate this, identifying the conserved multigene modules
antiSMASH hosts a regularly updated database that constitute its building blocks.
of gene clusters it has detected in all nucleotide
sequences present in GenBank. antiSMASH Secondary Metabolism-Specific Gene
then combines multiple BlastP runs into Family Analysis
a comparative search of every identified gene Most genes involved in the biosynthesis of sec-
cluster against all other known gene clusters. ondary metabolite have (close) homologues
This is used to generate a multiple gene cluster with similar functions in other secondary
alignment (Fig. 3), which can aid the biologist metabolite biosynthesis gene clusters. This can
in assessment of the novelty of the gene cluster, be used to infer the functions of the genes
A 36 antiSMASH
antiSMASH, Fig. 3 Example of a multiple gene cluster alignment by antiSMASH, showing identified homologue
clusters of the query gene cluster
residing in the biosynthetic gene cluster based Pfam matches and running Blast for each gene
on sequence homology. antiSMASH simplifies against a database of all bacterial and fungal
this process by categorizing the genes of protein sequences.
every identified gene cluster into secondary
metabolism-specific gene families and automat-
ically generating approximate phylogenetic Stand-Alone Version
trees of each gene in the context of its gene
family. Stand-alone versions of antiSMASH are avail-
able for download for Windows, Mac OS X, and
Genome-Wide Pfam and Blast Analysis Ubuntu Linux. Additionally, several related
Finally, antiSMASH also offers the possibility scripts are available from the antiSMASH
(transferred from CLUSEAN; Weber et al. website. An EMBL formatting script can be
2009) to do a comprehensive analysis of all downloaded to format raw FASTA sequences
genes within a submitted genome, identifying together with a text file containing gene
antiSMASH 37 A
annotations into an EMBL file that can be sub- gene clusters. Various functionalities –
mitted to antiSMASH. Also, a script is available comparative, phylogenomic, enzymatic, etc. –
which allows running antiSMASH on multiple are integrated in one single pipeline, making it A
files, in batch mode. straightforward for genomicists and natural prod-
uct researchers to study the biosynthetic potential
of any organism.
Development
References
Related Tools Anand S, Prasad MV, Yadav G, Kumar N, Shehara J,
Ansari MZ, Mohanty D. SBSPKS: structure based
Several other software tools for the study of sec- sequence analysis of polyketide synthases. Nucleic
ondary metabolism have been published. For Acids Res. 2010;38:W487–96.
Eddy SR. Accelerated profile HMM searches. PLoS
example, ClustScan (Starcevic et al. 2008) and
Comput Biol. 2011;7:e1002195.
NP.searcher (Li et al. 2009) can both be used to Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington
detect bacterial polyketide and NRP biosynthesis JE, Gavin OL, Gunasekaran P, Ceric G, et al. The Pfam
gene clusters. The same is the case for CLUSEAN protein families database. Nucleic Acids Res. 2010;38:
D211–22.
(Weber et al. 2009), the pipeline which has now
Fischbach MA, Walsh CT. Assembly-line enzymology for
been integrated entirely into antiSMASH. For the polyketide and nonribosomal peptide antibiotics:
analysis of fungal sequences, SMURF (Khaldi logic, machinery, and mechanisms. Chem Rev.
et al. 2010) offers a gene cluster detection potential 2006;106:3468–96.
Fischbach MA, Walsh CT. Antibiotics for emerging path-
similar to that of antiSMASH. Structural analysis
ogens. Science. 2009;325:1089–93.
of polyketide synthases can be performed with Hutchinson CR, McDaniel R. Combinatorial biosynthesis
the SBSPKS suite (Anand et al. 2010). Finally, in microorganisms as a route to new antimicrobial,
draft genomes with many small contigs and antitumor and neuroregenerative drugs. Curr Opin
Investig Drugs. 2001;2:1681–90.
metagenomes with fragments too small for gene
Khaldi N, Seifuddin FT, Turner G, Haft D, Nierman WC,
cluster detection can be scrutinized with NaPDoS Wolfe KH, Fedorova ND. SMURF: genomic mapping
(Ziemert et al. 2012) in order to find protein of fungal secondary metabolite clusters. Fungal Genet
domains related to secondary metabolite biosyn- Biol. 2010;47:736–41.
Li MH, Ung PM, Zajkowski J, Garneau-Tsodikova S,
thesis and analyze these phylogenetically.
Sherman DH. Automated genome mining for natural
products. BMC Bioinformatics. 2009;10:185.
Medema MH, Blin K, Cimermancic P, de Jager V,
Summary Zakrzewski P, Fischbach MA, Weber T, Takano E,
Breitling R. antiSMASH: rapid identification, annota-
tion and analysis of secondary metabolite biosynthesis
antiSMASH is an easy-to-use web server for the gene clusters in bacterial and fungal genome
detection of secondary metabolite biosynthesis sequences. Nucleic Acids Res. 2011;39:W339–46.
A 38 Approaches in Metagenome Research: Progress and Challenges
frequencies, GC content, and synonymous codon during Arctic winter, Alonso Sáez et al. (2012)
usage, whereas similarity-based binning makes identified thaumarchaeal pathways for ammonia
use of sequence homology. Among others, oxidation. A number of other Thaumarchaeota
PhyloPythiaS, introduced by Patil et al. (2011), are also capable of ammonia oxidation, but unex-
represents an appropriate application to perform pectedly these Arctic thaumarchaeal organisms
composition-based binning. With respect to harbored a high abundance of genes involved in
similarity-based binning, typically searches urea transport and degradation.
against reference databases (e.g., National Center
for Biotechnology Information databases) are
performed using alignment tools such as Metagenomic Biomolecule Discovery
BLAST+ (Camacho et al. 2009). Subsequently,
BLAST results can be interpreted by applying To access the large pool of unexplored biomole-
software such as MEGAN (Huson et al. 2011). cules, microbial community DNA has been
Due to the often very high diversity of micro- extracted and metagenomic libraries have been
bial communities, assembly of metagenome- constructed. Small-insert and large-insert
derived sequences is challenging. In a recent metagenomic libraries can be screened to identify
metagenomic survey of honey bee gut novel biomolecules. For the construction of
microbiota, de novo assembly of 81,343,096 small-insert libraries containing metagenomic
Illumina paired-end reads resulted in 54,700 scaf- DNA 15 kb, plasmids are appropriate vectors,
folds of contigs (total length, 76.6 Mb) (Engel whereas cosmids, fosmids, and bacterial artificial
et al. 2012). Similar to the approach conducted by chromosomes (BACs) can be used for cloning of
Engel et al. (2012), single-genome assemblers large metagenomic DNA molecules (cosmids
were used for metagenome assembly with modi- and fosmids, 40 kb; BACs, 100–200 kb).
fied settings. Recently, a single-genome assem- Metagenomic libraries from different microbial
bler (Velvet) has been extended to enable the habitats such as glacier ice, digestive tracts of
assembly of short metagenomic reads (Namiki animals, soil, hot springs, or seawater have
et al. 2012). This new de novo assembler already been constructed and successfully
(MetaVelvet) generated significantly higher N50 screened for novel biomolecules (see, e.g.,
scores, a parameter that evaluates assembly qual- Nacke et al. 2012). Some of these biomolecules
ity, than analyzed single-genome assemblers for exhibit valuable characteristics for industrial
simulated datasets. applications such as thermal stability,
Based on assemblies or individual halotolerance, and activity under acidic or alka-
metagenomic sequence reads, gene prediction, line conditions. In a recent metagenomic
annotation, and reconstruction of pathways can approach, Sulaiman et al. (2012) isolated a gene
be carried out to assess the functional potential encoding a novel cutinase homolog designated
encoded by metagenomes. Consecutive LC-cutinase with polyethylene terephthalate-
processing of these steps is provided by degrading activity from a leaf-branch compost
a number of web-based tools like MG-RAST fosmid library. The enzyme showed higher spe-
(Meyer et al. 2008). These tools utilize resources cific polyethylene terephthalate-degrading activ-
of reference databases such as SEED (Overbeek ity than previously reported bacterial and fungal
et al. 2005) and KEGG (Kanehisa et al. 2008) cutinases. Thus, LC-cutinase is a potent candi-
to link biological information to predicted date for industrial applications, i.e., in textile
genes. In a recent survey including metagenomic industry. In general, two different metagenomic
methods, the functional potential of Arctic screening approaches for the identification of
Thaumarchaeota was investigated (Alonso Sáez novel biomolecules can be distinguished:
et al. 2012). By analyzing a metagenome derived function-based screening and sequence-based
from a Southeast Beaufort Sea sample collected screening.
Approaches in Metagenome Research: Progress and Challenges 41 A
Principle and Variations of biosensors. Nevertheless, all of these function-
Function-Driven Screens based screening approaches share one significant
disadvantage: the dependence of target gene pro- A
To perform function-driven screening, the duction on the expression machinery of the
construction of small-insert or large-insert metagenomic library host.
metagenomic libraries is required. A broad array
of different function-based screening approaches
can be applied using these libraries. The pheno- Principle and Variants of
typic insert detection (PID) is the most frequently Sequence-Based Screening
applied screening strategy. Metagenomic library-
containing clones expressing target genes are Conserved regions of genes or proteins enable
identified based on phenotypic characteristics. sequence-driven screening approaches. Based on
This method has been applied to identify novel these regions degenerate primers can be designed
lipolytic genes and gene families from German and fragments of target genes amplified. For
forest and grassland soil samples using tributyrin example, novel biphenyl dioxygenase DNA seg-
as a screening substrate (Nacke et al. 2011). ments encoding active site residues were obtained
A total of 37 lipolytic clones, encoding novel from polychlorobiphenyl-contaminated soils
lipases and esterases, which could be assigned using this strategy (Standfuß-Gabisch et al.
to five different known families and two puta- 2012). After sequencing of an amplified partial
tively new families of lipolytic enzymes, were target gene, it can be decoded completely using
identified by halo formation on indicator agar primer walking and extracted environmental DNA
plates. The potential to identify entirely novel or a metagenomic library as a template. In this
target genes is an important advantage of way, an entire xylose isomerase gene (xym1) has
function-driven screening approaches. Modu- been derived from a soil metagenomic library
lated detection (MD) represents another (Parachin and Gorwa-Grauslund 2011). The gene
commonly applied strategy to perform function- product of xym1 consisted of 443 amino acids and
based screening. Only if a certain gene product is was most similar (83 % identity) to a xylose isom-
expressed by a metagenomic library-containing erase from Sorangium cellulosum. Additionally,
host strain, it can grow under selective condi- novel complex polyketide and nonribosomal pep-
tions. Recently, novel acid resistance genes tide biosynthesis gene cluster that often exceed
were derived from planktonic and rhizosphere average insert sizes of large-insert metagenomic
microbial communities of the Tinto River libraries can be discovered by using degenerate
(Spain) using this strategy (Guazzaroni primers and subsequent chromosome walking
et al. 2013). Fifteen genes, mainly encoding (Piel 2011). The potential to identify genes of
putative proteins of unknown function, interest even if they are not expressed in
conferred acid resistance to the host strain a metagenomic library host represents a major
Escherichia coli. Moreover, substrate-induced advantage of sequence-based screening, but only
gene expression (SIGEX), product-induced gene novel variants of already-known gene or protein
expression (PIGEX), and metabolite-regulated families can be detected by this method.
expression (METREX) screening strategies
allow the identification of target genes from
metagenomic libraries (Simon and Daniel Future Challenges in Metagenomic
2009). Recently, Wang et al. (2012) suggested Research and Related Meta-omic
biosensor-based genetic transducer (BGT) sys- Approaches
tems as an alternative and sensitive approach to
screen for gene clusters whose expression pro- One of the major requirements to combine and
duce small molecules that activate the employed compare metagenomic studies conducted by
A 42 Approaches in Metagenome Research: Progress and Challenges
research groups worldwide is the definition and metagenomic libraries resulted in identification
acceptance of minimum standards in experimental of previously unknown biomolecules, including
design. The same applies to metatranscriptomics, biomolecules with industrially relevant
metaproteomics, and metabolomics. In this way, characteristics.
comparison and combination of results obtained
from the different meta-omic approaches are fea-
sible. Metatranscriptomics, metaproteomics, and Cross-References
metabolomics comprise the study of the collective
gene transcripts, expressed proteins, and metabo- ▶ A 123 of Metagenomics
lites, respectively, generated by the microorgan- ▶ Extraction Methods, Variability Encountered
isms within an ecosystem (Nacke et al. 2014; in
Hettich et al. 2012; Patti et al. 2012). The conse- ▶ Fosmid System
quent application and combination of appropriate ▶ Genome Portal, Joint Genome Institute
meta-omic approaches will lead to an enormous ▶ Microbial Diversity, Bar-Coding Approaches
extension of knowledge on the gene structure, ▶ Microbial Ecosystems, Protection of
diversity, activity, and responses of microbial ▶ Phylogenetics, Overview
communities on an ecosystem level. Furthermore,
the rapid growth of meta-omic technologies will
continuously demand for progress in the field of References
bioinformatics. Thus, further development and
linkage of meta-omic analysis tools will be impor- Alonso Sáez L, Waller AS, Mende DR, et al. Role for urea
in nitrification by polar marine Archaea. Proc Natl
tant in the future. In addition, the application and
Acad Sci USA. 2012;109:17989–94.
improvement of culture-based methods will be Camacho C, Coulouris G, Avagyan V, et al. BLAST+:
still valuable in the future to extend the number architecture and applications. BMC Bioinforma.
of available reference genomes allowing mapping 2009;10:421.
Caporaso JG, Kuczynski J, Stombaugh J, et al. QIIME
of metagenomic data. In this context, the young
allows analysis of high-throughput community
discipline of single cell genomics has potential to sequencing data. Nat Methods. 2010;7:335–6.
play a complementary role by continuously con- Engel P, Martinson VG, Moran NA. Functional diversity
tributing novel reference genomes. within the simple gut microbiota of the honey bee.
Proc Natl Acad Sci USA. 2012;109:11002–7.
Gonzalez A, Knight R. Advancing analytical algorithms
and pipelines for billions of microbial sequences. Curr
Summary Opin Biotechnol. 2012;23:64–71.
Guazzaroni ME, Morgante V, Mirete S, et al. Novel acid
resistance genes from the metagenome of the Tinto
The introduction of metagenomics allowed
River, an extremely acidic environment. Environ
culture-independent analysis of microbial Microbiol. 2013;15:1088–1102.
populations in complex ecosystems. Subse- Handelsman J. Metagenomics: application of genomics to
quently, other culture-independent meta-omic uncultured microorganisms. Microbiol Mol Biol Rev.
2004;68:669–85.
disciplines including metatranscriptomics,
Hettich RL, Sharma R, Chourey K, et al. Microbial
metaproteomics, and metabolomics were metaproteomics: identifying the repertoire of proteins
established. Metagenomics provided insights that microorganisms use to compete and cooperate in
into the enormous phylogenetic and functional complex environmental communities. Curr Opin
Microbiol. 2012;15:373–80.
diversity of microbial communities within vari- Huson DH, Mitra S, Ruscheweyh HJ, et al. Integrative
ous environments on Earth. The increasing num- analysis of environmental sequences using MEGAN4.
ber of next-generation sequencing technologies Genome Res. 2011;21:1552–60.
led to a more comprehensive and cost-effective Kanehisa M, Araki M, Goto S, et al. KEGG for linking
genomes to life and environment. Nucleic Acids Res.
assessment of the information encoded by
2008;36:D480–4.
metagenomic DNA. Metagenomic approaches Kembel SW, Wu M, Eisen JA, et al. Incorporating 16S
comprising the construction and screening of gene copy number information improves estimates of
Arbuscular Mycorrhizal Fungi Assemblages in Chernozems 43 A
microbial diversity and abundance. PLoS Comput Standfuß-Gabisch C, Al-Halbouni D, Hofer B. Character-
Biol. 2012;8:e1002743. ization of biphenyl dioxygenase sequences and activ-
Li K, Bihan M, Yooseph S, Methé BA. Analyses of the ities encoded by the metagenomes of highly
microbial diversity across the human microbiome. polychlorobiphenyl-contaminated soils. Appl Environ A
PLoS ONE. 2012;7:e32118. Microbiol. 2012;78:2706–15.
Ludwig W, Klenk HP. Overview: a phylogenetic back- Sulaiman S, Yamato S, Kanaya E, et al. Isolation of
bone and taxonomic framework for procaryotic sys- a novel cutinase homolog wit polyethylene
tematics. In: Garrity GM, Boone DR, Castenholz RW, terephthalate-degrading activity from leaf-branch
editors. Bergey’s manual of systematic bacteriology, compost by using a metagenomic approach. Appl
Vol. 1. 2nd ed. New York: Springer; 2001. p. 49–65. Environ Microbiol. 2012;78:1556–62.
Meyer F, Paarmann D, D’Souza M, et al. The Wang Y, Chen Y, Zhou Q, et al. A culture-independent
metagenomics RAST server – a public resource for approach to unravel uncultured bacteria and functional
the automatic phylogenetic and functional analysis of genes in a complex microbial community. PLoS ONE.
metagenomes. BMC Bioinforma. 2008;9:386. 2012;7:e47530.
Murray AE, Kenig F, Fritsen CH, et al. Microbial life at Weinstock GM. Genomic approaches to studying the
13 C in the brine of an ice-sealed Antarctic lake. human microbiota. Nature. 2012;489:250–6.
Proc Natl Acad Sci USA. 2012;109:20626–31.
Nacke H, Will C, Herzog S, et al. Identification of novel
lipolytic genes and gene families by screening of
metagenomic libraries derived from soil samples of Arbuscular Mycorrhizal Fungi
the German biodiversity exploratories. FEMS
Microbiol Ecol. 2011;78:188–201.
Assemblages in Chernozems
Nacke H, Engelhaupt M, Brady S, et al. Identification and
characterization of novel cellulolytic and Chantal Hamel, Luke D. Bainard and Mulan Dai
hemicellulolytic genes and enzymes derived from Ger- Semiarid Prairie Agricultural Research Centre,
man grassland soil metagenomes. Biotechnol Lett.
Agriculture and Agri-Food Canada, Swift
2012;34:663–75.
Nacke H, Fischer C, Th€ urmer A, et al. Land use type Current, SK, Canada
significantly affects microbial gene transcription in
soil. Microb Ecol. 2014;67:919–30.
Namiki T, Hachiya T, Tanaka H, et al. MetaVelvet: an
Synonyms
extension of Velvet assembler to de novo metagenome
assembly from short sequence reads. Nucleic Acids
Res. 2012;40:e155. Diversity, arbuscular mycorrhizal fungi, Cana-
Overbeek R, Begley T, Butler RM, et al. The subsystems dian Prairie, Chernozem, land use.
approach to genome annotation and its use in the
project to annotate 1000 genomes. Nucleic Acids
Res. 2005;33:5691–702.
Parachin NS, Gorwa-Grauslund MF. Isolation of xylose Definition
isomerases by sequence- and function-based screening
from a soil metagenomic library. Biotechnol Biofuels.
AM fungi are obligate plant symbionts that form
2011;4:9.
Patil KR, Haider P, Pope PB, et al. Taxonomic the phylum Glomeromycota. These fungi contrib-
metagenome sequence assignment with structured out- ute to plant nutrient uptake, influence soil biotic
put models. Nat Methods. 2011;8:191–2. and abiotic environments, and provide important
Patti GJ, Yanes O, Siuzdak G. Innovation: Metabolomics:
ecosystem services. 454-pyrosequencing of
the apogee of the omics trilogy. Nat Rev Mol Cell Biol.
2012;13:263–69. amplicons from metagenomic DNA revealed
Piel J. Approaches to capturing and designing biologically the distribution of AM fungi in major Canadian
active small molecules produced by uncultured Chernozem great groups as influenced by land use
microbes. Annu Rev Microbiol. 2011;65:431–53.
and crop management.
Simon C, Daniel R. Achievements and new knowledge
unraveled by metagenomic approaches. Appl
Microbiol Biotechnol. 2009;85:265–76.
Simon C, Daniel R. Metagenomic analyses: past and future Introduction
trends. Appl Environ Microbiol. 2011;77:1153–61.
Song ZQ, Wang FP, Zhi XY, et al. Bacterial and archaeal
diversities in Yunnan and Tibetan hot springs, China. AM fungi form a mycorrhizal symbiosis with the
Environ Microbiol. 2013;15:1160–75. roots of the majority of land plants. They have
A 44 Arbuscular Mycorrhizal Fungi Assemblages in Chernozems
coevolved with plants over 450 Ma to produce of operational taxonomic units (OTU) of the tar-
today’s mycorrhiza, which is an organ special- get microbial group in a soil sample. The concept
ized in the extraction of soil nutrients. As such, of an OTU is useful in soil microbiology as the
AM fungi are seen as a key stone of agricultural majority of microbial species are still
sustainability (Garg and Chandel 2010). undescribed. OTUs serve as a proxy for species
World grain, pulse, and biofuel crop produc- making it possible to measure and describe soil
tion mainly occurs on deep (typically microbial diversity. In addition, OTUs can be
>18–25 cm) warm-colored soils rich in humus identified by comparison with known sequences
(>0.6 % organic carbon) and weatherable min- in public databases such as GenBank and
erals, with high levels of base saturation (>50 %) MaarjAM. AM fungi have been difficult to
and calcium as the main exchangeable cation study due to their obligate biotrophy and inability
(Durán et al. 2011). These soils have similar to grow in pure culture. However, polymerase
properties but have different names in other soil chain reaction (PCR) made possible the amplifi-
classification systems. They are Chernozems in cation of DNA from their spores and enabled the
Canada, Ukraine, and Russia; Mollisols in the molecular characterization and classification of
USA and South America; Isohumosols or Black taxa within the Glomeromycota (Schuessler
Soils in China; and Chernozems, Kastanozems, 2013).
and Phaeozems according to the FAO (Liu Fungal diversity is commonly assessed based
et al. 2012). These soils have typically developed on the internal transcribed spacer (ITS) of the
under condition of moisture deficit and grassland ribosomal RNA gene. However, abundant SSU
vegetation in temperate regions around the globe. rRNA gene sequences of AM fungi are found in
They mainly occur in a band across Eastern databases due to the traditional use of this region
Europe and Central Asia, in northeast China, for the Glomeromycota. Several primers sets pro-
from south-central Canada down to the Gulf of ducing taxonomically informative amplicons
Mexico, and over most of Uruguay and part of short enough for use with first- and next-
Argentina. generation molecular techniques have been used
in ecological studies of AM fungi.
The AM fungi have a patchy distribution in
Tackling the Complexity of Soil soil (Hart and Klironomos 2003). Thus in order to
Biodiversity capture their diversity, multiple samples must be
taken at a study site. A composite sample is
Soil hosts an extremely high level of microbial usually made by pooling and homogenizing all
diversity (Young and Crawford 2004). However, the samples from a sampling site. The distribu-
high-throughput next-generation sequencing now tion of organisms varies with soil depth, thus
allows generation of the massive sequence data sampling depth also matters. The AM fungi are
required to characterize soil microbial diversity. normally found within the rooting depth.
Amplicon sequencing is preferred over whole
genome sequencing for the study of the taxo-
nomic diversity of targeted microbial groups. Arbuscular Mycorrhizal Fungi in the
The 454 FLX and 454 FLX + technologies Canadian Chernozems
allow the sequencing of DNA amplicons up to
400 and 800 bp in length, respectively. Such long AM fungal communities in the Canadian Prairie
sequences contain sufficient taxonomic informa- Chernozem soils are composed of a few dominant
tion for the characterization of microbial commu- and a large number of subordinate taxa. Less than
nities and their use conveniently eliminates the 6 % of the AM fungal OTUs accounted for half of
need for sequence assembly. all AM fungal reads (Dai et al. 2013). Across the
Pyrosequencing of amplicons and bioinfor- Canadian prairie landscape, the Glomeraceae
matic analysis of sequence data yield the profile were the most abundant family, accounting for
Arbuscular Mycorrhizal Fungi Assemblages in Chernozems 45 A
65 % of all AM fungal OTUs and 54 % of the AM This concurs with the previous observation of
fungal reads. The Claroideoglomeraceae is sec- differences in the seasonal pattern of sporulation
ond in abundance with 25 % of all AM fungal of different AM fungal species (Dhillion and A
OTUs and 39 % of the AM fungal reads. Diversis- Anderson 1993). Seasonal variation of AM
poraceae accounted for 8 % of the OTUs and 7 % fungi in the North American Great Plains was
of the AM fungal reads. Paraglomaceae, also described as the replacement of the fungi of
Gigasporaceae, and Archeosporaceae are poorly the order Helotiales by AM fungi as the season
distributed across the prairie landscape, and unfolds in the North American Great Plains
Gigasporaceae and Archeosporaceae are rare. (Jumpponen 2011).
In other regions, spore counts in grazed The Chernozem great groups are distributed
Kastanozems of Inner Mongolia revealed that along a gradient of precipitation radiating out-
the AM fungal communities resembled those ward from the US border in eastern Alberta, i.e.,
observed in Canadian Chernozems (Tian from the Brown soil zone through Dark Brown
et al. 2009). The Gigasporaceae are susceptible and Black soils up to the Gray soil zone at the
to disturbance and largely absent in croplands, fringe of the boreal forest. The lowest abundance,
which explains their greater abundance in richness, and diversity of AM fungi were
the Kastanozems than in the Canadian Prairie observed in the driest soil zone (Brown Cherno-
Chernozems (Dai et al. 2012, 2013). Poorer zem), which supported a negative impact of
AM fungal diversity is reported from American moisture deficit on these fungi.
spore-based surveys of Mollisols under tallgrass Soil moisture appears to be just one of several
prairie cover where Paraglomaceae and factors that influence the composition of AM
Archeosporaceae were undetected (Eom fungal communities in Chernozem soils. Despite
et al. 2001; Bentivenga and Hetrick 1992). the highest levels of precipitation in the Gray soil
Tallgrass prairies managed with fire were found zone, the highly productive Black soils harbor the
to be very highly dominated by the Glomeraceae most abundant and diverse AM fungal communi-
(Bentivenga and Hetrick 1992), underlining the ties (Dai et al. 2012). Black, Gray, Dark Brown,
importance of land use in the structuring of AM and Brown soils had an average of 10.2, 7.1, 7.0,
fungal communities. and 6.2 AM fungal OTUs, respectively, and the
AM fungi share root occupation with fungal Shannon diversity index of these soil groups fol-
endophytes belonging to different taxonomic lows a similar trend. AM fungal communities in
groups. Non-AM fungal endophytes are particu- Brown soils are characterized by a reduced rela-
larly abundant in temperate grasslands (Porras- tive abundance of Claroideoglomeraceae com-
Alfaro et al. 2011). This observation triggered the pared to Black and Dark Brown soils. Other
question as to whether AM fungi are at the end of important factors that influenced the abundance
their range in dry areas. of AM fungal OTUs were A horizon thickness
This hypothesis was explored in the Canadian and physicochemical properties of the soils, such
Prairie using primers Glo1/NS31, which pro- as bulk density, Zn level, pH, electrical conduc-
duced 18S rDNA amplicons of about 230 bp tivity, and sulfur level.
(Yang et al. 2010). A succession of AM fungi Soils are classified based on their physical and
was detected as the soil dried from early to late chemical properties. A soil type represents
summer, suggesting that the adaptation of AM a living environment inhabited by different AM
fungi to soil moisture availability varies with fungal communities. American Mollisols and
species. Glomus viscosum, Funneliformis Alfisols contain distinct AM fungal spore assem-
mosseae, and Glomus hoi were dominant in blages (Ji et al. 2012). Similarly, Canadian Cher-
early summer, under conditions of moisture suf- nozems and Podzols and even different great
ficiency, whereas the dominant AM fungal OTUs groups of Chernozems contained distinct assem-
in late season conditions (i.e., dry soil) belonged blages of AM fungal rRNA gene sequences
to Glomus iranicum and Glomus macrocarpum. (Dai et al. 2013).
A 46 Arbuscular Mycorrhizal Fungi Assemblages in Chernozems
Land use modifies the conditions of the soil relatively poor in symbiotic AM fungi and are
environment and the impact of land use on the less hospitable to the Claroideoglomus than
structure of AM fungal communities exceeds that other Chernozems, whereas Black Chernozems
of soil type. In the Canadian Prairie, roadsides are rich in AM fungal resources. The influence
host a higher level of AM fungal diversity than of soil type on the composition of AM fungal
cropland or natural areas (Dai et al. 2013). Road- communities is relatively small compared to
sides have higher soil moisture levels than crop- that of land use type. Funneliformis have
land and most natural areas, further indicating a competitive edge and proliferate in conven-
that water availability is an important determi- tional crop production systems, whereas
nant of the abundance and structure of AM fungal Claroideoglomus and Glomeraceae incertae
communities. Seven percent of the AM fungal sedis are favored in organic production systems.
OTUs found across the prairie soil zones are These Glomeraceae incertae sedis, currently
unique to croplands, whereas 14 % of the AM known as the G. iranicum/G. indicum group,
fungal OTUs are specific to roadsides. Roadsides are associated with reduced crop productivity
and natural areas are dominated by an OTU and nutrient uptake.
closely related to Claroideoglomus lamellosum,
C. etunicatum, and C. claroideum, which account
for 14 % and 19 % of all AM fungal reads. References
In cropland, an OTU closely related to
Funneliformis mosseae accounted for as much Bentivenga SP, Hetrick BAD. The effect of prairie man-
as 17 % of all AM fungal reads. The dominance agement practices on mycorrhizal symbiosis.
Mycologia. 1992;84:522–7.
of F. mosseae in croplands of the Canadian prai-
Dai M, Bainard LD, Hamel C, Gan Y, Lynch D. Impact of
rie is supported by studies based on metagenomic land use on arbuscular mycorrhizal fungal communi-
methods (Ma et al. 2005; Sheng et al. 2012; Dai ties in rural Canada. Appl Environ Microbiol.
et al. 2012, 2013) and on spore counts (Talukdar 2013;79:6719–29. doi:10.1128/aem.01333-13.
Dai M, Hamel C, Bainard LD, St. Arnaud M, Grant CA,
and Germida 1993).
Lupwayi NZ, Malhi SS, Lemke R. Negative and pos-
Crop management systems also have a strong itive contributions of arbuscular mycorrhizal fungal
influence on the composition of AM fungal com- taxa to wheat production and nutrient uptake efficiency
munities in Chernozem soils. Organic systems inorganic and conventional system in the canadian
prairie. Soil Biol Biochem. 2014;74:156–166.
have been shown to support more abundant and
Dai M, Hamel C, St. Arnaud M, He Y, Grant C,
diverse AM fungal communities compared to Lupwayi N, Janzen H, Malhi SS, Yang X, Zhou
conventional systems (Dai et al. 2014). Organic Z. Arbuscular mycorrhizal fungi assemblages in cher-
systems also promote greater proliferation of nozem great groups revealed by massively parallel
pyrosequencing. Can J Microbiol. 2012;58:81–92.
Claroideoglomus and of incertae sedis taxa of
Dhillion SS, Anderson RC. Seasonal dynamics of domi-
the Glomeraceae, currently referred to as Glomus nant species of arbuscular mycorrhizae in burned and
iranicum and Glomus indicum. However, these unburned sand prairies. Can J Bot. 1993;71:1625–30.
Glomeraceae incertae sedis are seemingly para- Durán A, Morrás H, Studdert G, Xiaobing L. Distribution,
properties, land use and management of Mollisols in
sitic as they were associated with reduced crop
South America. Chin Geogr Sci. 2011;21:511–30.
growth and N and P uptake efficiency. Eom AH, Wilson GWT, Hartnett DC. Effects of ungulate
grazers on arbuscular mycorrhizal symbiosis and
fungal community structure in tallgrass prairie.
Mycologia. 2001;93:233–42.
Summary Garg N, Chandel S. Arbuscular mycorrhizal networks:
process and functions. A review. Agron Sustain Dev.
Metagenomic studies on the distribution of AM 2010;30:581–99.
fungi in Chernozems are extremely useful to Hart MM, Klironomos JN. Diversity of arbuscular mycor-
rhizal fungi and ecosystem functioning. In: van der
understand how the living soil provides ecolog-
Heijden MGA, editor. Mycorrhizal ecology, Ecologi-
ical services and supports the production of cal studies, vol. 157. Berlin: Springer; 2003.
food and bioproducts. Brown Chernozems are p. 225–42.
Arbuscular Mycorrhizal Fungi Assemblages in Chernozems 47 A
Ji B, Bentivenga SP, Casper BB. Comparisons of AM Sheng M, Hamel C, Fernandez MR. Cropping practices
fungal spore communities with the same hosts but modulate the impact of glyphosate on arbuscular
different soil chemistries over local and geographic mycorrhizal fungi and rhizosphere bacteria in
scales. Oecologia. 2012;168:187–97. agroecosystems of the semiarid prairie. Can A
Jumpponen A. Analysis of ribosomal RNA indicates sea- J Microbiol. 2012;58:990–1001.
sonal fungal community dynamics in Andropogon Talukdar NC, Germida JJ. Occurrence and isolation of
gerardii roots. Mycorrhiza. 2011;21:453–64. vesicular-arbuscular mycorrhizae in cropped field
Liu X, Lee Burras C, Kravchenko YS, Duran A, soils of Saskatchewan, Canada. Can J Microbiol.
Huffman T, Morras H, Studdert G, Zhang X, Cruse 1993;39:567–75.
RM, Yuan X. Overview of Mollisols in the world: Tian H, Gai JP, Zhang JL, Christie P, Li L.
distribution, land use and management. Can J Soil Arbuscular mycorrhizal fungi in degraded typical
Sci. 2012;92:383–402. steppe of Inner Mongolia. Land degrad dev.
Ma WK, Siciliano SD, Germida JJ. A PCR-DGGE method 2009;20:41–54.
for detecting arbuscular mycorrhizal fungi in culti- Yang C, Hamel C, Schellenberg MP, Perez JC, Berbara
vated soils. Soil Biol Biochem. 2005;37:1589–97. RL. Diversity and functionality of arbuscular mycor-
Porras-Alfaro A, Herrera J, Natvig DO, Lipinski K, rhizal fungi in three plant communities in semiarid
Sinsabaugh RL. Diversity and distribution of soil fun- Grasslands National Park. Can Microb Ecol.
gal communities in a semiarid grassland. Mycologia. 2010;59:724–33.
2011;103:10–21. Young IM, Crawford JW. Interactions and self-
Schuessler A. Glomeromycota. Taxonomy. 2013. organization in the soil-microbe complex. Science.
Accessed 6 Nov 2013. http://schussler.userweb.mwn. 2004;304:1634–7.
de/amphylo/
B
Bacterial Diversity in Tree Canopies microorganisms that are associated with plants,
of the Atlantic Forest will likely be essential for establishing conserva-
tion strategies for protecting endangered plant
Marcio R. Lambais1 and David E. Crowley2 species. The large reservoir of microbial diversity
1
Luiz de Queiroz College of Agriculture on plant surfaces also represents a largely
(ESALQ), University of São Paulo (USP), untapped bank of microbial products that may
Piracicaba, SP, Brazil be of interest for pharmaceutical, agricultural,
2
Enviromental Sciences, University of and environmental applications.
California, Riverside, Riverside, CA, USA
Introduction
Synonyms
Plant surfaces in natural and agricultural ecosys-
Bacterial communities in the phyllosphere of the tems are colonized by a variety of epiphytic
Atlantic forest microorganisms that have been examined in rela-
tion to their diversity, ecology, and genetics using
culture-dependent and culture-independent
Definition approaches. Among the various surfaces that are
presented by plants, the leaf surface, also known
16S rRNA gene profiling is one of the main as the phyllosphere (Ruinen 1956), is one of the
approaches used for the study of microbial com- most common habitats for terrestrial microorgan-
munities that are associated with plants and ani- isms. The phyllosphere may be colonized by bac-
mals, which are mostly comprised of species terial cells at an average density of 106–107 cells
unable to grow under laboratory conditions. cm2 on plants from temperate regions (Lindow
Even though plants harbor an enormous micro- and Brandl 2003) and may be even higher on
bial diversity on their various surfaces, the func- tropical plants where dense canopies and
tions of these microorganisms, except for a few a moist shaded environment are conducive for
that are pathogens or symbionts, are largely bacterial growth. Considering that the estimated
unknown, but are speculated to modify plant total leaf area of terrestrial plants is approxi-
chemical signals, alter root exudation patterns, mately 6.4 108 km2 (Morris and Kinkel
and provide protection against pathogens. Under- 2002), the number of bacterial cells on leaf
standing of the factors that shape the structure of surfaces globally has been estimated to be as
microbial communities, and the functions of high as 1026 cells. Despite the importance of
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
B 50 Bacterial Diversity in Tree Canopies of the Atlantic Forest
Bacterial Diversity in Tree Canopies of the Atlantic microbial species, based on morphology of cells.
Forest, Fig. 1 Microbial biofilm on the leaf surface of (b) Diatom cells embedded in the microbial biofilm on
trees of the Atlantic forest. (a) Biofilm with multiple the leaf surface
plant-microbe interactions in plant disease, metabolic and signaling networks, leading to the
almost nothing is known about the indigenous, self-organization of highly complex communities
nonpathogenic bacteria that colonize plant leaf that have been selected by long-term coevolution
surfaces and their functions in terrestrial with their plant host. In general, the bacterial
ecosystems. populations in the phyllosphere occur as multi-
species biofilms (Fig. 1) mostly located at the
base of trichomes and nutrient-rich locations
The Phyllosphere Habitat along the veins and junctions of epidermal cells
(Morris et al. 1998; Monier and Lindow 2004).
Due to the harsh conditions and the highly com- Communication between microbial cells, and
petitive environment on plant leaves, microor- between microbial and plant cells, may be an
ganisms that live in the phyllosphere almost important factor controlling the dynamics of
certainly have evolved specific traits that enable leaf colonization and biofilm growth and
them to grow in such environments. Diurnal var- development.
iations in UV light incidence, temperature, water One of the major selection factors for micro-
availability, osmotic conditions, the concentra- bial colonization of leaf surfaces is the ability to
tion of reactive oxygen species, as well as the tolerate or grow on the myriad chemical sub-
low availability of nutrients make the stances that are released from plant leaf tissues
phyllosphere an extreme environment for micro- and/or produced by other microorganisms. This
bial growth (Vorholt 2012). All of these factors, includes many thousands of secondary metabo-
together with the specific morphological traits of lites, such as monoterpenes that serve as signal
the leaves, may contribute to the selection of factors and defense compounds, as well as chem-
specific microbial populations of bacteria, fungi, ical attractants and deterrents for insects, herbi-
archaea, and protozoa that will colonize the vores, and pathogens. However, the specific
phyllosphere and interact at different levels with secondary metabolites driving the structure of
the plant host. In addition, the microbial bacterial communities in the phyllosphere are
populations will interact with each other through unknown.
Bacterial Diversity in Tree Canopies of the Atlantic Forest 51 B
Bacterial Communities in the of biodiversity that is struggling to survive. The
Phyllosphere Atlantic forest used to be the second largest trop-
ical forest in South America and represented 1.3
Many early surveys of phyllosphere communities million km2 in the 1500s, when the Portuguese
have relied on descriptions of bacteria that can be first arrived in Brazil. Today, approximately 7 % B
cultivated on agar media and isolated as individ- of the original Atlantic forest remains, since most
ual colonies. Using various types of growth of it has been converted to agricultural or urban
media, 85 species of culturable microorganisms areas, leaving a patchwork of fragmented rem-
from 37 genera have been reported in the nants. The remnants of the Atlantic forest are
phyllospheres of rye, olive, sugar beet, and considered to be some of the oldest undisturbed
wheat (Ercolani 1991; Legard et al. 1994; forests on the planet, containing approximately
Thompson et al. 1993). While this is an impres- 20,000 plant species, of which nearly half are
sive number of species, studies using molecular endemic (Tabarelli et al. 2003). Several research
methods have revealed that the actual microbial projects have been developed in the Atlantic for-
species richness in the phyllosphere of agricul- est as part of the ongoing BIOTA-FAPESP (São
tural plants is much greater than this and suggest Paulo Research Foundation) program, which has
that different plant species harbor unique com- been successfully established to examine the bio-
munities that are similar for individuals of the diversity of the São Paulo State (Brazil).
same plant species (Yang et al. 2001). The dis- Different approaches can be used to survey the
covery of high levels of bacterial species richness microbial diversity in the phyllosphere. The first
associated with different agronomic plants has approach is using DNA fingerprinting methods.
prompted many questions about the true extent A low-resolution DNA fingerprinting method
of microbial diversity that may be associated with referred to as PCR-DGGE (polymerase chain
the phyllosphere of different plants in natural reaction-denaturing gradient gel electrophoresis),
ecosystems around the world. It has been specu- through which amplified fragments of highly var-
lated that since bacteria can be transported across iable regions of the bacterial 16S rRNA gene are
the globe in dust (Griffin et al. 2002), only a small separated by electrophoresis in a denaturing gra-
number of bacterial species may be adapted to dient polyacrylamide gel, has been used for
grow on leaf surfaces. On the other hand, if each studying the bacterial community structures in
plant species selects for its own microbial com- the phyllosphere of tree species of the Atlantic
munity, the microbial species diversity that is forest. This methodology generates a distinctive
associated with all of the different plant species fingerprint that can be used to compare the
on earth could be enormous. This question can relative similarities of communities, but does
only be answered by systematic surveys of not provide information on the identities of
phyllosphere microbial diversity in different eco- the bacterial species within the communities.
systems. Considering the current rate of extinc- To compare the phylogenetic diversity in the
tion of plant species, it is especially urgent to phyllosphere and generate diversity indices for
begin surveys of phyllosphere microorganisms different phyllosphere communities, sequencing
that are associated with endangered biomes. of specific regions of the bacterial 16S rRNA
gene is normally used.
With these combined approaches, it has been
Bacterial Community in the shown that the 16S rRNA gene band patterns for
Phyllosphere of the Atlantic Forest the bacterial communities from different tree spe-
cies of the Atlantic forest are distinct from each
Many tropical forests and biodiversity hotspots other (Lambais et al. 2006). Communities from
contain endemic plant species that are preserved replicates for different individuals of the same
only in a few remnant areas. The Atlantic forest tree species showed some expected variation,
of Brazil is an example of a forest with high levels but overall are highly similar to each other.
B 52 Bacterial Diversity in Tree Canopies of the Atlantic Forest
The similarities between the leaf bacterial com- transcontinental distances (Redford et al. 2010)
munities within and between species were further suggest a strong genetic component in the
measured statistically and showed that the trees regulation of the phyllosphere associated
could be segregated into groups according to tree microbiome.
species, family, and order, suggesting a coevolu- The majority of bacterial OTUs in the
tion between trees and microbial populations phyllosphere of the trees of the Atlantic forest
associated with the phyllosphere (Lambais have been assigned to the phylum Proteobacteria.
et al. data not published). Evidence of coevolu- Based on a survey of several tree species in the
tion of microbial populations associated with the Atlantic forest, including Ocotea dispersa,
bark (dermosphere) and rhizosphere of trees of Ocotea teleiandra, Mollinedia schottiana,
the Atlantic forest also has been observed, Mollinedia uleana, Eugenia cuprea, Eugenia
suggesting that plants coevolved with specific melanogyna, and Tabebuia serratifolia, it has
microbiomes (Lambais et al. data not published). been shown that, in general, approximately half
An estimate of the bacterial species richness asso- of the bacteria in the phyllosphere are phyloge-
ciated with the phyllosphere of trees in the Atlan- netically related to Gammaproteobacteria,
tic forest suggests the existence of 2–13 million whereas 20 % are related to Alphaproteobacteria
undescribed bacterial species that colonize the and 5 % to Flavobacteria, even though interspe-
collective phyllosphere of the Atlantic forest cific variation may occur (Lambais et al. data not
(Lambais et al. 2006). Interestingly, studies of published). For instance, in the phyllosphere of
the phyllosphere of different individuals of the Ocotea teleiandra, a high frequency of Alphapro-
same tree species in the Atlantic forest over teobacteria and a low frequency of Gammapro-
a range of distances and at different times show teobacteria have been detected, in contrast to
that the similarities between bacterial community other tree species.
structures in the phyllosphere of the same plant Altogether, these results show that every tree
species decrease with the increasing distance species that has been examined in the Atlantic
between individual trees, even though they still forest contains its own unique bacterial commu-
share high levels of similarity (Lambais et al. data nity and that spatially separated individuals of the
not published). Over larger scales, such as when same tree species have similar bacterial commu-
the bacterial communities of the individuals of nities, within the same environment (forest phys-
the same plant species are separated by hundreds iognomy). The variations in bacterial community
of kilometers, significant differences in commu- structures in the phyllosphere that were observed
nity structure are observed. These data suggest using the PCR-DGGE and sequencing
that the bacterial diversity in the phyllosphere of approaches to compare similarities among indi-
plants of the Atlantic forest may be even higher viduals indicate that the community composi-
than the predicted 2–13 million species estimate tions may vary on different leaves. This may
that does not take into account beta diversity. correspond with different leaf ages, location in
While still in an early phase, research aimed at the canopy, light incidence, and microclimate
measurements of beta diversity includes a survey conditions that influence the leaf environment
of Tamarix trees in Mediterranean and Dead Sea and types of chemical substances that are
regions in Israel and two locations in the USA secreted by the plant leaves. The bacteria may
(Finkel et al. 2011). These studies suggest that also interact with various fungi and algae that
besides the plant genetic component driving colonize the leaf surfaces and change the chemi-
the bacterial community structure in the cal and physical environment of the leaf habitat.
phyllosphere, environmental conditions associ- In future studies, it will be necessary to examine
ated with particular geographical locations are the microbial communities on leaf surfaces at the
also important. On the other hand, the high microsite scale to determine changes in species
levels of similarity of the bacterial communities composition and the ecology of different habitats
in the phyllosphere of Pinus ponderosa over on the leaf surface, for example, on the adaxial
Bacterial Diversity in Tree Canopies of the Atlantic Forest 53 B
and abaxial leaf surfaces or within biofilms and between plants. Terpenes and other plant second-
microcolonies at distinct physical locations on ary metabolites produced in plant leaves are also
the leaf surface. important feedstocks for various biochemicals
that are used in the industry and for pharmacol-
ogy. Future studies should investigate the B
Drivers of Community Structure in the genomes and genes encoding enzymes in the
Phyllosphere phyllosphere that may have broad application
for industrial biotechnology, as in the work
The development of different bacterial commu- described by Delmotte et al. (2009), which used
nities in the phyllosphere of different tree species proteogenomics to study the microbial commu-
demonstrates the strong effect of leaf surface nity associated with the phyllosphere of soybean,
environment as a selection factor. The initial clover, and Arabidopsis.
inoculation of leaves of different trees very likely
begins with the growth of opportunistic microor-
ganisms that are transported in dust, by insects, or Conclusion
that are splashed from adjacent trees by rain.
Inheriting a minimal microbiome through the Recent studies have provided only a glimpse into
seeds may also be a possibility. Further selection the microbial diversity that is associated with the
then occurs depending on differences in the types tree canopies in the Atlantic forest, and there are
of carbon substrates that are available for growth, many new questions that arise from this research.
as well as various physical and environmental For example, to what degree do soil, nutritional,
factors and interactions within the microbial and other environmental factors affect the com-
community. The primary carbon substrates that position and structure of microbial communities
are used for microbial growth include carbohy- in the phyllosphere? What is the diversity of
drates, amino acids, and organic acids. The com- fungi and Archaea on the plant leaf surfaces,
position and amounts of these substances may and how do these microorganisms interact?
vary for different plant species, but may also Future research should also examine the
vary over time depending on leaf age, insect functional aspects of phyllosphere microbial
damage, and rainfall, for instance. Another communities and the interactions that occur
potentially important selective factor is the pro- between phyllosphere bacteria and their host
duction of different types and quantities of mono- plants using metagenomics, metaproteomics,
terpenes and other volatile substances that are and metabolomics. As we begin to survey these
released from the leaf surfaces. These substances bacterial communities through systematic study
may be both toxic to some microorganisms and of different plant species, there will be exciting
used as growth substrates by others. Phytochem- opportunities for studies of the metabolic capa-
istry research has shown that tree species have bilities and ecological functions of phyllosphere
species-specific differences in their biochemical microorganisms in terrestrial ecosystems.
signatures for volatile molecules (Arey
et al. 1995). If terpenes act as selective sub-
stances, certain types of bacteria may be Summary
predicted to occur in relation to the biochemical
signatures of volatile organic compounds Each plant species is able to select its own bacte-
released by the leaves. Very little work has been rial community, and probably its own
conducted on this research topic, but bacteria are microbiome, which may be affected by plant
known to contain enzymes that convert terpenes genomic components and the environment. Alto-
to derivative substances. In this manner, the gether, the phyllosphere of plant species of the
phyllosphere bacteria may influence chemical Atlantic forest may harbor several million species
signaling to insects and other microorganisms or of bacteria that remain to be described. The roles
B 54 Bacteriocin Mining in Metagenomes
of the microbial communities of the phyllosphere fields. In: Lindow SE, Hecht-Poinar EI, Elliott V, edi-
in forest ecology are not yet known, but are likely tors. Phylosphere microbiology. St Paul: APS Press;
2002. p. 365–75.
to include chemical signaling, nitrogen fixation, Morris CE, Monier JM, Jacques MA. A technique to
and plant protection, among other functions. This quantify the population size and composition of the
immense microbial diversity may also provide biofilm component in communities of bacteria in the
new biomolecules of interest for pharmaceutical, phyllosphere. Appl Environ Microbiol. 1998;64:
4789–95.
agricultural, and environmental applications. Redford AJ, Bowers RM, Knight R, Linhart Y, Fierer
N. The ecology of the phyllosphere: geographic and
phylogenetic variability in the distribution of bacteria
Cross-References on tree leaves. Environ Microbiol. 2010;12:2885–93.
Ruinen J. Occurrence of Beijerinckia species in the
phyllosphere. Nature. 1956;177:220–1.
▶ New Computational Methodologies to Tabarelli M, Pinto LP, Silva JMC, Costa CMR. Endan-
Understand Microbial Diversity gered species and conservation planning. In: Galindo-
Leal C, Câmara IG, editors. The Atlantic forest of
South America: biodiversity, status, threats and out-
looks. Washington, DC: Island Press; 2003. p. 86–94.
References Thompson IP, Bailey MJ, Fenlon JS, Fermor TR, Lilley
AK, Lynch JM, McCormack PJ, McQuilken MP,
Arey J, Crowley DE, Crowley M, Resketo M, Lester Purdy KJ. Quantitative and qualitative seasonal
J. Hydrocarbon emissions from natural vegetation in changes in the microbial community from the
California South-coast-air-basin. Atmos Environ. phyllosphere of sugar beet (Beta vulgaris). Plant Soil.
1995;29:2977–88. 1993;150:177–91.
Delmotte N, Knief C, Chaffron S, et al. Community Vorholt JA. Microbial life in the phyllosphere. Nat Rev
proteogenomics reveals insights into the physiology Microbiol. 2012;10:828–40.
of phyllosphere bacteria. Proc Natl Acad Sci Yang CH, Crowley DE, Borneman J, Keen NT. Microbial
USA. 2009;106:16428–33. phyllosphere populations are more complex than pre-
Ercolani GL. Distribution of epiphytic bacteria on olive viously realized. Proc Natl Acad Sci USA. 2001;98:
leaves and the influence of leaf age and sampling time. 3889–94.
Microb Ecol. 1991;21:35–48.
Finkel OM, Burch AY, Lindow SE, Post AF, Belkin
S. Geographic allocation determines the population
structure in phyllosphere microbial communities of
a salt-excreting desert tree. Appl Environ Microbiol. Bacteriocin Mining in Metagenomes
2011;77:7647–55.
Griffin DW, Kellogg CA, Garrison VH, Shinn EA. The
global transport of dust – an intercontinental river of Orla O’Sullivan1,2, Colin Hill3, Paul Ross1,2 and
dust, microorganisms and toxic chemicals flows Paul Cotter1,2
through the Earth’s atmosphere. Amer Sci. 2002;90: 1
Teagasc Food Research Centre, Moorepark,
228–35.
Fermoy, Co., Cork, Ireland
Lambais MR, Crowley DE, Cury JC, B€ ull RC, Rodrigues 2
RR. Bacterial diversity in tree canopies of the Atlantic Alimentary Pharmabiotic Centre, University
forest. Science. 2006;312:1917. College, Cork, Ireland
Legard DE, McQuilken MP, Whipps JM, Fenlon JS, 3
Alimentary Pharmabiotic Centre,
Fermor TR, Thompson IP, Bailey MJ, Lynch
Department of Microbiology, University
JM. Studies of seasonal changes in the microbial
populations on the phyllosphere of spring wheat as College, Cork, Ireland
a prelude to the release of a genetically modified
microorganism. Agric Ecosyst Environ. 1994;50:
87–101.
Definition
Lindow SE, Brandl MT. Microbiology of the
phyllosphere. Appl Environ Microbiol. 2003;69:
1875–83. Bacteriocins are heat-stable ribosomally synthe-
Monier JM, Lindow SE. Frequency, size, and localization sized peptides produced by one bacterium which
of bacterial aggregates on bean leaf surfaces. Appl
are active against other bacteria and against
Environ Microbiol. 2004;70:346–55.
Morris CE, Kinkel LL. Fifty years of phylosphere micro- which the producer has a specific immunity
biology: significant contributions to research in related mechanism.
Bacteriocin Mining in Metagenomes 55 B
Introduction classes: those which are modified (Class I) and
those which are unmodified (Class II) (Cotter
Bacteriocins are ribosomally synthesized antimi- et al. 2005; Rea et al. 2011) (Table 1). This
crobial peptides that are produced by many bac- approach to classification excludes larger pro-
teria and which kill or inhibit the growth of other teins, such as the bacteriolysins and the colicin- B
bacteria. Bacteriocin producers are protected as a type antimicrobials, which as a consequence of
consequence of dedicated immunity (self- their larger size may be regarded as representing
protective) systems (Cotter et al. 2005). Bacterio- different classes of antimicrobials.
cins are of both academic and commercial Further classification of the Class I and II
interest, with several in use as food preservatives peptides is possible, for example, Class
or as the active agent in clinical or veterinary I bacteriocins from Gram-positive bacteria can
antimicrobials. It is not surprising that there is be divided into Class Ia, Class Ib, and Class
significant interest in the identification and char- Ic. Class Ia, the lantibiotics, harbor the unusual
acterization of new bacteriocin gene clusters. The posttranslationally modified residues lanthionine
growing volume of metagenomic sequence data (Lan) and/or b-methyllanthionine (meLan); these
is an important resource which can be mined for are products of the interaction of cysteines with
the in silico discovery of novel bacteriocins. enzymatically dehydrated serines (dehydroalanine;
Dha) and threonines (dehydrobutyrine; Dhb).
A Background to Bacteriocins Lantibiotics can be subdivided according to the
Bacteriocins were first described in 1925 and enzyme responsible for lanthionine formation; sub-
since then bacteriocin producers have been iden- class I use LanBC, subclass II use LanM, and
tified in a myriad of different environments, bear- subclass III use RamC-like, while subclass IV are
ing out a prediction by Klaenhammer in 1988 that modified by LanL enzymes. It should be noted,
bacteriocin production may be almost ubiquitous however, that subclass III and IV peptides identi-
(Klaenhammer 1988). The spectrum of activity fied to date have not been shown to possess antimi-
of these peptides can be narrow (lethal to bacteria crobial activity and thus are referred to as
in the same or closely related species) or broad lantipeptides. Class Ib, the labyrinthopeptins,
(lethal to bacteria in other genera). Many bacte- have a labyrinthine structure and contain
riocins function by depolarizing the cell mem- the posttranslationally modified amino acid
brane or through the inhibition of cell wall labionin, formed through a series of serine phos-
synthesis (Cotter et al. 2005). There are phorylations, dehydrations of phosphoserines to
a number of different classification schemes. didehydroalanines, and cyclizations. Class Ic, the
One approach, originally employed to classify sactibiotics, are cyclic peptides, generated from the
bacteriocins produced by Gram-positive bacteria, posttranslation formation of intramolecular cross-
has been to divide bacteriocins into two major linkages between the a-carbon and sulfur of amino
Bacteriocin Mining in Metagenomes, Table 1 Classification scheme for bacteriocins (Modified from (Rea
et al. 2011))
Class Divisions Further subclasses Examples
Class I Ia: Lantibiotics Subclass I–IV Lacticin 3,147, nisin A, subtilin
Ib: Labyrinthopeptins Labyrinthopeptins A1 and A2
Ic: Sactibiotics Single- and two-peptide bacteriocins Thuricin CD, subtilisin A
Class II IIa: Pediocin-like Subclasses I–IV Pediocin PA-1, munticin
IIb: Two-peptide bacteriocins Subclasses A and B Salivaricin P, lactococcin G
IIc: Circular bacteriocins Subclasses 1 and 2 Acidocin B, gassericin A
IId: Linear non-pediocin-like Lactococcin A
Single-peptide bacteriocins
B 56 Bacteriocin Mining in Metagenomes
Bacteriocin Mining in Metagenomes, Fig. 1 Structure of nisin A; the prototypical Gram-positive-modified bacte-
riocin (Modified residues in gray)
Bacteriocin Mining in Metagenomes 57 B
BACTIBASE, a bacteriocin database and suite
of analysis tools established to archive known
bacteriocin sequences and enhance the discovery
of bacteriocins in genomic data (Hammami
et al. 2010). The current release of BACTIBASE B
contains 177 bacteriocin sequences against which
one can test the homology of a query bacteriocin
sequence, perform sequence alignments, and pre-
dict peptide structure (Hammami et al. 2010).
Searches are limited to the known sequences
already in the database, and the usefulness of
the tool is also affected by the fact that bacterio-
cin peptides themselves often share little or no
homology. A specific bacteriocin mining tool,
BAGEL2 (BActeriocin GEnome Location), was
established to search for novel bacteriocin
sequences in genomic data (de Jong et al. 2010).
BAGEL2 has a built-in database of bacteriocin
Bacteriocin Mining in Metagenomes, Fig. 2 Repre-
sentative agar plate depicting the outcome of a culture- and bacteriocin-related sequences and, in addition
based screen for bacteriocin activity to genes encoding the structural bacteriocin pep-
tide, uses genes involved in bacteriocin biosynthe-
microbiology to screen for large collections of sis, regulation, export, and immunity to reveal
strains, using a culture-based assessment of their related genes in novel clusters. Additionally
ability to produce novel antimicrobials (Fig. 2). searches can be implemented against finished
This is then followed by the subsequent identifi- genome sequences or against novel genomes
cation of the responsible genes through uploaded by the user. The fact that genes involved
subcloning, mutagenesis, reverse genetics, or, in the modification of Class I bacteriocins, such as
more recently, sequencing of the corresponding those generically named lanM, lanB, and lanC or
genome. However, in spite of constant improve- those encoding radical SAMs associated with
ments in culturing techniques, it is still estimated sactibiotic production, are frequently more
that just 10–50 % of bacteria are culturable. For- highly conserved than the structural genes them-
tunately, metagenomic DNA sequencing pro- selves has also been utilized in recent years to
vides an alternative with respect to identifying identify Class I gene clusters in genomic and
novel bacteriocin gene clusters by facilitating an metagenomic databases. During this period targeted
unbiased characterization of entire microbial searches for bacteriocins in genomic data have
communities. In particular, recent improvements resulted in the discovery of several novel active
in sequencing technologies have resulted in bacteriocins, such as lichenicidin (Begley et al.
a massive increase in sequence data, leading to 2009), and a Streptococcus-associated lantibiotic
the development of valuable public databases and (Majchrzykiewicz et al. 2010), among others. This
annotation pipelines (http://camera.calit2.net/, strategy parallels similar genome-based approaches
http://img.jgi.doe.gov/, http://metagenomics.anl. which have identified gene clusters encoding other
gov/). The generation of vast quantities of DNA ribosomally synthesized natural products
sequence data from metagenomics-based pro- (Velásquez and van der Donk 2011). In addition to
jects from varying environments across the the identification of novel bacteriocins, the screening
globe represents a considerable resource from of genomes using the LtnM1 protein of lacticin 3147
which new bacteriocin gene clusters can be iden- (Begley et al. 2009; O’Sullivan et al. 2011) or the
tified. There are a number of ways in which this radical SAM proteins of thuricin CD (Murphy
information can be harnessed. One example is et al. 2011) as drivers has also revealed several
B 58 Bacteriocin Mining in Metagenomes
Bacteriocin Mining in Metagenomes, Table 2 LanM homologs in metagenomic databases from (O’Sullivan
et al. 2011)
Protein function Metagenome Location % identity E-value
Lantibiotic-modifying enzyme Sea water Indian Ocean 29 1.07E-16
Hypothetical protein Soil sample Waseca County, USA 25 2.85E-16
Lantibiotic-modifying enzyme Whale fall rib carcass Santa Cruz Basin, USA 27 4.65E-12
Lantibiotic-modifying enzyme Hypersaline lagoons Galapagos Islands 30 1.83E-08
Hypothetical protein Coastal sea water Gulf of Mexico 25 2.39E-08
Hypothetical protein Hypersaline lagoons Galapagos Islands 24 1.55E-07
Hypothetical protein Open ocean Indian Ocean 36 4.51E-07
Hypothetical protein Coral reef Cook’s Bay, French Polynesia 24 5.89E-07
Hypothetical protein Open ocean Indian Ocean 29 1.71E-06
Mersacidin-modifying enzyme Open ocean Galapagos Islands 25 3.81E-06
Hypothetical protein Hypersaline lagoons Galapagos Islands 24 4.94E-06
Bacteriocin Mining in Metagenomes 59 B
bacteriocin cluster. Existing tools for will not be available, or may not be culturable,
metagenome analysis are in two formats: func- and other strategies will be required. The
tional key word search engines, such as those genetics-based options available can be divided
available through the MG-RAST (Glass into in vivo and in vitro approaches. Regardless
et al. 2010) and IMG/M (Markowitz et al. 2008) of the approach, specific genes within the cluster B
platforms, and homology search engines, such as will need to be regenerated through DNA synthe-
JCoast (Richter et al. 2008), MetaMine sis technology. In the case of in vivo harnessing,
(Bohnebeck et al. 2008), and CAMERA (Sun the DNA fragment(s) will be cloned and
et al. 2011). Functional searches rely heavily on expressed heterologously, using approaches
searching among annotated genes. This is inher- such as those employed to facilitate the produc-
ently reliant on accurate annotation, and due to tion of a Streptococcus-associated lantibiotic
the small size and heterogeneous nature of bac- cluster by Lactococcus lactis (Majchrzykiewicz
teriocin peptides, the corresponding genes are et al. 2010) and by Escherichia coli. Alterna-
often overlooked or mis-annotated. Homology tively, when dealing with modified bacteriocins,
search tools such as CAMERA and JCoast are one can clone and express individual genes het-
single gene search-driven, although JCoast does erologously but then purify them to facilitate the
have a graphical user interface that allows visu- in vitro reconstitution of biosynthesis using the
alization of the surrounding gene neighborhood corresponding modification proteins or related
which would prove particularly useful for screen- enzymes originating from other sources (Knerr
ing for the presence of other genes in the bacteri- and van der Donk 2012). Finally, an alternative
ocin operons (Richter et al. 2008). Metamine non-genetics-based approach, which is available
allows homology searches with “gene neighbor- when gene clusters predicted to encode
hoods”; again this would prove particularly use- unmodified residues are identified, is to employ
ful for bacteriocin clusters. Metamine searches peptide synthesis with a view of generating
are, however, restricted to marine metagenomic a synthetic equivalent of the natural antimicro-
databases (Bohnebeck et al. 2008). It should also bial. It is anticipated that these various options
be noted that, as a consequence of the evolution of will be widely used in the years to come.
DNA sequencing technologies, longer stretches of
contiguous metagenomic DNA will become avail-
able which will further enhance our ability to Summary
identify complete bacteriocin gene clusters.
Despite this, it must also be noted that the presence In order to effectively mine metagenomes for
of bacteriocin homologs alone is not an indicator bacteriocins, accurate annotation of the datasets
of function. Clearly in silico analysis is not suffi- is essential. As the volume of data grows, it is
cient to determine functional presence of anticipated that the precision of annotation tools
a bacteriocin. However, the likelihood that even will improve in tandem. The number of
a proportion of bacteriocin homologues will be bacteriocin-associated gene homologs present in
deemed functional is an intriguing prospect. diverse metagenomic environments suggests the
presence of multiple corresponding gene clusters.
Harnessing Bacteriocin Gene Clusters The further expansion of metagenomic DNA
While the in silico analysis of newly identified databases will undoubtedly further increase our
bacteriocin gene clusters within metagenomic appreciation of just how widespread, and diverse,
DNA can be of great value from a fundamental these clusters are. As the commercial application
perspective, the harnessing of the antimicrobial of bacteriocins becomes more common (for
potential of these clusters will undoubtedly review see (Cotter et al. 2005)), we can anticipate
become a priority in the future. In the majority that we will reap the benefits of in silico screening
of instances, the specific strain from which the and harnessing of this untapped reservoir of novel
fragment of metagenomic DNA has originated bacteriocins.
B 60 Binning Sequences Using Very Sparse Labels Within a Metagenome
parameter of growth, which is called Growth where X is the number of species included in the
Threshold (GT) and defined as metagenome and Y represents the serial number
of metagenome. For example, “10Sp_Set1” is the
GT ¼ D lnðSFÞ, first metagenome containing 10 random species.
In each genome, the seeding sequences were
where D is the data dimension and SF is the user- firstly identified as the flanking region of 16S
defined Spread Factor that takes value (0, 1], with rRNA genes of the length ranging from 8 to
0 representing minimum and 1 representing max- 13 kilobase (kb). The seeding sequences that
imum growth. overlapped with other rRNA and tRNA genes
There are four phases in GSOM training: ini- were excluded to avoid possible interferences
tialization, growing, and two smoothing phases. caused by highly similar sequence compositions.
In the initialization phase, weight vectors of ini- After removing the tRNA, rRNA, and seeding
tial nodes in the minimum single lattice grid are sequences, the remaining genomic regions were
initialized by random values and the GT is calcu- randomly chopped into simulated metagenomic
lated according to data dimension and user- fragments of the length from 8 to 13 kb. The
defined SF. During the growing phase, every length restriction of 8–13 kb is used to provide a
node keeps an accumulated error counter and standardized rule for either seeding or
the counter of the winning node (Ewinner) is metagenomic sequences (Mavromatis et al.
updated by 2007), but with the outlook for single-molecule
sequencing techniques on the horizon (Clarke
EWinner ðt þ 1Þ ¼ EWinner ðtÞ þ jxðkÞ wwinner ðtÞj: et al. 2009), these are definitely achievable length
for metagenomes in the near future.
When Ewinner exceeds GT, the winning node The tetranucleotide frequency of
that is at the boundary of current map will grow metagenomic sequences is the training feature
new nodes to its neighboring vacant lattice and we used in our implementation for binning
initialize a weight vector by interpolating or because it has a better resolution in species sep-
extrapolating weight vectors of existed neighbor- aration (Abe et al. 2003) and is highly similar
ing nodes around the winning node. If the win- between intragenomic fragments compared to
ning node is not a boundary node, the intergenomic fragments. The tetranucleotide fre-
accumulated error (Ewinner) is evenly distributed quencies were computed using a four-base slid-
outwards to its neighbors. The two smoothing ing window and normalized by the length of the
phases are for fine-tuning the weights of nodes. corresponding sequence (frequency per base).
The hexagonal lattice was used for GSOM in this A total of 256 (44) combinations of nucleotide
study as the hexagonal lattice yields better data usages, i.e., AAAA, AAAT, AAAG, AAAC . . .
topology preservation (Hsu et al. 2003). CCCG, and CCCC, are represented in the feature
vector of 256 dimensions.
From the NCBI Archaea/Bacteria genome data- Metagenomic sequences that belong to closely
base, we randomly selected 10, 20, and 40 species related species are likely to have homologous
to generate metagenomes of different complex- sequences between the clusters (bins), and this
ity. Three sets were drawn for the 10 and fact makes the identification of clustering bound-
20 species datasets, and only one set for the aries much more difficult. Therefore, a modified
40 species dataset due to the limitation imposed strategy is needed to identify clusters so that
by the available computing resources. Simulated GSOM can be improved as a more practical solu-
metagenomes were denoted by “XSp_SetY ” tion for binning.
Binning Sequences Using Very Sparse Labels Within a Metagenome 63 B
Binning Sequences Using Very Sparse Labels Within S-GSOM. (b) The pseudo-code for node assigning process
a Metagenome, Fig. 1 The S-GSOM algorithm. in S-GSOM
(a) Schematic diagram of the clustering process of
The seeded GSOM (S-GSOM), which allows containing any sample are most likely
identifying clusters automatically in the feature representing a cluster boundary. So a penalty fac-
map using seeds (labeled data), is our proposed tor greater than one is multiplied to the actual
modification of GSOM. There are three core steps distance when calculating the distance between
in S-GSOM. Firstly, the very small amounts of empty nodes and clustered nodes. This will lead
labeled seeds (labeled feature vectors) are com- the S-GSOM not to label empty nodes to any
bined with unlabeled data (unlabeled feature vec- cluster (Fig. 1b). According to the empirical
tors). Secondly, the combined input vectors are fed observation that the clustering results are not
into GSOM training, which treats the seeds as the very sensitive to the penalty factor value between
unlabeled data. Finally, after the normal phases of 2 and 5, the penalty factor value of 2 was used in
GSOM training, S-GSOM identifies clusters based all our experiments.
on the location of seeds in the final map and the Before the initiation of the taxonomy-assigning
specified amount of nodes in the cluster (Fig. 1a). process, the seeded nodes must be assigned to
In the last step of S-GSOM training, the cluster a specific taxon. When all seeds in one node are
identification phase, the nodes that have seeds are coming from the same taxon or there is only
identified and labeled as clustered nodes. Then a single seed, it is trivial for S-GSOM to assign
the S-GSOM is going to assign other un-clustered the seeded node to the same taxon as contained
nodes, one by one, to clusters iteratively until the seeds. If the seeds in one node belong to multiple
specified clustering percentage (more details in taxa, the seeded node will be assigned to the major
Clustering Percentage (CP) Determination sec- taxon. However, when seeds are of multiple taxa
tion) is reached. In each iteration, a set of and have equal amounts, e.g., two seeds are in one
un-clustered nodes that are adjacent to the clus- node and belonging to taxon A and B, respec-
tered nodes is identified. The node in the set of the tively, all seeds are discarded.
shortest Euclidean distance to the adjacent clus- To illustrate the role of S-GSOM in binning,
tered node will be assigned to the same cluster Fig. 2 depicts the schematic diagram that explains
with the clustered node. However, nodes not how S-GSOM fits into the whole binning process.
B 64 Binning Sequences Using Very Sparse Labels Within a Metagenome
Binning Sequences Using Very Sparse Labels Within a Metagenome, Fig. 2 An overview of binning process
using S-GSOM
Binning Sequences Using Very Sparse Labels Within against CP. A trend of decreasing in performance with
a Metagenome, Fig. 3 Identification of an appropriate increasing in CP can be noted. A compromised value of
clustering percentage (CP). Five datasets for each of 5, 10, CP ¼ 55 % is marked where both the number of assigned
and 20 species are randomly samples. The average of nodes and performance are high
S-GSOM’s performance for the datasets are plotted
Binning Sequences Using Very Sparse Labels Within a Metagenome, Table 1 Clustering performance of semi-
supervised algorithms. Performance is measured by the adjusted Rand index (ARI) and weighted F-measure (WF)
COP K Constrained K Seeded K TSVM S-GSOM-55
ARI WF ARI WF ARI WF ARI WF ARI WF
10Sp_Set1 0.84 0.94 0.84 0.94 0.84 0.93 0.25 0.59 0.85 0.95
10Sp_Set2 0.89 0.96 0.79 0.90 0.78 0.90 0.41 0.69 0.93 0.97
10Sp_Set3 0.58 0.83 0.85 0.93 0.84 0.93 0.27 0.62 0.83 0.93
20Sp_Set1 0.91 0.90 0.77 0.82 0.76 0.82 0.45 0.65 0.97 0.96
20Sp_Set2 0.76 0.82 0.70 0.79 0.67 0.79 0.43 0.62 0.83 0.89
20Sp_Set3 0.81 0.89 0.75 0.86 0.75 0.86 0.46 0.67 0.97 0.98
40Sp 0.58 0.76 0.71 0.85 0.68 0.84 0.24 0.56 0.83 0.91
helps S-GSOM to achieve better clustering by not binning result is shown in Fig. 4b. Nodes
assigning those ambiguous sequences. The containing seeds from multiple species were col-
S-GSOM visualization of binning sequences of ored in grey with the label of species number.
10Sp_Set1, 20Sp_Set1, and 40Sp is provided in A significantly higher abundance of grey nodes
Fig. 4. around “C6” and “C7,” respectively representing
We considered the 20-species metagenomes Haemophilus influenzae 86-028NP and
as examples to analyze the resolution of binning Haemophilus somnus 129PT, indicates that
with S-GSOM. At CP ¼ 55 %, an average of metagenomic sequences with similar
82 % sequences were assigned with 92 % accu- tetranucleotide frequencies, resulted from closely
racy to their source species. The distribution of related species, tend to be clustered without
B 66 Binning Sequences Using Very Sparse Labels Within a Metagenome
Binning Sequences Using Very Sparse Labels Within species, it is displayed in a color that uniquely identifies
a Metagenome, Fig. 4 Resulted growing self- the species. The node without a letter indicates that there is
organizing maps (GSOM) of (a) 10Sp_Set1, (b) no data (sequences) located in it. The grey nodes represent
20Sp_Set1, and (c) 40Sp metagenomes. Each hexagon multiple species in the node, and the exact number is as
represents a single node. If the node contains a single labeled
a clear boundary. This further highlights the cluster sequences of unseeded species, i.e., the
importance of obtaining seeds in non-boundary unknown species. To demonstrate this advantage,
regions. an iso-CP (constant CP) contour is delineated
In addition to the distinguished clustering per- in Fig. 5a, generated with a five-species
formance, S-GSOM possesses a prominent metagenome with only four seeds. By applying
advantage brought by the seeding method to different CP values, a group of nodes were found
Binning Sequences Using Very Sparse Labels Within a Metagenome 67 B
Binning Sequences Using Very Sparse Labels Within be assigned when CP ¼ 27 % and dark grey nodes
a Metagenome, Fig. 5 Illustration of exploring an at CP ¼ 55 %, light grey at CP ¼ 77 %, and white at
unseeded cluster. (a) The five-species S-GSOM map. CP ¼ 100 %. (b) Internode distance map with nodes
The seeded nodes are shown with unique colors and assigned at CP ¼ 55 %
labels. Nodes in charcoal color represent nodes that will
only clustered at CP ¼ 77 % (on the top-right a community without any dominant species, has
region). This situation is most likely when sparse long contigs required by the composition-
a species is relatively abundant, but does not based analysis (Mavromatis et al. 2007; Teeling
have a seed. Figure 5b shows the allocation of et al. 2004a), we also excluded the simHC dataset
nodes to seeds at CP ¼ 55 %. However, a protru- from our analysis.
sion of species “1” into the unassigned region, For the purpose of fair comparison, all methods
which belongs to species “5,” is an incorrect need to be compared at the same taxonomic level
assignment that sometimes happens to nodes of binning. Binning at a very high level, e.g.,
without a correct seed. kingdom, clearly has no significance; therefore,
the results are compared at the order level here
and results for comparing at other taxonomic
Comparison of Binning Fidelity Using levels are included in the supplementary materials
S-GSOM of original publication (Chan et al. 2008a). At the
order level, the results for simLC (low complexity)
In this section, we compared the binning perfor- and simMC (medium complexity) metagenomes
mance of S-GSOM with three other methods: are shown in two separated tables, one for binning
BLAST, kmer, and PhyloPythia, reported on the contigs larger than 8 kb and the other for contigs
metagenomes of different complexities composed of at least 10 reads. To evaluate the
(Mavromatis et al. 2007) after assembly by performance, rather than using simple averages of
Arachne (Batzoglou et al. 2002), Phrap (Green, all bins (Mavromatis et al. 2007), we used weighted
1996), and JAZZ (Aparicio et al. 2002). How- average that gives higher weights to larger bins to
ever, JAZZ produced small number of contigs better reflect the amount of correctly binned contigs.
compared to Arachne and Phrap (Mavromatis In both simLC and simMC, S-GSOM
et al. 2007), so contigs assembled by JAZZ performed reasonably for binning contigs larger
were excluded. In addition, because the simHC, than 8 kb, where it is more accurate than all
B 68 Binning Sequences Using Very Sparse Labels Within a Metagenome
Binning Sequences Using Very Sparse Labels Within a Metagenome, Table 2 Binning summary for low
complexity metagenome for contigs larger than 8 kb
Binned
Assembler Method Bins Contigs Total#Contigs %ofBinContigs #ofPredNotInAct wSp wSn
Arachne kmer (7 mer) 0 0 202 0 85 – 0.000
Arachne kmer (8 mer) 0 0 202 0 149 – 0.000
Arachne BLAST distr 0 0 202 0 0 – 0.000
1
Arachne BLAST distr 0 0 202 0 0 – 0.000
2
Arachne S-GSOM 1 141 202 69.8 0 1.000 0.698
(CP ¼ 55 %)
Arachne gen 1 168 202 83.17 0 1.000 0.832
PhyloPythia
(p:0.85)
Arachne ssp. 1 186 202 92.08 0 1.000 0.921
PhyloPythia
(p:0.85)
Arachne S-GSOM 1 180 202 89.11 0 1.000 0.891
(CP ¼ 75 %)
Arachne gen 1 201 202 99.5 0 1.000 0.995
PhyloPythia
(p:0.5)
Arachne ssp. 1 201 202 99.5 0 1.000 0.995
PhyloPythia
(p:0.5)
Phrap kmer (7 mer) 0 0 229 0 129 – 0.000
Phrap kmer (8 mer) 0 0 229 0 154 – 0.000
Phrap BLAST distr 0 0 229 0 0 – 0.000
1
Phrap BLAST distr 0 0 229 0 0 – 0.000
2
Phrap S-GSOM 1 157 229 68.56 0 1.000 0.686
(CP ¼ 55 %)
Phrap gen 1 185 229 80.79 0 1.000 0.808
PhyloPythia
(p:0.85)
Phrap ssp. 1 205 229 89.52 0 1.000 0.895
PhyloPythia
(p:0.85)
Phrap S-GSOM 1 204 229 89.08 0 1.000 0.891
(CP ¼ 75 %)
Phrap gen 1 227 229 99.13 0 1.000 0.991
PhyloPythia
(p:0.5)
Phrap ssp. 1 227 229 99.13 0 1.000 0.991
PhyloPythia
(p:0.5)
Total#Contigs total number of contigs in the dataset, %ofBinContigs the percentage of contigs binned, #ofPredNotInAct
the number of contigs predicted as a taxon that is not present in the dataset, which are treated as the un-binned contigs,
wSp weighted specificity, wSn weighted sensitivity
Binning Sequences Using Very Sparse Labels Within a Metagenome 69 B
Binning Sequences Using Very Sparse Labels Within a Metagenome, Table 3 Binning summary for medium
complexity metagenome for contigs larger than 8 kb
Binned
Assembler Method Bins contigs Total#Contigs %ofBinContigs #ofPredNotInAct wSp wSn
Arachne kmer (7 mer) 0 0 301 0 47 – 0.000
B
Arachne kmer (8 mer) 0 0 301 0 191 – 0.000
Arachne BLAST distr 1 0 0 301 0 0 – 0.000
Arachne BLAST distr 2 0 0 301 0 0 – 0.000
Arachne S-GSOM 2 220 301 73.09 0 1.000 0.731
(CP ¼ 55 %)
Arachne gen 2 242 301 80.4 0 1.000 0.804
PhyloPythia
(p:0.85)
Arachne ssp. 2 242 301 80.4 0 1.000 0.804
PhyloPythia
(p:0.85)
Arachne S-GSOM 2 279 301 92.69 0 1.000 0.927
(CP ¼ 75 %)
Arachne gen 2 301 301 100 0 1.000 1.000
PhyloPythia
(p:0.5)
Arachne ssp. 2 301 301 100 0 1.000 1.000
PhyloPythia
(p:0.5)
Phrap kmer (7 mer) 0 0 401 0 84 – 0.000
Phrap kmer (8 mer) 0 0 401 0 271 – 0.000
Phrap BLAST distr 1 0 0 401 0 0 – 0.000
Phrap BLAST distr 2 0 0 401 0 0 – 0.000
Phrap S-GSOM 2 318 401 79.3 0 1.000 0.793
(CP ¼ 55 %)
Phrap gen 2 301 401 75.06 0 1.000 0.751
PhyloPythia
(p:0.85)
Phrap ssp. 2 295 401 73.57 0 1.000 0.736
PhyloPythia
(p:0.85)
Phrap S-GSOM 2 367 401 91.52 0 1.000 0.915
(CP ¼ 75 %)
Phrap gen 2 399 401 99.5 1 1.000 0.995
PhyloPythia
(p:0.5)
Phrap ssp. 2 399 401 99.5 1 1.000 0.995
PhyloPythia
(p:0.5)
settings of kmer and BLAST methods, but S-GSOM still outperformed PhyloPythia for the
was outperformed by PhyloPythia in both confi- simMC, particularly in terms of sensitivity, i.e.,
dence settings (CP ¼ 75 % vs. p-value ¼ 0.5 and having a higher true positive rate, at the family
CP ¼ 55 % vs. p-value ¼ 0.85) regardless of the level (refer to the supplementary materials of
assembler used (Tables 2 and 3). Nevertheless, original publication).
B 70 Binning Sequences Using Very Sparse Labels Within a Metagenome
Binning Sequences Using Very Sparse Labels Within a Metagenome, Table 4 Binning summary for low
complexity metagenome for contigs with at least 10 reads
Binned
Assembler Method Bins Contigs Total#Contigs %ofBinContigs #ofPredNotInAct wSp wSn
Arachne kmer (7 mer) 0 0 367 0 168 – 0.000
Arachne kmer (8 mer) 0 0 367 0 312 – 0.000
Arachne BLAST distr 1 0 0 367 0 0 – 0.000
Arachne BLAST distr 2 0 0 367 0 0 – 0.000
Arachne S-GSOM 3 295 367 80.38 0 1.000 0.798
(CP ¼ 55 %)
Arachne gen 2 214 367 58.31 0 1.000 0.583
PhyloPythia
(p:0.85)
Arachne ssp. 2 236 367 64.31 0 1.000 0.638
PhyloPythia
(p:0.85)
Arachne S-GSOM 3 343 367 93.46 0 0.950 0.926
(CP ¼ 75 %)
Arachne gen 2 292 367 79.56 0 1.000 0.796
PhyloPythia
(p:0.5)
Arachne ssp. 2 296 367 80.65 0 1.000 0.798
PhyloPythia
(p:0.5)
Phrap kmer (7 mer) 2 3 482 0.62 159 1.000 0.000
Phrap kmer (8 mer) 3 17 482 3.53 281 1.000 0.000
Phrap BLAST distr 1 0 0 482 0 0 – 0.000
Phrap BLAST distr 2 0 0 482 0 1 – 0.000
Phrap S-GSOM 8 381 482 79.05 9 1.000 0.728
(CP ¼ 55 %)
Phrap gen 3 236 482 48.96 0 1.000 0.488
PhyloPythia
(p:0.85)
Phrap ssp. 3 272 482 56.43 0 1.000 0.560
PhyloPythia
(p:0.85)
Phrap S-GSOM 8 443 482 91.91 9 1.000 0.840
(CP ¼ 75 %)
Phrap gen 4 368 482 76.35 1 1.000 0.759
PhyloPythia
(p:0.5)
Phrap ssp. 5 387 482 80.29 1 1.000 0.797
PhyloPythia
(p:0.5)
property of S-GSOM further allows the identifi- CP value or be considered as part of the boundary
cation of unseeded clusters. However, the of neighboring clusters and thus become hardly
sequence number of unseeded species should be detectable.
at least as many as in the seeded clusters; other- It is very likely that the 16S rRNA fragments
wise, S-GSOM may wrongly assigned the of some species were not or difficult to be sam-
unseeded species to an unrelated species at low pled. In such circumstances, we can still obtain
B 72 Binning Sequences Using Very Sparse Labels Within a Metagenome
elucidated functions in ecosystems are only those The Value of Metagenome Resources
that can be cultured, which are estimated to be Exploited
about <1 % of microbes existing in nature; thus, The results of gene prediction, annotation on the
most microbes are recognized as unculturable sequence, and metabolic assembly through
species (Handelsman 2004). Accordingly, the (individual) genome reconstruction of
strategy to overcome the limitation of pure cul- metagenome give not only the understanding of
ture and thus to explore the entire microbial microbial ecology and physiology but also the
resource have attempted, which has induced expectation that useful genetic resources and the
a new paradigm shift. whole synthetic pathway of specific compounds
Metagenomics is a research area that studies in vivo are readily explored. With the develop-
the metagenome which is the total genomes of all ment of the amplification tools for rare DNAs and
organisms existing in a certain habitat. It extracts technology related to high-throughput sequenc-
DNA from a complex microbial community and ing of DNA, it is now possible to analyze and
analyzes the information of genomes using understand the function of individual species in
molecular biological tools mainly based on direct the whole community of natural strains. As an
sequencing. Therefore, metagenomics is example, the elucidation of broad distribution of
a microbial community analysis method to access non-extreme ammonia-oxidizing archaea, AOA,
all contents of microbial genomes, which goes as dominant species in a wide range of ecological
beyond the limited scope of cultivated cells. niches clarified a major provider of energy flow
Metagenomic research has been revolutionized and nitrogen cycling in ecosystem (Erguder
by the development of genome-manipulating et al. 2009). Thus, a fundamental reconsideration
technologies, and despite its short history, new of the geochemical cycle of nitrogen is
functional genes, proteins, and biomaterials have demanded. In line with this, environmental shot-
been mined successfully (Xu 2006). Comprehen- gun sequencings of specific samples from ocean,
sive understanding has also been attained on the soil, plant, and animal stimulated interest in the
ecology and physiology of microbes. In addition, diversity of microorganism and indwelling met-
a huge amount of sequence information derived abolic gene clusters, enabling the elucidation of
from metagenome is integrated using bioinfor- species and community functions in specific
matic tools. Accordingly, the scope of application niche. With the introduction of high-throughput
in the entire range of biotechnology has altered screening that can detect extremely weak activity
based on the potential value of the metagenome and signal, new methods have been developed for
(Fig. 1). rapid detection of target libraries with a small
Biological Treasure Metagenome, Fig. 1 A value of are not limited to, environmental, agricultural, medical,
metagenome resource. Current metagenomic studies and industrial needs
result in various fields of applications that include, but
Biological Treasure Metagenome 75 B
amount of sample, thus facilitating screening in vitro. However, the limitation of engineering
with more positive hits (genes and proteins). processes in the exploration of sequence space
Candidates captured through this process are and the innate weak point of the stepwise screen-
compared with other known resources in shared ing process that cannot gather effectively the
pattern of functional or sequence signature, effect of beneficial mutation in the alternative B
which predicts the functional roles of the candi- landscape may reasonably explain the strength
dates in silico. With the derived functional roles, of the exploration of the metagenome from
metabolic pathway and/or capacity of the whole microbial community that has already possessed
microbiome is constructed. Currently, the various functional space (including biologically
microbiome in a specific environment is reliably permitted sequence space) by evolutionary expe-
established through reconstitution of the genome rience. Therefore, the metagenome can play
data of all organisms by bioinformatic tools a significant role as a resource to provide new
(Kunin et al. 2008). Metagenome data provide alternatives and get desired products from the
the understanding of complex biological systems highly precise and specific enzyme reactions of
through online public databases and data- thousands of substrates used in industry (Fig. 2).
integration tools. Currently, assembled genomes One of the major trends of research on biolog-
analyzed in an integrated logics are providing ically mediated processes is white biotechnology,
a window for forming more complete genomes, which is to find alternatives to petrochemical
and this is expected to reduce time and cost in compounds using renewable resources. There-
finding new resources considerably. fore, attention is paid to the production of fossil
fuels by bioconversion or fermentation using bio-
The Value of Metagenome Resources Remain mass. To this end, the acquisition of regulatory
to be Elucidated genes, potential enzymes, and gene clusters
Besides the elucidated physiological and ecolog- related to the production of organic acids, alco-
ical values by using metagenomic approaches, hols, solvents, and diesels is also obtainable from
there is an obvious reason for obtaining useful metagenome. What is more, organic compounds,
genes or physiologically active substances from which have been out of people’s attention for
the metagenome. This is because the access to economical reasons, are again spotlighted along
screening the library of pure-cultured cells has with their application to improved price compet-
been limited, and it is very difficult to sustain itiveness, low risk of environmental pollution,
the novelty of resources originating from and innovative tools of systems biology. We
culturable strains. According to what is known, also expect critical roles of the metagenome in
enzymatic degradation or synthesis is possible for increasing agricultural productivity and the utili-
almost every organic compound that can be found zation of biomass. Besides, research on the
naturally or synthesized chemically. However, human metagenome can provide clues to causes
regardless of the existence of related enzymes of diseases, acquired immune system, and new
with promising activities, the functional and methods of treating pathogen through the ana-
sequence spaces of metagenome resources from lyses of microbiome database from microbial
the whole living organisms in ecosystem are still communities (Gill et al. 2006). Also, in response
left mostly unexplored. Therefore, if hurdles in to the serious side effects of synthetic drugs and
the screening process, due to the approach based increasing drug-resistant pathogens, finding new
on the homology of protein sequence, can be natural inhibitors or suppressors, including
overcome, it will be possible to find resources in quorum-sensing blocker, as antibiotics in the
new areas. Of course, it is generally known metagenome could be possible. In this respect,
that the approaches of screening from the there are many attempts to approach the new
metagenome compete with protein engineering potential of metagenome resources through ana-
technologies that mutated or fine-tuned existing lyzing the resistome formed naturally by biolog-
genetic resources or induced forced evolution ical species existing around the ecological
B 76 Biological Treasure Metagenome
Biological Treasure Metagenome, Fig. 2 Exploring results in various fields of further applications that solve
value creation from integrative research activities of global problems such as fine chemical, environmental,
metagenome. Information concerning the application medical services, and future energies. Basically,
fields of metagenome resources is gathered and processed metagenome information also provides a clue for the
by systematically integrative systems. This information origin and minimal genome of living organisms
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
C 80 Carbohydrate-Active Enzymes Database, Metagenomic Expert Resource
down very specific complex carbohydrates in structure, specificity, and mechanism, which
a highly specific manner. Exquisite details of provides significant predictive power. Initially
complex carbohydrates create immense func- motivated by a need to delineate cellulases
tional differences. For instance, cellulose and (EC 3.2.1.4) into distinct structural families, the
amylose, two simple polymers of glucose resi- first incarnation of the GH family classification,
dues linked between their position 1 and their as such, comprised 35 GH families (Henrissat
position 4, only differ by the equatorial vs. axial 1991). Five years later, the number of GH fami-
orientation of the glycosidic bond (b for cellulose lies grew to 57 families (Henrissat and Bairoch
and a for amylose). This minute difference gives 1996), and has continuously expanded to reach
rise to two massively different polysaccharides: 113 in 2009 (Cantarel et al. 2009). As of March
cellulose, whose mechanical properties rival 2012, 130 sequence-based families of GHs have
those of steel, is synthesized by plants as been defined and are presented in the continu-
a structural polysaccharide notoriously recalci- ously updated CAZy database (http://www.cazy.
trant to hydrolysis while amylose is a component org/). In parallel with the development of the
of starch and, as a reserve carbohydrate, is readily classification of GH families, sequence-based
converted to glucose. classifications of the glycosyltransferases (GTs)
(Campbell et al. 1997), polysaccharide lyases
Carbohydrate-Active Enzymes and Their (PLs) (Lombard et al. 2010), carbohydrate ester-
Classification ases (CEs) (Cantarel et al. 2009), and
Carbohydrate-active enzymes (CAZymes) cata- carbohydrate-binding modules (CBMs)
lyze selective reactions to assemble and break (Boraston et al. 2004) have similarly been devel-
down complex carbohydrates and glycoconjugates oped and added to the CAZy database.
for a large array of biological functions globally
underpinning glycobiology. These enzymes, Functional Prediction of Carbohydrate-Active
which comprise glycoside hydrolases (GH), poly- Enzymes
saccharide lyases (PL), carbohydrate esterases The immense variety of carbohydrate structures
(CE), and glycosyltransferases (GT), have gradu- and their involvement in extremely different bio-
ally evolved from a limited number of primordial logical functions make that functional annota-
carbohydrate-active enzymes coding genes by tions such as “putative carbohydrate-active
acquiring novel specificities at substrate and prod- enzyme” or “putative glycosidase” have very
uct level. In addition, these enzymes often display limited information value. Instead, a useful func-
a modular structure with a catalytic module tional prediction for a CAZyme should indicate
appended to one or several other domains, such the likely nature of sugar being cleaved or trans-
as carbohydrate-binding modules, allowing for ferred, with a description of the exact connectiv-
increased specificity and/or specific targeting to ity between the sugar undergoing catalysis and
a particular substrate/region (Boraston et al. 2004). the molecule it is attached to or detached from.
The sequence-based classification of A feature that was recognized very early on
CAZymes was initiated in 1991 (Henrissat was that the sequence-based families of
1991; Henrissat and Bairoch 1993, 1996) as carbohydrate-active enzymes group together
a complement to the long-standing Enzyme Com- enzymes of differing substrate specificity and
mission (EC) number system (http://www.chem. hence group together enzymes with different EC
qmul.ac.uk/iubmb/enzyme/), which is based numbers (Henrissat 1991; Campbell et al. 1997).
solely on enzyme activities. Given the prevalence Because of the multifunctional nature of these
of convergent evolution of enzymes that cleave enzymes, it is believed that a limited number of
glycosidic bonds, as well as the demonstrable catalytic and binding progenitors (protein domain
catalytic promiscuity of individual enzymes, families), which can be found in different combi-
sequence-based classification has proven to be nation, gave rise to the vast number of enzymes
a robust way to unify information on enzyme and of carbohydrate structures that exist in
Carbohydrate-Active Enzymes Database, Metagenomic Expert Resource 81 C
modern organisms, resulting in the gradual and of cases, inferences are done by detecting the
simultaneous acquisition of exquisite substrate similarity of sequence between the newly gener-
specificity for both carbohydrate biosynthesis ated DNA sequence (or putatively encoded pro-
and carbohydrate degradation. Since most tein) and sequences already in databases. This
CAZyme protein domain families are approach does not perform equally with different
multifunctional, prediction of functional roles classes of proteins in terms of the biological
for uncharacterized carbohydrate-active enzyme inference that can be derived. For instance, the C
encoding genes simply by family assignment can assignment to families of protease/peptidases has
lead to erroneous annotations, especially at high often limited predictive power: the prediction are
sequence divergence. Additionally, the universe often only based on the fold the most informative
of known carbohydrate structures with the same information being essentially that of the catalytic
types of linkage bonds is smaller than the uni- machinery – for instance, “serine protease”– and
verse of possibility; therefore, even when func- little predictive power in terms of what is the
tions are known, there are potentially more specific peptide substrate targeted by the enzyme.
possible substrates. As a result the number of Thus, the very difficulty with CAZymes (huge
sequences that can be assigned to CAZy families structural and functional variety of substrates) is
increases very rapidly, but the number of also at the origin of their intrinsic advantage:
CAZymes whose substrate specificity has been these enzymes had to evolve to achieve the exqui-
established (even roughly) grows at a much lower site specificity necessary to carry out their func-
pace. As sequencing data grows with increasing tion in a selective manner. The high information
genomic and metagenomic characterization, this content of complex carbohydrates has therefore
proportion of characterized enzymes continues to translated into the proteins that assemble and
decrease. In spite of limitations due to the pres- deconstruct then by leaving evolutionary sig-
ence of different substrate specificities in many nals/traces that can be recognized in the
CAZyme families, it is often possible to assign sequence. While experimental developments in
a broad substrate category (for instance, pectin, the field of glycomics are still slow in comparison
cellulose, xylan) to a number of CAZyme fami- to the boom in sequencing technologies,
lies (Cantarel et al. 2012) even if the precise carbohydrate-active enzymes are perhaps the
substrate or product specificity (for instance, to most adapted to functional inference from geno-
distinguish between endo- and exo-acting mic and metagenomic data.
enzymes or to distinguish between b-D- The direct genetic sequencing of microbial
xylosidase and a-L-arabinofuranosidase) cannot communities (metagenomics) is beginning to
be predicted accurately based on simple family explore the great gene diversity in the microbial
assignment. In order to improve functional pre- world. Environmental samples from diverse envi-
diction, the partition of CAZyme families into ronment are being studied to better understand
subfamilies based on phylogenetic analysis has the role of microbes in various habitats from the
been explored. Significantly subfamily classifica- human body to the ocean floor. This technology
tion of several families of GHs and PLs has has allowed scientist to begin to answer questions
shown that the majority of the defined subfam- not possible with studying only cultivable spe-
ilies were monospecific, thus indicating a better cies. Here we review the burgeoning exploration
correlation of substrate specificity between of carbohydrate-active enzymes in metagenomic
sequences at the subfamily level than the family samples.
level (Lombard et al. 2010; St. John et al. 2010;
Stam et al. 2006). Glycobiology in Microbial Communities
The advent of low cost DNA sequencing has Microbial communities isolated from human
revolutionized biology, and the central question fecal material are the most well studied in the
is no longer how to obtain nucleotide sequence, usage of CAZymes. CAZyme diversity in
but how to make sense of it. In the vast majority human gut microbiota studies (Gill et al. 2006;
C 82 Carbohydrate-Active Enzymes Database, Metagenomic Expert Resource
Duan CJ, Xian L, Zhao GC, Feng Y, Pang H, Bai XL, Stam MR, Danchin EG, Rancurel C, Coutinho PM,
et al. Isolation and partial characterization of novel Henrissat B. Dividing the large glycoside hydrolase
genes encoding acidic cellulases from metagenomes family 13 into subfamilies: towards improved func-
of buffalo rumens. J Appl Microbiol. 2009;107(1): tional annotations of alpha-amylase-related proteins.
245–56. Protein Eng Des Sel. 2006;19(12):555–62.
Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, Suen G, Scott JJ, Aylward FO, Adams SM, Tringe SG,
Samuel BS, et al. Metagenomic analysis of the human Pinto-Tomas AA, et al. An insect herbivore
distal gut microbiome. Science. 2006;312(5778):1355–9. microbiome with high plant biomass-degrading capac-
Hehemann JH, Correc G, Barbeyron T, Helbert W, ity. PLoS Genet. 2010;6(9):e1001129.
Czjzek M, Michel G. Transfer of carbohydrate-active Tasse L, Bercovici J, Pizzut-Serin S, Robe P, Tap J,
enzymes from marine bacteria to Japanese gut Klopp C, et al. Functional metagenomics to mine the
microbiota. Nature. 2010;464(7290):908–12. human gut microbiome for dietary fiber catabolic
Henrissat B. A classification of glycosyl hydrolases based enzymes. Genome Res. 2010;20(11):1605–12.
on amino acid sequence similarities. Biochem Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL,
J. 1991;280(Pt 2):309–16. Duncan A, Ley RE, et al. A core gut microbiome in
Henrissat B, Bairoch A. New families in the classification obese and lean twins. Nature. 2009;457(7228):480–4.
of glycosyl hydrolases based on amino acid sequence Turnbaugh PJ, Quince C, Faith JJ, McHardy AC,
similarities. Biochem J. 1993;293(Pt 3):781–8. Yatsunenko T, Niazi F, et al. Organismal, genetic,
Henrissat B, Bairoch A. Updating the sequence-based and transcriptional variation in the deeply sequenced
classification of glycosyl hydrolases. Biochem gut microbiomes of identical twins. Proc Natl Acad Sci
J. 1996;316(Pt 2):695–6. U S A. 2010;107(16):7503–8.
Kabeerdoss J, Shobana Devi R, Regina Mary R, Rama- Warnecke F, Luginbuhl P, Ivanova N, Ghassemian M,
krishna BS. Faecal microbiota composition in vegetar- Richardson TH, Stege JT, et al. Metagenomic and
ians: comparison with omnivores in a cohort of young functional analysis of hindgut microbiota of a wood-
women in southern India. Br J Nutr. 2011;20:1–5. feeding higher termite. Nature. 2007;450(7169):
Laine RA. A calculation of all possible oligosaccharide 560–5.
isomers both branched and linear yields 1.05 10(12) Zhu L, Wu Q, Dai J, Zhang S, Wei F. Evidence of
structures for a reducing hexasaccharide: the isomer cellulose metabolism by the giant panda gut
barrier to development of single-method saccharide microbiome. Proc Natl Acad Sci U S A.
sequencing or synthesis systems. Glycobiology. 2011;108(43):17714–9.
1994;4(6):759–67.
Lombard V, Bernard T, Rancurel C, Brumer H, Coutinho
PM, Henrissat B. A hierarchical classification of poly-
saccharide lyases for glycogenomics. Biochem
J. 2010;432(3):437–44. Challenge of Metagenome Assembly
Mahowald MA, Rey FE, Seedorf H, Turnbaugh PJ, Fulton
RS, Wollam A, et al. Characterizing a model human
and Possible Standards
gut microbiota composed of members of its two dom-
inant bacterial phyla. Proc Natl Acad Sci U S A. Matthew B. Scholz1, Chien-Chi Lo1 and
2009;106(14):5859–64. Patrick Chain2
Matteotti C, Haubruge E, Thonart P, Francis F, De 1
Genome Science Group, Los Alamos National
Pauw E, Portetelle D, et al. Characterization of a new
beta-glucosidase/beta-xylosidase from the gut Laboratory, Los Alamos, NM, USA
2
microbiota of the termite (Reticulitermes santonensis). Bioscience Division, Los Alamos National
FEMS Microbiol Lett. 2011;314(2):147–57. Laboratory, Los Alamos, NM, USA
Muegge BD, Kuczynski J, Knights D, Clemente JC,
Gonzalez A, Fontana L, et al. Diet drives convergence
in gut microbiome functions across mammalian
phylogeny and within humans. Science. 2011; Introduction
332(6032):970–4.
Pope PB, Denman SE, Jones M, Tringe SG, Barry K,
As technology and methodology have allowed for
Malfatti SA, et al. Adaptation to herbivory by the
tammar wallaby includes bacterial and glycoside more advanced assemblies of metagenomes, the
hydrolase profiles different from other herbivores. need for commensurate assignment of quality to
Proc Natl Acad Sci U S A. 2010;107(33):14793–8. these assemblies has become evident. There are
St John FJ, Gonzalez JM, Pozharski E. Consolidation of
currently no set standards for describing the qual-
glycosyl hydrolase family 30: a dual domain 4/7
hydrolase family consisting of two structurally distinct ity of sequencing, assembly, or analysis of
groups. FEBS Lett. 2010;584(21):4435–41. metagenomic assemblies. Uncorrected, this may
Challenge of Metagenome Assembly and Possible Standards 85 C
lead to faulty conclusions based on assumptions analysis and assembly. All of these factors under-
that the assembly is more or less accurate, or lie the highly variable nature of metagenomes,
representative of the sample, than it truly is. This making it difficult to generate accurate assem-
need is similar to, but far more complex than, the blies and also difficult to define standards or
dilemma that faced the microbial sequencing and otherwise grade the effectiveness of an assembly.
assembly community as more and more genomes It can be reasonably stated that metagenomic
were sequenced with new technologies and assem- assembly is still in its infancy and generally pro- C
bled with novel algorithms. duces what can only be described as draft assem-
For bacterial genomes, the quality of assembly blies of metagenomic data, though it has certainly
and finishing efforts has been standardized for been possible in some rare cases to recover full
several years, resulting in a much better under- and near-complete genomes from some environ-
standing of the types of analyses that can be ments (Huttenhower et al. 2012).
performed on each level of finished genome and The utility of sequencing a community sample
the resulting value. While the need for standards is based solely on the ability of the researcher to
in metagenomics has been very clear in terms of garner useful information from the data
metadata (Yilmaz et al. 2011), less attention has (assembly and annotation). This ability is, more
been focused on the genomic data itself. As the often than not, reflective of the “quality” of
field continues to advance and mature, it is clear a metagenome assembly. Additionally, the goals
that efforts in standardizing assemblies as well as of a given project can also affect this question, by
functional and phylogenetic classification are altering the types of analysis needed or the depth
sorely needed. Given the flux in application of of sequencing required, among many factors. The
various sequencing technologies (with different gross differences and determinations of sequence
and sometimes variable sequence qualities) to diversity between two or more metagenomes can
genome reconstruction, the most recent version be typically analyzed using simple comparative
of this standard for microbial genomes is to tests, whereas a more in-depth analysis for gene
divide sequences into broad levels of complete- content or for application to proteomic (peptide
ness and quality, from draft to completely fin- mass prediction) or metabolic pathway (function
ished (Chain et al. 2009). While these standards and operon prediction) analysis requires larger
are valuable, it is difficult to apply similar stan- assembled regions of contiguous sequence
dards to metagenomic assemblies, where the (contigs), with low error rates.
effort is to reconstruct the genomes (or parts of While it is not possible to set de facto stan-
genomes) of many organisms present within dards for metagenomic analysis or assembly, this
a sample. The numbers of different species in entry is an attempt to discuss a number of the
a community selected for sequencing and potential impediments to adequate or good
metagenomic assembly can vary from two to metagenome assembly. Additionally, several
millions of individual genomes from many spe- possible methods for improvement and validation
cies, with varying frequency of each genome. of assembled contig sets from metagenomic
Additionally, each genome may be different in assemblies, as well as potential methods for gen-
size, G+C content, and repetitiveness as well as erating higher quality draft metagenomes from an
have other genome-specific issues that make it individual sample, will be discussed.
impossible to assemble all genomes equally well.
Additionally, community genomics are com-
plicated by the potential existence of many Barriers to Metagenomic Assembly
strains of the same species (species of the same
genus, etc.), of recombination, of horizontal gene As has been addressed previously, metagenome
transfer events among members of the commu- assembly is difficult, requiring ever-increasing
nity, and of other factors that further complicate computational resources at a rate fast outpacing
C 86 Challenge of Metagenome Assembly and Possible Standards
“Moore’s law” (Miller et al. 2010; Scholz sequenced and increasing the percentage of reads
et al. 2012). This is a product of the limitations incorporated, affecting the slope of the curve. In
of current sequencing technologies coupled with short, the more complex communities, such as
the available assembly algorithms that can falter those found in soils and sediments, will require
when running into the massive scale of data pro- much greater sequencing inputs to allow for
duced by current next-generation sequencers assembly of a significant proportion of the data.
(NGS). For metagenomic sequencing and assem- Conversely, simpler communities, such as bio-
bly, there is a paradoxical problem with data. reactors, enrichment cultures, and naturally sim-
While the relatively low cost and highest ple environments can achieve nearly 100 %
throughput sequencers produce hundreds to thou- incorporation of data even with relatively few
sands of gigabytes of data per run, the short read reads (<200 million Illumina reads).
lengths of these technologies limit the types of
assembly procedures that can be applied (Scholz
et al. 2012). Due to the diversity of genomes and Assembling Subsets of Data
the variation between members of a community,
a good assembly of a metagenome requires sig- To allow current assemblers to better process the
nificantly more sequencing (potentially terabases mountains of data, it is generally believed that
of data per sample for some environments). In dividing reads into smaller, categorical bins may
direct opposition to this requirement, the current enable improved, or “targeted,” assembly. While
state-of-the-art assemblers for NGS data are lim- this partitions the data into manageable parcels
ited by available computational memory, mean- for assembly, it has also been used as a filtering
ing that, currently, computers are only capable of method, to remove extraneous reads from the
assembling as little as 1 % of the data required for dataset pre-assembly (Godoy-Vitorino et al.
the most complex of samples. The computational 2012). There have been several very thorough
time, processing power, and required system methods developed for binning of reads or contigs.
memory for assembling any genome using Binning can be performed as a function of nucle-
state-of-the-art assembly algorithms (Miller otide frequencies, or abstractions of that (Kmer-
et al. 2010) are directly proportional to the size based filtering, etc.), on statistical analysis of
and complexity of the genome(s) to be assembled read relationships (read topology) or on similarity
(and partly coupled with errors introduced during to known genomes or genome signatures
the sequencing process). In the case of (homology, etc.). Additionally, HMMs or other
metagenomes, then, the first approximation learning algorithms may eventually be developed
would be that the requirements for assembly are to allow rapid binning of reads. However, once
a function of the number of unique bases binning is performed, many of the issues surround-
(or unique “words” or Kmers) in all genomes ing assembly of many sequences again become
contained within the community to be sequenced. relevant and require in-depth analysis and work.
This easily overshadows even the largest and As each binning method will invariably introduce
most complex eukaryotic genomes, making both false positives and false negatives, it is not
assembly of all microbial genomes within clear what effect these algorithms may have on
a single metagenomic sample, given today’s a “final” assembly or if the effect will be consistent
infrastructure and algorithms, infeasible. This among different samples.
variation and computational limitation leads to
a variable amount of data that can be incorporated
into any given assembly. It can be expected that Whole Sample Assembly
read incorporation into metagenome assemblies
will follow a logarithmic curve, with the amount Full metagenome sequence runs, or bins of
of available sequence data covering more of the metagenomic data, are run through an assembly
diversity and complexity of the community being methodology or program; however most current
Challenge of Metagenome Assembly and Possible Standards 87 C
algorithms are designed for isolate genome of single-genome assembly (N50, total assembly
assembly (Miller et al. 2010; Scholz size, etc.). To improve accuracy, this can be com-
et al. 2012). While isolate genome assembly bined with manual inspection of the data under-
assumes that there are a limited number of solu- lying the contigs, using tools such as Consed
tions to the assembly, as the complexity of the (Gordon 2003), Hawkeye (Schatz et al. 2011),
genome increases and concomitant amount of or other alignment viewers. It bears noting here
sequence data are required, these decisions that these tools require sequential attention to C
become more difficult for algorithms to make. each individual contig, making validation of
For metagenomes, there are additional complica- metagenomic assemblies of many thousands to
tions, such as strain-level variation within millions of contigs prohibitively time-
a species, varying levels of similarity among the consuming.
multiple species within the population, including Additional validation can be gained by read
horizontal gene transfer, and the ever present mapping input sequence data to contigs to iden-
problem of variability of organism frequency/ tify errors or areas with unexplained variances in
abundance within the community. Several recent coverage. None of the tools or processes avail-
attempts have been made (e.g., MetaVelvet able for validations of single-genome assemblies
(Namiki et al. 2012), RAY (Boisvert is directly applicable to metagenomes and
et al. 2010), or Meta-IDBA (Peng et al. 2011)) requires either significant alterations in method
to solve one or more of these metagenome- or a completely new approach. This is due in part
specific issues. However, there is not yet to the much larger amount of data required for
a perfect algorithm, and all can benefit from a metagenome assembly as well as the intrinsic
improved understanding of the inherent complex- complexities associated with metagenomes, men-
ities within metagenomes as well as from tioned above.
improved algorithms for determining which data How, then, does one assess whether
are to be examined and how. Given the varied a metagenomic assembly is good or valid? It is
nature of the complexities that exist in communi- possible to examine statistics of an assembly to
ties, it is likely that the perfect assembly algo- determine if it has value and assess the quality of
rithm will have to evaluate the data and make assembled contigs to give a measure of what
decisions during metagenome assembly. analysis can be performed on the assembled
data (e.g., longer contigs allow for more annota-
tion analysis). Additionally, it is easy to calculate
Assembly Validation and Metrics the total number of bases assembled, allowing for
a rough estimate of how many genomes may be
Validation of metagenomic assemblies is not cur- captured in an assembly. However, it is also
rently a standard process. Some efforts have important to validate that the assembly is an
focused on validation using tools adapted from accurate representation of the input data by use
single-genome assemblies which, due to the dif- of read mapping or other comparative tools. For
ferences in complexity, can vary from being sim- metagenomes, with the stated issues of lack of
ply an inefficient method for validation at best to uniformity, it is likely that a valuable tool for
being misleading and based on incorrect assump- obtaining improved assemblies will be to perform
tions at worst. Validation of assembly complete- several assemblies in parallel and compare inter-
ness (good assemblies provide large contigs with sample assemblies to each other. This will also be
more of the raw data) and accuracy (good assem- an important method to compare the results of
blies harbor few errors such that it is a close binning and of different assembly methods to
representation of the target organism) is combined, or iterative, assemblies of the entire
a nuanced and nebulous process even with isolate dataset. For environments that have been amply
genomes. A typical series of statistical properties studied and for which there are a number of
of contigs is often used to describe the goodness pertinent reference sequences, such as for
C 88 Challenge of Metagenome Assembly and Possible Standards
human microbiome samples (Lampe 2008; project using the assembly program
Huttenhower et al. 2012; Methe et al. 2012), it SOAPdenovo with different Kmers as an input
is possible to use these to validate assemblies. parameter. What is important to note is that it is
Recent work with isolation and sequencing of difficult to select the best assembly based on any
single cells from within environmental samples single metric, even given the same assembler
raises the possibility of using reference-based with a single parameter change. In fact, it is the
validation tools on metagenome samples as well rule rather than the exception that no single
(Kant et al. 2011; Leung et al. 2012). assembly of the data can provide the best statis-
tics for every metric.
Statistical Comparisons
Read Mapping as Contig Validation
As mentioned above, the first approximation of
the quality of any assembly is an examination of It is important that any assembly be verified by
the metrics associated with the assembly. For methods beyond those utilizing simple statistical
metagenomes, these metrics should be different methods. It is also important that validation algo-
from those used for isolate genome work. Statis- rithms be independent from those utilized to per-
tics that are linked to the total assembly size (e.g., form the assembly. Currently, Burrows-Wheelers
N50) have little value, as the size of the (Langmead et al. 2009; Li and Durbin 2010) read
metagenome, the assembly (and the assembled mapping can serve as an independent approach of
number of contigs), and the choices made for validating the contigs assembled based on the raw
assembly (binning, filtering, assembler algo- sequencing data. This approach has the ability to
rithm, Kmer size, etc.), which can affect the num- validate assembled contigs by basis of coverage
ber and types of bases included in a metagenomic of every base contained within the contig (Fig. 1)
assembly, can all result in drastically different as well as based on the variation of coverage
interpretations. The evaluation of metagenome within the contig (Fig. 2).
assemblies is often conducted in a holistic man- It may not always be the case that coverage
ner, utilizing a number of important statistics and along a contig will appear as even as with isolate
validation metrics. This can be used to assess the genomes, due to the issues of strain (allele) var-
completeness of various assembly methods. iations, of gene duplication, of ribosomal gene
Table 1 shows a selection of assembly statistics similarities between species, and of horizontal
for a single sample (MH0001) from the MetaHIT gene transfer. Additionally, because read
Challenge of Metagenome Assembly and Possible Standards, Table 1 Statistical metrics of metagenome
assembly
Number of Maximum Bases in largest Bases in contigs % read
Assembly type contigs contig size Total bases 100 contigs >10 kb incorporation
SOAPdenovo- 378,624 18,148 63,350,050 1,025,623 438,734 60.8
Kmer 21
SOAPdenovo- 303,536 18,150 55,682,346 1,155,330 839,420 61.1
Kmer 23
SOAPdenovo- 244,200 25,192 47,972,706 1,220,072 949,421 60.6
Kmer 25
SOAPdenovo- 188,074 23,935 40,311,428 1,162,160 843,499 59.9
Kmer 27
SOAPdenovo- 140,502 28,068 33,228,335 1,177,230 804,055 58.9
Kmer 29
SOAPdenovo- 109,722 28,068 27,463,402 1,245,286 918,627 57.8
Kmer 31
Challenge of Metagenome Assembly and Possible Standards 89 C
Challenge of Contigs Coverage vs. Contigs Length
Metagenome Assembly
and Possible Standards, 100
Fig. 1 Coverage
histogram of metagenome
assembly. Displays
percentage coverage of 90
every contig as a function C
of the contig length
80
Contigs coverage (%)
70
60
50
Challenge of
Metagenome Assembly
and Possible Standards,
Fig. 2 Base-by-base
coverage histogram of
a single contig generated
within a metagenome
assembly. Areas where
coverage varies from the
mean may be identified as
regions of low quality or
confidence
mapping is fundamentally different from Kmer- a so-called edge-effect that prevents a read from
based assemblies, short contigs will generally mapping to a contig if the read-to-contig align-
have poorer coverage, when considering the per- ment ends in the middle of the read yet at the end
centage of total bases in the contig. This is due to of the constructed contig. However, due to the
C 90 Challenge of Metagenome Assembly and Possible Standards
speed and accuracy of Burrows-Wheeler style finished reference bacterial genomes. These
aligners, this method of validation is both rapid genomes are useful both for phylogenetic and
and sufficiently accurate to allow reasonable cer- functional classification and for validation of
tainty that an assembly is valid and that the assembly. When it is known or suspected that
contigs represent the genomes present within a particular organism is present within a sample
the sample. Finally, read mapping can be com- (e.g., Rhizobium spp. are expected in rhizosphere
bined with a number of other tools such as samples, while Escherichia coli are generally
SAMtools (Li et al. 2009) to locate possible pop- found in fecal samples), alignments against such
ulation differences such as single nucleotide references can be used to validate contigs that are
polymorphisms (SNPs), insertions or deletions generated from the metagenomic sample in
(indels), and other assembly errors within the question.
contigs. This allows assemblies to be validated In the future, reference-based approaches may
and potentially improved in an unsupervised be best utilized in a sample-specific manner to
manner based on the alignment of reads as well both contribute to and help validate metagenomic
as to make empirical judgments of assembly assemblies by using draft reference genomes gen-
quality. erated via single-cell (or microcolony) isolation
from the same site, followed by amplification and
sequencing. The advent of multiple displacement
Comparisons of Multiple Assemblies amplification to allow for the sequencing of
minute quantities of DNA, including single cells
Beyond statistical comparisons of multiple or clusters of cells, shows great promise for
assemblies and evaluation using the raw input metagenomic projects by allowing the inclusion
data, it is also possible to determine how similar of sample-specific genomes to be used in
two assemblies are using the same initial data. reference-based assembly methods.
For example, for the entries listed in Table 1,
there is no guarantee that the largest contigs
from each sample are the same or that the contigs Metagenome Assembly Standards:
have been recapitulated in the various other A Proposed Tiered System
assemblies. The mechanisms for comparing two
contig sets to each other are evolving and can As a nascent field, the methodology for
range from full assembly alignments using metagenome assembly is still under great flux.
BLAST- (McGinnis and Madden 2004) or Currently available tools are able to produce
NUCmer-based comparisons (Delcher et al. valid, useful assemblies of some fraction of any
2002) to protein coding content-based analyses metagenomic sample. However, these assemblies
to more sophisticated methods. In the future, must be considered as a set of draft contigs only,
training of better assembly pipelines may involve particularly if no form of validation has been
evaluating differences among several results in performed. Read-based validation can be used
terms of possible rearrangements, SNPs, indels, to inform and improve on assemblies; however
and errors in joining repetitive regions to deter- this is a time-consuming process and should not
mine if one methodology can be considered con- be expected to be a long-term, high-throughput
sistently better than another. solution for metagenome assemblies. However,
this does not obviate the need for validation pro-
tocols; it merely highlights the lack of algorith-
Generalized References and Site- mic approaches to the technique.
Specific References for Validation There are several promising areas of assembly
investigation that could produce assemblies dis-
The recent explosion in sequencing capacity has tinguishable from draft or validated draft
resulted in an ever-increasing number of draft and metagenomic assemblies. These include the use
Challenge of Metagenome Assembly and Possible Standards 91 C
Challenge of Metagenome Assembly and Possible Challenge of Metagenome Assembly and Possible
Standards, Table 2 Proposed statistical reporting met- Standards, Table 3 Classification of assembly methods
rics for metagenome assembly for metagenomes. Reporting would ideally describe both
classification and statistics described in Table 2
Proposed metric Description
Percent of read Percentage of read mapping or Quality Description
incorporation incorporated into assembly. This Draft One assembler, one parameter
serves as a metric as to how much Quality draft (QD) Multiple assemblers, multiple
additional sequencing may be parameters, merging-based
C
required for better assembly final assembly
Size of metagenome Number of base pairs included in Binning assisted (HQD) Multiple parameter assemblies
assembly the final assembly. This is performed by binning of reads
a measure of how many genomes into subsets, followed by
may have been assembled and can merging-based final assembly
be utilized to determine what Reference-guided Binning based on reference
additional sequencing will be RHQD sequences, followed by HQD
allowed in terms of additional assembly
sequence data incorporation
Location-specific Reference-guided assembly
Largest contig size This is typically a measurement of reference-guided including sequencing and
how well the most abundant assembly assembly of individual isolate,
organism assembled single cells, or microcolony-
Number of bases in This measurement is similar to based organisms isolated from
large contigs largest contig size but also allows the same environment as the
depth of analysis to potentially metagenome sample in
include less well-assembled species question
Fold coverage A histogram describing the number
histogram of bases covered at a given fold
coverage. This will indicate the
variation between abundant and
non-abundant organisms
incorporated in a sample, the total number of
bases in the resulting assembly, the size of the
largest contig, and the number of bases in the
of reference genome datasets to improve assem- largest 100, 1,000 and 100,000 contigs. Addi-
blies, the inclusion of long read technologies to tional options can include a histogram of fold
help generate longer contigs and scaffolds as well coverage of assembled contigs and alternative
as to allow linkage of genetic differences among measures of assembly. The second level of
haplotypes, and the use of iterative and combined reporting requires a community acceptance of
assembly methods to correct “invalid” contig and assembly types, similar to isolate genome assem-
scaffold regions and to find previously blies. The current default methodology for
unreported overlaps among contigs and reads. metagenome assembly (use of a single assembler,
In order to provide a complete assembly over- with a single or best parameter selected) is pro-
view, the standardized reporting of two important posed to be called a Draft Metagenome Assem-
pieces of information for any assembly of bly. Iterative and multiple assemblies coupled
metagenomic sample is proposed. Tables 2 and with the merging of contigs and validation/cor-
3 describe a first approximation of reporting that rection of contigs, such as that utilized at the
would help disseminate information regarding DOE Joint Genome Institute and Los Alamos
the quality (Table 2) and assembly levels National Laboratory, could be considered high-
(Table 3) of metagenome assemblies to a broader quality draft. Additional levels of quality require
audience. The first and most important level of technologies that are not currently adopted,
reporting is an accurate and consistent descrip- including the use of general reference genomes
tion of the assembly metrics as discussed above to perform reference-guided assemblies. Finally,
and in Table 2. These metrics should include, at the best assembly possible will require sequenc-
a bare minimum, the percentage of reads ing and assembly of genomes gathered by use of
C 92 Challenge of Metagenome Assembly and Possible Standards
Definition
Aim and Scope of CLUSEAN
CLUSEAN, the CLUster SEquence ANalyzer, is
a BioPerl-based software pipeline for the annota- CLUSEAN, the CLUster SEquence ANalyzer
tion of secondary metabolite biosynthetic gene (Weber et al. 2009), is a BioPerl (Stajich
clusters encoding the biosynthesis of molecules et al. 2002)-based tool collection that allows
with, e.g., antibiotic or anticancer activities. a semiautomatic annotation and analysis of sec-
CLUSEAN contains modules for automated ondary metabolite gene clusters. A typical
homology search, protein domain identification, CLUSEAN analysis run is carried out in two
and, in case of modular polyketide synthases and stages: in the first stage, the gene products of
non-ribosomal peptide synthetases-containing whole genomes or biosynthetic gene clusters are
pathways, substrate prediction for the biosyn- compared against standard databases. In the sec-
thetic enzymes. ond stage, secondary metabolite-specific ana-
lyses are carried out (Fig. 1).
During the first analysis stage, similar proteins
Introduction of all annotated gene products are identified using
BLAST (Altschul et al. 1990) against the
A majority of antimicrobials used in human med- non-redundant protein database, and conserved
icine to combat infectious diseases, e.g., tetracy- protein domains are identified with the
cline, penicillin, vancomycin, or erythromycin, HMMER (Eddy 2001) software searching against
many anticancer drugs, and other bioactive mol- the Pfam protein family database (Bateman
ecules, e.g., the immunosuppressant rapamycin, et al. 2002).
are derived from microbial secondary metabo- In the second stage, protein domains com-
lites, also denoted as natural products. These monly observed in the context of secondary
compounds are mainly synthesized by bacteria metabolism are identified using HMMER on
C 94 CLUSEAN, Overview
CLUSEAN, Overview, Fig. 1 Data processing within the CLUSEAN annotation pipeline (Reprinted from Weber
et al. 2009 with permission from Elsevier)
a custom HMM profile database. This analysis the catalytic domains of modular PKS and NRPS,
leads to the identification of the conserved func- which can indicate functionality of the enzymatic
tional domains in modular polyketide synthases domain and thus has an influence on the synthe-
and non-ribosomal peptide synthetases (NRPS). sized product.
Amino acid specificities of NRPS adenylation CLUSEAN has been included as an integral
domains are predicted with an integrated NRPS part into antiSMASH, antibiotics, and secondary
predictor (Rausch et al. 2005; Röttig et al. 2011). metabolites analysis shell, http://antismash.
All annotation is provided as annotation tags secondarymetabolites.org (Medema et al. 2011),
in EMBL-formatted sequence flat files which can where most analysis results can be accessed
be imported in standard sequence analysis tools, interactively or downloaded on a user-friendly
e.g., the Artemis sequence editing software web page.
(Rutherford et al. 2000) or the ACT sequence
comparison tool (Carver et al. 2005). The
CLUSEAN annotation can be exported in Availability and System Requirements
tabulator, or comma-separated text files, or as
MS Excel tables. CLUSEAN is freely distributed under a GNU
In addition to the prediction modules inte- GPL and can be downloaded from https://
grated into the automated pipeline script, addi- bitbucket.org/tilmweber/clusean.
tional tools exist to define KS types of trans-AT CLUSEAN has the following software
PKS according to Nguyen et al. (2008) and to requirements: BLAST + 2.2.24 (or later),
check the presence of conserved amino acids in HMMer 2, HMMer 3, BioPerl 1.6.9 (or later),
Computational Approaches for Metagenomic Datasets 95 C
and Perl libraries Sort::ArrayOfArrays and predicting NRPS adenylation domain specificity.
Spreadsheet::WriteExcel::Simple. Nucleic Acids Res. 2011;39(Web Server issue):
W362–7
Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P,
Rajandream MA, et al. Artemis: sequence visualiza-
Summary tion and annotation. Bioinformatics. 2000;16(10):
944–5.
Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA,
Mining microbial and fungal genome data is Dagdigian C, et al. The Bioperl toolkit: perl modules
C
a successful novel strategy to identify producers for the life sciences. Genome Res. 2002;12(10):
of novel drug candidates. CLUSEAN is a widely 1611–8.
used tool to provide automated annotation of Weber T, Rausch C, Lopez P, Hoof I, Gaykova V, Huson
DH, et al. CLUSEAN: a computer-based framework
secondary metabolite gene clusters and to extract for the automated analysis of bacterial secondary
information from the sequence data which can be metabolite biosynthetic gene clusters. J Biotechnol.
basis for the deduction of the putative biosyn- 2009;140(1–2):13–7.
thetic products.
References
Synonyms
Altschul SF, Gish W, Miller W, Myers EW, Lipman
DJ. Basic local alignment search tool. J Mol Biol.
Bioinformatic analysis; Metagenome data
1990;215(3):403–10.
Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, analysis
Eddy SR, et al. The Pfam protein families database.
Nucleic Acids Res. 2002;30(1):276–80.
Carver TJ, Rutherford KM, Berriman M, Rajandream
MA, Barrell BG, Parkhill J. ACT: the artemis compar-
Definition
ison tool. Bioinformatics. 2005;21(16):3422–3.
Eddy SR. HMMER: profile hidden Markov models for The process of gaining information about
biological sequence analysis. 2001. Available from: a metagenomic community from sequence data
http://hmmer.janelia.org/.
Medema MH, Blin K, Cimermancic P, de Jager V,
using a variety of interdisciplinary techniques
Zakrzewski P, Fischbach MA, et al. AntiSMASH: and approaches.
rapid identification, annotation and analysis of second-
ary metabolite biosynthesis gene clusters in bacterial
and fungal genome sequences. Nucleic Acids Res.
2011;39(Web Server issue):W339–46.
Introduction
Nguyen T, Ishida K, Jenke-Kodama H, Dittmann E,
Gurgui C, Hochmuth T, et al. Exploiting the mosaic The history of computational approaches to
structure of trans-acyltransferase polyketide synthases metagenomic data analysis is brief given the
for natural product discovery and pathway dissection.
Nat Biotechnol. 2008;26(2):225–33.
rapid development of the field. In 1998,
Rausch C, Weber T, Kohlbacher O, Wohlleben W, Huson a visionary paper described techniques for inves-
DH. Specificity prediction of adenylation domains in tigation of the molecular diversity of environmen-
nonribosomal peptide synthetases (NRPS) using tal communities and coined the term metagenome
transductive support vector machines (TSVMs).
Nucleic Acids Res. 2005;33(18):5799–808.
(Handelsman et al. 1998). Focus was placed on
Röttig M, Medema MH, Blin K, Weber T, Rausch C, screening clone libraries for interesting biological
Kohlbacher O. NRPSpredictor2–a web server for activities, a mainly laboratory-based endeavor
C 96 Computational Approaches for Metagenomic Datasets
which has been continually successful at identify- programs intended for annotation of
ing relevant novel genes with novel functionality. metagenomic sequences before making some
Other researchers took a more technology-driven comments on relevant statistical analyses. Lastly,
approach by randomly sequencing metagenomic we briefly review the current state of affairs in
DNA from an acid mine biofilm and the metadata collection and standards.
well-known Sargasso Sea projects. These
sequence-based approaches required considerable
computational capacity for assembly and 16S rDNA Profiling
similarity searches. The Sargasso Sea project in
particular provided researchers with considerable Targeted sequencing of the ubiquitously present
headaches with data analysis due to its sheer size. 16S SSU bacterial and archaeal ribosomal gene
Microbiomes of humans and mice quickly has become a common technique in deriving
followed and have remained a major source of estimates of microbial diversity in a community.
metagenomic data to date, particularly with Despite its popularity, with approximately 90 %
respect to diet, health, and disease. As sequencing of all datasets having been produced according to
has become cheaper so has the demand for multi- this method (Davenport and T€ummler 2012), this
ple groups of samples and detailed comparative approach is not metagenomics in the strict sense.
analyses of time courses. Study design with con- 16S rDNA profiling completely ignores func-
trol groups has in turn become more complex and tional diversity such as gene content and acces-
critical. Some groups have even flirted with the sory genome elements while also overlooking
next stage in community analysis and investigated potentially important viral and eukaryotic taxa.
metatranscriptomes and metaproteomes of envi- However, this approach provides consistent qual-
ronmental communities. itative estimates of bacterial and archaeal mem-
While targeted sequencing of single genes has bers of the community, although care must be
been the norm for most projects to date (2012), taken with quantitative aspects. In addition, sev-
many groups are becoming interested again in eral capable software packages are available for
true metagenomics sensu stricto, i.e., the investi- analysis. Errors can occur for a number of rea-
gation of microbial community structure and sons, including copy number variations of ribo-
function using whole genome shotgun somal RNA operons in prokaryotic genomes, the
metagenome datasets. Large studies such as lack of coverage of “universal” primers, and
Metahit (Qin et al. 2010) and a comprehensive multi-template PCR biases. A recent effort has
cow rumen analysis (Hess et al. 2011) have incorporated copy number information of the 16S
driven the acceptance of this approach. Storage gene and reported improvements in microbial
requirements and computational resources can diversity estimates (Kembel et al. 2012).
quickly become limiting in these types of ana- In the past, 16S genes were sequenced using
lyses, though many of the state-of-the-art long Sanger reads, and only fully covered genes
algorithms described below do a good job at were used for analysis. Later, the long-read
mitigating these factors. 454 sequencing technology made targeting of
In the following sections we concentrate on one or more of the shorter so-called hypervariable
the principles, advantages, and problems of the regions of 16S genes possible at a much reduced
main approaches to computational metagenome cost. In turn, others have investigated the use of
analysis. We highlight existing approaches and overlapping paired end Illumina short-read tech-
mention some of the most widely applied soft- nologies to sequence hypervariable regions.
ware in the field, which is then listed with web There is still debate about whether only targeting
links in Table 1. We first deal with 16S rDNA regions of the 16S gene leads to similar results as
profiling, before describing the state of the art in using the full length gene and if this leads to
metagenome assembly and taxonomic assign- biases for some phylogenetic groups (Pinto and
ment algorithms. Subsequently, we discuss Raskin 2012).
Computational Approaches for Metagenomic Datasets 97 C
Computational Approaches for Metagenomic Datasets, Table 1 A non-exhaustive list of software used directly
or indirectly in metagenomics and mentioned in the article
Availability (online
Program tool or standalone) Purpose URL
Allpaths LG Standalone Read assembly http://www.broadinstitute.org/software/
allpaths-lg/blog/?page_id¼12
PE-Assembler Standalone Read assembly http://www.comp.nus.edu.sg/~bioinfo/
peasm/PE_manual.htm
C
SSPACE Standalone Contig scaffolding http://www.baseclear.com/
landingpages/sspacev12/
AMOS (AMOScmp) Standalone Assisted read assembly http://sourceforge.net/apps/mediawiki/
amos/index.php?title¼AMOScmp
Velvet (Columbus) Standalone Assisted read assembly http://www.ebi.ac.uk/~zerbino/velvet/
Newbler (runMapping) Standalone Assisted read assembly http://454.com/products/analysis-
software/index.asp
VAAL Standalone Assisted read assembly, ftp://ftp.broadinstitute.org/pub/crd/
polymorphism discovery VAAL/VAAL_manual.doc
MetaVelvet Standalone Metagenome assembly http://metavelvet.dna.bio.keio.ac.jp/
Meta-IDBA Standalone Metagenome assembly http://i.cs.hku.hk/~alse/hkubrg/
projects/metaidba/
Cross_match Standalone Masking of vector http://www.phrap.org/
sequences phredphrapconsed.html
Phrap Standalone Long-read assembly http://www.phrap.org/
phredphrapconsed.html
CAP3 Standalone Long-read assembly http://seq.cs.iastate.edu/cap3.html
Glimmer-MG Standalone Ab initio gene finding in http://www.cbcb.umd.edu/software/
metagenomic samples glimmer-mg/
MetaGeneMark Online and Ab initio gene finding in http://exon.gatech.edu/metagenome/
standalone metagenomic samples Prediction/
http://exon.gatech.edu/
license_download.cgi
FragGeneScan Standalone Ab initio gene finding in http://omics.informatics.indiana.edu/
metagenomic samples FragGeneScan/
MetaGeneAnnotator Standalone Ab initio gene finding in http://metagene.cb.k.u-tokyo.ac.jp
metagenomic samples
Orphelia Standalone Ab initio gene finding in http://orphelia.gobics.de/
metagenomic samples
Prodigal Standalone Ab initio gene finding in http://prodigal.ornl.gov/
metagenomic samples
BLAST Standalone and Homology search http://blast.ncbi.nlm.nih.gov/
online
BLAT Standalone and Homology search http://genome.ucsc.edu/FAQ/FAQblat.
online html
HMMer Standalone and Homology search http://hmmer.janelia.org/
online
MG-RAST Online Metagenomic analysis http://metagenomics.anl.gov/
pipeline
IMG/M Online Metagenomic analysis http://img.jgi.doe.gov/cgi-bin/m/main.
pipeline cgi
CAMERA Online Metagenomic analysis http://camera.calit2.net/
pipeline
WebMGA Online Metagenomic analysis http://weizhong-lab.ucsd.edu/
pipeline metagenomic-analysis/
(continued)
C 98 Computational Approaches for Metagenomic Datasets
Read length plays an important role in achiev- as a guide. Programs such as AMOScmp, Velvet
ing accuracy and high genomic coverage of an (Columbus), Newbler (runMapping), and VAAL
assembly. There is an ongoing debate whether it (Table 1) can be used for assisted assembly. Most
is sufficient to use short reads for metagenomic short-read assemblers are designed to assemble
analysis (Luo et al. 2012) or alternatively only a single genome and, thus, not optimal for assem-
use reads that are as long as possible for proper bly of metagenomic samples where reads from
annotation of genes, which ideally should include homologous regions of less represented genomes
their promoters, riboswitches, co-operonic genes, can be treated as error reads. Development of
and signature protein domains (Temperton and metagenomic assemblers, such as MetaVelvet
Giovannoni 2012). Regardless, using longer and Meta-IDBA (Table 1) should address this
paired end reads will always result in more accu- problem. In these assemblers the de Bruijn
rate assemblies better covering the length of the graph (Flicek and Birney 2009) for the entire
assembled genomes. assembly is analyzed for presence of subgraphs
Due to sequencing costs, next-generation corresponding to multiple bacterial genomes in
sequencing technologies offering relatively the sample.
long, paired end reads and yet providing high Due to high costs Sanger shotgun sequencing
coverage (enough to assemble underrepresented of metagenomic samples is less attractive.
bacterial genomes) are ideal for metagenomics However, it can be considered for low-diversity
projects. For example, the Illumina MiSeq instru- samples. Assembly of long Sanger reads can
ment can produce 8 Gb of 2 250 bp reads in generate nearly complete bacterial genome
one run at a lower per base cost than Sanger or sequences, ideal for subsequent annotation
454 sequencing technologies, while still offering efforts. Cloning vectors offer large insert sizes,
substantial sequence length. Another reason to e.g., bacterial artificial chromosomes (BACs)
avoid short single-end reads in the assembly (up to 200 Kb), yeast artificial chromosomes
step is that aside from problems with assembly (YACs) (up to 1.5 Mb), and fosmids (up to
of repetitive regions these reads are also likely to 90 Kb). Therefore, it is possible to amplify,
produce misassembled chimeric contigs. Gener- sequence, and assemble manageably large
ally, repetitive regions of any length can be stretches of DNA sequence randomly positioned
assembled by using multiple libraries of paired within a genome and overlapping each other, thus
end reads with varying insert sizes. Libraries with leading to assembly of nearly complete genomes.
shorter insert sizes can be used to build initial Vector sequences should be excluded from the
contigs avoiding misassembly of repetitive reads assembly using vector-masking software such as
into pseudo contigs. Longer insert libraries can be Cross_match (Table 1). The two most commonly
used for scaffolding and gap filling of the initial used long-read assemblers are Phrap and CAP3
contigs. For these reasons programs like Allpaths (Table 1). Trimming of poor-quality 50 - and 30
LG and PE-Assembler (Table 1) require paired ends should be implemented to improve the
end libraries with different insert sizes. Alterna- assembly.
tively, standalone scaffolding tools such as
SSPACE (Table 1) can be utilized to scaffold
already existing contigs using long insert librar- Taxonomic Assignment (Binning)
ies. It is recommended to filter out poor-quality
reads and analyze average base quality of the Assignment of derived sequence reads to their
remaining reads. Based on this analysis taxon of origin is a key goal of most metagenomic
a minimal required read length can be determined studies. This process is also referred to as bin-
for uniform or adaptive (quality-based) trim- ning, as sequences are placed into “bins”
ming. When references of closely related organ- representing the various taxa. Two types of
isms are available, it is possible to perform assignment have been largely utilized to date,
assisted assemblies using the reference genome compositional and sequence similarity based.
Computational Approaches for Metagenomic Datasets 101 C
Compositional signals depend on the concept given that a reference sequence is available. The
of the genome signature. This relies on the simple key advantages of these methods are that they are
idea that the composition of oligomers such as an accurate and widely accepted robust method
tetramers from closely related genomes is more and also can give direct knowledge of gene con-
similar than those from distantly related genomes tent following alignment. The main disadvantage
(Mrázek 2009). There is a significant body of is the lack of available reference sequence for
research on this topic involving research into some taxa, which can lead to false overrepresen- C
identification of genomic islands, genes of aber- tations of somewhat related taxa in the estimates.
rant composition, genome evolution, and classi- Also, computation tends to be more demanding
fication of metagenome sequences. The main than the compositional approach. This is espe-
advantage of compositional classifiers is that cially so in the case of the BLAST algorithm
they can determine associations in the absence used in the popular software MEGAN (Table 1).
of alignment by assessment of normalized oligo- MEGAN uses a lowest common ancestor
mer counts. Furthermore, unsupervised machine approach to assign reads with two database hits
learning techniques such as self organizing maps to a taxon. If the reads hit unrelated bacteria from
are not biased by the availability of fully different phyla, the lowest common ancestor will
sequenced reference sequences. The main draw- be that prior to phylum, such as Bacteria. How-
back of these classifiers is the long sequences ever, if the reads hit different species of say
needed to derive robust oligomer statistics. For Burkholderia, the algorithm will appoint a hit to
example, the program PhyloPythiaS (Table 1) the genus Burkholderia. BLAST is effective
and more recent frameworks typically require since it allows alignments against the well char-
more than 1,000 bp of input sequence. As such, acterized metagenomic protein space, as well as
they are not able to assign the numerous short the less well-known nucleotide space.
reads from modern Illumina and SOLiD Another popular solution is the web-based
sequencers to various taxa, which is certainly analysis toolbox MG-RAST (Table 1).
possible with other techniques (see next section), MG-RAST allows taxonomic binning, but is
but they do work well on assembled contigs. This more focused on functional investigation and
leads to problems, as contigs do not reflect the comparison of metagenomes. It is further detailed
distributions of raw reads initially observed in the in the Annotation section below. WebMGA
metagenome. Also, some distantly related organ- (Table 1) is an alternative very capable
isms may not have sufficiently divergent genome metagenomics web server which uses efficient
signatures for assignment. algorithms such as FR-HIT and CD-HIT for flex-
Compositional data has been used in a number ible read alignment and highly efficient cluster-
of metagenomic studies. Willner et al. (2009) ing, respectively. MetaPhlAn (Table 1) attempts
analyzed the compositions of 86 microbial and to optimize unique clade-specific marker genes as
viral metagenomes sequenced with 100 bp a reduced reference sequence of about 400 thou-
454 reads. They found that dinucleotides sand genes most representative of each taxo-
explained more of the variance observed than nomic unit and map reads to it. This kind of
higher order nucleotides such as tetramers, mapping potentially allows assignment of reads
although this is probably due to the short length to higher taxonomic levels such as species and
of the read sequences used, which leads to has the advantage of being extremely rapid.
non-robust statistics for higher order oligomers. A further solution which seeks to use curated
Another advantage of oligomers is their ability to reference sequences is Genometa (Table 1).
detect contamination in contigs due to the diver- This GUI program puts emphasis on finding the
gent oligomer profile and their relatively modest mapping coordinates of even very short reads in
computational burden. a genome to check if a taxon is actually present,
A more widely-used method of binning or if it is more likely to be either contamination or
sequences is to find sequences by similarity, just a related ORF or genomic island.
C 102 Computational Approaches for Metagenomic Datasets
Other programs aim to combine composi- publically available databases that can be used
tional and similarity based tools. A well-known for functional annotation, such as PFAM,
approach is PhymmBL (Table 1). This program TIGRFAM, KEGG, EggNOG, COG, SEED,
uses both BLAST and compositional attributes GenBank, RefSeq, UniProt, GO, and PATRIC
to assign even reads as short as 100 bp. The (Table 2).
authors found this technique to be more accu- MG-RAST allows the users to upload their
rate than either of the methods alone and have sequence data (in FASTA, FASTQ, and SFF for-
continued to improve their software. In general, mat) and metadata. The uploaded data are pre-
binning is still a difficult task, and algorithms ferred to be shared, but this is not mandatory. The
which work very well on one dataset may be data are quality controlled with the QC pipeline
extremely limited on the next. As such, we based on the settings provided by the user. The
recommend using at least two binning QC pipeline features include read quality filtra-
approaches on the sample to gain the maximum tion and trimming, dereplication, model organ-
possible information. ism screening, demultiplexing and merging mate
pairs. Currently, the minimal accepted read
length is 75 bp. Assemblies can also be submit-
Annotation ted. Starting from version 4.0 the pipeline will
also support read assembly. Submissions to this
Annotation of metagenomics samples requires pipeline are queued and submitted for feature
identification of features of interest in the assem- prediction using FragGeneScan, which identifies
bled fragments or reads binned into their Opera- the most likely reading frame and performs
tional Taxonomic Units (OTUs). For ab homology search on translated features. A pro-
initio identification of potential gene sequences gram called Uclust is then used to cluster 90 %
entire ORFs should be located. Various software identical protein fragments. The number of reads
exists to perform this task in metagenomic in each cluster is identified to estimate abun-
projects, e.g., Glimmer-MG, MetaGeneMark, dances. The pipeline also provides various visu-
FragGeneScan, MetaGeneAnnotator, Orphelia alization tools to view the results and to perform
(Table 1). These programs utilize various types comparative analysis with over 590 public
of Markov models for analysis of codon usage or metagenomes.
frequency of other genome composition elements IMG/M concentrates on comparative analysis
in the binned genomes. However, instead of of microbial genomes. The pipeline accepts
a single model the analysis is based on multiple assembled or unassembled reads. Unassembled
Markov models trained with data from a large reads are quality controlled, trimmed, and
variety of bacterial species. The trained dereplicated; their low-complexity regions are
model providing the best fit is then selected for masked. Aside from protein coding genes, ab
gene prediction. Given the complexity of initio gene finding also includes detection of
metagenomic assemblies, especially when only CRISPRs and noncoding RNA. RNA detection
short reads are utilized, it is expected that a large is performed using tRNAscan-SE for tRNAs and
proportion of assembled contigs may only have HMM models for rRNAs. Coding sequences are
partial ORFs. These sequences can still be predicted using a combination of Prodigal,
included in homology analysis using BLAST, Metagene, MetaGeneMark, and FragGeneScan
BLAT, or HMM (Table 1) searches against (Table 1). Longer sequences are also searched
gene or protein nonredundant databases. There against a local nonredundant protein database
are a number of online pipelines, e.g., using BLASTX. IMG/M provides functional
MG-RAST, IMG/M, and CAMERA (Table 1), annotation of the entire metagenome and sup-
available for ab initio- and homology-based ports functional comparisons to other stored
DNA structure annotation as well as functional annotated metagenomes. Various visualization
analysis of identified genes using a battery of tools facilitate this kind of comparison,
Computational Approaches for Metagenomic Datasets 103 C
e.g., Phylogenetic Distribution of Genes or obtained by multiple methods. An important step
Radial Phylogenetic Tree. to ensure comparability is normalization. Nor-
CAMERA provides a collection of online malization of these metrics can be undertaken
tools for metagenomics analysis. The provided using relative abundances, GC content, genome
tools allow the following analysis steps: sequence size, or prevalence of single-copy genes. How-
QC, sequence assembly, ORF prediction, RNA ever, particularly normalization of true
prediction, BLAST, clustering, functional anno- metagenomic data is in a state of flux with little C
tation, and viral diversity estimation. current consensus. Care must be taken with the
InterProScan (Table 1) is one of the most GC content delivered by sequencers as a result of
advanced programs for protein functional analy- the different sample preparation and amplifica-
sis. It incorporates BLAST and HMM searches tion schemes. It must be assumed that all
against an array of protein domain and functional sequencing runs have some form of quantitative
site databases (PROSITE, PRINTS, Pfam, bias against either or both low GC and high GC
ProDom, SMART, TIGRFAMs, PIR superfam- organisms, meaning that they will be underrepre-
ily, SUPERFAMILY, Gene3D, PANTHER, and sented in the samples. This problem has not been
HAMAP (Table 2)). Online and locally installed widely considered in metagenomics to date. GC
versions are available. Due to highly bias assessment programs, such as Picard’s
CPU-intensive nature of the BLAST and HMM CollectGcBiasMetrics (Table 1), are particularly
searches, it is recommended to run this program useful in observing and quantifying relative bias
on a computer cluster. of read coverage at different GC values using just
MEGAN is a standalone tool for visualization of a reference sequence and a BAM alignment file.
BLAST search results as taxonomic dendrograms, Larger genomes are more likely to be sampled in
functional dendrograms using the SEED classifica- a randomly sheared metagenomic DNA sample.
tion, pathways using KEGG orthology, compara- This can be compensated for by normalizing for
tive visualization, etc. A good collection of general genome length, if applicable for the taxonomic
information about other available metagenomics attribution method used.
software and resources can be found on http:// Many metrics have been taken directly from
seqanswers.com/wiki/Metagenomics. the field of ecology. Alpha, beta, and gamma
diversity summarize the species diversity in one
habitat, species diversity across multiple habitats,
Statistical Analysis and total diversity over total species diversity
across a larger scale landscape, respectively. Spe-
Early metagenomic datasets, such as the Sargasso cies richness is simply the number of species
Sea, were relatively simple surveying projects by found, while species diversity includes
design. Attempts were made to quantitate species a measure of the abundance of members of each
abundances using relative abundance of reads species. Other measures such as Shannon and
and presence of 16 S rRNA and single-copy Simpson indices are also available. One use
genes. Later studies then focused more on com- case is from Dinsdale and coworkers (2008),
parative spatial or temporal variation of the where functional metagenomic diversity was
microbial community. Due to this increasing characterized separately across a range of bacte-
sophistication multiple metrics for characterizing rial and viral genomes in many different habitats.
the complexity of the community have been Interestingly, functional metagenomics was
developed. As many projects in metagenome reported by Dinsdale and coworkers to explain
analysis are not based on strict hypothesis testing, a larger proportion of the variance in each dataset
exploratory data analysis techniques such as mul- (about 75 %) and thus be predictive of metabolic
tivariate statistics are often employed (see capacity within the taxa of an ecosystem,
below). Generally, the metric under study is the than analysis of taxa by 16S rRNA genes only
estimate of abundance of a taxon, which can be (about 10 %).
C 104 Computational Approaches for Metagenomic Datasets
The aforementioned indices attempt to char- As with any statistical analyses, care must be
acterize a highly multidimensional dataset into taken when performing multiple tests due to fre-
a single number, which can be useful as quent generation of false positives. While
a summary but obscures the underlying data. Bonferroni corrections are extremely good at
Therefore, advanced ordination methods for removing false positive test results, the extreme
multidimensional datasets such as principal com- stringency of this method will certainly mask
ponents analysis (PCA) and multidimensional a number of biologically true associations (false
scaling (MDS) have been applied to differentiate negatives). As such, we advocate the use of less
communities and reveal associations with abiotic stringent tests such as the Benjamini-Hochberg
parameters. Whichever of the many ordination false discovery rate method (FDR; van den Oord
methods is chosen, it is of great importance to and Sullivan 2003). Lastly, it should be noted that
check the variance explained by the observed extensive and high quality metadata is crucial to
components or functions. Where the principal observing and quantitating trends in microbial
components or alternative statistics explain little community structure.
of the variance in the data, this indicates the
variation in the data cannot be explained by the
variables measured, and caution must be taken in Metadata
interpreting the results. Various clustering algo-
rithms have also been demonstrably useful in Collection of metadata about metagenomes is
grouping similar datasets based on measures essential for making the sequence data and anal-
from normalized read counts to oligomer content ysis results meaningful and reusable by the sci-
and differentiating them from controls in entific community. Moreover, properly collected
a manner identical to the clustering schemes pop- and complete metadata can also help the scien-
ular in microarray group expression analysis tists originally analyzing a metagenomic sample
(Mrázek 2009). Clustering can also be important to draw conclusions about their findings that oth-
for quality control and identification of outlier erwise may be overlooked. A first step in this
microbial communities, which may also be attrib- direction is development of the minimum infor-
uted to technical artifacts. mation about a genome sequence (MIGS) speci-
Further types of statistical community com- fication and its extension to the minimum
parison metrics have been developed especially information about a metagenome sequence
for metagenomics. One example is the UniFrac (MIMS) specification by the Genomic Standards
distance metrics used to calculate a distance mea- Consortium (GSC). MIGS provides general
sure between microbial communities using infor- information about a genomic sequence, similar
mation from a supplied phylogenetic tree to what is collected by the NCBI Trace Archive
(Hamady et al. 2010). UniFrac uses a beta diver- or NCBI Short Read Archive, extended to more
sity measure detailing community membership detailed metadata about environment, nucleic
over space and time, which has distinct advan- acid sequence source, and assay preparation.
tages, and the phylogenetic tree method shows MIMS extends this specification to also include
improvements over comparing simple lists metadata about the habitat, e.g., temperature, pH,
of taxa. salinity, pressure, chlorophyll, conductivity, light
Experimental design is of paramount impor- intensity, dissolved organic carbon (DOC), cur-
tance in obtaining robust statistical results. Since rent, atmospheric data, density, alkalinity,
estimates of microbial communities tend to be dissolved oxygen, particulate organic carbon
noisy, replicates are necessary to gain a reliable (POC), phosphate, nitrate, sulfates, sulfides, and
assessment of variance. As finance is usually the primary production (Field et al. 2008). An XML
limiting factor, either samples can be sequenced schema is used to implement the MIGS/MIMS
at a lesser depth or cheaper sequencing technol- checklist. This schema is the basis for ongoing
ogies can be used. development of the Genomic Contextual Data
Computational Approaches for Metagenomic Datasets 105 C
Markup Language (GCDML). This language advanced analysis software also increases. In
should support polymorphic validation of various conclusion, computational analysis of
taxa (requiring different checklists) and develop- metagenomic samples is becoming more afford-
ment of ontologies. able and available to the research community and
Another interesting resource that addresses the provides exciting research and software develop-
need for sharing standardized metagenomics data ment opportunities.
is the Genomes OnLine Database (GOLD, http:// Advances in metagenomic analysis of micro- C
www.genomesonline.org/). This database con- bial communities also provide opportunities for
tains a collection of completed and ongoing pro- metatranscriptomic and metaproteomic research.
jects with the associated metadata, which are
based on a controlled vocabulary coordinated
with the GSC. Cross-References
Another online resource that collects
GSC-compliant metadata is CAMERA, already ▶ Lessons Learned from Simulated
mentioned in the Annotation section of this Metagenomic Datasets
review. CAMERA is involved in GSC activities ▶ Metagenomics, Metadata, and Meta-analysis
and provides input for development of ▶ Nucleotide Composition Analysis: Use in
metagenomic metadata standards that are also Metagenome Analysis
used for submission of metagenomic data to ▶ Phylogenetics, Overview
CAMERA. ▶ Silva Databases
Summary References
microbial diversity and abundance. PLoS Comput rRNA genes (rDNA), depending on the physiolog-
Biol. 2012. doi:10.1371/journal.pcbi.1002743. ical condition of the microbes (Klappenbach
Luo C, Tsementzi D, Kyrpides NC, et al. Individual
genome assembly from complex community short- et al. 2000; Liao 2000). Because the rRNA
read metagenomic datasets. ISME J. 2012;6:898–901. sequences are vertically delivered to the next gen-
Mrázek J. Phylogenetic signals in DNA composition: eration, they cannot be inherited by a different
limitations and prospects. Mol Biol Evol. 2009;26: species. Hence, 16S rRNA sequences are consid-
1163–9.
Pinto AJ, Raskin L. PCR biases distort bacterial and ered to be a stable marker of morphological dif-
archaeal community structure in pyrosequencing ference and have been applied in the taxonomic
datasets. PLoS One. 2012;7:e43093. classification of prokaryotes (Woese 1987). Since
Qin J, Li R, Raes J, et al. A human gut microbial gene the 1980s, partial and full-length sequences have
catalogue established by metagenomic sequencing.
Nature. 2010;464:59–65. been obtained using polymerase chain reaction
Temperton B, Giovannoni SJ. Metagenomics: microbial (PCR) technology (Lane et al. 1985). These
diversity through a scratched lens. Curr Opin sequences have been deposited in the Ribosomal
Microbiol. 2012;15:605–12. Database Project (RDP), Greengenes (http://
van den Oord EJCG, Sullivan PF. False discoveries and
models for gene discovery. Trends Genet. 2003;19: greengenes.lbl.gov), and SILVA rRNA public
537–42. databases (Cole et al. 2009; Pruesse et al. 2007).
Wendl MC, Kota K, Weinstock GM, et al. Coverage theo- The closest relative of a microbial organism of
ries for metagenomic DNA sequencing based on interest can be figured out by comparing the organ-
a generalization of Stevens theorem. J Math Biol.
2012. doi:10.1007/s00285-012-0586-x. ism’s rRNA sequence with the collected
Willner D, Thurber RV, Rohwer F. Metagenomic signa- sequences of known species (DeLong 1992;
tures of 86 microbial and viral metagenomes. Environ Fuhrman et al. 1992). Moreover, with the devel-
Microbiol. 2009;11:1752–66. opment of next-generation sequencing techniques
(Quail et al. 2012), rRNA sequences of microbial
communities in environmental samples can be
massively obtained in a short period (the effi-
Conserved Regions in 16S Ribosome ciency is platform dependent). These advances in
RNA Sequences and Primer Design detection have dramatically improved our under-
for Studies of Environmental standing of the communities of environmental
Microbes microbes in different sites on Earth (Qian
et al. 2011; Roussel et al. 2008).
Yong Wang1 and Pei-Yuan Qian2 The primer design is critical, regardless of
1
Division of Deep Sea Science, Sanya Institute of which sequencing method is employed, the
Deep Sea Science and Engineering, San Ya, Sanger method or next-generation methods.
Hainan, China With the rapid advances in metagenomics, envi-
2
KAUST Global Collaborative Program, ronmental samples can now consist of thousands
Division of Life Science, Hong Kong University of microbial species (Tremaroli and Backhed
of Science and Technology, Hong Kong, China 2012). Therefore, it is important to use primers
that are suitable to most of the species to fully
investigate the microbial community. If the
Definition primers fail to land on the matching parts of the
rDNA of certain dominant microbes, then these
The taxonomic classification of prokaryotic organ- species will be excluded from the PCR
isms based on morphological differences is diffi- amplicons, resulting in a poor survey of the com-
cult. A ribosomal RNA (rRNA) sequence has munity. For example, in a study of the microbial
many polymorphic sites that can act as a genetic communities in the Red Sea, the selection of
earmark to uncover the genetic background of primers almost failed to capture the entire
prokaryotes (Fox 2010). Bacterial and archaeal SAR11 group belonging to alpha-Proteobacteria
genomes contain one to several copies of 16S (Qian et al. 2011). Primer specification is also the
Conserved Regions in 16S Ribosome RNA Sequences 107 C
major concern in other studies (Huse et al. 2008; regions were searched again using nonredundant
Huws et al. 2007; Klindworth et al. 2013). core sequences from the SILVA database. A total
The strategy for primer design is based on the of 11 bacterial and seven archaeal segments with
conservation of the target sequences. Primers are degeneration sites were obtained. Because the
designed to obtain the variant sequences between nonredundant sequences were used and many of
two conserved regions. The degree of conserva- them were incomplete, the identified conserved
tion of the regions directly contributes to the sequences had more polymorphisms and the con- C
coverage rate of the primers targeting servation degree at both ends of the 16S rDNA
a community in an environmental sample. The sequences could not be evaluated (Table 1).
conserved regions in 16S rRNA sequences are However, three new conserved regions, located
involved in essential translational functions and at the bacterial 252–275 and 547–575 regions
interact with ribosomal proteins. For instance, and the archaeal 560–578 region, were found in
universally conserved sites G530, A1492, and these sequences. In the overlapping segment
A1493 in 16S rRNA sequences are crucial for between 565 and 575, a universally conserved
tRNA binding in the A site (Brimacombe and region was recognized for bacterial and archaeal
Stiege 1985; Demeshkina et al. 2012). Along sequences: 50 -TGGG[C/T][C/G/T]TAAAG-30 .
with the neighboring conserved sites, these sites This region has been used to design primers for
have been recognized to be ideal regions for the identification of clinical bacteria (Nikkari
primer design, as exemplified by the frequently et al. 2002). The positions of the conserved
used universal primers U519 and U1492 (Baker regions are standardized to the approximate posi-
et al. 2003). Apart from the conserved regions, tions on Escherichia coli 16S rDNA. It is inter-
there are a total of nine variant regions that esting that all the archaeal conserved segments
correspond to the species-specific structural have corresponding bacterial segments at the
sequences of ribosomal RNA (Huws et al. 2007; same standardized positions and share some
Wang and Qian 2009). The variant regions can be conserved sites with the bacterial counterparts
obtained through PCR, followed by sequencing (Table 1).
and comparison for taxonomic assignment.
Conserved Regions in 16S Ribosome RNA Sequences and Primer Design for Studies of Environmental
Microbes, Table 1 Conserved regions in archaeal and bacterial 16S rRNA core sequences
Start End Conserved sequence
Bacteria
252 275 TTGGYRRGGTAAHRGCYYACCAAG
311 365 CCACAHKGGVACTGAGAYACKGBCCACCTACGGGWGGCWGCAGTVRRGAAT
507 536 CTAACTHYGTGCCAGCAGCCGCGGTAAKAC
547 575 AGCGTTRYYCGGAWTYAYTGGGYKTAAAG
683 707 GTGTAGVRGTGAAATBCGTWGAKAT
765 806 GAAAGCKWGGGKAGCRAACRGGATTAGATACCCBGGTAGTCC
883 932 CTGGGRAGTACGVYCGCAAGRBTRAAACTCAAAGGAATTGACGGGGRCYC
935 986 ACAAGCRGYGGAGYRTGTGGYTTAATTCGAHRMWAMGCGMRRAACCTTACC
1,045 1,062 CAGGTGBTGCATGGYTGT
1,067 1,085 AGCTCGTGYCGTGAGRTGT
1,090 1,113 TTAAGTSCBRYAACGAGCGCAACC
Archaea
325 359 CWRGYCCTACGGGRYGCAGCAGKCGCGAAAMCTYY
514 539 GGTGYCAGCCGCCGCGGTAAHACCGC
560 578 WTTAYTGGGYYTAAAGCRT
679 701 GACRGTGAGGRAYGAARSCYDGG
781 806 CRAWCSGGATTAGACCCSRGTAGTCC
883 931 CTGGGRAGTAYGRYCGCAAGRYTGAAACTTAARGGAATTGGCGGGGGAG
953 972 GGTTYAATYGRABTCAACGC
The conserved regions were obtained by searching consecutive conserved sites in alignment file of 339 archaeal and
1,845 bacterial nonredundant core 16S rDNA sequences in SILVA database (release 108). Cutoff percentage of
occurrence of a nucleotide at a conserved site is 90 %. The positions are according to Escherichia coli 16S rDNA
positions. Abbreviations for degeneration sites are Y for C or T, R for A or G, W for A or T, K for G or T, M for C or A,
S for C or G, V for not T, H for not G, B for not A, and D for not C
88.5 % and 81.1 %, respectively (Table 2). Two more than 95 % of the bacterial 16S rDNA
candidates, 683–700 and 691–707, were selected sequences. For the archaeal candidates, the two
from the bacterial conserved region of 683–707 candidate primers in the region of 514–539 have
for the test; both were associated with low cover- the highest coverage rates, 93.5 % and 93.8 %,
age rates. Obviously, more degeneration sites have respectively. Thus, the archaeal rDNA sequences
been introduced in these bacterial primers com- appear to be more difficult to be fully covered than
pared with the same sets described previously the bacterial sequences, considering the average
(Wang and Qian 2009). This means that more coverage rate of 90.8 % with the low rates for the
polymorphisms emerged in the nonredundant four primers at both ends ignored. In regard to
dataset, which results in the low rates for the two universal primers, this study recommends the
primers. Thus, the primers from this rRNA region primers at the 515–533, 785–806, and 907–928
are not recommended due to their generally low positions. High coverage rates for these primers
coverage rate, although a previous study obtained were confirmed using the bacterial and archaeal
a 90.5 % coverage rate for a similar primer (Wang datasets (Table 2).
and Qian 2009). The overall quality of the other
candidate primers is high, with the average cover-
age rate being 92.7 % (Table 2). The best bacterial Summary
primers in this study were located at the E. coli
positions of 547–568, 556–575, 907–928, and A short list of 16S rDNA primers has been com-
1,046–1,062. These primers are able to recover piled using simplified nonredundant rDNA
Conserved Regions in 16S Ribosome RNA Sequences 109 C
Conserved Regions in 16S Ribosome RNA Sequences and Primer Design for Studies of Environmental
Microbes, Table 2 Evaluation of candidate primers
Position Sequence % coverage
Bacteria
259–275 GGTAAHRGCYYACCAAG 93.6 %
321–338 ACTGAGAYACKGBCCACC 86.6 %
334–353 CCACCTACGGGWGGCWGCAG 94.1 % C
515–533 GTGCCAGCAGCCGCGGTAA 93.6 %
547–568 AGCGTTRYYCGGAWTYAYTGGG 95.1 %
556–575 CGGAWTYAYTGGGYKTAAAG 96.9 %
683–700 GTGTAGVRGTGAAATBCG 88.5 %
691–707 GTGAAATBCGTWGAKAT 75.9 %
765–782 GAAAGCKWGGGKAGCRAA 82.1 %
785–806 GGATTAGATACCCBGGTAGTCC 94.7 %
907–928 AAACTCAAAGGAATTGACGGGG 96.6 %
946–964 AGYRTGTGGYTTAATTCGA 92.5 %
1,046–1,062 AGGTGBTGCATGGYTGT 95.8 %
1,067–1,085 AGCTCGTGYCGTGAGRTGT 93.0 %
1,090–1,113 TTAAGTSCBRYAACGAGC 90.8 %
Archaea
328–346 GYCCTACGGGRYGCAGCAG 83.8 %a
340–357 GCAGCAGKCGCGAAAMCT 80.2 %a
514–533 GGTGYCAGCCGCCGCGGTAA 93.8 %
519–539 CAGCCGCCGCGGTAAHACCGC 93.5 %
560–578 ATTAYTGGGYYTAAAGCRT 90.3 %
683–701 GTGAGGRAYGAARSCYDGG 81.1 %
785–806 GGATTAGATACCCSRGTAGTCC 90.9 %
883–902 CTGGGRAGTAYGRYCGCAAG 92.6 %
897–914 CGCAAGRYTGAAACTTAA 91.4 %
907–928 AAACTTAARGGAATTGGCGGGG 92.9 %
916–931 GGAATTGGCGGGGGAG 82.9 %a
953–972 GGTTYAATYGRABTCAACGC 79.9 %a
Degenerated nucleotides are referred to Table 1
a
Coverage percentages that need an adjustment due to incompleteness of some short 16S rDNA sequences at the region
datasets. These primers will be useful for identi- sequences and polymorphisms are revealed.
fying environmental microbes, as they are capa- Alternatively, the conserved sites will be more
ble of detecting more than 90 % of the known evident if full-length 16S rRNA sequences from
bacteria and archaea. However, the number of variant rare biospheres are found to exhibit the
prokaryotic organisms that resist being captured same conservation patterns. In return, these con-
by these 16S rRNA primers cannot be estimated. servation patterns may help to understand the role
It has been estimated that only about 1 % of the of ribosome RNAs in protein translation.
microbes on Earth are culturable. Moreover, the The method proposed here is also useful for
microbes colonizing extreme and geologically generating specific primers for interested taxa at
isolated environments are far from being lower taxonomic levels in an environment. In
completely explored (Pace 1997; Sogin a previous report, the CHECK_PROBE program
et al. 2006). The conserved sites in the 16S in the RDP database and the BLAST program
rDNA sequences will be degenerated if more were employed to predict cyanobacteria-specific
C 110 Conserved Regions in 16S Ribosome RNA Sequences
primers (Nubel et al. 1997). However, there are Klappenbach JA, Dunbar JM, Schmidt TM. rRNA
problems with these methods as demonstrated operon copy number reflects ecological strategies
of Bacteria. Appl Environ Microbiol. 2000;66:
previously (Wang and Qian 2009). Hence, the 1328–33.
method here is recommended since it may enable Klindworth A, Pruesse E, Schweer T, Peplies J, Quast C,
more specific primers to be generated for differ- Horn M, Glockner FO. Evaluation of general 16S
ent taxonomic levels. ribosomal RNA gene PCR primers for classical and
next-generation sequencing-based diversity studies.
Nucleic Acids Res. 2013;41:e1.
Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin ML, Pace
Cross-References NR. Rapid determination of 16S ribosomal RNA
sequences for phylogenetic analyses. Proc Natl Acad
Sci U S A. 1985;82:6955–9.
▶ Binning Sequences Using Very Sparse Labels Liao D. Gene conversion drives within genic
Within a Metagenome sequences: concerted evolution of ribosomal RNA
▶ Challenge of Metagenome Assembly and genes in bacteria and archaea. J Mol Evol. 2000;
Possible Standards 51:305–17.
Nikkari S, Lopez FA, Lepp PW, Cieslak PR, Ladd-
▶ I-rDNA and C16S: Identification and Wilson S, Passaro D, Danila R, Relman DA. Broad-
Classification of Ribosomal RNA Gene range bacterial detection and the analysis of
Fragments unexplained death and critical illness. Emerg Infect
▶ RITA: Rapid Identification of Dis. 2002;8:188–94.
Nubel U, Garcia-Pichel F, Muyzer G. PCR primers to
High-Confidence Taxonomic Assignments for amplify 16S rRNA genes from cyanobacteria. Appl
Metagenomic Data Environ Microbiol. 1997;63:3327–32.
Pace NR. A molecular view of microbial diversity and the
biosphere. Science. 1997;276:734–40.
Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W,
References Peplies J, Glockner FO. SILVA: a comprehensive
online resource for quality checked and aligned ribo-
Baker GC, Smith JJ, Cowan DA. Review and re-analysis somal RNA sequence data compatible with ARB.
of domain-specific 16S primers. J Microbiol Methods. Nucleic Acids Res. 2007;35:7188–96.
2003;55:541–55. Qian P-Y, Wang Y, Lee OO, Lau SCK, Yang J,
Brimacombe R, Stiege W. Structure and function of ribo- Lafi FF, Al-Suwailem A, Wong TYH. Vertical strati-
somal RNA. Biochem J. 1985;229:1–17. fication of microbial communities in the Red Sea
Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, revealed by 16S rDNA pyrosequencing. ISME J.
Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, 2011;5:507–18.
Garrity GM, Tiedje JM. The ribosomal database pro- Quail M, Smith M, Coupland P, Otto T, Harris S,
ject: improved alignments and new tools for rRNA Connor T, Bertoni A, Swerdlow H, Gu Y. A tale of
analysis. Nucleic Acids Res. 2009;37:141–5. three next generation sequencing platforms: compari-
DeLong EF. Archaea in coastal marine environments. son of Ion Torrent, Pacific Biosciences and Illumina
Proc Natl Acad Sci U S A. 1992;89:5685–9. MiSeq sequencers. BMC Genomics. 2012;13:341.
Demeshkina N, Jenner L, Westhof E, Yusupov M, Roussel EG, Bonavita M-AC, Querellou J, Cragg BA,
Yusupova G. A new understanding of the decoding Webster G, Prieur D, Parkes RJ. Extending the
principle on the ribosome. Nature. 2012;484:256–9. sub-sea-floor biosphere. Science. 2008;320:1046.
Fox GE. Origin and evolution of the ribosome. Cold Sogin ML, Morrison HG, Huber JA, Welch DM,
Spring Harb Perspect Biol. 2010;2:1–18. Huse SM, Neal PR, Arrieta JM, Herndl GJ. Microbial
Fuhrman JA, McCallum K, Davis AA. Novel major diversity in the deep sea and the underexplored rare
archaebacterial group from marine plankton. Nature. biosphere. Proc Natl Acad Sci U S A. 2006;103:
1992;356:148–9. 12115–20.
Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman Tremaroli V, Backhed F. Functional interactions between
DA, Sogin ML. Exploring microbial diversity and the gut microbiota and host metabolism. Nature.
taxonomy using SSU rRNA hypervariable tag 2012;489:242–9.
sequencing. PLoS Genet. 2008;4:e1000255. Wang Y, Qian P-Y. Conservative fragments in bacterial
Huws SA, Edwards JE, Kim EJ, Scollan ND. Specificity 16S rRNA genes and primer design for 16S ribosomal
and sensitivity of eubacterial primers utilized for DNA amplicons in metagenomic studies. PLoS ONE.
molecular profiling of bacteria within complex 2009;4:e7401.
microbial ecosystems. J Microbiol Methods. 2007;70: Woese CR. Bacterial evolution. Microbiol Rev.
565–9. 1987;51:221–71.
Culture Collections in the Study of Microbial Diversity, Importance 111 C
Type strains, which constitute the name-
Culture Collections in the Study of bearing reference strain of a species and are
Microbial Diversity, Importance often used in the study of bacterial systematics,
are available from culture collections worldwide.
Martin Sievers Type strains must be deposited in two public
Zurich University of Applied Sciences, Institute collections in two different countries in order to
of Biotechnology, Waedenswil, Switzerland have the name and thus the species validated C
(Stackebrandt 2010).
Culture collections are a valuable resource for
Introduction the exploitation of biological diversity and can
help countries rich in biodiversity to understand
Prokaryotes, which comprise the bacterial and and utilize their microbial diversity more effec-
archaeal domains, show very high biodiversity. tively (Arora et al. 2005). Culture collections also
Over 50 different phyla including candidate phyla act as an interface between their providers and
with cultivable species and uncultivable represen- users of genetic resources to support fair and
tatives, which are only characterized via equitable sharing of the benefits based on docu-
metagenomics, have been detected. Microbial ments like Prior Informed Consent (PIC) and
strains are ubiquitous and are able to grow in mutually agreed terms (Sievers et al. 2010). In
extreme environments, and to determine their func- fulfilling these roles, culture collections have sev-
tions and activities in the environment is essential eral responsibilities regarding biosafety require-
for our understanding of life. Culture collections ments. These include compliance with
can help in the cataloging and preservation of international agreements and conventions on bio-
microbial strains and their genomic DNA. diversity, the support of researchers seeking
intellectual property rights, and to implement
new technologies and to find additional funding
Role of Culture Collections for their vital work.
In addition, microbial culture collections
Culture collections are important in the preserva- which are recognized as international depositary
tion of biodiversity and thus contribute to the authorities (IDA) offer deposition of microorgan-
objectives of the Convention on Biological isms involved in inventions for patent purposes
Diversity (CBD; www.cbd.int) through the pres- according to the Budapest Treaty (http://www.
ervation of important genetic resources. wipo.int/treaties/en/registration/budapest/trtdocs_
The primary function of microbial culture col- wo002.html).
lections is to gather, maintain, and distribute
strains which have unique properties and are of
practical value in various applications like Importance of Microorganisms
research, teaching, quality control assays, and
biotechnology (Uruburu 2003; Emerson and Microbial strains are used for a wide range of
Wilson 2009). Culture collections supply their scientific, industrial, and health-care applications,
users with well-characterized strains and replicable for example, as sources of enzymes, proteins,
parts (plasmid, DNA) as well with the associated vitamins, organic acids, bioactive compounds,
documentations relevant to these biological mate- antimicrobial peptides, and biopolymers. Micro-
rials. Cultures, strain, and DNA from culture col- organisms are used in agriculture as bio-fertilizer,
lections are distributed with a material transfer in wastewater treatment as agents for degradation
agreement (MTA) which provides users with all of compounds with complex structures, for metal
relevant handling information and regulations for recovery to catalyze specific chemical reactions,
commercial use of the supplied biological material. for bioenergy production, as starter cultures in the
C 112 Culture Collections in the Study of Microbial Diversity, Importance
production of fermented food, as probiotics, and as (Roesch et al. 2007). A large percentage of soil
reference material in diagnostics and development bacteria could not be isolated by cultivation
of new therapeutics. (between 90 % and 99 % in a given sample).
In contrast to their beneficial relatives, patho- Soil bacteria of the phyla Acidobacteria and
genic microbes cause severe diseases in humans, Verrucomicrobia are poorly represented in pure
animals, and plants, resulting in significant eco- cultures, and members of the Actinobacteria,
nomic loss and risk to global health. The useful Firmicutes, and Proteobacteria, in contrast, are
products and processes provided by microorgan- well represented in culture collections. For exam-
isms can be grouped into four broad categories: ple, the most dominant genus of the American
fine chemicals, processes, commodities, and Type Culture Collection (ATCC) soil accessions
emerging technologies (Kuo and Garrity 2002). is Streptomyces, belonging to the phylum
Thus, the use and the study of microorganisms Actinobacteria, reflecting their importance as
contribute to further economic growth and health producers of bioactive compounds and in soil
promotion and are of immense social and ecolog- ecology (Floyd et al. 2005).
ical value (Komagata 1999; Prakash et al. 2012; Currently, culture collections cover only
Smith 2003). a fraction of the diversity of microorganisms
and will benefit from the deposition of new
strains which are suitable for industrial use since
Microbial Diversity they represent rich and abundant source of novel
molecules with various biological activities.
Due to their myriad environmental roles and
functions, microorganisms are important compo-
nents of the world’s biodiversity. Microbial Identification of Strains at Species Level
diversity refers to the richness and degree of
variability among species and strains within an Isolated strains are checked for purity microscop-
ecosystem. Microbial communities of an investi- ically, for morphological homogeneity by unifor-
gated sample are composed of species which mity of colony form on agar plate, by distinct
could be isolated as well as the “silent majority” color formation on chromogenic agar, and con-
species which are considered non-culturable formation by denaturing gradient gel electropho-
under standard laboratory conditions and only resis DGGE (single band obtained). Identification
their DNA is accessible for genetic characteriza- of pure strains at species level is usually
tion. The richness of bacterial species is highly performed using ribosomal RNA gene sequence
variable in different environmental communities. analysis. Housekeeping genes encoding RNA
Some environments like the upper atmosphere, polymerase beta subunit (rpoB), RNA polymer-
glacial ice, and highly acidic stream waters have ase sigma factor (rpoD), gyrase beta subunit
low numbers of bacterial species in comparison (gyrB), recombinase A (recA), or heat shock pro-
to soil, microbial mats, and marine water, which tein (hsp60) provide in some cases better genetic
harbor vast numbers of bacterial species (Fierer resolution on the species level than the 16S rDNA
and Lennon 2011). The estimation of the number sequence used in taxonomic studies. Combina-
of bacterial species per gram of soil is not a trivial tions of housekeeping genes in multi-locus ana-
task. Metagenomic approaches based on analysis lyses provide a taxonomic tool for identification
of environmental DNA sequence data help to of prokaryotes at species and strain level (Moore
study microbial communities and to estimate et al. 2010). DNA sequences in combination with
their species richness. Based on high-throughput protein spectra obtained by MALDI-TOF-MS are
16S rDNA pyrosequencing and phylogenetic very efficient to identify strains at species level.
analysis, the most abundant species of bacteria MALDI-TOF-MS used for species identification
in different soil samples were assigned to generates protein spectra in the size range between
the phyla Proteobacteria and Bacteroidetes 2 and 20 kDa which is dominated by ribosomal
Culture Collections in the Study of Microbial Diversity, Importance 113 C
proteins. By use of this technology, the generated in an ecosystem. These data sets can be applied to
spectra of an unknown strain are compared with develop cultivation methods for ecologically
a reference data bank (Wieser et al. 2012). important microorganisms which are not-yet cul-
DNA-DNA hybridization (DDH) values are tivable (Prakash et al. 2012). Information based
used to determine relatedness between strains on DNA sequences is increasingly used in eco-
and strains belong to the same species when logical research and in investigating microbial
DDH values are approximately 70 % or greater communities. Storage of extracted DNA for use C
(Wayne et al. 1987). Average nucleotide identity of DNA barcoding technology should be depos-
(ANI) of common genes is discussed to be an ited in a repository for further taxonomic and
alternative method for replacing DDH. The cut- biotechnological studies (Vernooy et al. 2010).
off value of 70 % DDH for species delineation Handling of biological sequence data derived
correlates to 95 % ANI value (Goris et al. 2007). from “omics” (genomics, transcriptomics, prote-
ANI can be calculated by partial sequencing of omics, metabolomics) including storage and
the genomes (at least 20 %) of the query strains accessibility should be standardized. Culture col-
(Richter and Rosselló-Móra 2009). Unique lections developing to biological resource centers
strains of one species can be identified by meta- (BRC) meet the high standard of quality manage-
bolic activities (sugar utilization, acid produc- ment and accreditation processes and are able to
tion), resistance to antibiotics, and genetic participate in networking initiatives to strengthen
fingerprints obtained by rep-PCR. the collaboration between collections and their
Strains in a collection should undergo minimal users (Janssens et al. 2010; Stackebrandt 2010).
passages before distribution to reduce genetic Sequencing of complete bacterial genomes
variations within these strains. This can be leads to the discovery and characterization of
achieved by establishment of a two-tiered system new gene families (Wu et al. 2009). The ongoing
composed of a master and working (distribution) characterization of microbes will lead to new
bank for each organism (Day and Stacey 2008). strains, microbial metabolites, and novel
protein-coding genes suitable for use in many
industrial and health applications.
Future Tasks of Culture Collections
with hydrolysates such as beef extract or yeast isolates. In 2002 they reported on the use of
extract. To create a solid medium, agar, diffusion chambers to grow microbial colonies
a polysaccharide product of seaweed, is added from an intertidal sandy flat in an aquarium
to the broth formulation prior to autoclaving. containing seawater as the growth medium
To support the growth of fastidious organisms, (Kaeberlein et al. 2002). They estimated that up
fresh defibrinated blood (usually sheep’s blood) to 40 % of the cells inoculated into the chamber
is often added to cooled agar media prior to could be cultivated, but attempts to grow these
pouring into Petri dishes. Haemophilus and microcolonies in pure culture were very ineffi-
Neisseria require blood that has been lysed prior cient. One isolate, which grew poorly on the agar
to use (chocolate agar). Strict anaerobes require plates, grew well in coculture with any of three
agar media that can be pre-reduced (i.e., incu- other isolates obtained from the chambers. They
bated in an anoxic environment to remove all expanded this to create a high-throughput Ichip
O2 prior to use), and some species require the diffusion array that contains 192 chambers per
addition of hemin and vitamin K. A good general array (Nichols et al. 2010). A clever application
agar for this use is Anaerobic Reducible Blood of the technology was the creation of an upper
Agar (Remel, Lenexa, KS), which contains cys- palate dental appliance that carried a 72-chamber
teine HCl, palladium chloride, and dithiothreitol Ichip diffusion array (Sizova et al. 2012). The
to maintain a low redox potential of the agar. It appliance was worn by a subject for 48 h then
also contains hemin and vitamin K and can be recovered and placed in an anaerobic chamber.
purchased with colistin and nalidixic acid as Bacterial cells from the chambers were plated on
a selective medium for isolation of gram- a “basic anaerobic medium” that was low in sugar
positive organisms or with kanamycin, vanco- concentration to prevent selection for fast-
mycin, and neomycin as a selective medium growing species. This method contributed 39 iso-
for gram-negative organisms, particularly the lates, several of which represented taxa that had
Bacteroides. not been previously cultivated. The take-home
Numerous preformulated specialty powdered lessons that these authors stressed were that
and premade bacteriological media and plates are “domestication” of uncultivated organisms from
available for selection, identification, and culti- the human microbiome is more likely if the bac-
vation of a wide variety of human bacterial spe- teria are first grown in vivo and that cell growth
cies, with a focus on pathogens. should be allowed to occur “unimpeded by neigh-
bors,” for example, by growth in diffusion cham-
bers, by dilution to extinction (Rappe et al. 2002),
Methods to Enhance Growth of or by growth encapsulated in microdroplets
Uncultivable Organisms (Zengler et al. 2002). They also stressed the
requirements of strict anaerobic conditions (for
The concept of isolation of the pure cultures has oral samples) and the utilization of media low in
been challenged by groups that have shown that readily utilizable carbohydrates (Sizova
cocultivation of organisms can sometimes et al. 2012).
lead to the successful isolation of previously The commensalism and mutualism of some
uncultivable organisms. New technologies have bacterial species have been exploited to stimulate
been applied to isolate and capture cells and then growth of previously uncultivated organisms.
incubate them an environment that simulates Examples of commensalism in dental plaque are
(or is) the natural one. well established, such as the catabolism of sugars
The groups of Slava Epstein and Kim Lewis by streptococci to lactic acid, which is fermented
have made key contributions to methods and by the veillonellae, which cannot utilize sugars.
discoveries leading to the cultivation of such Vartoukian et al. were able to cultivate Cluster
Culturing 117 C
A Synergistetes, which had not been previously environment of the species that is sought.
accomplished, by growing human plaque sam- The application of diffusion chambers and
ples in a complex cooked meat medium microdroplet technologies to human
(Vartoukian et al. 2010). Using fluorescent in microbiome samples should accelerate cultiva-
situ hybridization (FISH) directed against the tion of some species, and metabolic predic-
Synergistetes 16S rRNA, they followed the pres- tions from whole genome shotgun sequencing
ence of an isolate, Synergistetes SGP1, and may, in future, permit rationale cultivation of C
observed that the Synergistetes cells formed new species of bacteria.
aggregates with other bacteria. Ultimately, they
showed that growth of SGP1 was stimulated
by cross-streaks of Staphylococcus aureus,
Fusobacterium nucleatum, Parvimonas micra, References
and Treponema forsythia, which were members
of the cell aggregates. The mechanism of the D’Onofrio A, Crawford JM, Stewart EJ, Witt K,
effect has not yet been ascertained. Siderophore Gavrish E, Epstein S, et al. Siderophores from neigh-
boring organisms promote the growth of uncultured
sharing, or “stealing,” is a theme common in
bacteria. Chem Biol. 2010;17:254–64.
bacterial pathogenesis and is the one that the Dewhirst FE, Chen T, Izard J, Paster BJ, Tanner AC, Yu
Epstein and Lewis group observed as WH, et al. The human oral microbiome. J Bacteriol.
a mechanism that permitted coculture of some 2010;192:5002–17.
Gao Z, Tseng CH, Pei Z, Blaser MJ. Molecular analysis of
strains of bacteria isolated from sand biofilms
human forearm superficial skin bacterial biota. Proc
(D’Onofrio et al. 2010). They observed that Natl Acad Sci U S A. 2007;104:2927–32.
samples plated in high density yielded much Goodman AL, Kallstrom G, Faith JJ, Reyes A, Moore A,
higher numbers of colonies than expected Dantas G, et al. Extensive personal human gut
microbiota culture collections characterized and
compared to plates with diluted biofilm samples
manipulated in gnotobiotic mice. Proc Natl Acad Sci
and hypothesized that adjacent pairs of species U S A. 2011;108:6252–7.
might have growth dependencies. One strain, Handelsman J. Metagenomics: application of genomics to
Micrococcus luteus KLE1011, was shown to uncultured microorganisms. Microbiol Mol Biol Rev.
2004;68:669–85.
secrete 5 distinct but related siderophores, any
Kaeberlein T, Lewis K, Epstein SS. Isolating
one of which was able to induce growth of the “uncultivable” microorganisms in pure culture in
uncultivated strain Maribacter polysiphoniae a simulated natural environment. Science. 2002;
KLE1104. The M. luteus strain was then used as 296:1127–9.
Nichols D, Cahoon N, Trakhtenberg EM, Pham L,
“bait” to capture additional uncultivated bacteria
Mehta A, Belanger A, et al. Use of ichip for high-
from the samples (D’Onofrio et al. 2010). It throughput in situ cultivation of “uncultivable” micro-
would be surprising if this phenomenon was not bial species. Appl Environ Microbiol. 2010;
observed between members of the human 76:2445–50.
Rappe MS, Connon SA, Vergin KL, Giovannoni SJ.
microbiome.
Cultivation of the ubiquitous SAR11 marine bacterio-
plankton clade. Nature. 2002;418:630–3.
Sizova MV, Hohmann T, Hazen A, Paster BJ, Halem SR,
Murphy CM, et al. New approaches for isolation of
previously uncultivated oral bacteria. Appl Environ
Summary Microbiol. 2012;78:194–203.
Vartoukian SR, Palmer RM, Wade WG. Cultivation of
The cultivation of prokaryotes continues to a Synergistetes strain representing a previously
follow mostly traditional methods, although uncultivated lineage. Environ Microbiol. 2010;
12:916–28.
some groups are beginning to recognize that
Zengler K, Toledo G, Rappe M, Elkins J, Mathur EJ, Short
the cultivation of the uncultivable requires JM, et al. Cultivating the uncultured. Proc Natl Acad
a better appreciation of the in vivo Sci U S A. 2002;99:15681–6.
C 118 Customizable Web Server for Fast Metagenomic Sequence Analysis
Customizable Web Server for Fast Metagenomic Sequence Analysis, Fig. 1 The web server page for DNA
clustering
• Meta-RNA (Huang et al. 2009) identifies DNAs and RNAs and removes them from
rRNAs from fragmented sequences using input metagenomic sequences. A fast mapping
a hidden Markov model-based algorithm. program FR-HIT (Niu et al. 2011) is used to
• A BLAST-based program identifies rRNAs by align the input sequences against human ref-
comparing the query against several rRNA erence sequences.
reference databases. • CD-HIT-EST, an ultrafast sequence-
• For metagenomic data from human subjects, clustering program, clusters the DNAs into
WebMGA offers a tool that identifies human groups or removes redundant sequences.
C 120 Customizable Web Server for Fast Metagenomic Sequence Analysis
Customizable Web Server for Fast Metagenomic Sequence Analysis, Fig. 2 A simple workflow using tools in
WebMGA
• WebMGA has a taxonomy-binning tool that • CD-HIT-OTU (Li et al. 2012) is a pipeline that
maps the reads to reference genomes using filters and processes the raw rRNA tags and
FR-HIT and then assigns taxonomy clusters them into operational taxonomic units
annotations. (OTUs). CD-HIT-OTU is available at http://
• ORF_finder (Li 2009) calls ORFs from input weizhongli-lab.org/cd-hit-otu.
sequences by six-reading-frame translation. Each of the above tools has a Web interface
• Metagene (Noguchi et al. 2006) identifies where users can run them individually. Users
ORFs from fragmented sequences. with programming skills can even compose
• FragGeneScan (Rho et al. 2010) identifies a script to run a customized multistep analysis
ORFs and also tries to correct frameshift workflow through WebMGA’s Web services. As
errors. illustrated in Fig. 1, a user can upload a DNA
Users can input protein or peptide sequences dataset to run several analysis processes in paral-
to run the following analyses: lel. The user can use HMM-based or BLAST-
• CD-HIT (Li et al. 2001, 2002; Li and Godzik based method to find rRNAs and to produce
2006; Huang et al. 2010) clusters the input a FASTA file with rRNA masked. The latter
sequences into protein clusters or removes result file is then processed by an ORF calling
redundant sequences. program, and the ORFs are used for function and
• A multistep clustering pipeline groups protein pathway annotation. This workflow is illustrated
sequences into protein families. in Fig. 2.
• WebMGA uses HMMER3 program (Eddy
2009) to compare input peptides against
Pfam and Tigrfam databases and assign the Summary
domain or protein families.
• WebMGA uses RPS-BLAST to compare WebMGA provides researchers the tools for
NCBI’s COG, KOG, and PRK databases and rapid metagenomic sequence analysis through
provide function annotations. Web server and Web services. The tools and
• WebMGA provides Gene Ontology functions in WebMGA cover a large scope of
(GO) annotations. metagenomic data analysis such as raw sequence
• WebMGA searches KEGG database and pro- quality control, human DNA filtering, OTU esti-
vides pathway annotations. mation, taxonomy binning, sequence clustering,
16S rRNA tags can also be analyzed through and function and pathway annotation. By directly
WebMGA: accessing the Web services with client-side
• RDP Classifier (Wang et al. 2007) analyzes scripts, users can customize and run their own
rRNA tags and assigns taxonomy annotations. workflows. The tools and data in WebMGA are
Customizable Web Server for Fast Metagenomic Sequence Analysis 121 C
constantly being updated, and new tools for fast Li WZ, Godzik A. Cd-hit: a fast program for clustering
metagenomic data analysis will be continuously and comparing large sets of protein or nucleotide
sequences. Bioinformatics. 2006;22(13):1658–9.
added. Li WZ, Jaroszewski L, et al. Clustering of highly homol-
ogous sequences to reduce the size of large protein
databases. Bioinformatics. 2001;17(3):282–3.
Cross-References Li WZ, Jaroszewski L, et al. Tolerating some redundancy
significantly speeds up clustering of large protein data-
bases. Bioinformatics. 2002;18(1):77–82.
C
▶ Fast Program for Clustering and Comparing Li W, Fu L, et al. Ultrafast clustering algorithms for
Large Sets of Protein or Nucleotide Sequences metagenomic sequence analysis. Brief Bioinform.
▶ FR-HIT Overview 2012;13(6):656–68.
Lowe TM, Eddy SR. tRNAscan-SE: a program for
improved detection of transfer RNA genes in genomic
sequence. Nucleic Acids Res. 1997;25(5):955–64.
Niu B, Fu L, et al. Artificial and natural duplicates in
References pyrosequencing reads of metagenomic data. BMC
Bioinforma. 2010;11:187.
Caporaso JG, Kuczynski J, et al. QIIME allows analysis of Niu B, Zhu Z, et al. FR-HIT, a very fast program to recruit
high-throughput community sequencing data. Nat metagenomic reads to homologous reference
Methods. 2010;7(5):335–6. genomes. Bioinformatics. 2011;27(12):1704–5.
Cox MP, Peterson DA, et al. SolexaQA: at-a-glance qual- Noguchi H, Park J, et al. MetaGene: prokaryotic gene
ity assessment of Illumina second-generation sequenc- finding from environmental genome shotgun
ing data. BMC Bioinforma. 2010;11:485. sequences. Nucleic Acids Res. 2006;34(19):5623–30.
Eddy SR. A new generation of homology search tools Rho M, Tang H, et al. FragGeneScan: predicting genes in
based on probabilistic inference. Genome Inform. short and error-prone reads. Nucleic Acids Res.
2009;23(1):205–11. 2010;38(20):e191.
Huang Y, Gilna P, et al. Identification of ribosomal RNA Schloss PD, Westcott SL, et al. Introducing mothur: open-
genes in metagenomic fragments. Bioinformatics. source, platform-independent, community-supported
2009;25(10):1338–40. software for describing and comparing microbial com-
Huang Y, Niu B, et al. CD-HIT Suite: a web server for munities. Appl Environ Microbiol. 2009;75(23):
clustering and comparing biological sequences. Bioin- 7537–41.
formatics. 2010;26(5):680–2. Wang Q, Garrity GM, et al. Naive Bayesian classifier for
Li W. Analysis and comparison of very large rapid assignment of rRNA sequences into the new
metagenomes with fast clustering and functional anno- bacterial taxonomy. Appl Environ Microbiol.
tation. BMC Bioinforma. 2009;10:359. 2007;73(16):5261–7.
D
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
D 124 DACTAL
DACTAL, Fig. 1 DACTAL algorithmic design. the dataset into small, overlapping subsets, estimates trees
DACTAL can begin with an initial tree (bottom triangle), on each subset, and merges the small trees into a tree on
or through a technique that divides the unaligned sequence the entire dataset (figures included from a previous publi-
dataset into overlapping subsets. Each subsequent cation (Nelesen et al. 2012), with permission from the
DACTAL iteration uses a novel decomposition strategy publisher).
called “PRD” (padded recursive decomposition) to divide
datasets of about 50,000 sequences. Thus, the of sequences around each sequence with some
estimation of a large phylogenetic tree is a very overlap between the sequence subsets. Alterna-
challenging problem, and one of the biggest tively, the division can be performed by comput-
issues is the estimation of the multiple sequence ing an alignment and tree on the dataset (using
alignment for the dataset. some fast and approximate methods) and then
using the tree to produce a recursive decomposi-
tion of the sequence dataset. In either case, the
Methods decomposition that is produced produces subsets
that overlap at least one other subset by some
DACTAL (Nelesen et al. 2012) is a method for specified minimum amount (default 50) and that
estimating a very large phylogeny without need- are themselves small (by default each subset has
ing to estimate a multiple sequence alignment on at most 200 sequences).
the entire dataset. The basic approach is Once the decomposition is performed, trees
a combination of divide-and-conquer plus itera- are estimated on each subset, using some favored
tion (see Fig. 1). method; the default is a maximum likelihood
The input is a set of unaligned but homologous analysis (default RAxML) on a good multiple
sequences, and each iteration produces a tree (but sequence alignment, with the default being
no alignment) on the full dataset. With the excep- MAFFT (Katoh et al. 2005). These subsets are
tion of the first iteration, each iteration begins small (by default, they have at most
with the tree from the previous iteration. In the 200 sequences in them), and as the experimental
first iteration, the method begins by dividing the results show, this is sufficient even for datasets
dataset into overlapping subsets, each with at with about 28,000 sequences.
most some user-specified number of sequences; After the trees are computed, they can be
the default for this is 200. This division into sub- merged together into a tree on the full set of
sets can be accomplished through the use of taxa using a supertree method; the default is
a technique that uses BLAST to form small sets SuperFine+MRP (Swenson et al. 2012),
DACTAL 125 D
a supertree method that has excellent accuracy Results
and which “boosts” the accuracy of MRP
(another supertree method; see Bininda-Emonds The performance of DACTAL was evaluated in
2004). Subsequent iterations begin with the comparison to maximum likelihood trees com-
tree estimated during the previous iteration and puted on SATe-I (Liu et al. 2009) and other
then decompose the dataset into overlapping alignment methods on simulated datasets with
subsets, compute trees on subsets, and merge 1,000 sequences and on biological datasets with
the trees into a tree on the full dataset. The num- 6,000–28,000 sequences (Nelesen et al. 2012).
ber of iterations is a parameter that is set by the The results of these experiments are shown in D
user. Thus, DACTAL is a method that can be the figures below and demonstrate that DACTAL
modified to enable different techniques for esti- had accuracy comparable to that of SATe-I and
mating trees on subsets and for combining could analyze larger datasets than SATe-I. These
subset tree into a full set of trees, and the target experiments also show that DACTAL was sub-
subset size and overlap between subsets are stantially more accurate than two-phase methods
parameters that can be set by the user. The default (i.e., methods that align sequences and then esti-
settings were selected for accuracy and speed mate trees on these alignments).
and provide good results, as the results section Figure 2 compares running time and tree accu-
demonstrates. racy on the 20 replicate datasets for 15 model
DACTAL, Fig. 2 Comparisons of ten iterations of RAxML(MAFFT) starting trees. Asterisks (*) denote
DACTAL to SATe and RAxML trees estimated on differ- model conditions for which DACTAL’s missing branch
ent alignments on “moderate-to-difficult” simulated rate is a statistically significant improvement over the next
1,000-taxon datasets. We show missing branch rates best method, according to Benjamini-Hochberg-corrected
(top) and runtimes in hours (bottom); n ¼ 20 for each pairwise t-tests (n ¼ 40; alpha ¼ 0: 05) (figures included
model condition, and standard error bars are shown. from a previous publication (Nelesen et al. 2012), with
DACTAL and SATe runtimes include the time to compute permission from the publisher).
D 126 DACTAL
DACTAL,
Fig. 3 Comparisons of
DACTAL and SATe
iterations with two-phase
methods on the 16S.T
dataset with 7,350
sequences. The starting
trees were RAxML on the
MAFFT-PartTree
alignment (RAxML(Part))
for SATe and FastTree-2 on
the MAFFT-PartTree
alignment (FT(Part)) for
DACTAL. We show
missing branch rates (top)
and cumulative runtimes in
hours (bottom); n ¼ 1 for
each reported value.
Iteration 0 is used to
compute the starting tree
for DACTAL and SATe
(figures included from
a previous publication
(Nelesen et al. 2012), with
permission from the
publisher).
conditions with 1,000 taxa, originally used to DACTAL is faster than SATe, although it is
evaluate SATe-I (Liu et al. 2009). These model slower than the two-phase methods.
conditions vary in terms of rates of evolution, Figure 3 shows performance on a single bio-
indel lengths (short, medium, or long), and rela- logical dataset, 16S.T, from the Comparative
tive rates of substitutions and indels (insertions Ribosomal Webpage (CRW) (Cannone
and deletions). et al. 2002). This dataset has 7,350 sequences
The error in tree estimation is computed using and a high rate of evolution and so represents
the missing branch rate, which is the fraction of a challenging phylogenetic dataset. The reference
the nontrivial bipartitions in the true (model) tree tree for this dataset is based on a curated struc-
that are missing in the estimated tree. In this tural multiple sequence alignment (Cannone
experiment, DACTAL is run for ten iterations, et al. 2002). This figure gives four different
while SATe-I runs for 24 h after it computes the two-phase methods (maximum likelihood com-
RAxML(MAFFT) starting tree. The running time puted using FastTree-2 or RAxML on either
comparison shows that DACTAL is much faster Clustal-Quicktree or MAFFT-PartTree align-
than SATe-I on every model condition. The com- ments), but also shows trees obtained for each
parison with respect to accuracy shows that of ten iterations produced by SATe-I and
DACTAL has approximately the same accuracy DACTAL. Note how SATe-I and DACTAL
as SATe-I and that both DACTAL and SATe-I both improve with each iteration, with the initial
are much more accurate than the two-phase iterations producing the biggest reductions in tree
methods on the difficult 1,000-taxon model con- error, and that they track each other iteration by
ditions. Finally, this figure also shows that iteration. However, note that each DACTAL
DACTAL 127 D
iteration is much faster than each SATe-I itera- Unlike these truly alignment-free methods,
tion, so that ten iterations of DACTAL finish in DACTAL is not completely alignment-free,
about 1/8 the time of ten iterations of SATe-I. since it does compute alignments on subsets.
However, the results shown here suggest that
highly accurate trees are indeed possible without
Discussion requiring a multiple sequence alignment on the
full dataset.
DACTAL is a method for estimating trees from
unaligned sequences. While it does not require D
the estimation of an alignment on the full dataset, Future Work
it is not entirely alignment-free, since it estimates
alignments on subsets. However, these subsets The phylogenetics research community has been
are small, containing only 200 sequences, which developing improved methods for alignment and
reduces the computational and analytical chal- phylogeny estimation. These methods may well
lenges to running DACTAL. These experiments lead to improved estimations of larger trees and
show that DACTAL can produce highly accurate could reduce the need for methods like DACTAL.
phylogenetic estimates on very large datasets, However, DACTAL may continue to be a useful
improving on the accuracy of both two-phase tool for improving scalability of these methods to
methods (that first align the sequences and then very large datasets, containing many tens of thou-
estimate the tree) and SATe-I. sands of sequences, since these improved tech-
Alignment-free methods (i.e., that do not use niques could be used to estimate trees on subsets
any multiple sequence alignment technique at all of taxa. This may be particularly relevant to the
to compute trees) have also been designed; these recent effort to develop methods that co-estimate
are surveyed in Vinga and Almeida 2003 and sequence alignments and trees under complex
Chan and Ragan 2013. Alignment-free methods models of sequence evolution (see Bouchard-
typically compute trees in three steps: first, each Cote and Jordan 2013 for a recent paper and
sequence is characterized by some distribution other methods surveyed in Warnow 2013). Most
(e.g., its k-mer distribution for some appropri- of these methods are computationally very inten-
ately chosen k), then distances between sive and limited to at most 200 sequences (and
sequences are computed, and finally a tree is even then are computationally intensive), and
computed on the distance matrix. Unlike DACTAL could potentially be used to improve
DACTAL, these truly alignment-free methods their scalability to larger datasets. More generally,
have not, to our knowledge, been shown to pro- the phylogenetics research community has been
duce trees of comparable accuracy to methods developing sophisticated techniques for highly
that estimate multiple sequence alignments and accurate estimations of alignments and trees, but
then compute maximum likelihood trees on these these statistically based methods often use tech-
alignments. Furthermore, the alignment-free niques (such as MCMC) that are computationally
methods surveyed in these papers do not have intensive and do not run on large datasets.
any theoretical guarantees under Markov models DACTAL provides a basic tool for improving the
of evolution. An interesting contrast to these scalability of these techniques and so comple-
methods is the recent result given in Daskalakis ments these efforts. Thus, large-scale phylogeny
and Roch 2010. This technique is guaranteed estimation may well improve through a combina-
statistically consistent under the TKF1 model tion of efforts – some aimed at improving the
(Thorne et al. 1991) and so represents an impor- estimation of trees and alignments on small
tant advance in theory. However, this method has datasets, using statistically informed but computa-
not yet been implemented, so it remains tionally intensive methods, and other efforts aimed
a theoretical contribution rather than a usable at using divide-and-conquer to combine smaller
technique. trees into larger trees.
D 128 Diversity and Distribution of Marine Microbial Eukaryotes
marine alveolate lineages. Aquat Microb Ecol. Skovgaard A, Massana R, Balague V, Saiz
2006;42:277–91. E. Phylogenetic position of the copepod-infesting par-
Ichinomiya M, Yoshikawa S, Kamiya M, Ohki K, asite Syndinium turbo (Dinoflagellata, Syndinea). Pro-
Takaichi S, and Kuwata A. Isolation and characteriza- tist. 2005;156:413–23.
tion of Parmales (Heterokonta/Heterokontophyta/ Taylor FJR, Hoppenrath M, Saldarriaga JF. Dinoflagellate
stramenopiles) from the Oyashio region, western diversity and distribution. Biodivers Conserv.
North Pacific. J Phycol. 2011;47:144–151. 2008;17:407–18.
Keeling PJ. Chromalveolates and the evolution of plastids Thaler M, Lovejoy C. Distribution and diversity of
by secondary endosymbiosis. J Eukaryot Microbiol. a protist predator Cryothecomonas (Cercozoa) in Arc-
2009;56:1–8. tic marine waters. J Eukaryot Microbiol.
Kosman CA, Thomsen HA, Ostergaard JB. Parmales 2012;59:291–9.
(Chrysophyceae) from Mexican, Californian, Baltic, Vaulot D, Eikrem W, Viprey M, Moreau H. The diversity
Arctic and Antarctic waters with the description of of small eukaryotic phytoplankton (<¼ 3 mu m) in
a new subspecies and several new forms. Phycologia. marine ecosystems. FEMS Microbiol Rev.
1993;32:116–28. 2008;32:795–820.
Lovejoy C, Legendre L, Martineau MJ, Bacle J, von Yoon HS, Price DC, Stepanauskas R, Rajah VD, Sieracki
Quillfeldt CH. Distribution of phytoplankton and ME, Wilson WH, Yang EC, Duffy S, Bhattacharya
other protists in the North Water. Deep-Sea Res Part D. Single-cell genomics reveals organismal interac-
II Top Stud Oceanogr. 2002;49:5027–47. tions in uncultivated marine protists. Science.
Massana R, Castresana J, Balague V, Guillou L, 2011;332:714–7.
Romari K, Groisillier A, Valentin K, Pedros-Alio
C. Phylogenetic and ecological analysis of novel
marine stramenopiles. Appl Environ Microbiol.
2004;70:3528–34.
Massana R, Unrein F, Rodriguez-Martinez R, Forn I,
Lefort T, Pinhassi J, Not F. Grazing rates and func- DNA Methylation Analysis by
tional diversity of uncultured heterotrophic flagellates.
ISME J. 2009;3:588–96. Pyrosequencing
Moon-van der Staay SY, De Wachter R, Vaulot
D. Oceanic 18S rDNA sequences from picoplankton Florence Busato and Jörg Tost
reveal unsuspected eukaryotic diversity. Nature. Laboratory for Epigenetics and Environment,
2001;409:607–10.
Not F, Valentin K, Romari K, Lovejoy C, Massana R, Centre National de Génotypage, CEA- Institut de
Tobe K, Vaulot D, Medlin LK. Picobiliphytes: Génomique, Evry, France
a marine picoplanktonic algal group with unknown
affinities to other eukaryotes. Science.
2007;315:253–5.
Not F, Latasa M, Scharek R, Viprey M, Karleskind P, Synonyms
Balague V, Ontoria-Oviedo I, Cumino A, Goetze E,
Vaulot D, Massana R. Protistan assemblages across the Quantitative sequencing by synthesis
Indian Ocean, with a specific emphasis on the
picoeukaryotes. Deep-Sea Res Part I Oceanogr Res
Pap. 2008;55:1456–73.
Raven JA, Finkel ZV, Irwin AJ. Picophytoplankton: Definition
bottom-up and top-down controls on ecology and evo-
lution. Vie Et Milieu-Life Environ. 2005;55:209–15. Pyrosequencing is a sequencing-by-synthesis
Reyes-Prieto A, Yoon HS, Moustafa A, Yang EC, Ander-
sen RA, Boo SM, Nakayama T, Ishida K, Bhattacharya method that quantitatively monitors the real-
D. Differential gene retention in plastids of common time incorporation of nucleotides using an
recent origin. Mol Biol Evol. 2010;27:1530–7. enzymatic conversion of pyrophosphate into
Rozanska M, Poulin M, Gosselin M. Protist entrapment in a proportional light signal. Quantitative mea-
newly formed sea ice in the Coastal Arctic Ocean.
J Mar Syst. 2008;74:887–901. sures are crucial for applications such as the
Seenivasan R, Sausen N, Medlin LK, Melkonian M. analysis of DNA methylation patterns, which
Picomonas judraskeda gen. et sp. nov.: The first identi- are intensively studied in various developmen-
fied member of the Picozoa Phylum nov., a widespread tal and pathological contexts as well as for bac-
group of picoeukaryotes, formerly known as
‘picobiliphytes’. PLoS ONE 2013;8(3):e59565. terial identification and determination of allelic
doi:10.1371/journal.pone.0059565. imbalance.
DNA Methylation Analysis by Pyrosequencing 133 D
Introduction The experimental procedure of the
pyrosequencing assay is simple and relatively
While Sanger sequencing has been the “gold robust and results are highly reproducible. There-
standard” for the identification of sequence vari- fore, pyrosequencing has become a widely used
ants for a long time, pyrosequencing with its analysis platform for various biological and/or
improved ability for quantification, decreased diagnostic applications such as routine
limit of detection and accelerated workflow lead- (multiplex) genotyping of single-nucleotide
ing to a shorter time to results, has become polymorphisms (SNPs), methylation analysis of
a valuable alternative notably for many clinical bisulfite-treated samples, bacterial typing, muta- D
and diagnostic applications. Pyrosequencing is tion detection, and allele quantification (Marsh
a sequencing-by-synthesis method, where nucle- 2007).
otides are incorporated complementary to
a template strand leading to the release of pyro-
phosphate (PPi) that will – after several enzy- DNA Methylation
matic reactions – produce a light signal
proportional to the amount of incorporated nucle- DNA methylation is a post-replication modifica-
otide (Fig. 1). tion that occurs in mammals almost exclusively
DNA Methylation Analysis by Pyrosequencing, used together with APS by an ATP sulfurylase to produce
Fig. 1 Nucleotides added into the pyrosequencing reac- ATP. ATP will be subsequently used by luciferase to
tion (here exemplified by a thymine) are incorporated by oxidate luciferin to oxyluciferin generating a proportional
the DNA polymerase extending the pyrosequencing light signal. Unincorporated nucleotides are degraded by
primer when they are complementary to the DNA tem- apyrase to avoid unspecific background signals. The reac-
plate sequence. This incorporation releases PPi, which is tions are detailed in the text
D 134 DNA Methylation Analysis by Pyrosequencing
on the 50 position of the pyrimidine ring of cyto- can also promote spontaneous deamination,
sines in the context of a dinucleotide CpG (Tost enhance DNA binding of carcinogens, or increase
2009). CpGs represent less than 1 % of all bases ultraviolet absorption by DNA and, as a result,
and are mostly methylated in the mammalian increase the rate of mutations, DNA adduct for-
genome. CpGs are relatively rare because they mation, and subsequent gene inactivation. As
are easily transformed into TpGs by deamination, DNA methylation has been shown to be
and as thymine is a naturally occurring building influenced by diet and environmental exposure,
block of the DNA, these mutations are less well it has been postulated that DNA methylation
recognized and repaired by the cellular machin- might constitute a measurable molecular memory
ery. This elevated mutation rate has led to CpG of our lifestyle and environment (Cortessis
depletion during evolution. et al. 2012).
However, relatively CpG-rich clusters, called Methylation of cytosines in other sequence
CpG islands, are found in the promoter and first contexts (CpNpG, CpA, etc.) has been identified
exon of approximately two-thirds of all genes. in cultured cells such as mouse embryonic stem
Mostly unmethylated, these CpG islands are dis- cells. In plants, methylation on cytosines is more
tributed throughout the human genome and main- prevalent and more diverse compared to mam-
tain the chromatin in an open configuration to mals, and their DNA is highly methylated. The
allow transcription. The absence of DNA meth- methylcytosines are mainly located in CpG and
ylation is not directly correlated to the transcrip- CpNpG sequences, but they may also occur in
tional activity of the corresponding gene, but other contexts. DNA methylation controls plant
rather the transcriptional potential. However, growth and development, with a particular
a certain number of promoter CpG islands are involvement in the regulation of gene expression
methylated in a tissue-specific manner, and this and DNA replication, similar to its function in
DNA methylation helps to maintain transcrip- mammalian cells.
tional silence in non-expressed or noncoding Compared to mammals, bacteria have at least
regions of the genome. Methylated regions also two methylated bases in addition to
maintain transcriptional inactivation, as exempli- 5-methylcytosine: N6-methyladenine in the
fied by the methylation and repression of repeti- sequence context GpApTpC and GpApNpTpC
tive and transposable elements. Furthermore, and N4-methylcytosine (Casadesus and Low
some genes, called imprinted genes, express 2006). These methylated bases are involved in
only one allele depending on their parent of ori- the protection of bacterial DNA, where they act
gin (maternal or paternal allele), and the as a defense mechanism against bacteriophage
non-expressed allele is associated with infection. They play also crucial roles in the con-
a repressed imprinting control region, which is trol of DNA repair, replication, transposition,
in many cases marked by DNA methylation. and – similar to eukaryotes – gene expression.
Inactivation of one X chromosome in female Particularly, adenine methylation plays an impor-
mammals is another example in which DNA tant role in the regulation of gene expression in
methylation plays an important role in gene dos- bacteria, with its absence allowing the binding of
age and regulation. specific proteins to the bacterial DNA. Methyla-
During aging and in the context of patholo- tion patterns have also been correlated to the
gies, particularly cancer, regions normally virulence of several pathogens.
unmethylated become methylated, and this However, due to their greater diversity, the
hypermethylation can induce or is at least asso- presence of many “orphan” methyltransferases,
ciated with aberrant gene expression patterns. For i.e., enzymes not part of a restriction enzyme
example, methylation of the DNA repair genes system that methylate bacterial genomes at spe-
MLH1 and MGMT can lead to their inactivation, cific sites and the only recent emergence of
resulting respectively in microsatellite instability appropriate tools to study the DNA modifica-
and increased mutation frequency. Methylation tions, DNA methylation in bacteria has not been
DNA Methylation Analysis by Pyrosequencing 135 D
a topic of intensive research. The advent of single translates the methylation signal into a sequence
molecule sequencing technologies such as the difference. After PCR amplification the methyla-
single molecule real-time sequencer from Pacific tion status at a given position is manifested in the
Biosciences performing sequencing with an ratio C (former methylated cytosine) to T (former
immobilized polymerase at the bottom of zero- non-methylated cytosine) and can be analyzed as
mode waveguide wells in zeptoliter volumes has a virtual C/T polymorphism spanning the entire
revolutionized the possibilities for DNA methyl- allele frequency spectrum from 0 % to 100 % in
ation analysis in bacteria and allowed the direct the bisulfite-treated DNA. The latter principle is
readout of CpG and other methylation modifica- commonly used for DNA methylation analysis by D
tions in bacteria (Davis, et al. 2013). pyrosequencing. It should be noted that the
reduced complexity of the bisulfite-treated DNA
(which essentially consists of a three-letter
Principles of the DNA Methylation genome) creating homopolymeric and highly
Analysis AT-rich sequences provides a challenge for the
design of PCR amplification-based assays and
As DNA methylation is involved in many biolog- induces frequently a preferential amplification
ical processes, it is of great importance to analyze of either unmethylated or methylated alleles.
DNA methylation patterns and their variability. This bias has to be monitored and corrected for
As DNA methylation is not retained during PCR to ensure accurate quantification of DNA meth-
amplification, it is necessary to make use of pro- ylation levels of the analyzed CpGs.
cedures that are able to differentiate the epige-
netic state. Methods for DNA methylation
analysis are based on four main principles: Principle of the Pyrosequencing
(1) the use of methylation-sensitive restriction Reaction
endonucleases, i.e., enzymes that are blocked by
methylated cytosines in their recognition Pyrosequencing is a polymerase-based quantita-
sequence are widely used for the analysis of tive real-time sequencing method used to analyze
methylation patterns in combination with their multiple sequence variations in a region of inter-
methylation-insensitive isoschizomers. Although est. In contrast to conventional Sanger sequenc-
methods based on methylation-sensitive restric- ing that uses a mixture of the four fluorescently
tion enzymes are simple and cost-effective as labeled chain-terminating ddNTPs and strand-
they do not require any special instrumentation, elongating dNTPs, only one nucleotide is dis-
they are hampered by the limitation to specific pensed at a time by an inkjet-type cartridge in
restriction sites as only CpG sites found within pyrosequencing reactions using either a user-
these sequences can be analyzed. (2) The meth- defined sequence-specific dispensation order or
ylated fraction of a genome can be enriched by a repetitive cyclic dispensation order of the four
precipitation with a bead-immobilized antibody nucleotides for unknown sequences.
specific for 5-methylcytosine or (3) affinity puri- This iterative incorporation of unmodified
fication of methylated DNA with MBD proteins, nucleotides by the exonuclease-deficient Klenow
but these methods do not permit the analysis of fragment of DNA polymerase I will result in the
DNA methylation patterns at single-nucleotide release of inorganic pyrophosphate (PPi), while
resolution. (4) The most widely used approach all unincorporated nucleotides will be degraded
consists of the chemical modification of genomic prior to addition of the next nucleotide by an
DNA with sodium bisulfite. This chemical reac- apyrase. When the polymerase encounters
tion induces the hydrolytic deamination of a noncomplementary nucleotide, it pauses while
non-methylated cytosines to uracils, while meth- nucleotide degradation takes place. The pyro-
ylated cytosines are resistant to conversion under phosphate is in the presence of adenosine
the chosen reaction conditions. This method thus phosphosulfate (APS) transformed by an ATP
D 136 DNA Methylation Analysis by Pyrosequencing
sulfurylase into several products including ATP. reaches this polymorphism, both nucleotides of
The latter will be used in the subsequent step to the variable position will be added successively
oxidize luciferin to oxyluciferin by a luciferase and their proportional luminometric signal quan-
resulting in the creation of a proportional amount tified by the software.
of photons, which can be monitored by a CCD Since all the enzymatic reactions are quantita-
camera (Fig. 1). The four enzymes are present in tive, the intensity of the bioluminometric response
a well-balanced mixture allowing the DNA poly- is directly proportional to the amount of incorpo-
merase to extend the newly synthesized DNA rated nucleotides: the incorporation of two identi-
strand until it encounters a noncomplementary cal consecutive nucleotides will have the double
nucleotide while at the same time avoiding intensity (and therefore peak height in the resulting
unspecific nucleotide incorporation and out-of- pyrogram) compared to the signal of a single-
phase sequencing. A key step in the development nucleotide incorporation. This quantitative nature
of applications for pyrosequencing was the addi- of the results is the most important characteristic
tion of a single-stranded DNA binding protein to of the pyrosequencing technology because it
the reaction mixture (now also included in the allows performing quantitative applications such
commercial kits), which led to a substantial as DNA methylation analysis. Furthermore, as
increase in read length and overall greater accu- pyrosequencing proceeds at a rate of one dispen-
racy through the reduction of the formation of sation per minute, results on the presence and
secondary structures and mispriming (Dupont abundance of variable nucleotides will be avail-
et al. 2004). able between 10 and 60 min after launching
Samples of interest are amplified by PCR a pyrosequencing reaction. The total time to
performed with one of the two amplification results starting from the PCR amplification is com-
primers being biotinylated. This allows the isola- monly below 3–4 h and therefore much faster than
tion of a single-stranded sequencing template conventional Sanger sequencing.
through the capture of the biotinylated amplifica-
tion product on streptavidin-coated Sepharose
beads. After washing steps, the use of a sodium Inconveniences
hydroxide solution allows the denaturation of the
double-stranded DNA and isolation of the However, there are some inconveniences
biotinylated single strand used as template in associated with this technology, mainly
the pyrosequencing procedure. A (pyro)sequenc- concerning the analysis of variation in the close
ing primer is subsequently annealed to this tem- proximity of homopolymers, the size of the
plate, and the sequence is synthesized one amplification product, and the sequencing read
nucleotide at a time. The light signals are then length. Pyrosequencing as well as the closely
generated by the enzymatic cascade by extending correlated 454 sequencing and semiconductor
the 30 end of the nascent strand described above. sequencing (Ion Torrent) suffer from the lack of
It should be noted that the nucleotide dATP acts precision in the analysis of homopolymers.
as a natural substrate for luciferase (although less The bioluminometric response is only linear
efficient compared to ATP). Therefore the a-S- (R2 > 0.99) for the sequential addition of up
dATP analogue is used as nucleotide for primer to five identical nucleotides (C, G, T) or three
extension as it is equally well incorporated by the a-S-dATPs. Sequence variation in close proxim-
polymerase. ity to homopolymer reads might therefore not be
Pyrosequencing can analyze almost any poly- easily resolved, and the quantitative accuracy
morphism in the amplified sequence. As the might be limited. Due to the thermal instability
expected sequence is in most cases known of the enzymes, pyrosequencing has to be carried
a priori, the sequence to analyze is simply entered out at 28 C which limits the size of the amplifi-
into the software creating automatically cation product to 350 base pairs as the formation
a dispensation order, and once the sequencing of secondary structures can complicate annealing
DNA Methylation Analysis by Pyrosequencing 137 D
of the sequencing primer or increase background nucleotides will be added one after another.
signals. The limitation in the read length (less Each allele combination will result in a specific
than 100 dispensed nucleotides) is mainly due to pyrosequencing pattern that can easily be read
dilution effects and increasing background due to either by the software or by the user. Besides
frameshifts of subpopulations of sequenced mol- simple qualitative genotyping, pyrosequencing
ecules. This drawback can be partly overcome can be used for quantitative applications such as
using the below described serial pyrosequencing the level of mutation or the potential loss of one
approach. Lastly, the setup and optimization of allele (loss of heterozygosity (LOH)). LOH can
robust pyrosequencing assays including the assay result in a neutral phenotype but can also be D
design but also the entry of an optimal dispensa- involved in cancer as exemplified by the LOH
tion order requires a certain degree of experience of BRCA1 or BRCA2 in breast cancer.
and expertise, and only few tools are available in Due to its relatively short read length,
the public domain for the assay design. pyrosequencing is best suited for the detection
and quantification of mutation hotspots such as
the codons 12 and 13 of KRAS (Ogino
Serial Pyrosequencing et al. 2005), a gene commonly mutated in many
cancers including colorectal cancer, where it is
To overcome the restriction in read length, a solu- the most commonly mutated gene with
tion was found in the “recycling” of the single- a prevalence of ~ 40 % of patients, lung, or
stranded template after the pyrosequencing run. pancreatic cancer. Similar applications concern
As this template is not altered during the the analysis of BRAF (V600E) or JAK2 (V617F)
pyrosequencing reaction, it can be recovered mutations and polymorphisms such as C677T
after the run by the same template preparation MTHFR. Compared to conventional Sanger
protocol used after PCR amplification. Several sequencing, the limit of detection is significantly
pyrosequencing primers can therefore be used on improved (i.e., 2–7 % for pyrosequencing com-
the same DNA template to cover the entire ampli- pared to 10–20 % for conventional Sanger
fied sequence with sufficient intensity and good sequencing) which enables the user to call
quantitative resolution. This improvement enables low-level mutations with greater confidence and
the analysis of an entire region amplified in resolve, e.g., ambiguous Sanger sequencing
a single PCR. While the approach has initially results. This property of pyrosequencing is also
been devised for DNA methylation analysis (Tost of special importance in situations where, for
et al. 2006), it could also be used for the analysis of example, few tumor cells are present among nor-
several sequence variation within the same ampli- mal cells and/or a subclone of the tumor carries
fication product. the mutation of interest, which might expand
upon a given therapy and induce drug resistance.
Pyrosequencing has also been applied to more
Application: Genotyping and Mutation complex genetic analyses requiring accurate
Detection sequencing such as HLA (sub)typing (Ugolotti
et al. 2011). A quantitative readout is also of
Pyrosequencing can be used to genotype single- interest for the genotyping of SNPs in polyploidy
nucleotide polymorphism (SNP) and detect organisms such as plants where pyrosequencing
mutations involved in various diseases (cancer, has proven to be an effective tool.
Alzheimer’s disease, heart diseases, diabetes) or
in biological traits such as eye color or lactose
intolerance. Application: Transcript Quantification
Once the sequencing reaches the SNP (entered
in the sequence to analyze in the software using Just as it can quantify the ratio of mutations in a
the IUPAC single letter code), all possible heterogeneous mixture of DNA, pyrosequencing
D 138 DNA Methylation Analysis by Pyrosequencing
pyrosequencing-based analysis of methylation of the base with the template ensuring an ampli-
patterns in repetitive elements such as ALU or fication of only the exact complementary allele at
LINE1 elements (Yang et al. 2004). While the chosen temperature.
LINE1 elements do have a relatively conserved
sequence allowing thus the design of a sequence-
specific pyrosequencing assay for DNA methyl- Summary
ation analysis, methylation of ALU elements is
assessed by a cyclic dispensation. These assays Pyrosequencing is a sequencing-by-synthesis,
have been widely used for the measurement of easy-to-use method that can precisely analyze
global DNA methylation changes in response to genetic and epigenetic variation in an amplified
environmental stimuli (Cortessis et al. 2012). sequence of up to 350 base pairs. Its applications
are wide and various: genotyping, methylation
analysis, transcript quantification, bacterial typing,
Application: Allele-Specific DNA etc. The broad range of applications combined
Methylation Analysis with the above-described advantages has made
pyrosequencing a widespread analysis method.
Some genes display different methylation pat-
terns on the two alleles either randomly or in
a parent of origin-specific manner. Imprinting Cross-References
control regions regulating the expression of
imprinted genes are commonly methylated on ▶ Approaches in Metagenome Research:
only one allele (inherited from the mother or the Progress and Challenges
father) so that only one “parental allele” is ▶ Conserved Regions in 16S Ribosome RNA
expressed. Using a heterozygous SNP to differ- Sequences and Primer Design for Studies of
entiate the two alleles, the methylation status of Environmental Microbes
each allele can be interrogated after enrichment ▶ Extraction Methods, Variability Encountered in
of the methylated molecules using the above- ▶ Metagenomic Research: Methods and
described MSP with primers complementary to Ecological Applications
a specific DNA methylation pattern. The ▶ NGS QC Toolkit: A Platform for Quality
resulting amplification products are subsequently Control of Next-Generation Sequencing Data
pyrosequenced, and the ratio of the two alleles
after methylation enrichment is quantified by
genotyping the two alleles of the SNP after meth-
References
ylation enrichment (Kristensen et al. 2013).
To analyze methylation on both alleles sepa- Casadesus J, Low D. Epigenetic gene regulation in the
rately, it is possible to design two pyrosequencing bacterial world. Microbiol Mol Biol Rev.
primers, each specific of one allele using the two 2006;70:830–56.
Cortessis VK, Thomas DC, Levine AJ, Breton CV, Mack
alleles of a heterozygous single-nucleotide poly-
TM, Siegmund KD, et al. Environmental epigenetics:
morphism to differentiate the two alleles (Wong prospects for studying epigenetic mediation of
et al. 2006). The specificity of the allele-specific exposure-response relationships. Hum Genet.
enrichment can be further improved by modify- 2012;131:1565–89.
Davis BM, Chao MC, Waldor MK. Entering the era of
ing the base complementary to the SNP with an bacterial epigenomics with single molecule real time
LNA (locked nucleic acid). Locked nucleic acids DNA sequencing. Curr Opin Microbiol.
are RNA monomers with a modified backbone. 2013;16:192–8.
The sugar phosphate backbone has a 20 -O-40 -C Dejeux E, Audard V, Cavard C, Gut IG, Terris B, Tost
J. Rapid identification of promoter hypermethylation
methylene bridge. The bridge increases the
in hepatocellular carcinoma by pyrosequencing of eti-
monomer’s thermal stability, reduces its flexibil- ologically homogeneous sample pools. J Mol Diagn.
ity, and increases the hybridization interactions 2007;9:510–20.
DNA Methylation Analysis by Pyrosequencing 141 D
Dupont JM, Tost J, Jammes H, Gut IG. De novo quantita- microbial identification. Clin Chem. 2009;55:
tive bisulfite sequencing using the pyrosequencing 856–66.
technology. Anal Biochem. 2004;333:119–27. Shaw RJ, Akufo-Tetteh EK, Risk JM, Field JK,
How Kit A, Nielsen HM, Tost J. DNA methylation based Liloglou T. Methylation enrichment pyrosequencing:
biomarkers: practical considerations and applications. combining the specificity of MSP with validation by
Biochimie. 2012;94:2314–37. pyrosequencing. Nucleic Acids Res. 2006;34:e78.
Karimi M, Johansson S, Ekström TJ. Using LUMA: Tost J. DNA methylation: an introduction to the biology
a Luminometric-based assay for global DNA- and the disease-associated changes of a promising bio-
methylation. Epigenetics. 2006;1:45–8. marker. Mol Biotechnol. 2009;44:71–81.
Kristensen LS, Treppendahl MB, Asmar F, Girkov MS, Tost J, Gut IG. DNA methylation analysis by
Nielsen HM, Kjeldsen TE, et al. Investigation pyrosequencing. Nat Protoc. 2007;2:2265–75. D
of MGMT and DAPK1 methylation patterns in Tost J, Elabdalaoui H, Gut IG. Serial pyrosequencing for
diffuse large B-cell lymphoma using allelic MSP- quantitative DNA methylation analysis. Biotechniques.
pyrosequencing. Sci Rep. 2013;3. 2006;40:721–6.
Madi T, Balamurugan K, Bombardi R, Duncan G, Ugolotti E, Vanni I, Raso A, Benzi F, Malnati M, Biassoni
McCord B. The determination of tissue-specific DNA R. Human leukocyte antigen–B (-Bw6/-Bw4 I80, T80)
methylation patterns in forensic biofluids using bisul- and human leukocyte antigen–C (-C1/-C2)
fite modification and pyrosequencing. Electrophoresis. subgrouping using pyrosequence analysis. Hum
2012;33:1736–45. Immunol. 2011;72:859–68.
Marsh S, editor. Pyrosequencing protocols, methods in Wong H-L, Byun H-M, Kwan J, Campan M, Ingles S,
molecular biology vol 373. Totowa: Humana Press; Laird P, et al. Rapid and quantitative method of allele-
2007. specific DNA methylation analysis. Biotechniques.
Ogino S, Kawasaki T, Brahmandam M, Yan L, Cantor M, 2006;41:734–9.
Namgyal C, Mino-Kenudson M, Lauwers GY, Yang AS, Estécio MRH, Doshi K, Kondo Y,
Loda M, Fuchs CS. Sensitive sequencing method for Tajara EH, Issa J-PJ. A simple method for estimating
KRAS mutation detection by pyrosequencing. J Mol global DNA methylation using bisulfite PCR of
Diagn. 2005;7:413–21. repetitive DNA elements. Nucleic Acids Res.
Paliwal A, Vaissière T, Herceg Z. Quantitative detection 2004;32:e38.
of DNA methylation states in minute amounts of DNA Yang B, Wagner J, Yao T, Damaschke N, Jarrard
from body fluids. Methods. 2010;52:242–47. DF. Pyrosequencing for the rapid and efficient quanti-
Petrosino JF, Highlander S, Luna RA, Gibbs RA, fication of allele-specific expression. Epigenetics.
Versalovic J. Metagenomic pyrosequencing and 2013;8:1039–42.
E
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
E 144 Environmental Shaping of Codon Usage
represented in the environment, based on similar- Environmental Shaping of Codon Usage and Func-
ity searches against known microbial species’ tional Adaptation Across Microbial Communities,
Table 1 Metagenomes used to demonstrate the concept
sequences (Huson et al. 2007). of environmental shaping of codon usage
For a thorough understanding of microbial
NCBI
communities at the systems level, it is necessary Project
to capture the interplay of community constitu- Metagenome ID Reference
ents and organizational complexity in the com- Global Ocean Sampling 13694 (Venter
munity metabolism. Microbes in the same Expedition Metagenome, et al. 2004)
environment live within the same physical and the Sargasso Sea version 1
chemical constraints, such as temperature, pH, or Waseca County farm soil 13699 (Tringe
metagenome et al. 2005b)
ion concentration, probably causing the GC con-
Whale fall metagenomes 13700
tent to be metagenome specific (Foerstner
5-way (CG) acid mine 13696 (Tyson
et al. 2005). Furthermore, communities of drainage biofilm metagenome et al. 2004)
microbes have been shown to share tRNA pools Human distal gut biome 16729 (Gill
to facilitate horizontal gene transfer (Tuller et al. 2006)
et al. 2011), which also implies a limited choice Lean mouse 1 gut metagenome 17391 (Turnbaugh
of preferred cognate codons within the shared et al. 2006)
Obese mouse 1 gut 17397
tRNA pool. It has also been shown that fast
metagenome
growth rates introduce stronger bias in synony- US EBPR sludge metagenome 17657 (Martin
mous codon usage at the level of whole et al. 2006)
metagenomes (Vieira-Silva and Rocha 2010), OZ EBPR sludge metagenome 17659
much like the effect observed in single microbial
species (Rocha 2004; Sharp et al. 2005).
Microbial communities living under the same Eleven different microbial community sequenc-
environmental constrains, at the level of genes, ing samples (Table 1.) were used to demonstrate
can effectively be considered and studied as that microbes living in the same ecological niche,
metagenomes, thereby using approaches and regardless of their phyletic diversity, share
methodology valid for single microbial genome a common preference for codon usage. CU bias is
studies. One such approach is the functional char- present at the community level and is also different
acterization by translational optimization through between distinct communities. CU bias also varies
synonymous codon usage bias. within the community, with distributions resem-
The codon usage (CU) bias within a genome bling that of single microbial species, i.e., the
reflects the selection pressure for translational intercommunity CU bias can be observed. The
optimization of highly expressed genes – primar- effects of intercommunity CU bias and transla-
ily the protein synthesis machinery such as ribo- tional optimization concepts are utilized to identify
somal genes and elongation factors, but also genes with CU close to that of the meta-ribosomal
genes with environmental adaptation functions sample. These genes have high predicted expres-
(Supek et al. 2010). At the level of a single micro- sion across the entire microbial community and
bial genome, the effect of CU bias is routinely define its “functional fingerprint.” This approach
used to predict for functionally relevant and establishes a functional metagenomic platform that
highly expressed genes (Sharp and Li 1987; enables functional studies at the level of the entire
Karlin and Mrazek 2000; Plotkin and Kudla microbial community samples.
2011). The choice of preferred codons in
a single genome is most closely correlated with
abundance of the cognate tRNA molecules Description
(Ikemura 1985; Kanaya et al. 2001; Tuller
et al. 2010) and further influenced by the Microbes living in the same ecological niche
genome’s GC content (Chen et al. 2004). share a bias in CU. When comparing the distance
Environmental Shaping of Codon Usage 145 E
Environmental Shaping of Codon Usage and Func- to their respective metagenome of origin therefore
tional Adaptation Across Microbial Communities, forming two distinct groups (the distribution of log2
Fig. 1 Codon usage is metagenome specific. Soil versus ratio of the two distances for each gene is shown in the
human gut metagenome codon usage (CU) frequencies. inset). If the amino acid composition of metagenomes is
(a) The distance (MILC) of each gene’s CU frequency to kept constant and the codons are randomly chosen, CU
overall CU frequencies of two microbial communities. bias of each metagenome would be eliminated resulting in
Genes (red in human gut (N ¼ 33,422) and blue in Waseca uniform distribution of CU distances and overlap of two
soil (N ¼ 88,696) metagenome) are predominantly closer samples, as shown in b)
Environmental Shaping of Codon Usage and Func- (green, total comparisons N ¼ 1,029 comparisons). ICC
tional Adaptation Across Microbial Communities, measures were calculated, representing how “close” the
Fig. 2 Codon usage variability between same species CU profiles match, with ICC ¼1 denoting the perfect
in different metagenomes is larger than within a match. The orange distribution shows less variability and
metagenome. ORFs from each identified species (using is shifted toward higher ICC values, denoting the closer
MEGAN) were compared against their originating overall match of species’ CU to their metagenome of
metagenome (orange, total comparisons N ¼ 2,058) and origin
against same-species ORFs in a different metagenome
the constrained environmental conditions. (36% of the whole set) and the Alphaproteo-
R. palustris samples show on overall higher vari- bacteria class itself show virtually no deviation
ability in CU, suggesting plasticity of codon usage (ICC > 0.98 and 0.95, respectively) from the
that reflects on translational optimization and original metagenome CU.
adopts to each specific environment. Even though
the R. palustris strains generally show more vari- Codon Usage in Metagenomes Follows
ation in CU (Fig. 3), both species, regardless of Similar Patterns as in Single Microbial
environmental constraints, show the least relative Genomes
variation of CU within the COG categories (i.e., As has been established at the level of single
orthologous genes) for housekeeping, including microbial genomes (Ikemura 1985; Kanaya
ribosomal protein genes. et al. 2001), the distance of each gene’s CU fre-
quency to the overall CU of the whole genome and
The Variability of Codon Usage in to that of a “reference set” of highly expressed
Metagenomes upon Removal of genes (ribosomal protein genes) gives a character-
Dominant Phyla istic crescent-shaped plot (Fig. 4a, introduced by
Community-level codon usage bias is not an (Karlin and Mrazek 2000)). Metagenomes exhibit
effect caused by the most abundant species. CU similar CU distance distributions to those
frequencies of the Sargasso Sea metagenome, the observed in single bacterial genomes, despite the
largest dataset in this study, were compared to fact that they comprise of genes that originate from
other investigated metagenomes and to itself but diverse phylogenies (i.e., Santa Cruz whale car-
with dominant phyla removed. The comparisons cass bone in Fig. 4b). If the amino acid composi-
between Sargasso Sea CU frequencies and other tion of genes in a metagenome is kept constant but
metagenomes all show ICC < 0.75, while the the codons are randomly chosen, the crescent plot
same Sargasso sample with dominant phyla shape analogous to single bacterial genomes and
of the Alphaproteobacteria class removed CU bias is lost.
Environmental Shaping of Codon Usage 147 E
Environmental Shaping of Codon Usage and Func- set within an orthologous group to its centroid CU) for
tional Adaptation Across Microbial Communities, the strains of P. acnes (N ¼ 15,436), living in consistent
Fig. 3 Environmental variability of codon usage. Vari- environmental conditions, is shifted to the left, i.e., it
ability of codon usage per COG category in 6 strains of shows smaller variation and higher bias than for the
Rhodopseudomonas palustris and in 12 strains of R. palustris strains (N ¼ 24,071) living in diverse envi-
Propionibacterium acnes. The codon usage variability ronmental conditions
(calculated as median CU distance from the ribosomal
Predicting Metagenomic Expression and for the acid mine biofilm metagenome. The most
Functional Profiles Through Synonymous striking difference between metagenomes was
Codon Usage lack of enrichment in energy production and car-
Under different environmental constraints, CU bohydrate metabolism (COG supercategories
varies in single bacterial species, and C and G) in the obese mice microbiota sample,
metagenomes share synchronized CU as do sin- in contrast to both lean human and mouse
gle bacterial species. CU bias in metagenomes microbiota samples, indicating high metabolic
can be used to predict the expression levels of activity of lean gut bacteria.
genes in the same manner as is routinely used to Artificial metagenomes, constructed from ran-
predict genes optimized for high levels of expres- domly selected genes of whole genome bacterial
sion in single microbial genomes (Sharp and Li sequences from the NCBI with the same COG
1987; Karlin and Mrazek 2000; Supek and composition as their corresponding microbial sam-
Vlahovicek 2005). Figure 5 depicts the resulting ples, show loss of environment-specific enrichment
predictions at the level of whole metagenomes of optimization in their expression profiles.
using the meta-ribosomal protein reference set.
The most significantly enriched functions in the Validation with Metaproteomic Data
high expression level sets are (i) amino acid Predictions of gene expression for Sargasso Sea
transport and metabolism (COG supercategory metagenome were compared to the Sargasso Sea
E) for Sargasso Sea, (ii) energy production and metaproteomic study (Sowell et al. 2008) and
conservation (COG supercategory C) for the a functionally (COG) classified subset of the
Whale fall metagenomes, and (iii) inorganic ion human gut metaproteomic study (Verberkmoes
transport and metabolism (COG supercategory P) et al. 2009). Predicted expression values based
E 148 Environmental Shaping of Codon Usage
Environmental Shaping of Codon Usage and Func- B-plot for (a) a single microbial genome (Escherichia
tional Adaptation Across Microbial Communities, coli, N ¼ 4,358) and (b) a metagenome (whale carcass,
Fig. 4 Metagenomes show codon usage distribution sim- N ¼ 33,422). The metagenome shows the same character-
ilar to single genomes. The distance of each gene’s codon istic distribution as the genome with ribosomal genes
usage (CU) frequency forms the overall CU of the (meta) closer to the CU of the ribosomal set than the overall CU
genome and ribosomal reference set, displayed as a Karlin of the whole (meta)genome
Environmental Shaping of Codon Usage and Func- (N ¼ 40,916), Whale fall Antarctic bone (N ¼ 30,503),
tional Adaptation Across Microbial Communities, Whale fall Santa Cruz bone (N ¼ 33,422), obese mouse
Fig. 5 Enrichment of functions within highly expressed gut (N ¼ 4,058), lean mouse gut (N ¼ 4,955), human gut
genes in metagenomes. Enrichment or depletion of func- (N ¼ 47,765), Santa Cruz whale fall bone (N ¼ 33,422),
tional annotations in the 3% genes with highest predicted and acid mine (N ¼ 79,257). Metagenomes show different
expression (highest MELP measure) relative to the abun- functional enrichment patterns that are consistent with
dance of each COG supercategory in the whole environmental requirements (e.g., metabolite transport
metagenome for the OZ EBPR sludge (N ¼ 29,754), functions [E] in the Sargasso Sea or energy conversion
Waseca farm soil (N ¼ 88,696), acid mine biofilm [C] in the whale carcass metagenome). Letters at the
(N ¼ 79,257), Sargasso Sea (N ¼ 688,539), US EBPR bottom represent COG supercategories
sludge (N ¼ 20,175), Whale fall Santa Cruz microbial mat
Environmental Shaping of Codon Usage 149 E
on CU optimization positively correlate with Kanaya S, Yamada Y, Kinouchi M, Kudo Y, Ikemura
abundance in metaproteomic studies, both for T. Codon usage and tRNA genes in eukaryotes: Cor-
relation of codon usage diversity with translation effi-
the comparison of each gene with the protein ciency and with CG-dinucleotide usage as assessed by
most similar in sequence (Sargasso Sea multivariate analysis. Journal of Molecular Evolution.
rho¼0.34) and when median values per gene 2001;53:290–8.
and protein COG are compared (human gut Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF,
Itoh M, Kawashima S, et al. From genomics to chem-
rho¼0.34). This opens up for an in silico predic- ical genomics: new developments in KEGG. Nucl
tion of overall metagenomic proteome status. Acids Res. 2006;34:D354–7.
Karlin S, Mrazek J. Predicted highly expressed genes of
diverse prokaryotic genomes. Journal of Bacteriology.
Summary 2000;182:5238–50. E
Larimer FW, Chain P, Hauser L, Lamerdin J, Malfatti S,
Do L, et al. Complete genome sequence of the meta-
Analysis of eleven distinct metagenomes shows bolically versatile photosynthetic bacterium
that microbial communities exhibit codon usage Rhodopseudomonas palustris. Nature Biotechnology.
bias similar to that already described for single 2004;22:55–61.
Martin HG, Ivanova N, Kunin V, Warnecke F, Barry KW,
microbial species. Microbial communities sharing
McHardy AC, et al. Metagenomic analysis of two
an environment are likely to have similar synony- enhanced biological phosphorus removal (EBPR)
mous codon usage-based translational optimiza- sludge communities. Nature Biotechnology. 2006;24:
tion for expression of environment-specific 1263–9.
Oda Y, Larimer FW, Chain PSG, Malfatti S, Shin MV,
genes. This effect can be used to identify genes
Vergez LM, et al. Multiple genome sequences reveal
with unknown function and “optimal” codon adaptations of a phototrophic bacterium to sediment
encoding, indicating their potential for high microenvironments. Proceedings of the National
expression and therefore high relative importance Academy of Sciences of the United States of America.
2008;105:18543–8.
in the community metabolism and lifestyle.
Plotkin JB, Kudla G. Synonymous but not the same: the
causes and consequences of codon bias. Nat Rev
Genet. 2011;12:32–42.
References Rocha EPC. Codon usage bias from tRNA’s point of view:
Redundancy, specialization, and efficient decoding for
Bruggemann H, Henne A, Hoster F, Liesegang H, translation optimization. Genome Research. 2004;14:
Wiezer A, Strittmatter A, et al. The complete genome 2279–86.
sequence of Propionibacterium acnes, a commensal of Sharp P, Li W. The codon Adaptation Index–a measure of
human skin. Science. 2004;305:671–3. directional synonymous codon usage bias, and its
Chen SL, Lee W, Hottes AK, Shapiro L, McAdams potential applications. Nucleic Acids Res. 1987;
HH. Codon usage between genomes is constrained by 15(3):1281–95.
genome-wide mutational processes. Proceedings of Sharp PM, Bailes E, Grocock RJ, Peden JF, Sockett
the National Academy of Sciences of the United States RE. Variation in the strength of selected codon usage
of America. 2004;101:3480–5. bias among bacteria. Nucleic Acids Research.
Foerstner KU, von Mering C, Hooper SD, Bork 2005;33:1141–53.
P. Environments shape the nucleotide composition of Sowell SM, Wilhelm LJ, Norbeck AD, Lipton MS, Nicora
genomes. EMBO reports. 2005;6:1208–13. CD, Barofsky DF, et al. Transport functions dominate
Gill SR, Pop M, DeBoy RT, Eckburg PB, Turnbaugh PJ, the SAR11 metaproteome at low-nutrient extremes in
Samuel BS, et al. Metagenomic analysis of the human the Sargasso Sea. ISME J. 2008;3:93–105.
distal gut microbiome. Science. 2006;312:1355–9. Staley JT, Konopka A. MEASUREMENT OF IN SITU
Hunyadkurti J, Feltoti Z, Horvath B, Nagymihaly M, ACTIVITIES OF NONPHOTOSYNTHETIC
Voros A, McDowell A, et al. Complete Genome MICROORGANISMS IN AQUATIC AND TERRES-
Sequence of Propionibacterium acnes Type IB Strain TRIAL HABITATS. Annual Review of Microbiology.
6609. J Bacteriol. 2011;193:4561–2. 1985;39:321–46.
Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis Supek F, Škunca N, Repar J, Vlahoviček K, Šmuc
of metagenomic data. Genome Research. 2007;17: T. Translational Selection Is Ubiquitous in Prokary-
377–86. otes. PLoS Genet. 2010;6:e1001004.
Ikemura T. Codon Usage and Transfer-RNA Content in Supek F, Vlahovicek K. Comparison of codon usage mea-
Unicellular and Multicellular Organisms. Molecular sures and their applicability in prediction of microbial
Biology and Evolution. 1985;2:13–34. gene expressivity. Bmc Bioinformatics. 2005;6:15.
E 150 Evaluating Putative Chimeric Sequences from PCR-Amplified Products
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, body parts from different living beings. In molec-
Kiryutin B, Koonin EV, et al. The COG database: an ular biology, a chimeric sequence or chimera is
updated version includes eukaryotes. Bmc Bioinfor-
matics. 2003;4:14. a DNA sequence composed of DNA fragments
Tringe SG, von Mering C, Kobayashi A, Salamov AA, originated from two or more genes or genomes.
Chen K, Chang HW, et al. Comparative metagenomics Chimeric sequences can be naturally gener-
of microbial communities. Science. 2005a;308:554–7. ated during DNA recombination which occurs
Tringe SG, von Mering C, Kobayashi A, Salamov AA,
Chen K, Chang HW, et al. Comparative Metagenomics naturally within a genome or by taking up foreign
of Microbial Communities. Science (New York, N Y ). DNA by an organism. These processes of cross-
2005b;308:554–7. over recombination are of interest in phyloge-
Tuller T, Carmi A, Vestsigian K, Navon S, Dorfan Y, netic and evolution studies and need to be
Zaborske J, et al. An Evolutionarily Conserved Mech-
anism for Controlling the Efficiency of Protein Trans- identified (Posada and Crandall 2002). Neverthe-
lation. Cell. 2010;141:344–54. less, chimeras represent a serious problem to be
Tuller T, Girshovich Y, Sella Y, Kreimer A, Freilich S, considered when they are generated as artifacts
Kupiec M, et al. Association between translation effi- during DNA manipulation and/or analysis.
ciency and horizontal gene transfer within microbial
communities. Nucleic Acids Research. 2011;39: Chimeric artifacts can be produced at different
4743–55. stages during experimental DNA studies. Some
Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V, examples can be described relating to cloning
Mardis ER, Gordon JI. An obesity-associated gut procedures, DNA amplification, and/or DNA
microbiome with increased capacity for energy har-
vest. Nature. 2006;444:1027–31. assembling during computational analysis
Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, (Fig. 1).
Richardson PM, et al. Community structure and metab- During DNA library preparation, genomic
olism through reconstruction of microbial genomes DNA is generally broken down into small frag-
from the environment. Nature. 2004;428:37–43.
Venter JC, Remington K, Heidelberg JF, Halpern AL, ments which will be introduced into cloning vec-
Rusch D, Eisen JA, et al. Environmental genome shot- tors or sequenced independently (Sambrook and
gun sequencing of the Sargasso Sea. Science. Russell 2001). These fragments are generated by
2004;304:66–74. physical or enzymatic means. The generation of
Verberkmoes NC, Russell AL, Shah M, Godzik A,
Rosenquist M, Halfvarson J, et al. Shotgun overlapping strand endings can lead to the ran-
metaproteomics of the human distal gut microbiota. dom fusion of DNA fragments resulting in chi-
Isme Journal. 2009;3:179–89. meras which can be detected upon sequencing
Vieira-Silva S, Rocha EPC. The Systemic Imprint of (Fig. 1a).
Growth and Its Uses in Ecological (Meta)Genomics.
PLoS Genet. 2010;6:e1000808. By far, DNA amplification procedures repre-
sent the most frequently reported processes gen-
erating chimeric sequences. Most amplification
procedures are prone to generate chimeras. The
most studied case is the polymerase-chain reac-
Evaluating Putative Chimeric tion (PCR) amplification procedure where multi-
Sequences from PCR-Amplified ple sequences of a target DNA region are
Products produced through a cycling amplification reac-
tion. The amplification is exponential and errors
Juan M. Gonzalez during the reaction can be greatly amplified at the
Instituto de Recursos Naturales y Agrobiologia, end of the PCR (Fig. 1b). Due to a variety of
IRNAS-CSIC, Seville, Spain causes, incomplete amplification of the target
fragment can behave as a priming sequence in
the next cycle potentially originating a DNA
Introduction fragment from two, or more, different DNA tem-
plates. The generation of chimeric sequences dur-
The term chimera has its origins in the Greek ing PCR amplification can occur for any gene
mythology defining a creature composed of although the most studied case is that of
Evaluating Putative Chimeric Sequences from PCR-Amplified Products 151 E
a b
chimera
E
chimera
chimera
c A B C D
A E C F
A E C F
A B C D
A B C F A E C D
chimera chimera
Evaluating Putative Chimeric Sequences from incomplete synthesis of the target DNA fragment can lead
PCR-Amplified Products, Fig. 1 Scheme of different in the next cycle to the annealing to a different target with
possibilities of potential chimera formation during DNA conserved regions and result in its extension using
cloning and library preparation (a), PCR amplification (b) a different DNA target. The consequence is the generation
and computer processing of assembling DNA fragments of a chimera resulting from two different DNAs. During
(c). Examples are presented on chimeras formed during computation of the assembly of small DNA fragments
libraries aimed at both vector cloning (left on a) and direct obtained through sequencing (c), different possibilities
sequencing (right on a). During PCR amplification (b), could be similarly valid and some of them can be chimeras
ribosomal RNA genes (rRNAs) which present overestimations of the microbial diversity in
both highly conserved and variable regions environmental studies (Hugenholtz and Huber
within their sequences. The rRNAs are present 2003; Gonzalez et al. 2005). Thus, it is of most
in every organism because the cells require them importance to detect and filter out those chimeric
for protein synthesis. In Microbiology, rRNAs DNA sequences.
are widely used to detect and classify microor- In addition to the potential to generate chi-
ganisms; because most of these microorganisms meras during DNA manipulation, the possibility
are often unculturable and cannot be detected to produce chimeras during computer processing
otherwise, the rRNAs are, at present, the only of DNA sequences should be considered. Small
mean to survey for these microbes. It is easy to DNA fragments forming DNA libraries are
deduce that a chimera would represent sequenced through a variety of sequencing plat-
a nonexisting microorganism, and so considering forms. These sequences are assembled into larger
chimeras as real sequences can induce serious fragments of gene or genomic DNA. During this
E 152 Evaluating Putative Chimeric Sequences from PCR-Amplified Products
assembly, a potential exists to produce a chimeric and today these genes represent the primary way
final sequence (Fig. 1c). Above all, this can be to classify microorganisms which are difficult to
generated at the extreme of DNA assembled differentiate otherwise, either by morphology or
fragments generally induced by the presence of physiological traits.
repetitive sequences (which often causes trouble The rRNAs combine highly conserved and
during the assembly process) or by chimeras variable regions. Thus, partial synthesis of these
formed during early DNA manipulation steps or genes during PCR amplification can lead to
library preparation. As well, these assembling a DNA fragment able to anneal to a different
errors can truncate the generation of larger rRNA sequence in a complex mixture of DNAs.
contigs or fragments of genomic DNA during Annealing of that incomplete DNA fragment to
the assembly. The assembly of DNA fragments a target DNA from a different organism and
from different organisms into a single DNA extension in the same PCR cycle will result in
sequence is a risk when working with DNA sur- the formation of hybrid sequences of rRNAs.
veys of complex communities, for instance, on This rRNA has been originated by portions of
metagenomes, that is, genomic studies of com- sequences from different microorganisms
plex microbial communities (Mende et al. 2012). (Fig. 1b). Subsequent PCR cycles will generate
Independently of the step where chimeric multiple copies of that artifact. The result is the
sequences have been generated, they need to be generation of chimeras which represent
detected and filtered out to clean up these undesired artifacts that need to be detected and
sequence artifacts for further analysis. Numerous eliminated previous to further analysis.
strategies and pieces of software have been pro- The presence of chimeras in DNA databases
posed. Herein, the case of rRNAs will be used as have been previously reported (Hugenholtz and
example as most studies on chimera evaluation Huber 2003; Ashelford et al. 2005; Gonzalez
have been carried out on these genes. et al. 2005) which affects negatively when users
attempt to classify microorganisms by their
rRNAs. About 5 % of rRNA gene sequences
Chimeras and Microbial Diversity can represent suspicious or potential chimeras
(Ashelford et al. 2005; Haas et al. 2011). The
Most surveys of the composition of microbial use of curated rRNA-specific databases is
communities in natural environments are being recommended. Databases, such as RDP
performed through a PCR amplification step (Ribosomal Database Project; Cole et al. 2009),
(Gonzalez et al. 2012). Generating a high number Greengenes (DeSantis et al. 2006), and SILVA
of fragments from the rRNAs (rRNA amplicons) (Quast et al. 2013) (Table 1), have curated
represented in a community is a step previous entries. These repositories ensure the lack of chi-
to library preparation and sequencing meras and so a realistic approximation to the
(Wintzingerode et al. 1997; Roesch et al. 2007). identification of microorganisms through
At present, microbial communities are under- amplicon sequencing.
stood as composed by a highly diverse number In spite of the potential for chimeras in envi-
of microorganisms most of which remain ronmental microbial surveys, current understand-
unculturable (Curtis et al. 2002). If microorgan- ing of these communities suggests a huge
isms cannot be cultured in the laboratory, it microbial diversity (Curtis et al. 2002). This enor-
implies that the only means to analyze their mous diversity suggests that chimera detection is
potential features is through their nucleic acids. more complex than expected. However, the exis-
Due to the complexity of genomes, accurate tax- tence of a large set of sequences from microbial
onomic classification of microorganisms can rRNAs can be an allied for an increasing accu-
only be performed with a small number of racy in detecting chimeras. Only by knowing
genes; the most frequently used are the rRNAs. what is real, one can be in situation to discard
Extensive databases have been built with rRNAs what is unreal or chimeric (Gonzalez et al. 2005).
Evaluating Putative Chimeric Sequences from PCR-Amplified Products 153 E
Evaluating Putative Chimeric Sequences from PCR-Amplified Products, Table 1 Some resources focused on
rRNAs including database and software suites incorporating options and tools for chimera detection
Chimera check Database/
Name procedure software Link Reference
Ribosomal Database Pintail Database and http://rdp.cme.msu.edu Cole et al. 2003,
Project (RDP) tools 2009
SILVA Pintail Database and http://www.arb-silva.de Quast et al. 2013
tools
Greengenes Bellerophon Database and http://greengenes.lbl.gov DeSantis
tools et al. 2006
Mothur Variousa Software suite http://www.mothur.org Schloss
et al. 2009
E
QIIME ChimeraSlayer Software suite http://qiime.org Caporaso
et al. 2010
AmpliconNoise Perseus Software suite http://code.google.com/p/ Quince
ampliconnoise/ et al. 2011
a
Various options are available: Bellerophon, Ccode, Pintail, ChimeraSlayer, Uchime, Perseus
Evaluating Putative Chimeric Sequences from PCR-Amplified Products, Table 2 Some of the latest software
alternatives for chimera detection in sequence data
Program Link Reference
Bellerophon http://comp-bio.anu.edu.au/bellerophon/bellerophon.pl Hugenholtz and Huber 2003
Ccode http://www.microextreme.net/downloads.html Gonzalez et al. 2005
Pintail http://www.mybiosoftware.com/rna-analysis/1262 Ashelford et al. 2005
WigeoN http://microbiomeutil.sourceforge.net/#A _WigeoN Haas et al. 2011
Decipher http://decipher.cee.wisc.edu/FindChimeras.html Wright et al. 2011
ChimeraSlayer http://microbiomeutil.sourceforge.net/#A_CS Haas et al. 2011
Uchime http://drive5.com/uchime/uchime_download.html Edgar et al. 2011
Perseus http://code.google.com/p/ampliconnoise/ Quince et al. 2011
In fact, the large diversity of microorganisms proposed to check or detect chimeras. Table 2
known so far can provide with a range of vari- presents some of those alternatives with indica-
ability within specific microbial taxa. tion of its original publication and a link to its
As microbial taxonomy and the sequences of www homepage. As mentioned above, most of
rRNAs become increasingly defined and curated, these studies have been carried out to detect chi-
the detection of chimeric rRNAs is gaining accu- meras in DNA fragments obtained from PCR
racy. Thus, curated and extensive rRNA data- amplification and specifically on rRNA genes.
bases will definitively contribute both to avoid Originally, a simple method to intuitively and
the potential detection of real sequences as chi- approximately detect a potential chimera was to
meras and to improve on the accurate detection of search independently for homologues to the ini-
unreal sequences as chimeras. tial and finals portions of the DNA fragments.
This search was usually performed by blast
searches (Altschul et al. 1990). If this blast
Chimera Evaluation resulted in different organisms for the initial and
final portions of the DNA fragment, it was suspi-
Different procedures have been published to cious to be a chimera (Cole et al. 2003). More
check for chimeras in newly generated DNA sophisticated attempts have been designed
sequences. There has been a long list of programs through the years. A fruitful method was to
E 154 Evaluating Putative Chimeric Sequences from PCR-Amplified Products
analyze potential chimeras by comparison to the Amplicon sequencing is still the most used
sequences obtained from the rRNA gene library procedure for microbial surveys through rRNAs.
being sequenced and analyzed (Hugenholtz and The detection of potential chimeras during
Huber 2003). Similar analysis can be carried out these studies is a requirement to avoid the false
to full DNA databases or repositories (Ashelford consideration of nonexisting microorganisms
et al. 2005; Quast et al. 2013). Further improve- and an overestimation of microbial diversity.
ments included the analysis of the query sequence Current pipelines for the processing of amplicon
in relationship to the known sequences showing sequencing data incorporate chimera screening
highest homology, for instance, within a taxo- and filtering procedures. Databases must con-
nomic group. These known sequences marked tinue their current effort to evaluate newly
the variability for small portions of the DNA deposited sequences for potential chimeras.
fragment under analysis, and so those sequences Curated rRNA databases are a required refer-
showing the highest dispersion than the limited ence for the taxonomically classification of
by known sequences were identified as potential microorganisms through sequencing data.
chimeras, and these assessments included statis- These efforts will result in a more accurate
tical results of the computational analysis detection of chimeras, a significant decrease in
(Ashelford et al. 2005; Gonzalez et al. 2005). misclassifications due to erroneous sequences
Different procedures are periodically proposed included in databases, and an improved knowl-
to screen for chimeras, mainly performing ana- edge of microbial species, gene, and genomic
lyses of portions of the DNA fragment (Wright diversity.
et al. 2011) by searching if different results are
received from DNA database searches. A DNA
fragment is proposed as a chimera if it presents Future Perspectives
different homology results for different portions
throughout its length. As NGS is attracting most research on genomics,
As a result of the next-generation sequencing metagenomics, transcriptomics, and amplicon
(NGS) platforms, large number of sequences is sequencing surveys, the massive data they gen-
being generated through whole library sequenc- erate and the work needed for processing
ing. The screening of such amount of data would these results is exponentially increasing. High-
not be possible without the latest developments throughput procedures are required to cope with
and the recent design of pipelines for the analysis this demand. The use of current pipelines, or
of large data sets of DNA amplicon sequences future improvements, should build a standard
(Schloss et al. 2009; Caporaso et al. 2010; Quince for amplicon sequencing. The detection of
et al. 2011). The inclusion of chimera checking sequencing errors through algorithms in bioin-
procedures within these pipelines (Table 1) has formatics should also be introduced into these
greatly facilitated the analysis of massive high-throughput pipelines, all aiming to obtain
sequencing data. Nevertheless, the newly intro- clean and accurate data previous to pursue fur-
duced algorithms are masked by the advantages ther analysis. The screening and curation being
presenting the whole pipelines and the easily performed by public repositories must continue
handling of large sequencing data (Quince in spite of the developments in pipelines and
et al. 2011). One should confirm that the compu- algorithms to ensure that databases remain as
tational pipeline to process your sequencing data clean as possible of chimeric and erroneous
includes a chimera filtering procedure. Besides, sequences. At a time when sequencing analyses
some of these pipelines offer the possibility of are not manually edited anymore, algorithms to
using different databases. The inclusion in these automatically filtering off chimeras and the
analyses of curated databases is an important required curation at the scientist and database
point to be considered. ends will become of increasing relevance.
Extended Local Similarity Analysis (eLSA) of Biological Data 155 E
Acknowledgments The author acknowledges funding Quast C, Pruesse E, Yilmaz P, et al. The SILVA ribosomal
from the Spanish Ministry of Economy and Competitive- RNA gene database project: improved data processing
ness, project CONSOLIDER CSD2009-00006, which and web-based tools. Nucl Acids Res. 2013;41:
includes participation of Feder funds. D590–6.
Quince C, Lanzen A, Davenport RJ, Turnbaugh
PJ. Removing noise from pyrosequenced amplicons.
References BMC Bioinforma. 2011;12:38.
Roesch LFW, Fulthorpe RR, Riva A, et al.
Pyrosequencing enumerates and contrasts soil micro-
Altschul SF, Gish W, Miller W, Myers EW, Lipman bial diversity. ISME J. 2007;1:283–90.
DJ. Basic local alignment search tool. J Mol Biol. Sambrook JJ, Russell DDW. Molecular cloning.
1990;215:403–10. A laboratory manual. Cold Spring Harbor: Cold Spring
Ashelford KE, Chuzhanova NA, Fry JC, Jones AJ, Harbor Laboratory Press; 2001. E
Weightman AJ. At least 1 in 20 16S rRNA sequence Schloss PD, Westcott SL, Ryabin T, et al. Introducing
records currently held in public repositories is esti- mother: open-source, platform-independent, commu-
mated to contain substantial anomalies. Appl Environ nity supported software for describing and comparing
Microbiol. 2005;71:7724–36. microbial communities. Appl Environ Microbiol.
Caporaso JG, Kuczynski J, Stombaugh J, et al. QIIME 2009;75:7537–41.
allows analysis of high-throughput community Wintzingerode F, Göbel UB, Stackebrandt E. Determina-
sequencing data. Nat Methods. 2010;7:335–6. tion of microbial diversity in environmental samples:
Cole JR, Chai B, Marsh TL, et al. The Ribosomal Data- pitfalls of PCR-base rRNA analysis. FEMS Microbiol
base Project (RDPII): previewing a new autoaligner Rev. 1997;21:213–29.
that allows regular updates and the new prokaryotic Wright ES, Yilmaz LS, Noguera DR. DECIPHER,
taxonomy. Nucl Acids Res. 2003;31:442–3. a search-based approach to chimera identification
Cole JR, Wang Q, Cardenas E, et al. The Ribosomal for 16S rRNA sequences. Appl Environ Microbiol.
Database Project: improved alignments and new tools 2011;78:717–25.
for rRNA analysis. Nucl Acids Res. 2009;37:D141–5.
Curtis TP, Sloan WT, Scannell JW. Estimating prokary-
otic diversity and its limits. Proc Natl Acad Sci USA.
2002;99:10494–9.
DeSantis TZ, Hugenholtz P, Larsen N, et al. Greengenes,
a chimera-checked 16S rRNA gene database and
Extended Local Similarity Analysis
workbench compatible with ARB. Appl Environ (eLSA) of Biological Data
Microbiol. 2006;72:5069–72.
Edgar RC, Haas BJ, Clemente JC, et al. UCHIME Fengzhu Sun and Li Charlie Xia
improves sensitivity and speed of chimera detection.
Molecular and Computational Biology Program,
Bioinformatics. 2011;27:2194–200.
Gonzalez JM, Zimmermann J, Saiz-Jimenez C. Evaluat- Department of Biological Sciences, University of
ing putative chimeric sequences from PCR-amplified Southern California, Dana and David Dornsife
products. Bioinformatics. 2005;21:333–7. College of Letters, Arts and Sciences,
Gonzalez JM, Portillo MC, Belda-Ferre P, Mira
Los Angeles, CA, USA
A. Amplification by PCR artificially reduces the pro-
portion of the rare biosphere in microbial communi-
ties. PLoS ONE. 2012;7(1):e29973.
Haas BJ, Gevers D, Earl AM, et al. Chimeric 16S rRNA Synonyms
sequence formation and detection in Sanger and
454-pyrosequenced PCR amplicons. Genome Res.
2011;21:494–504. Local association analysis; Local similarity
Hugenholtz P, Huber T. Chimeric 16S rDNA analysis
sequences of diverse origin are accumulating in the
public databases. Intl J Syst Evol Microbiol.
2003;53:289–93.
Mende DR, Waller AS, Sunagawa S, et al. Assessment Introduction
of metagenomic assembly using simulated next
generation sequencing data. PLoS ONE. 2012;7(2): The advances in high-throughput low-cost exper-
e31386.
imental technologies have made possible time
Posada D, Crandall AK. The effect of recombination on
the accuracy of phylogeny estimation. J Mol Evol. series studies of hundreds or thousands biological
2002;54:396–402. factors simultaneously. The availability of such
E 156 Extended Local Similarity Analysis (eLSA) of Biological Data
datasets leads to an increased interest in profile significantly correlated in time interval from
similarity analysis techniques that can identify 4 to 17 if X is shifted three units toward origin
significant association patterns possibly embrac- as shown in the bottom-right panel (eLS ¼ 0.51,
ing biological insights. In the context of P ¼ 0.006).
metagenomics, factors of particular interest are
operational taxonomic units (OTUs), microbial Extended Local Similarity Analysis
genomes, and environmental genes. Their Extended local similarity analysis (eLSA) is an
association patterns may suggest microbe- analysis technique designed to capture local asso-
environment, symbiotic relationships, and other ciations possibly with time delays. eLSA extends
types of interactions. the original local similarity analysis technique
Many computational or statistical approaches (Qian et al. 2001; Ruan et al. 2006) and local
exist to study the profile similarity at global scale, shape analysis technique to time series data with
such as Pearson’s correlation coefficients (PCC), replicates (Xia et al. 2011). Improvements in
Spearman’s correlation coefficients (SCC), computation efficiency of p-values are also made
principal component analysis (PCA), multi- (Xia et al. 2013). Time series data of a pair of
dimensional scaling (MDS), discriminant func- factors X and Y with replicates can be expressed
tion analysis (DFA), and canonical correlation as data matrices X[1:m][1:n] and Y[1:m][1:n], where
analysis (CCA). However, in many biological each column is one sample from the time point
settings, the interaction may be active within and n is the number of time points; each row is
only certain subintervals or the response to regu- a replicate and m is the number of replicates.
lation may be time lagged. Methods based on the Given time series data of two factors and
global relationships of profiles may fail to detect a user-constrained delay limit, eLSA uses
these interactions. Extended local similarity anal- dynamic programming algorithm to find the con-
ysis (eLSA) method is specifically developed to figuration of the data that yields the highest
capture local and potentially time-delayed extended local similarity (eLS) score –
co-occurrence and association patterns in time a similarity metric defined as
series data that cannot otherwise be identified
by ordinary correlation analysis.
l1
X
eLSX 1
½1:m½1:n , Y ½1:m½1:n j ¼ maxi, j, l s:t: jijjD j
n k¼0
F X
Description ½1:m, iþk F Y ½1:m, jþk
Local Association with Possible Time Delays where D is the delay limit and F is the summa-
Local association refers to the association that rizing function for repeated measures (mean,
only occurs in a subinterval of the time of inter- median, etc.). For example, within a delay limit
est. Time-delayed association indicates that there of two units, the first time spot of one series might
is a time lag for the response of one factor to the be aligned to the third time spot of the other
change in another factor. As an example of local series, thus maximizing their eLS.
association, in Fig. 1, the top-left panel shows For a dataset of many factors, eLSA is applied
two series X and Y with nonsignificant correla- to each pairwise combination of factors in the
tion (r ¼ 0.26, P ¼ 0.273); however, they are in dataset. Candidate associations are then evaluated
fact significantly correlated in the time interval statistically by a permutation test, which calculates
from 7 to 16 as shown in the bottom-left panel the p-value – the proportion of scores exceeding
(eLS ¼ 0.43, P ¼ 0.028). As an example the original eLS score after shuffling the first series
of time-delayed local association, in Fig. 1, and reevaluating the eLS score many times – or
the top-right panel shows two series X and Y more efficiently by theoretical approximation.
with nonsignificant correlation (r ¼ 0.26, Researchers can use eLSA to detect undirected
P ¼ 0.272); however, they are in fact associations, i.e., association patterns without
Extended Local Similarity Analysis (eLSA) of Biological Data 157 E
Extended Local Similarity Analysis (eLSA) of Biolog- 7 to 16 (eLS ¼ 0.43, P ¼ 0.028); top right, two series
ical Data, Fig. 1 Examples of local and time-delayed X and Y with nonsignificant correlation (r ¼ 0.26,
associations. Top left, two series X and Y with nonsignif- P ¼ 0.272); bottom right, they are significantly correlated
icant correlation (r ¼ 0.26, P ¼ 0.273); bottom left, they in time interval from 4 to 17 if X is shifted three units
are in fact significantly correlated in the time interval from toward origin (eLS ¼ 0.51, P ¼ 0.006)
Extended Local Similarity Analysis (eLSA) of Biolog- correlation coefficients. The tools then assess the statisti-
ical Data, Fig. 2 The eLSA pipeline. Users start with cal significance (p-values) of these correlation statistics
raw data (matrices of time series) as input and specify their using the permutation test and filter out insignificant
requirements as parameters. The LSA tools subsequently results. Finally, the tools construct a partially directed
F-transform and normalize the raw data and calculate association network from the significant associations
extended local similarity (eLS) scores and Pearson’s
E
158
Extended Local Similarity Analysis (eLSA) of Biological Data, Fig. 3 An eLSA subnetwork built around g-proteobacteria OTUs as central nodes (abbreviated Alt
alteromonas, CHB CHABI-7, Gam g-proteobacterium, S86 SAR86, S92 SAR92)
Extended Local Similarity Analysis (eLSA) of Biological Data
Extraction Methods, Variability Encountered in 159 E
time delays, and directed associations, where the permutation testing, and network construction.
change of one factor may temporally lead or fol- More information about the software is available
low another factor. Figure 2 shows the analysis from eLSA’s homepage at http://meta.usc.edu/
pipeline of the eLSA technique. softs/lsa.
Collection of the aqueous phase containing DNA after centrifugation and subsequent
removal of environmental matrix, including proteins, humic acids and polysaccharides.
Extraction using organic hydrophobic solvents (phenol, chloroform, isoamyl alcohol) causing precipatation of organic
cell components. The hydrophilic DNA remains in the aqeous phase which can be collected after phase separation
following centrifugation. DNA is recovered from the aqeous phase by precipitation with ethanol or isopropanol.
Extraction Methods, Variability Encountered in, samples. Step B is the step in all protocols where most
Fig. 1 Schematic presentation of the steps and proce- biases are introduced
dures to extract and purify DNA from environmental
E 162 Extraction Methods, Variability Encountered in
The first step in every DNA-based study is the combination with detergents are preferred. The
collection and storage of environmental samples use of agarose plugs to perform enzymatic lysis
before the DNA is extracted (step A in Fig. 1). has shown to be very effective in obtaining high
Depending on whether the samples are fresh or molecular weight DNA (Williamson et al. 2011).
have been stored cold or frozen or whether they Next to the method of lyses the environmental
have been freeze dried can already give rise to matrix is also a source of variation. The extrac-
variations in the extracted DNA quality and quan- tion and liberation of DNA always is executed in
tity depending on the environmental matrix and a “lysis buffer.” The buffer normally is of alka-
the community composition. However, recently line pH (8–9) which reduces electrostatic inter-
it has been shown using a pyrosequencing actions between DNA and proteins and which
approach that the variation introduced due to inhibits enzymes degrading DNA (nucleases)
sample storage of soil and human-associated and facilitates denaturing of other proteins.
samples was insignificant (Lauber et al. 2010). Often a chelating agent (e.g., EDTA) is added to
After sample storage two routes can be followed the buffer which destabilizes cell walls and mem-
to step B, the liberation of DNA from cells (step branes as well as proteins by binding cations
B in Fig. 1). Either cells are released from the (Ca2+, Mg2+). Besides protecting DNA from deg-
environmental matrix by shaking or sonication radation once it is liberated, compounds that bind
followed by harvesting by, e.g., density centrifu- the DNA should be removed before non-DNA
gation with subsequent lysis (indirect extraction) components are removed by centrifugation.
or the cells are lysed in the environmental Humic acids are derived from plant and animal
matrix directly (direct extraction). Generally, remains by decomposition and are highly diverse
the direct lysis is preferred because the DNA in chemical structure. Due to their variability of
yield is higher due to no cell loss during cell functional groups on the molecules that can more
extraction and purification. However, especially or less strongly adhere to DNA and to the fact that
in metagenomic studies where large intact DNA the amount and structure depend on the biota and
fragments are required to (>20 kb in size) obtain chemical conditions of the environment, the
complete genes, operons, and genomes, it has impact of humic acids on DNA extraction is
been shown that the indirect method is preferred highly variable. Hence, a large number of com-
and does also not lead to a significant difference pounds (step B, Fig. 1) have been tested and used
in overall diversity (Delmont et al. 2011b). The to bind and remove humic acids already at the
subsequent liberation of DNA from cells is the stage of liberation of DNA. The latest addition
step in all extraction protocols where most bias is was the use of vitamins (Techer et al. 2010). Cen-
introduced. Cell walls have to be broken. The trifugation removes cell debris and precipitated
efficiency is dependent on the cell wall structure components, while the supernatant containing the
(gram + vs. gram –) and the presence of extracel- DNA is taken to step C (Fig. 1) which is the
lular slime layers composed of polysaccharides extraction from DNA out of the remaining
and proteins. Also the lyses methods commonly organic cell and environmental components.
used, physical, enzymatic, and chemical (Fig. 1), This is done by phase separation using hydropho-
differ in their efficiency of lyses, giving rise to bic solvents (step C, Fig. 1), keeping the DNA in
variability, strongly depending on the community the aqueous which is underneath the hydrophobic
composition in terms of the presence of difficult phase containing the remaining cell components.
to lyse cells. Also at this step, a choice of method Variability in this step can only come from the
can be made on the basis of the downstream quality of the chemicals and the pipetting skills of
application. The physical disruption techniques the researcher. Care has to be taken not to collect
(e.g., bead beating) yield low molecular weight any of the hydrophobic phase which leads to
DNA (<20 kb) not suitable for metagenome stud- differences in the amounts of aqueous phase col-
ies. In this case the enzymatic lyses methods in lected. The DNA is recovered by precipitation
Extraction Methods, Variability Encountered in 163 E
using ethanol or isopropanol which destroys the Variability and Community Composition
helical structure leading to precipitation. After Assessment
resuspension in water or buffer, the DNA can be
ready for use in various analyses of abundance, The central question in microbial ecological
diversity, or genomic procedures or has to be research is why microbial communities are com-
additionally cleaned up to remove any remaining posed in the way they do and what factors influ-
impurities as indicated in step D (Fig. 1). The ence community composition. To this end it is
potential additional variation introduced here is essential when comparing one sample with
that loss of DNA can occur leading to changes in another that differences observed are due to
relative abundances of species not reaching the biotic or abiotic factors and not biases introduced
E
detection limits of the respective downstream by the methods used. It is obvious from the pre-
method. Hence, when DNA yield from samples vious section that a bias-free extraction of DNA
is low, additional cleanup is often not an option. from all environments is not possible. The matrix
Also at this step some procedures are more appli- as shown in Fig. 1 is a collection of methods
cable when HMW DNA is preferred. A procedure developed with the goal to obtain
where direct current and pulsating nonlinear cur- PCR-amplifiable DNA. Hence, the protocols
rents in gel electrophoresis are alternate has been were not designed for bias-free extraction but
shown to be very effective in purifying HMW for obtaining extract enabling downstream appli-
DNA from the soil (Engel et al. 2012). The last cations. Considering the inherent problems spe-
step before downstream analyses is the quality cific to various environmental matrices, not
control and quantification of the DNA concentra- a single protocol will suffice to be applied to all
tion. UV spectrophotometry is most often used as environments. The protocols developed were
an indicator of purity, where the ratio of absor- designed and tested to yield the highest quality
bance at wavelengths 260/280 nm should be and quantity of DNA and highest diversity in
2 when DNA is free from proteins or humic fingerprinting (denaturing gel electrophoresis
acids. The NanoDrop device is mostly used for (DGGE), terminal restriction fragment length
this purpose because it only requires a few ml polymorphism(T-RFLP), microarray) methods
of the precious extracted DNA. However, the or highest abundance assessed with quantitative
spectrophotometric methods suffer from the fact polymerase chain reaction (qPCR) or highest
that co-extracted RNA is also measured and MW DNA in metagenomic studies. Hence, com-
that humic acids also lead to absorbance, eventu- munity composition was the criterion for testing
ally overestimating the amount of DNA in the performance of protocols, and the amount of pro-
extract. Alternative methods based on fluorescent tocols available is a good indicator of the biases
dyes binding to double-stranded DNA can be introduced. However, it was demonstrated that
used which only detect DNA, but which are also even when applying 1 protocol on exactly the
sensitive to interference by humic acids. Bias- same soil sample, community composition ana-
free quantification methods are the ones where lyses following DNA extraction are not bias-free
gel electrophoresis is combined with densitome- (Pan et al. 2010). When a single well-
try, which even is available in a lab-on chip homogenized soil sample was extracted in differ-
format. ent laboratories using the same protocol, biases
All the procedures described in Fig. 1 were already introduced at the initial extraction.
have also been combined and offered as commer- The DNA quantity (Fig. 2a) as well as quality
cial ready-to-go DNA extraction kits for various varied significantly between laboratories leading
environmental matrices often by machinery to significant differences in community composi-
for cell lyses. In Table 1 an overview of some tion of methane-oxidizing bacteria (Fig. 3) as
commercially available kits and equipment is assessed by PCR-based microarray analyses.
given. Moreover, the same extractions performed by
E 164 Extraction Methods, Variability Encountered in
Extraction Methods, Variability Encountered in, Table 1 Overview of a number of commercially available DNA
extraction kit, lyses equipment, additional cleanup kits, and DNA quantitation methods
Soil DNA extraction kits
PowerSoil and PowerMax/Mobio http://www.mobio.com/soil-dna-isolation/powermax-soil-dna-isolation-
kit.html
SoilMaster/Epicentre Technologies http://www.epibio.com/item.asp?id¼388
E.Z.N.A._ Soil DNA Kit/Omega BioTek http://www.omegabiotek.com/product_detail.php?ID¼95
ZR Soil Microbe DNA Kit/Zymo Research http://www.zymoresearch.com/media/downloads/212/D6001d.pdf
FastDNA_ SPIN kit for Soil/MP http://www.biocompare.com/11793-DNA-Purification-Kits-Soil/
Biomedicals 2691724-FastDNA96-Soil-Microbe-DNA-Kit/
Cell disruption equipment
BioSpec Mini Bead Beater http://www.biospec.com/product/28/mini_beadbeater/
MP Biomedicals FastPrep ®-24 or MP http://www.mpbio.com/product_info.php?family_key¼116004500
Biomedicals FastPrep ®-96
Geno/Grinder ® http://www.spexsampleprep.com/equipment-and-accessories/
equipment_product.aspx?typeid¼1
Free/Mill ® http://www.spexsampleprep.com/equipment-and-accessories/
equipment_product.aspx?typeid¼2
Additional cleanup kits
Wizard ® SV Gel and PCR Clean-Up http://www.promega.com/products/dna-and-rna-purification/dna-
System fragment-purification/wizard-sv-gel-and-pcr-clean_up-system/
Sepharose 4B ® columns http://www.gelifesciences.com/webapp/wcs/stores/servlet/catalog/nl/
GELifeSciences-nl/products/AlternativeProductStructure_17546/
17075701
Nonlinear electrophoresis (SCODA) http://www.borealgenomics.com/products/aurora/
DNA quality/quantity
NanoDrop http://www.nanodrop.com/
PicoGreen (QuaniTTM) http://www.invitrogen.com/site/us/en/home/brands/Product-Brand/
Quant-iT.html
Microfluidics http://www.genomics.agilent.com/GenericB.aspx?
Agilent Bioanalyzer PageType¼Family&SubPageType¼FamilyOverview&PageID¼183
two investigators simultaneously in the same lab- consequences for the subsequent outcome of the
oratory using exactly the same chemicals and downstream analyses.
equipment also yielded significant differences in Important improvements were made to reduce
DNA quantity (Fig. 2b) and quality proving that extraction bias by extracting the same sample
also the investigator can introduce biases, proba- matrix, remaining in the pellet of step B (Fig. 1)
bly due to pipet handling in step C (Fig. 1) of the multiple times (Feinstein et al. 2009). After three
protocol. Another source of bias appeared to extractions DNA quantity as well as bacteria
come from the DNA quantitation method abundance reached a plateau which was similar
(Fig. 2) leading to significantly different commu- for a number of different lyses protocols. This
nity profiles (Fig. 4) as well as abundance of demonstrates that a single extraction always
methane-consuming bacteria. In this case gives a biased picture of the community compo-
overestimation of DNA quantity by NanoDrop sition. Combining multiple extraction protocols
leads to a higher dilution of the DNA to reach has shown to enhance the detected diversity of
the same input amount of target DNA as in the recovered species by more than 80 % (Delmont
PicoGreen-based PCR reaction. This dilution et al. 2011a) in soil samples. However, the rela-
reduced the remaining inhibition of the PCR by tive abundance of the various approaches was
contaminants still present in the DNA with different, making this approach very important
Extraction Methods, Variability Encountered in 165 E
a Nanodrop Picogreen b
300 a
b
250
*
DNA concentration (ng/uL)
bc
c * *
200 c
a
a * *
150 c
*
100
b
c
50
E
0
A B C D E A1 A2 B1 B2 C1 C2 D1 D2 E1 E2
Laboratory Investigator
Extraction Methods, Variability Encountered in, countries (P < 0.05, unequal honestly significant differ-
Fig. 2 DNA concentrations (means 1 standard devia- ence test). In panel B, the asterisk indicates a significant
tion) as analyzed with NanoDrop or PicoGreen, showing difference between investigators within one laboratory
the comparisons between laboratories (a) and between (as assessed using Students’ t test; P < 0.01) (From Pan
investigators in the various laboratories (b). Different et al. 2010 with permission)
letters in panel A indicate significant differences between
Extraction Methods, Variability Encountered in, dissimilarity between MOB communities. Analyses of
Fig. 3 Nonmetric multidimensional scaling plot using similarity (ANOSIM) resulted in a significant difference
log-transformed Bray-Curtis dissimilarity matrices based between MOB community structures analyzed in the dif-
on signal intensity values of the pmoA microarray ana- ferent laboratories. Only samples from laboratory A and
lyses on DNA extracted in five different laboratories. B did not differ from each other (n ¼ 8, except for labo-
Distances between symbols represent relative ratory E [n ¼ 6]) (From Pan et al. 2010 with permission)
for complete diversity assessment but not for and by a number of different laboratories (Petric
comparisons between different samples or envi- et al. 2011). The protocol was only standardized
ronments. The first attempt for standardization up to what is believed to be the step (step B,
between samples and environments has been Fig. 1) causing most variation. Thirteen different
established for soils where an ISO-certified laboratories tested a number of soil types. There
extraction protocol was tested on various soils was variation in DNA quantity and quality and
E 166 Extraction Methods, Variability Encountered in
Extraction Methods, Variability Encountered in, represent relative dissimilarity between MOB communi-
Fig. 4 Nonmetric multidimensional scaling plot using ties. Analyses of similarity (ANOSIM) resulted in
log-transformed Bray-Curtis dissimilarity matrices based a significant difference between MOB community struc-
on signal intensity values of pmoA microarray analyses, tures when based on different DNA concentration mea-
performed on the basis of the NanoDrop or PicoGreen surements (n ¼ 8) (From Pan et al. 2010 with permission)
DNA quantitation method. Distances between symbols
also in community fingerprinting but acceptable laboratory by the same person using the identical
as compared to commonly observed variation. chemicals and machinery, especially the bead-
Although the soils did not differ/vary much in beating apparatus. Of course the latter may not
their complexity and only one fingerprinting always be feasible, and an extraction robot may
method was used, this standard protocol is be very useful in order to reduce variation caused
a very important step toward comparability of by pipet handling (e.g., Maxwell-16 system from
samples. At least for the intensively studied soil Promega). However, in order to come to real
habitat, comparisons may be possible and similar ecological comparisons of microbial communi-
standardizations for related habitats may be ties, new methods of standardization have to be
a way to go. developed. Internal standardization by spiking
samples with a known amount of cells may be
an option. The most important, however, will be
Conclusions to assess for every sample matrix what the extent
of the bias is and take that into account in the
It is obvious that not one protocol of DNA extrac- interpretation.
tion will be bias-free and that applying a single
protocol to a sample will never yield a “true”
picture of microbial community composition. Summary
The inherent differences in the properties of envi-
ronmental matrices prevent this. However, Microbial communities are the drivers of all eco-
important improvements have been made leading systems on Earth but are also the least understood
to the recommendation to perform multiple branch on the tree of life. The advent of molecu-
extractions on the same matrix and multiple pro- lar biological techniques assessing environmen-
tocols with varying stringency of lyses to maxi- tal nucleic acids has revolutionized the amount of
mize diversity assessments of single samples. information on environmental microbial commu-
When different samples have to be compared in nities. However, in the era of metagenomics and
time or between treatments or habitats, it is best high-throughput sequencing, the critical step in
when extractions are performed in the same microbial community analyses is still the
Extradiol Dioxygenases Retrieved from the Metagenome 167 E
extraction of DNA from environmental samples. community structure in soil and human-associated
DNA is extracted by liberation from cells samples. Fems Microbiol Lett. 2010;307(1):80–6.
Liu W-T, Jansson JK, editors. Environmental molecular
followed by extraction from the matrix using microbiology. Norfolk: Caister Academic press; 2010.
organic solvents and recovered by precipitation Lombard N, Prestat E, van Elsas JD, Simonet P. Soil-
with alcohols. The lyses of cells and the removal specific limitations for access and analysis of soil
of contaminants that degrade or adhere to the microbial communities by metagenomics. Fems
Microbiol Ecol. 2011;78(1):31–49.
DNA call for many different approaches varying Pan Y, Bodrossy L, Frenzel P, Hestnes AG, Krause S,
in effectiveness and leading to substantial bias in Luke C, et al. Impacts of inter- and intralaboratory
downstream genomic or metagenomic applica- variations on the reproducibility of microbial commu-
tions. Next to this, variation can also be intro- nity analyses. Appl Environ Microbiol. 2010;76(22):
7451–8. E
duced to investigator skills. Improvements have Petric I, Philippot L, Abbate C, Bispo A, Chesnot T,
been made for increasing the observed diversity Hallin S, et al. Inter-laboratory evaluation of the ISO
in one single sample, and for soils, an standard 11063 “Soil quality – method to directly
ISO-certified extraction protocol has been extract DNA from soil samples”. J Microbiol Methods.
2011;84(3):454–60.
established facilitating ecological comparisons Techer D, Martinez-Chois C, D’Innocenzo M, Laval-
for this habitat. For true ecological comparisons, Gilly P, Bennasroune A, Foucaud L, et al. Novel per-
new ways of standardization have to be spectives to purify genomic DNA from high humic
developed. acid content and contaminated soils. Sep Purif
Technol. 2010;75(1):81–6.
Williamson KE, Kan J, Polson SW, Williamson SJ. Opti-
mizing the indirect extraction of prokaryotic DNA
References from soils. Soil Biol Biochem. 2011;43(4):736–48.
OH OH
Catecholic compounds meta-cleavage product
(R: -H, -Cl, -CH3, -Ph etc) (yellow-colored)
I.2.F
I.2.D
I .2.E
Q C3 AN PSE O I.5
Q6 BAC AJA
Q9 969 ART PA
Q8 HORH
Q5 4048 RHO
ST
R
G
9B PU
PH 9NL9 B
K
UR
3
Q 4 N 3D
R
Q59 PJ
EB
U
770
EP
BAC
PS
9 Z
I.2.C
4
26
BPHC
I.2.G
52
5
1H11 1A1 2C1 9E D12
7 11 1
1E12 4E12 2A1 6H 10- E
1D9 9C8 5B2 3F 8 PH
N
BPH
1D2 9E4-1 5F10-1 4E 44 9S A HY
24
C BU
RCE
5F2 2B9 3G3 Q5 459 SP
BPHC 3H5 7B2 2C5-1 Q45
PSEP 8
I.3.A Q52032 S 6D4 10D
PSEPU 9B9 6F5 I.2.B
Q84EP0 9BUR
K 1A9
4D5
BPHC PSES1
’’
3F10-2
Q51749 PSEFL 9E4-2
OSR 9B1
Q53126 RH 6B91
R H OGO
BPHC1 U 2H2
I.3.B P EP
S 5F1
TODE F2 1 3A2 0-2
OE
R 2C
8 RH RH XY 5-2
N 5 Q LE2
I.3. 3
O69 F5 RH CTO
O
62 9A SR DM 5970 PSE
I.3.I Q7 87 Q PB 8 P PU
9 LC HO ER NA 597 PS SEP
Q
Q 5 R HO H HH 09 E U
. K R R P UF
Q 7M 1 P AE
I.3 KW 59 HO ER PS SE
59 0
Q9 693
Q LE SE
R O EP PU
72 R7 EP
XY LE 2 9P 2 9SP
92 HO H
0 I.2
2I RH
XY 3U2 5091
O
VV 5 R OR
9M ER
0
U .A
Q8
76 62
PS ICO
PS LC
H
. G Q 3
R
ES XX
I.3
8 P
69
A
ES
4
7E 3
P
1
P
S
2H
O
C
1
35
Q
76
RHO A
.H
C PS SO
O
OCA
69
U
EPA
Q
H
P
I.3
33
Q9KWQ8 RHOS
MPC2 RALEU
Q6W1M5 RHISN
PSE
S
O
RHOR
4A3
BPH
Q8
6B9-
ED
25
.L
5B-2
Q8L185 9N
5F10-3
Q5
I.3
C2
FB
325
1
.J
CATA
DB
BPH
RH
I.3
HN
.D
P72
OG
I.3
.M
O I
I.3
I . 1.B
C
.1.A
I.3.
I.3.E
I.1.C
I.3.F
I.6
I.4
0.1
Extradiol Dioxygenases Retrieved from the 9SPHN, Q50912; MPC2 RALEU, P17296; Q6W1M5
Metagenome, Fig. 2 A phylogenetic tree showing both RHISN, Q6W1M5; Q9KWQ8 RHOSR, Q9KWQ8;
metagenomic EDOs and previously identified type I EDOs. Q8L185 9NOCA, Q8L185; BPHC2 RHOGO, P47232;
The metagenomic clones identified from the activated sludge DBFB PSEPA, P47243; CATA RHORH, Q53034; BPHC
of wastewater from a Coke plant (Suenaga et al. 2007) are PSEPA, P11122; P72325 RHOSO, P72325; Q52533 PSESP,
shown in red. The accession numbers of the previously Q52533; Q8VV92 9MICO, Q8VV92; O69355 RHOER,
identified EDOs are as follows: BPHC BACPJ, Q8GR45; O69355; Q762H4 RHORH, Q762H4; O69362 RHOER,
Q59770 RHORH, Q59770; PHEB BACST, P31003; O69362; Q762I0 RHORH, Q762I0; O69359 RHOER,
Q89NL9 BRAJA, Q89NL9; Q59693 PSEPU, Q59693; O69359; Q9KWQ5 RHOSR, Q9KWQ5; Q9LC87 9ACTO,
Q9ZAN5 9BURK, Q9ZAN5; Q52264 PSEPU, Q52264; Q9LC87; Q762F5 RHORH, Q762F5; O69358 RHOER,
Q52444 9SPHN, Q52444; Q45459 SPHYA, Q45459; O69358; TODE PSEPU, P13453; BPHC1 RHOGO,
XYLE2 PSEPU, Q04285; Q59708 PSEPU, Q59708; P47231; Q53126 RHOSR, Q53126; Q51749 PSEFL,
DMPB PSEUF, P17262; Q59709 PSEPU, Q59709; NAHH Q51749; BPHC PSES1, P17297; Q84EP0 9BURK,
PSEPU, P08127; Q59720 PSESP, Q59720; Q7M0R7 Q84EP0; Q52032 PSEPU, Q52032; BPHC PSEPS,
ALCXX, Q7M0R7; XYLE1 PSEPU, P06622; XYLE P08695; BPHC BURCE, P47228 (This figure was drawn
PSEAE, P27887; Q83U22 9PSED, Q83U22; Q6N3D3 using the FigTree software (http://tree.bio.ed.ac.uk/software/
RHOPA, CGA009; Q44048 ARTGO, Q44048; Q50912 figtree/))
E 170 Extradiol Dioxygenases Retrieved from the Metagenome
active site and are coordinated by the known subfamilies, but surprisingly, 23 genes
so-called 2-His-1-carboxylate facial triad motif could not be classified into existing subfamilies,
(Lipscomb 2008). and therefore, four new subfamilies, namely,
I.1.C, I.2.G, I.3.M, and I.3.N (Fig. 2), were pro-
posed. Among these novel EDOs, the I.2.G sub-
EDOs Retrieved from the Metagenome family genes were overrepresented among the
retrieved metagenomic EDOs and branched at
At the time of writing of this report (March 2013), a deep point in the lineage. Enzymatic character-
42,295 “extradiol dioxygenase” sequences have ization demonstrated that the I.2.G EDOs have
been deposited in the Protein database of NCBI unique properties, including Mn(II) dependence
(www.ncbi.nlm.gov/protein), 1,076 of which are instead of the more common Fe(II) dependence,
derived from “uncultured bacteria.” Of the 1,076 as well as the highest affinity for catechol
sequences, however, only few contain complete reported thus far, and tolerance for thermal and
EDO sequences (Vilchez-Vargas et al. 2010; chemical inhibitors (NaCN and H2O2) (Suenaga
Suenaga 2012). et al. 2009).
Based on the yellow coloration of catechol
ring-cleavage products, 235 positive clones were
identified from the fosmid library constructed EDO Application for Bioremediation
from environmental DNA extracted from petrol-
contaminated soil (Brennerova et al. 2009). Each polluted site harbors contaminants that
PCR-based classification of the internal sequences carry environment-specific EDO genes. Monitor-
of the metagenomic EDO genes showed that only ing these “marker” EDO genes using the
one-fourth of the observed EDOs belong to sub- metagenomic approach may be a good method
family I.3.A of I.3.B that would be expected as in evaluating the bioremediation process
predominant taking into consideration of the (Widada et al. 2002). Furthermore, retrieving
knowledge obtained from isolated bacteria. novel EDOs, as well as engineering these for
Genes of subfamily I.2.A, which have frequently higher activity and stability, can enhance the
been used as DNA markers for assessing the cat- development of bioremediation processes.
abolic potential of polluted sites, were also absent
(Vilchez-Vargas et al. 2010). Functional analysis
of representative proteins indicated that 1 clone, Summary
s45, has exceptionally high affinity for different
catecholic substrates. Metagenomic approaches are an effective means
Coke plant wastewater contains various aro- of discovering novel enzymes including EDOs,
matic compounds and activated sludge that is which present specific sequences and enzymatic
used for decontamination may serve as a rich properties based on their substrate preference,
resource for EDO discovery. Suenaga metal dependence, inhibitor tolerance, and vari-
et al. (2007) created a metagenomic fosmid ous physicochemical properties. Research
library using the activated sludge and by func- targeting different environments may help in fur-
tional screening, 91 EDO-positive clones were thering the knowledge about the diversity
identified. Based on their substrate specificity of EDOs.
for various catecholic compounds, 38 clones
were subjected to shotgun DNA sequencing.
Some clones contained 2 EDO genes and as Cross-References
a result, a total of 43 EDO genes were identified.
Approximately half of these were classified into ▶ Metagenomics Potential for Bioremediation
Extradiol Dioxygenases Retrieved from the Metagenome 171 E
References De Lorenzo V. Systems biology approaches to bioreme-
diation. Curr Opin Biotechnol. 2008;19:579–89.
Abraham WR, Nogales B, Golyshin PN, et al. Pieper DH, Seeger M. Bacterial metabolism of
Polychlorinated biphenyl-degrading microbial com- polychlorinated biphenyls. J Mol Microbiol
munities in soils and sediments. Curr Opin Microbiol. Biotechnol. 2008;15:121–38.
2002;5:246–53. Suenaga H, Ohnuki T, Miyazaki K. Functional screening
Brennerova MV, Josefiova J, Brenner V, et al. of a metagenomic library for genes involved in micro-
Metagenomics reveals diversity and abundance of bial degradation of aromatic compounds. Environ
meta-cleavage pathways in microbial communities Microbiol. 2007;9:2289–97.
from soil highly contaminated with jet fuel under Suenaga H, Mizuta S, Miyazaki K. The molecular basis
air-sparging bioremediation. Environ Microbiol. 2009; for adaptive evolution in novel extradiol dioxygenases
11:2216–27. retrieved from the metagenome. FEMS Microbiol
Chakraborty R, Coates JD. Anaerobic degradation of Ecol. 2009;69:472–80. E
monoaromatic hydrocarbons. Appl Microbiol Suenaga H. Targeted metagenomics: a high-resolution
Biotechnol. 2004;64:437–46. metagenomics approach for specific gene clusters in
Eltis LD, Bolin JT. Evolutionary relationships among complex microbial communities. Environ Microbiol.
extradiol dioxygenases. J Bacteriol. 1996;178:5930–7. 2012;14:13–22.
Fritsche W, Hofrichter M. Aerobic degradation by micro- Top EM, Springael D. The role of mobile genetic elements
organisms. In: Rehm H-J, Reed G, editors. Biotech- in bacterial adaptation to xenobiotic organic com-
nology: environmental processes II, vol. 11b. pounds. Curr Opin Biotechnol. 2003;14:262–9.
2nd ed. Weinheim: Wiley-VCH Verlag GmbH; 2008. Vilchez-Vargas R, Junca H, Pieper DH.
Furukawa K, Suenaga H, Goto M. Biphenyl dioxygenases: Metabolic networks, microbial ecology and “omics”
functional versatilities and directed evolution. technologies: towards understanding in situ biodeg-
J Bacteriol. 2004;186:5189–96. radation processes. Environ Microbiol. 2010;12:
Janssen DB, Dinkla IJT, Poelarends GJ, et al. Bacterial 3089–104.
degradation of xenobiotic compounds: evolution and Widada J, Nojiri H, Omori T. Recent developments in
distribution of novel enzyme activities. Environ molecular techniques for identification and monitoring
Microbiol. 2005;7:1868–82. of xenobiotic-degrading bacteria and their catabolic
Lipscomb JD. Mechanism of extradiol aromatic ring- genes in bioremediation. Appl Microbiol Biotechnol.
cleaving dioxygenases. Curr Opin Struct Biol. 2002;60:45–59.
2008;18:644–9.
F
Fast Program for Clustering and amount of data has become one of the major
Comparing Large Sets of Protein or issues and challenges in many sequencing-based
Nucleotide Sequences research. Such challenges are typically domi-
nated by two factors: huge data size and high
Weizhong Li sequence redundancy. Sequence clustering is
J. Craig Venter Institute, La Jolla, CA, USA a key technique that can address these two issues
at once, by clustering the sequences and reducing
them to a smaller subset of representative
Synonyms sequences.
Sequence clustering is a technique to group
CD-HIT is a fast program for clustering large sequences into groups (clusters), such that similar
amount of protein and nucleotide sequences sequences are clustered together and can be
potentially represented by a single representative
sequence. A sequence similarity between two
Definition sequences is normally defined based on an opti-
mal alignment between them. Such optimal
Sequence clustering is a process to group alignment is usually found by dynamic program-
sequences into groups (clusters) such that similar ming techniques, which are computationally
sequences are clustered together and can be expensive. Traditional clustering algorithms that
potentially represented by a single representative require many pairwise sequence comparisons are
sequence. CD-HIT uses a greedy incremental impractical for clustering very large sequence
clustering algorithm enhanced by an efficient datasets. Reducing the number of sequence com-
word filtering heuristics and an effective parisons is the key to efficient sequence cluster-
parallelization technique to do clustering on big ing that can cope with the massive amount of
sequence datasets efficiently. sequencing data.
Greedy incremental clustering has been
employed in sequence clustering to reduce the
Introduction number of sequence comparisons since the
implementation of a tool by Holm and Sander
Since the development of high-throughput (1998) to create nrdb90 for protein sequences
sequencing technologies, the amount of available with a decapeptide filter to further reduce the
biological sequences has increased dramatically number of comparisons. To overcome some lim-
and continues to increase rapidly. Efficient han- itations of that tool and further improve the clus-
dling and effective analysis of such massive tering efficiency, CD-HIT was developed to use
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
F 174 Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences
the same greedy incremental algorithm, but with Filtering Based on Shared Words
a much more efficient filtering heuristics Checking a query sequence against each of the
(Li et al. 2001, 2002). CD-HIT was then extended representative sequences is very inefficient,
to support clustering of nucleotide sequences because such checking involves sequence com-
(Li and Godzik 2006) and became one of the parison based on sequence alignment using
most widely used programs for sequence dynamic programming, which is computationally
clustering due to its efficiency to handle large expensive. To reduce such comparisons, a word
datasets. (k-mer or q-gram) indexing table can be used to
The rapid increasing amount of sequence data filter out unnecessary comparisons based on the
demand even more efficient clustering programs number of words shared between the query
and have lead to the development an enhanced sequence and each of the representative
version of CD-HIT (Fu et al. 2012), which has sequences.
been reengineered to support clustering of very The idea is that, for two sequences to have
large sequence datasets. In this new CD-HIT, identity above an identity cutoff, they must
a parallelization technique was developed to share a minimum number of common words
safely and efficiently parallelize the greedy incre- given the sequence lengths. It is easy to see that,
mental clustering algorithm. This parallel given two sequences with an alignment length
CD-HIT can achieve very good speedup L and an identity cutoff C, the maximum number
(quasilinear speedup for up to eight cores) on of mismatches and gaps that are allowed between
multicore computers for sequence clustering. two aligned sequences is E ¼ L(1 C), so the
CD-HIT and its derived programs such as minimum number of shared words of length
CD-HIT-454, CD-HIT-DUP, CD-HIT-LAP, and W should be L + 1 (E + 1)*W. This is also
CD-HIT-OTU have extensive applications in the minimum number of shared words between
metagenomics field. A summary of these a query sequence of length L and any other longer
applications is available from a recent review reference sequences. In CD-HIT, this threshold is
paper (Li et al. 2012). adjusted according to the presence of unknown
letters such as “N” and “X,” etc., and to the
command line options.
Methods To speed up the counting of shared words, an
indexing table is built for the representative
CD-HIT uses a greedy incremental clustering sequences to record for each word the indices
algorithm with filtering heuristics based on of the representative sequences and the number
shared word counting for efficient clustering. It of occurrences the word appears. This will
is further enhanced by an effective parallelization allow efficient counting of shared words
technique that can achieve very good speedup on between a query and each of the representative
multicore computers. sequences.
Sanger using primer vectors is smaller than the difference that the ends sequenced are separated
size of most inserts of this size (Venter by a much larger distance. Also, the fosmids that
et al. 2004). show promise of revealing some interesting
However, large insert and particularly fosmids activity, or corresponding to an interesting
have been very popular for metagenomic workers microbe, can be fully sequenced (Fig. 1), tradi-
(DeLong et al. 2006; Martin-Cuadrado tionally by Sanger dideoxy but now also by
et al. 2007). The main reason is that the insert in high-throughput approaches (Martin-Cuadrado
a fosmid is a sizeable natural contig that contains et al. 2009).
typically 30–40 genes. This size is very appropri- Fosmids can also be screened by PCR to select
ate for annotation since bacterial and archaeal those belonging to selected groups of microbes,
gene clusters are arranged functionally, i.e., largely by using 16S rRNA primers (Martin-
genes with related function, such as different Cuadrado et al. 2008). This way, the fosmids
enzymes of a metabolic pathway, are located containing ribosomal operons can be identified
next to each other, often organized in operons. and those containing the target rRNA gene fully
Therefore, function can be inferred with much sequenced. This approach is a bit tricky when the
more reliability from a large contig. A common target group are bacteria because fosmid prepa-
approach taken for analysis of fosmid libraries is rations are always contaminated with E. coli
the fosmid-end sequencing by using the vector DNA and PCR of 16S rRNA gene gives always
primers. This generates datasets that are similar that amplicon. As an alternative methodology to
to the short insert (also known sometimes as select bacterial fosmids containing ribosomal
shotgun) libraries but with the important operons, primers for 16–23S gene spacer or ITS
Screening
Bacteria
16S rDNA Archaea
16S ITS 29S
18S rDNA Eukaryotes
Fosmid System, Fig. 1 Methods for selecting fosmids genes such as rRNAs. In the case of bacteria, a strategy
for full sequencing. End sequences can provide clues as to to select those containing other rRNAs different from
the kind of genes present in the fosmid and allow for E. coli that is present in all the clones is shown. The
selecting those involved in interesting processes or amplicon includes the internal transcribed spacer (ITS),
microbes (Martin-Cuadrado et al. 2009). Alternatively, and the size of this hypervariable region shows the clones
fosmid clones can be screened by PCR or hybridization containing rRNA genes different from those of E. coli
to select those that contain taxonomically informative
Fosmid System 179 F
a b
35000 35000
30000 30000
25000 25000
20000 20000
15000 15000
10000 10000
5000 5000
0 0 F
0 20 40 60 80 100 0 20 40 60 80 100
GC% GC%
Fosmid System, Fig. 2 Frequency distribution of GC% direct 454 pyrosequencing dataset. (b) All reads of the
for the two metagenomic sequence datasets from the DCM fosmids dataset after removing the vector pCC1fos
Mediterranean water column at the deep chlorophyll max- sequences. GC% of vector pCC1fos ¼ 48 %. For details
imum (50 m deep). (a) All reads obtained in the DCM see Ghai et al. (2010)
were used. The amplicons were run in an agarose cyanophages in the case of marine samples from
gel, and only those with a significantly different the photic zone. Also it provided much larger
size from that of E. coli were selected (Quaiser contigs (up to 44 Kbp and close to 200 contigs
et al. 2008). over 10 Kbp). The importance of long contigs for
With the advent of high-throughput sequenc- interpreting metagenomic datasets cannot be
ing (HTS), the applications of fosmids are still stressed enough since annotation of large clusters
significant. First of all, they provide a way to of genes is much more reliable (see above). For
assemble much larger contigs, the Achilles’ heel example, Ghai et al. assemble large fragments of
of the HTS. Ghai et al. (Ghai et al. 2010) the genomes of marine Euryarchaea of group II
sequenced 1,000 pooled fosmids by that later on were instrumental in assembling the
454 pyrosequencing and compared the results complete genome of one of their members from
with the direct 454 pyrosequencing of the same a natural environment (Iverson et al. 2012).
DNA before cloning. The results indicated A recent application described for fosmid vec-
a strong bias in the fosmid clones against some tors has been their use for metaviriome studies.
specific groups of microbes such as Candidatus Metaviromes have a major problem when
Pelagibacter ubique and Prochlorococcus that sequenced by HTS. Viral genes are even more
happen to be the most abundant microbes in this difficult to annotate, and to infer information
environment. Besides, the GC distribution plot from their sequence is close to impossible unless
indicated that high GC of ca. 50 % was enriched large fragments of the viral genome are available.
versus the reads of the directly sequenced DNA This problem has been solved by fosmid cloning
(Fig. 2). The reasons for these biases are obscure, in a pilot study carried out by Garcia-Heredia
and a similar bias was found for environmental et al. (2012). These authors have retrieved viral
BAC libraries (Feingersch and Beja 2009). How- DNA from a natural extreme environment and
ever, fosmid cloning provided a complementarity could reconstruct complete to near-complete
to direct pyrosequencing, providing a way to viral genomes that prey on microbes which pure
access microbes that were relatively less abun- culture is very fastidious and hence not adequate
dant in the sample such as marine Euryarchaea or for classical phage isolation in pure culture.
F 180 Fosmid System
Besides, the chances of screening for biologi- DeLong EF, Preston CM, et al. Community genomics
cal activity are better when using larger inserts, among stratified microbial assemblages in the ocean’s
interior. Science. 2006;311(5760):496–503.
among other things because the complete meta- Feingersch R, Beja O. Bias in assessments of marine
bolic pathway might be present, in case more than SAR11 biodiversity in environmental fosmid and
one gene is needed, and also the genomic context BAC libraries? ISME J. 2009;J3(10):1117–9.
facilitates expression (e.g., better chances of the Garcia-Heredia I, Martin-Cuadrado AB, et al.
Reconstructing viral genomes from the environment
required promoters and control machinery being using fosmid clones: the case of haloviruses. PLoS
present). Many recent examples have used One. 2012;7(3):30.
fosmid clones for expression of activities such Ghai R, Martin-Cuadrado A, et al. Metagenome of the
as enzymes (Selvin et al. 2012) or bioactive Mediterranean deep chlorophyll maximum studied by
direct and fosmid library 454 pyrosequencing. ISME
compounds (Riaz et al. 2008; Huang et al. 2009; J. 2010;9:1154–1166.
Parsley et al. 2011). Huang Y, Lai X, et al. Characterization of a deep-sea
The third generation of high-throughput sediment metagenomic clone that produces water-
single-molecule nucleic acid sequencing such as soluble melanin in Escherichia coli. Mar Biotechnol.
2009;11(1):124–31.
Nanopore or Helicos might generate long reads Iverson V, Morris RM, et al. Untangling genomes from
that, provided they have enough reliability, might metagenomes: revealing an uncultured class of marine
make fosmid cloning and sequencing obsolete Euryarchaeota. Science. 2012;335(6068):587–90.
(Munroe and Harris 2010; Manrao et al. 2012). Kim UJ, Shizuya H, et al. Stable propagation of cosmid
sized human DNA inserts in an F factor based vector.
Nucleic Acids Res. 1992;20(5):1083–5.
Manrao EA, Derrington IM, et al. Reading DNA at single-
Summary nucleotide resolution with a mutant MspA nanopore
and phi29 DNA polymerase. Nat Biotechnol.
2012;30(4):349–53.
Many authors used the fosmid vectors to describe Martin-Cuadrado AB, Lopez-Garcia P, et al.
metagenomes. They allow to generate large librar- Metagenomics of the deep Mediterranean, a warm
ies with relatively small investment of time and bathypelagic habitat. PLoS One. 2007;2(9):e914.
money, and they can be used for multiple pur- Martin-Cuadrado AB, Rodriguez-Valera F, et al. Hind-
sight in the relative abundance, metabolic potential
poses. For example, fosmid-end sequencing pro- and genome dynamics of uncultivated marine archaea
vides data similar to shotgun libraries (in small from comparative metagenomic analyses of bathype-
insert vectors) but can be screened for sequences lagic plankton of different oceanic regions. ISME
of interest for full fosmid sequencing. There are J. 2008;2(8):865–86.
Martin-Cuadrado AB, Ghai R, et al. CO dehydrogenase
many examples of studies carried out that way. genes found in metagenomic fosmid clones from the
They can be screened by PCR for genes of interest deep Mediterranean sea. Appl Environ Microbiol.
such as 16S rRNA or others. Fosmids are also 2009;75(23):7436–44.
better vectors for expression screening by biolog- Munroe DJ, Harris TJ. Third-generation sequencing fire-
works at Marco Island. Nat Biotechnol. 2010;28(5):
ical activity. The advent of high-throughput 426–8.
sequencing technologies provides new opportuni- Parsley LC, Linneman J, et al. Polyketide synthase path-
ties for sequencing and screening fosmids. How- ways identified from a metagenomic library are
ever, long read single-molecule sequencing might derived from soil Acidobacteria. FEMS Microbiol
Ecol. 2011;78(1):176–87.
replace the need for fosmid cloning and render this Quaiser A, Lopez-Garcia P, et al. Comparative analysis of
metagenomic approach obsolete. genome fragments of Acidobacteria from deep Medi-
terranean plankton. Environ Microbiol. 2008;10(10):
2704–17.
Riaz K, Elmerich C, et al. A metagenomic analysis of soil
References bacteria extends the diversity of quorum-quenching
lactonases. Environ Microbiol. 2008;10(3):560–70.
Beja O, Suzuki MT, et al. Construction and analysis of Selvin J, Kennedy J, et al. Isolation identification and
bacterial artificial chromosome libraries from a marine biochemical characterization of a novel halo-tolerant
microbial assemblage. Environ Microbiol. 2000;2(5): lipase from the metagenome of the marine sponge
516–29. Haliclona simulans. Microb Cell Fact. 2012;11(1):72.
FragGeneScan: Predicting Genes in Short and Error-Prone Reads 181 F
Shizuya H, Birren B, et al. Cloning and stable mainte- although discovering new genes is one of the
nance of 300-kilobase-pair fragments of human DNA most important aspects in metagenomics research.
in Escherichia coli using an F-factor-based vector.
Proc Natl Acad Sci USA. 1992;89(18):8794–7. Alternatively, sequence conservation information
Venter JC, Remington K, et al. Environmental genome can be utilized for prediction of novel protein-
shotgun sequencing of the Sargasso Sea. Science. coding genes (Krause et al. 2006; Yooseph
2004;304(5667):66–74. et al. 2008); for example, a Ka/Ks value of ~1 for
a group of similar sequences indicates that these
sequences are under no selective pressure and
FragGeneScan: Predicting Genes in hence unlikely to code for proteins. This way,
Short and Error-Prone Reads novel families that have multiple members in
a metagenomic dataset can be identified
Yuzhen Ye (Yooseph et al. 2008). The other straightforward F
Indiana University, School of Informatics and solution to novel gene prediction in metagenomics
Computing, Bloomington, IN, USA is to use feature-based approaches such as proba-
bilistic models to evaluate the probabilities of
open reading frames (ORFs) being protein-coding
Definition regions (Noguchi et al. 2006, 2008; Hoff
et al. 2009), in a manner similar to conventional
Protein-coding genes are functional units in gene-finding methods such as Glimmer and
genomes that encode for proteins. GeneMark (Lukashin and Borodovsky 1998;
FragGeneScan is a hidden Markov model Salzberg et al. 1998; Delcher et al. 1999).
(HMM)-based predictor of incomplete and com- Short read length and sequencing errors are
plete genes from short reads or complete two major issues that pose significant challenges
genomes of prokaryotes. to gene prediction: incomplete genes (gene frag-
ments) are difficult to predict, and sequencing
errors may cause frameshifts that further compli-
Introduction cate gene prediction. The average length of genes
in microorganisms is about 950 bps (Noguchi
Identification of genes is one of the most impor- et al. 2006), which is much longer than the
tant and challenging problems in whole microbial sequencing reads generated by most NGS
genome sequencing projects (Davidsen et al. (Morozova et al. 2009; Metzker 2010; Quail
2001; Aziz et al. 2008; Stewart et al. 2009). In et al. 2012). Different NGS methods now produce
metagenomics, gene finding can provide the sequencing reads of various lengths ranging from
opportunity to elucidate the activities and interac- 100 bps (from Illumina sequencers) to thousands
tions of genes within an environmental sample, of base pairs (PacBio sequencing) and have dif-
from which the metabolic and signaling pathways ferent error profiles (Morozova et al. 2009).
specific to the environment can be reconstructed Sanger sequencers produce reads with an error
and identified (Turnbaugh et al. 2009; HMP con- rate of up to 1 %, whereas 454 sequencers pro-
sortium 2012). Most commonly, genes encoded by duce reads with an error rate of up to 3 % (Richter
metagenomes have been identified by using et al. 2008; Hoff 2009). Illumina sequencing
homology-based methods such as BLASTX technology may produce reads that have high
(Altschul et al. 1990; Meyer et al. 2008), which mismatch rates, especially when relatively long
however is facing a challenge due to the large reads are acquired (e.g., G is mistaken as T, and in
amount of sequencing data even with recent devel- later cycles A, C, and G are mistaken as T)
opments of faster tools including RAPSearch (Kircher et al. 2009). In 454 sequencing reads,
(Ye et al. 2011; Zhao et al. 2012). Homology sequencing errors tend to occur in the homopol-
searches against known protein databases, how- ymer regions, resulting in frequent insertions and
ever, cannot be used to predict novel genes, deletions. Most of the sequencing errors in
F 182 FragGeneScan: Predicting Genes in Short and Error-Prone Reads
PacBio reads are also indels (Carneiro transition probabilities to insertion and deletion
et al. 2012). It has been shown that ORF-based states are set to 0 when applying FragGeneScan
gene prediction methods are more substantially to gene prediction in complete genomic sequences.
affected by sequencing errors (indels) that cause Given a short read (or a complete genome), the
frameshifts (Hoff 2009; Tang et al. 2013). As gene prediction problem is to find the best path of
a consequence, programs that are currently avail- hidden states (see below) that is most likely to
able for gene prediction from short reads show generate the observed nucleotide sequence,
a significant decrease in their performance as the which can be solved by the Viterbi algorithm.
sequencing error rate increases. For example, FragGeneScan reports genes if they meet the
a low sensitivity of 26–43 % was observed with following three conditions: (1) the length of the
sequencing error rate of 2.8 % (Hoff 2009). genes is longer than 60 bps, (2) the genes start in
a start state (start codon) or in a match state
(internal region of genes), and (3) the genes end
FragGeneScan Algorithm in a stop state (stop codon) or in a match state
(internal region of genes). Therefore,
The core of FragGeneScan (Rho et al. 2010) is FragGeneScan can predict complete genes as
a hidden Markov model (HMM) (Rabiner 1989), well as partial (fragmented) genes without start
which incorporates codon usage bias, sequencing and/or stop codons. Since the probability of gene
error models, and start/stop codon patterns in regions and noncoding regions is calculated
a unified model. FragGeneScan HMM consists solely based on the composition of sequences
of two-level representations based on data (which is consistent regardless of the read length
abstraction. FragGeneScan considers separate and gene length), FragGeneScan is more robust
states representing the gene regions in the for- when input sequences are of different lengths.
ward strand and the reverse strand of a nucleotide
sequence, such that it can predict genes simulta-
neously from both strands. The model has seven Applications of FragGeneScan
superstates, representing gene regions, start
codons and stop codons in both the forward FragGeneScan software is available as open
(three states) and backward strands (three states), source on http://omics.informatics.indiana.edu/
and noncoding regions (one state), respectively. FragGeneScan. It has been incorporated into sev-
The states for gene regions consist of six consec- eral metagenomic analysis pipelines, including
utive sets of a match state, an insertion state, and MG-RAST (http://press.igsb.anl.gov/mgrdev/
a deletion state, which collectively correspond to under-the-hood/mg-rast-tools/fraggenescan/),
a six-periodic inhomogeneous HMM. Each IMG/M (Markowitz et al. 2012), WebMGA
match state in the gene regions uses a second- (Wu et al. 2011), and EBI metagenomics service
order Markov chain to model the codon usage. (Wu et al. 2011).
The state for noncoding regions is based on
a first-order Markov chain. FragGeneScan also
incorporates the sequence patterns for each start Summary
codon (ATG, GTG, and TTG) and stop codon
(TAA, TAG, and TGA) in the start and stop Gene prediction in short reads (and assemblies)
state, respectively. will remain a challenging problem, even with
FragGeneScan HMM has a unique feature. By recent advances in the field (Tang et al. 2013).
allowing transitions between the insertion/deletion Proteins predicted from environmental sequences
states and the match states, this model effectively have already greatly expanded the universe of
detects frameshifts that are caused by indel errors protein sequences. Not surprisingly, an increas-
in sequencing. Considering that complete genomic ingly large number of these proteins we are get-
sequences are unlikely to contain indel errors, the ting are hypothetical proteins. Functional
FR-HIT Overview 183 F
prediction of these hypothetical proteins will play prokaryotic and phage genomes. DNA Res. 2008;15:
a key role in elucidating their functions, which 387–96.
Quail MA, Smith M, Coupland P, et al. A tale of three next
however, will be an even more daunting task. generation sequencing platforms: comparison of Ion
Torrent, Pacific Biosciences and Illumina MiSeq
sequencers. BMC Genomics. 2012;13:341.
Rabiner LR. A tutorial on hidden Markov models and
References selected applications in speech recognition. Proc
IEEE. 1989;77:257–86.
Altschul SF, Gish W, Miller W, et al. Basic local align- Rho M, Tang H, Ye Y. FragGeneScan: predicting genes in
ment search tool. J Mol Biol. 1990;215:403–10. short and error-prone reads. Nucleic Acids Res.
Aziz R, Bartels D, Best A, et al. The RAST server: rapid 2010;38(20):e191.
annotations using subsystems technology. BMC Geno- Richter DC, Ott F, Auch AF, et al. MetaSim – a sequenc-
mics. 2008;9(1):75. ing simulator for genomics and metagenomics. PLoS
Carneiro MO, Russ C, Ross MG, et al. Pacific biosciences ONE. 2008;3:e3373. F
sequencing technology for genotyping and variation Salzberg SL, Delcher AL, Kasif S, et al. Microbial gene
discovery in human data. BMC Genomics. 2012; identification using interpolated Markov models.
13:375. Nucleic Acid Res. 1998;26:544–8.
Davidsen T, Beck E, Ganapathy A, et al. The comprehen- Stewart AC, Osborne B, Read TD. DIYA: a bacterial
sive microbial resource. Nucleic Acids Res. 2001;38 annotation pipeline for any genomics lab. Bioinfor-
Suppl 1:D340–5. matics. 2009;25(7):962–3.
Delcher AL, Harmon D, Kasif S, et al. Improved microbial Tang S, Antonov I, Borodovsky M. MetaGeneTack: ab
gene identification with GLIMMER. Nucleic Acids initio detection of frameshifts in metagenomic
Res. 1999;27:4636–41. sequences. Bioinformatics. 2013;29(1):114–6.
HMP consortium. Structure, function and diversity of the Turnbaugh PJ, Hamady M, Yatsunenko T, et al. A core gut
healthy human microbiome. Nature. 2012;486(7402): microbiome in obese and lean twins. Nature.
207–14. 2009;457(7228):480–4.
Hoff K. The effect of sequencing errors on metagenomic Wu S, Zhu Z, Fu L, et al. WebMGA: a customizable web
gene prediction. BMC Genomics. 2009;10(1):520. server for fast metagenomic sequence analysis. BMC
Hoff KJ, Lingner T, Meinicke P, et al. Orphelia: predicting Genomics. 2011;12:444.
genes in metagenomic sequencing reads. Nucleic Ye Y, Choi JH, Tang H. RAPSearch: a fast protein simi-
Acids Res. 2009;37:W101–5. larity search tool for short reads. BMC Bioinforma.
Kircher M, Stenzel U, Kelso J. Improved base calling for 2011;12:159.
the Illumina Genome Analyzer using machine learning Yooseph S, Li W, Sutton G. Gene identification and pro-
strategies. Genome Biol. 2009;10(8):R83. tein classification in microbial metagenomic sequence
Krause L, Diaz NN, Bartels D, et al. Finding novel genes data via incremental clustering. BMC Bioinforma.
in bacterial communities isolated from the environ- 2008;9:182.
ment. Bioinformatics. 2006;22:e281–9. Zhao Y, Tang H, Ye Y. RAPSearch2: a fast and memory-
Lukashin AV, Borodovsky M. GeneMark.hmm: new solu- efficient protein similarity search tool for next-
tions for gene finding. Nucleic Acids Res. 1998;26: generation sequencing data. Bioinformatics. 2012;
1107–15. 28(1):125–6.
Markowitz VM, Chen IM, Chu K, et al. IMG/M: the
integrated metagenome data management and compar-
ative analysis system. Nucleic Acids Res. 2012;40-
(Database issue):D123–9.
Metzker ML. Sequencing technologies – the next genera-
FR-HIT Overview
tion. Nat Rev Genet. 2010;11(1):31–46.
Meyer F, Paarmann D, D’Souza M, et al. The Beifang Niu, Zhengwei Zhu, Limin Fu and
metagenomics RAST server – a public resource for Sitao Wu
the automatic phylogenetic and functional analysis of
metagenomes. BMC Bioinforma. 2008;9(1):386.
Center for Research in Biological Systems
Morozova O, Hirst M, Marra M. Applications of new (CRBS), University of California, San Diego,
sequencing technologies for transcriptome analysis. La Jolla, CA, USA
Annu Rev Genomics Hum Genet. 2009;10:135–51.
Noguchi H, Park J, Takagi T. MetaGene: prokaryotic gene
finding from environmental genome shotgun Definition
sequences. Nucleic Acids Res. 2006;34(19):5623–30.
Noguchi H, Taniguchi T, Itoh T. MetaGeneAnnotator:
detecting species-specific patterns of ribosomal bind- A crucial step in metagenomic data analysis is
ing site for precise gene prediction in anonymous fragment recruitment, a process of aligning
F 184 FR-HIT Overview
FR-HIT Overview, Fig. 1 Recruitment rate and speed of FR-HIT and other programs for four datasets. The x-axis is the
ratio of CPU time relative to BLASTN; y-axis is the ratio of number of recruited reads relative to BLASTN
that do not have enough common q-grams. In this On average, FR-HIT is ~2 orders of magnitude
step, the length of q-gram is 4. After filtering, faster than BLASTN with similar recruitment
banded alignments between the query and the rate. FR-HIT is slower than the mapping pro-
candidate blocks that passed the filter are grams SOAP2, BWA, and BWA-SW, but it
performed. recruits several times more reads.
The fragment recruitment performance of The results of alignments from FR-HIT can be
FR-HIT was compared to some widely used interactively visualized using Fragment
short-read mapping and sequence alignment Recruitment Viewer, a tool that plots the align-
tools including BLASTN, MegaBLAST, ments on a 2D map where the x-axis is the
SOAP2, BWA, BWA-SW, SSAHA2, BLAT, genome coordinate and y-axis is the alignment
and LAST using four metagenomic datasets of identity (Fig. 2). The map can be operated
up to one million reads covering 454 GS20, like a Google Map so that users can explore
454 GSFLX, 454 Titanium, and Illumina plat- the recruitment alignments from one or
forms. Reads are aligned to available microbial multiple samples to many reference genomes.
reference genomes and considered recruited if the Fragment Recruitment Viewer is available
alignments are at least 30 bp and at least 80 % from http://weizhongli-lab.org/mgaviewer.
identity. Some pre-calculated recruitment results using
The overall comparison of CPU time and the FR-HIT are available from the CAMERA
number of recruited reads are shown in Fig. 1. project (http://camera.calit2.net).
F 186 FR-HIT Overview
FR-HIT Overview, Fig. 2 Screenshots of the Fragment and tRNA). At right bottom corner, there are a few icons
Recruitment Viewer. The initial view of plot shows all hits to zoom in, zoom out, increase and decrease plot size, and
to the full reference genome. X-axis is the genome coor- reset to the default view. Mouse wheel can be used to
dinate, and y-axis is the alignment identity. Hits are col- zoom the plot. Plot can be panned using mouse. Informa-
ored by samples. The bottom of the plot shows genes of the tion of an alignment or a gene is displayed when the
reference genome colored by gene type (protein, rRNA, pointer is over it
Summary References
FR-HIT is an important tool to perform fragment Burkhardt S, Cramer A, Ferragina P. q-gram based database
searching using a suffix array (QUASAR). RECOMB
recruitment analysis for metagenomic sequences.
’99; 1999 Apr 11–14; Lyon; 1999, pp. 77–83.
The recruitment results can be visualized using Jokinen P, Ukkonen E. 2 algorithms for approximate
the fragment recruitment reviewer. They can also string matching in static texts. In: Tarlecki A, editor.
be analyzed to provide taxonomy and function Mathematical foundations of computer science. Lec-
ture notes in computer science, vol 520. Berlin:
annotations. As a fast alignment tool, FR-HIT can Springer; 1991, pp. 240–248.
also be used for many applications such as filter- Langmead B, Trapnell C, Pop M, et al. Ultrafast and
ing out human contaminations for human memory-efficient alignment of short DNA sequences
microbiome samples. to the human genome. Genome Biol. 2009;10:R25.
Functional Metagenomics of Bacterial-Cell Crosstalk 187 F
Li H, Durbin R. Fast and accurate short read alignment Cultivability and metabolic interdependence
with Burrows-Wheeler transform. Bioinformatics. of microbes in their ecosystems have confronted
2009;25:1754–60.
Li R, Li Y, Kristiansen K, et al. SOAP: short microbial ecologist with “the great plate-count
oligonucleotide alignment program. Bioinformatics. anomaly” (Staley and Konopka 1985) since the
2008;24:713–4. beginning of their studies. The term summarizes
Niu B, Zhu Z, Fu L, et al. FR-HIT, a very fast program to the great discrepancy between the loads of micro-
recruit metagenomic reads to homologous reference
genomes. Bioinformatics. 2011;27:1704–5. scopically observed bacteria in an environmental
Owolabi O, Mcgregor DR. Fast approximate string sample and the lower numbers obtained using
matching. Softw Pract Exp. 1988;18:387–93. culture-dependent counting techniques, indicat-
Rusch DB, Halpern AL, Sutton G, et al. The sorcerer II ing the lack of representativeness of culture-
global ocean sampling expedition: northwest
Atlantic through eastern tropical Pacific. PLoS Biol. dependent techniques in the study of most com-
2007;5:e77. plex bacterial ecosystem. F
The development of molecular cloning
approaches led microbial ecologists to explore
the enzymatic potential of their ecosystems by
Functional Metagenomics of heterologous expression. They developed tech-
Bacterial-Cell Crosstalk niques to extract total genomic DNA of bacterial
origin from complex environmental samples.
Tomas de Wouters1,3, Nicolas Lapaque1, These metagenomes can subsequently be
Emmanuelle Maguin1, Joël Doré1,2,3 and expressed in a well-known and cultivable host
Hervé M. Blottière1,2 using fosmids, cosmids, or bacterial artificial
1
INRA, AgroParisTech, Jouy en Josas, France chromosomes (BACs). The first application of
2
US 1367 MetaGenoPolis, INRA, Jouy en Josas, this technique allowed the identification of for-
France merly unknown fibrolytic enzymes from, among
3
UMR Micalis, AgroParisTech, Jouy en Josas, others, anaerobic and Gram-positive bacteria
France (Healy et al. 1995) using E. coli (a Gram-negative
bacterium) as a host. The use of heterologous
expression of the metagenome of an ecosystem
Synonyms to identify functionalities of uncultivable bacteria
was later coined “functional metagenomics” as
Host-microbiota interactions opposed to the use of molecular techniques for
phylogenetic characterization and in silico func-
tional predictions of microbial ecosystems called
Definition metagenomics.
been explored mainly through metagenomic cultivable bacteria or the study of monoxenic
studies of their phylogenetic composition and and gnotobiotic animal models. In order to cir-
their metabolic repertoire as far as in silico pre- cumvent this limitation, culture-independent
diction is possible. methods such as functional metagenomics have
Most attention has been focused on the intes- been adapted and used to study functions of the
tinal microbiota. Not only because of its unique human intestinal microbiota (Table 1). Initially
bacterial density but also because of the large the approach was used to search for enzymatic
mucosal interface that exposes the human body activities specific for intestinal metabolic
to this bacterial load. The study of germ-free functions.
animals and large human cohorts revealed corre- Using a BAC library prepared in an E. coli
lations between the composition of the intestinal host, Walter and colleagues screened a mouse
microbiota and physiological conditions of the intestinal metagenome for b-glucanase activity
host, such as the proper development of immu- identifying 3 out of a total 5,760 clones
nity, a balanced metabolism, and the systemic (containing a total of 320 Mb of genomic DNA,
inflammatory status (Cerf-Bensussan and each clone bearing on average 55 Kb) encoding
Gaboriau-Routhiau 2010). This systemic impact enzymes of interest (Walter et al. 2005).
indicates an interaction between the intestinal Similarly, by screening a small fragment
microbiota and the host that has since been sub- metagenomic library (14,000 clones,
ject to intensive scientific research. representing 77 Mb of genomic DNA, cloned
DNA fragments had sizes of up to 8 kb) derived
from a cow rumen content, Ferrer and colleagues
Functional Studies of the Intestinal identified and characterized 22 clones with dis-
Microbiota tinct hydrolytic activities (Ferrer et al. 2005). In
these two studies, the screening process only
The human intestinal microbiota harbors allowed a very limited coverage of the actual
a genetic repertoire >25 times larger than that metagenome due to the size of the library.
of each human host (Qin et al. 2010) encoding Although several studies have identified hydro-
a multitude of functions that contribute directly lytic enzymes using plasmid libraries, one of the
or indirectly to host’s physiology. Cultivation key issues of the functional approach is to obtain
efforts as compared to molecular techniques libraries bearing large fragments of DNA to have
revealed that 70–80 % of the dominant bacteria access to full operons and operational gene clus-
are not yet cultured. Therefore up to 80 % of the ters, i.e., from 10 to 50 Kb.
intestinal microbes have no representative in any Indeed, Jones and colleagues developed
bacterial strain collection for potential functional a more promising approach by screening about
studies (Suau et al. 1999; Hayashi et al. 2002). 90,000 metagenomic fosmid clones derived from
Functional studies of intestinal bacteria have a human fecal sample (representing a total of
therefore long been limited to the study of about 3.6 Gb bacterial DNA which is about one
strain. Since the Gram + bacteria represent a large such tools have been used for functional screens
part of the intestinal microbiota and most of the of pathogen-cell interaction, a functional
probiotic bacteria described to have beneficial metagenomic study of interactions between com-
effects on human health are Gram+, great efforts mensal bacteria and host cells using a Gram +
have been made to develop easy cloning tools for host has not been published yet.
such studies in Gram + hosts. Since the expres-
sion of heterologous genes in E. coli gave access
to around 40 % of the genes for both Gram + and
Gram- bacteria (Gabor et al. 2004), it makes it Summary
a suitable but not universal host. The utility of
a Gram + bacterial host is based on eventual Metagenomic studies are applied to complex sys-
potential preference for RBSs and hence tems. Functional metagenomics is no exception.
increased transcription but also on secretion of If we study a complex system, simplification can
proteins through Gram + specific signal peptides bring clarity. This is the case if we search for
or eventual surface exposure of bioactive proteins specific enzymatic activities in a complex eco-
through cell wall anchoring motifs. Screenings of system. Simultaneously, simplification harbors
metagenomic libraries in Streptomyces spp. the danger of oversimplification and therefore
(Wang et al. 2000) and even Archaea (Albers error or deception.
et al. 2006) have successfully been performed The authors consider functional metagenomics
for other ecosystems. Efforts for targeted expres- as a very useful and powerful tool to screen
sion of candidate proteins of the human intestinal complex ecosystems for specific functions and
microbiota have been made by developing pre- believe it can be extended to the study of host-
diction tools for surface-exposed and secreted microbiota interactions as performed in the stud-
proteins in Gram + hosts in order to mine the ies mentioned above. For a full understanding of
abundantly available metagenomic data the complex interaction of a microbiome with its
(Barinov et al. 2009). The expression of the iden- cellular counterpart, this is however only an
tified candidate genes in a Gram + host such as exploratory tool that will always require valida-
Bacillus subtilis or Lactococcus lactis will allow tion in a more holistic and thus more complex
functional screening in cell-based assays. Though model (Fig. 1).
Functional
Metagenomics of
Bacterial-Cell Crosstalk,
Fig. 1 Possible models to
study host-microbiota
interactions ordered by
complexity of the microbial
(ordinate) and cellular
model (abscise) toward the
understanding of human
intestinal physiology
Functional Metagenomics of Bacterial-Cell Crosstalk 191 F
Cross-References from the human gut microbiome for modulation of
eukaryotic cell growth. Appl Environ Microbiol [Inter-
net]. 2007 [cited 2011 Jun 22];73(11):3734–7. Avail-
▶ Functional Metagenomics of Human Intestinal able from http://www.pubmedcentral.nih.gov/
Microbiome b-Glucuronidase Activity articlerender.fcgi?artid¼1932692&tool¼pmcentrez&
▶ Functional Viral Metagenomics and the rendertype¼abstract
Development of New Enzymes for DNA and Hayashi H, Sakamoto M, Benno Y. Phylogenetic analysis
of the human gut microbiota using 16S rDNA clone
RNA Amplification and Sequencing libraries and strictly anaerobic culture-based methods.
▶ Use of Bacterial Artificial Chromosomes in Microbiol Immunol [Internet]. 2002 [cited 2011 Apr
Metagenomics Studies, Overview 21];46(8):535–48. Available from http://www.ncbi.
nlm.nih.gov/pubmed/12363017
Healy FG, Ray RM, Aldrich HC, Wilkie AC, Ingram LO,
Shanmugam KT. Direct isolation of functional genes
References encoding cellulases from the microbial consortia in F
a thermophilic, anaerobic digester maintained on lig-
Albers S-V, Jonuscheit M, Dinkelaker S, Urich T, nocellulose. Appl Microbiol Biotechnol [Internet].
Kletzin A, Tampé R, et al. Production of recombinant 1995 [cited 2011 Aug 17];43(4):667–74. Available
and tagged proteins in the hyperthermophilic archaeon from http://www.ncbi.nlm.nih.gov/pubmed/7546604
Sulfolobus solfataricus. Appl Environ Microbiol Huttenhower C, Gevers D, Knight R, Abubucker S, Bad-
[Internet]. 2006 [cited 2011 Aug 21];72(1):102–11. ger JH, Chinwalla AT, et al. Structure, function and
Available from http://www.pubmedcentral.nih.gov/ diversity of the healthy human microbiome. Nature
articlerender.fcgi?artid¼1352248&tool¼pmcentrez&re [Internet]. Nature Publishing Group; 2012 [cited
ndertype¼abstract 2012 Jun 13];486(7402):207–14. Available from
Barinov A, Loux V, Hammani A, Nicolas P, Langella P, http://www.nature.com/doifinder/10.1038/
Ehrlich D, et al. Prediction of surface exposed proteins nature11234
in Streptococcus pyogenes, with a potential applica- Jones BV, Begley M, Hill C, Gahan CGM, Marchesi
tion to other Gram-positive bacteria. Proteomics JR. Functional and comparative metagenomic analysis
[Internet]. 2009 [cited 2012 Sep 5];9(1):61–73. Avail- of bile salt hydrolase activity in the human gut
able from http://www.ncbi.nlm.nih.gov/pubmed/ microbiome. Proc Natl Acad Sci U S Am [Internet].
19053137 2008 [cited 2011 Aug 20];105(36):13580–5. Available
Cerf-Bensussan N, Gaboriau-Routhiau V. The immune from http://www.pnas.org/cgi/content/abstract/105/
system and the gut microbiota: friends or foes? Nature 36/13580
Rev Immunol [Internet]. Nature Publishing Group; Lakhdari O, Cultrone A, Tap J, Gloux K, Bernard F,
2010 [cited 2011 Jul 20];10(10):735–44. Available Ehrlich SD, et al. Functional metagenomics:
from http://www.ncbi.nlm.nih.gov/pubmed/20865020 a high throughput screening method to decipher
Ferrer M, Golyshina OV, Chernikova TN, Khachane AN, microbiota-driven NF-kB modulation in the human
Reyes-Duarte D, Santos V a PM Dos, et al. Novel gut. Sturtevant J, editor. PLoS ONE [Internet]. 2010
hydrolase diversity retrieved from a metagenome [cited 2010 Oct 1];5(9):e13092. Available from http://
library of bovine rumen microflora. Environ Microbiol www.pubmedcentral.nih.gov/articlerender.fcgi?artid¼
[Internet]. 2005 [cited 2013 Jan 28];7(12):1996–2010. 2948039&tool¼pmcentrez&rendertype¼abstract
Available from http://www.ncbi.nlm.nih.gov/pubmed/ Qin J, Li R, Raes J, Arumugam M, Burgdorf KS,
16309396 Manichanh C, et al. A human gut microbial gene
Gabor EM, Alkema WBL, Janssen DB. Quantifying the catalogue established by metagenomic sequencing.
accessibility of the metagenome by random expression Nature [Internet]. 2010;464(7285):59–65. Available
cloning techniques. Environ Microbiol [Internet]. from http://www.ncbi.nlm.nih.gov/pubmed/20203603
2004 [cited 2011 Jun 22];6(9):879–86. Available Staley JT, Konopka A. Measurement of in situ activities of
from http://www.ncbi.nlm.nih.gov/pubmed/15305913 nonphotosynthetic microorganisms in aquatic and ter-
Gloux K, Berteau O, El Oumami H, Béguet F, Leclerc M, restrial habitats. Ann Rev Microbiol [Internet]. 1985
Doré J. A metagenomic b-glucuronidase uncovers [cited 2011 Aug 13];39:321–46. Available from http://
a core adaptive function of the human intestinal www.ncbi.nlm.nih.gov/pubmed/3904603
microbiome. Proc Natl Acad Sci U S A [Internet]. Suau A, Bonnet R, Sutren M, Godon JJ, Gibson GR,
2011 [cited 2011 Jul 29];108(Suppl):4539–46. Collins MD, et al. Direct analysis of genes encoding
Available from http://www.pubmedcentral.nih.gov/ 16S rRNA from complex communities reveals many
articlerender.fcgi?artid¼3063586&tool¼pmcentrez& novel molecular species within the human gut. Appl
rendertype¼abstract Environ Microbiol [Internet]. 1999;65(11):4799–807.
Gloux K, Leclerc M, Iliozer H, L’Haridon R, Available from http://www.pubmedcentral.nih.gov/
Manichanh C, Corthier G, et al. Development of articlerender.fcgi?artid¼91647&tool¼pmcentrez&render
high-throughput phenotyping of metagenomic clones type¼abstract
F 192 Functional Metagenomics of Human Intestinal Microbiome b-Glucuronidase Activity
liver
aglycones F
aglycones
gut
ecosystems and offers the potential to identify of glycosyl hydrolase family 2 enzymes (Marchler-
new genes from the microbiota, including its Bauer and Bryant 2004). The BG protein also had
uncultured fraction. It is expected that about unique features, including an additional C-terminal
40 % of enzymatic activities should be recover- domain compared to known b-glucuronidases
able in E. coli (Gabor et al. 2004) and this host and primary sequence specificities that led to the
can express a significant number of genes proposal of novel consensus motifs for the
(Handelsman et al. 1998; Rondon et al. 2000). Firmicutes-borne BG and for glycosyl hydrolase
The metagenomic approach has revealed new family 2 (Gloux et al. 2011).
enzymes (Hayashi et al. 2005; Humblot On the basis of sequence specificities, the fre-
et al. 2007; Yun et al. 2004; Kim et al. 2006, quency of the novel Firmicutes or Bacteroidetes
Tasse et al. 2010, Cecchini et al 2013), anticancer BGs within the human gut metagenomes could be
products (Piel et al. 2005), and compounds assessed. It was such that at least one homolog
important for industrial, biotechnological, or could be found within approximately 104 bacte-
therapeutic applications (Streit and Schmitz rial genes, making it by far the most dominant BG
2004), all having no homolog in the host bacte- gene in human gut metagenomes. It was absent
rium (E. coli). b-glucuronidase represents an from other environmental metagenomes, includ-
important function of interaction between the ing animal guts, making it specific to the human
intestinal microbiota and the host and a relevant gut metagenome (Fig. 2). It was present in the
intestinal activity for human health. genomes of numerous human intestinal commen-
Metagenomic libraries from microbiota sals belonging to the phylogenetic and
obtained from human ileum or feces were
constructed in E. coli and their phylogenetic
diversity analyzed (Manichanh et al. 2006). The
first functional approach using these libraries
argued in favor of an efficiency of functional
expression from the four dominant phyla of the
digestive tract (Gloux et al. 2007). Despite the
presence of b-glucuronidase genes in the host
bacterium (E. coli), we designed a screening
strategy that allowed the identification of numer-
ous bioactive clones. Following primary screen-
ing for metagenomic clones overexpressing
b-glucuronidase activity, we subcloned the
inserts in a uidA- E. coli strain (Gloux
et al. 2011). Overall, 19 out of 6,144
metagenomic clones tested had fosmids able to
express a b-glucuronidase activity based on
para-nitro-phenyl-b-D-glucuronide bioconver-
sion (Bardonnet and Blanco 1992), with levels
ranging from 0.02 to 0.88 units. Phylogenetic,
genetic, and functional characteristics of Functional Metagenomics of Human Intestinal
Microbiome b-Glucuronidase Activity, Fig. 2 Abun-
b-glucuronidase-positive inserts were investigated.
dance of Firmicutes BG (blue), Bacteroidetes BG
A novel BG gene encoding a b-glucuronidase was (orange), and uidA homologs (green) in different
identified in both Firmicutes and Bacteroidetes environments. Abundance was assessed as hits per bil-
genetic backgrounds. The protein encoded by the lion base pairs to correct for size difference of
metagenomic datasets. The hit threshold was set as at
gene has two conserved glutamate residues
least 50 % identity with 50 % sequence coverage. For full
required for catalysis (Salleh et al. 2006) and the details of the different genomic hits, see Gloux et al.
conserved predicted TIM barrel domain structure (2011)
Functional Metagenomics of Human Intestinal Microbiome b-Glucuronidase Activity 195 F
metagenomic cores described for the human gus Firmicutes:
microbiome (Tap et al. 2009; Qin et al. 2010). Roseburia
Lachnospiraceae
Rum. Ruminococcaceae
Finally, gene duplications and its spread across gnavus intestinalis Clostridiaceae
diverse phylogenetic lineages suggested an eco- Peptostreptococcaceae
OTU3 Streptococcaceae
logical drive to ensure the presence of the activ- OTU24
ity, via functional redundancy, in spite of Actinobacteria:
Eubacterium Bifidobacteriaceae
population variability between individuals. eligens
In conclusion, a novel class of BG was Proteobacteria:
revealed by our functional metagenomic Enterobacteriaceae
majority of sequences had indeed been captured presence of b-glucuronides (Dabek et al. 2008;
by the degenerate PCR approach (McIntosh McIntosh et al. 2012). The BG gene was identi-
et al. 2012). There were slight differences in fied by screening for b-glucuronidase activity
relative abundance, which is not surprising con- (Gloux et al. 2011), thus confirming its function.
sidering the difference in technical approach as An investigation of several bacteria that harbor
well as volunteer numbers, but overall both a BG gene but no gus gene revealed only low
approaches correlated significantly in terms of levels of b-glucuronidase activity, in both the
relative abundance as well as prevalence of dif- absence and presence of b-glucuronide as inducer
ferent OTUs. Thus, a targeted approach based on (McIntosh et al. 2012). Thus, BG genes may only
degenerate primers appears to provide a good be expressed under specific conditions that are
coverage of b-glucuronidase genes. It currently yet to be identified. Alternatively, some variants
also allows for a more in-depth analysis per vol- of this diverse gene family may actually encode
unteer, as the actual metagenomic sequence cov- enzymes with different substrate specificities.
erage per volunteer in the pioneer metagenomic In conclusion, sequence-based analysis of
sequencing studies is relatively low and many genes encoding b-glucuronidases can be used to
genes are only partially covered. With the vast reveal the diversity of the b-glucuronidase-
advances in sequencing technology, however, positive community and forms a solid basis for
direct metagenomic mining for specific functional further functional investigation of this activity in
genes will become increasingly attractive. representative organisms.
Sequence-based analysis of functional genes
poses the risk of assigning functions to genes that
may in fact carry out a different activity, and the Summary
actual enzyme activity will ultimately have
to be established for representatives of gene var- The metabolic activities of the microbial commu-
iants less closely related to biochemically nity present in the human gut are closely linked to
characterized ones. Especially for glycoside the physiological status and overall health of its
hydrolases, it is often difficult to infer function host. Bacterial b-glucuronidase activity directly
from sequence alone (▶ Carbohydrate-Active interferes with one of the major host detoxifica-
Enzymes Database, Metagenomic Expert tion systems for a wide range of lipophilic com-
Resource). Both b-glucuronidase genes are pounds that enter the body via the diet, drugs, or
remotely related to each other based on protein exposure to environmental pollutants, as well as
sequence identity and belong to glycoside hydro- endogenous molecules. Glucuronidation of those
lase family 2, which also includes enzymes with compounds renders them more hydrophilic and
other specificities, including b-galactosidases facilitates their excretion, but b-glucuronidase
and b-mannosidases (http://www.cazy.org/GH2. activities within the gut microbiota convert them
html). The gus gene has been characterized bio- back to their respective aglycones, which leads to
chemically in bacteria from different phyloge- an extended retention time in the body. Many of
netic backgrounds (Beaud et al. 2005; Russell those compounds are toxic or carcinogenic, but
and Klaenhammer 2001), and the presence of potentially health-promoting compounds, such as
the gene in a panel of human gut isolates corre- plant phenolics ingested with the diet, may also be
lated relatively well with the detection of glucuronidated. Metagenomics can be utilized to
b-glucuronidase activity (Dabek et al. 2008). On enhance our understanding of which bacteria in
the other hand, it was shown that different strains the human gut carry b-glucuronidase activity.
of the same species can show differences in A functional metagenomic approach, whereby
enzyme activity levels when grown under the genes from environmental communities are
same conditions and that the level to which expressed in a heterologous host, has led to the
b-glucuronidase activity is induced varies in identification of a novel type of b-glucuronidase
dependence of the growth substrate and the gene, which was found to be prevalent within the
Functional Metagenomics of Human Intestinal Microbiome b-Glucuronidase Activity 197 F
human gut microbiota but not commonly found in microbiome. Proc Natl Acad Sci U S A. 2011;108:
other environments. Metagenomic sequence min- 4539–46.
Haiser HJ, Turnbaugh PJ. Developing a metagenomic
ing for this novel gene, as well as a previously view of xenobiotic metabolism. Pharmacol Res.
known b-glucuronidase gene, revealed the distri- 2013;69:21–31.
bution of these genes in different phylogenetic Handelsman J. Metagenomics: application of genomics to
lineages. These results provide a valuable founda- uncultured microorganisms. Microbiol Mol Biol Rev.
2004;68:669–85.
tion for further functional characterization of this Handelsman J, Rondon MR, Brady SF, Clardy J, Good-
important microbial activity. man RM. Molecular biological access to the chemistry
of unknown soil microbes: a new frontier for natural
products. Chem Biol. 1998;5:R245–9.
Hayashi H, Abe T, Sakamoto M, Ohara H, Ikemura T,
Cross-References Sakka K, Benno Y. Direct cloning of genes encoding
novel xylanases from the human gut. Can J Microbiol. F
▶ Carbohydrate-Active Enzymes Database, 2005;51:251–9.
Metagenomic Expert Resource Henrissat B, Cantarel B, Coutinho P. Carbohydrate-active
enzymes database, metagenomic expert resource.
▶ Fosmid System http://www.springerreference.com/index.chapterbid/
303280
Humblot C, Murkovic M, Rigottier-Gois L, Bensaada M,
References Bouclet A, Andrieux C, Anba J, Rabot S. Beta-
glucuronidase in human intestinal microbiota is neces-
Bardonnet N, Blanco C. uidA-antibiotic-resistance cas- sary for the colonic genotoxicity of the food-borne
settes for insertion mutagenesis, gene fusions and carcinogen 2-amino-3-methylimidazo[4,5-f]quinoline
genetic constructions. FEMS Microbiol Lett. in rats. Carcinogenesis. 2007;28:2419–25.
1992;72:243–7. Kim DH, Jung EA, Sohng IS, Han JA, Kim TH, Han
Beaud D, Tailliez P, Anba-Mondoloni J. Genetic charac- MJ. Intestinal bacterial metabolism of flavonoids and
terization of the beta-glucuronidase enzyme from its relation to some biological activities. Arch Pharm
a human intestinal bacterium, Ruminococcus gnavus. Res. 1998;21:17–23.
Microbiology. 2005;151:2323–30. Kim DH, Hong SW, Kim BT, Bae EA, Park HY, Han
Cecchini DA, Laville E, Laguerre S, Patrick Robe P, MJ. Biotransformation of glycyrrhizin by human intes-
Leclerc M, Doré J, Henrissat B, Remaud-Siméon M, tinal bacteria and its relation to biological activities.
Pierre Monsan P, Potocki-Véronèse G. Functional Arch Pharm Res. 2000;23:172–7.
metagenomics reveals novel pathways of prebiotic Kim YJ, Choi GS, Kim SB, Yoon GS, Kim YS, Ryu
metabolization by human gut bacteria. PLoS ONE. YW. Screening and characterization of a novel ester-
2013;8:1–9. ase from a metagenomic library. Protein Expr Purif.
Dabek M, McCrae SI, Stevens VJ, Duncan SH, Louis 2006;45:315–23.
P. Distribution of b-glucosidase and b-glucuronidase Manichanh C, Rigottier-Gois L, Bonnaud E, Gloux K,
activity and of b-glucuronidase gene gus in human Pelletier E, Frangeul L, Nalin R, Jarrin C, Chardon P,
colonic bacteria. FEMS Microbiol Ecol. 2008;66: Marteau P, Roca J, Dore J. Reduced diversity of faecal
487–95. microbiota in Crohn’s disease revealed by
Flores R, Shi J, Gail MH, Gajer P, Ravel J, Goedert a metagenomic approach. Gut. 2006;55:205–11.
JJ. Association of fecal microbial diversity and taxon- Marchler-Bauer A, Bryant SH. CD-Search: protein
omy with selected enzymatic functions. PLoS ONE. domain annotations on the fly. Nucleic Acids Res.
2012;7:e39745. 2004;32:327–31.
Gabor EM, Alkema WB, Janssen DB. Quantifying the McBain AJ, Macfarlane GT. Ecological and physiological
accessibility of the metagenome by random expression studies on large intestinal bacteria in relation to pro-
cloning techniques. Environ Microbiol. 2004;6: duction of hydrolytic and reductive enzymes involved
879–86. in formation of genotoxic metabolites. J Med
Gloux K, Leclerc M, Iliozer H, L’haridon R, Microbiol. 1998;47:407–16.
Manichanh C, Corthier G, Nalin R, Blottière HM, McIntosh FM, Maison N, Holtrop G, Young P, Stevens VJ,
Doré J. Development of high-throughput phenotyping Ince J, Johnstone A, Lobley G, Flint HJ, Louis P.
of metagenomic clones from the human gut Phylogenetic distribution of genes encoding
microbiome for modulation of eukaryotic cell growth. b-glucuronidase activity in human colonic bacteria
Appl Environ Microbiol. 2007;73:3734–7. and the impact of diet on faecal glycosidase activities.
Gloux K, Berteau O, El Oumami H, Béguet F, Leclerc M, Environ Microbiol. 2012;14:1876–87.
Doré J. A metagenomic b-glucuronidase uncovers Morotomi M, Nanno M, Watanabe T, Sakurai T, Mutai M.
a core adaptive function of the human intestinal Mutagenic activation of biliary metabolites of
F 198 Functional Viral Metagenomics and the Development of New Enzymes
1-nitropyrene by intestinal microflora. Mutat Res. Nalin R, Dore J, Leclerc M. Towards the human
1985;149:171–8. intestinal microbiota phylogenetic core. Environ
Nanno M, Morotomi M, Takayama H, Kuroshima T, Microbiol. 2009;11:2574–84.
Tanaka R, Mutai M. Mutagenic activation of biliary Tasse L, Bercovici J, Pizzut-Serin S, Robe P, Tap J,
metabolites of benzo(a)pyrene by beta-glucuronidase- Klopp C, Cantarel BL, Coutinho PM, Henrissat B,
positive bacteria in human faeces. J Med Microbiol. Leclerc M, Doré J, Monsan M, Remaud-Simeon M,
1986;22:351–5. Potocki-Veronese G. Functional metagenomics to
Piel J, Butzke D, Fusetani N, Hui D, Platzer M, Wen G, mine the human gut microbiome for dietary fiber
Matsunaga S. Exploring the chemistry of uncultivated catabolic enzymes. Genome Res. 2010;20:1605–12.
bacterial symbionts: antitumor polyketides of the Tryland I, Fiksdal L. Enzyme characteristics of beta-
pederin family. J Nat Prod. 2005;68:472–9. D-galactosidase- and beta-D-glucuronidase-positive
Qin J, Ruiqiang L, Raes J, Arumugam M, Solvsten K, bacteria and their interference in rapid methods for
Burgdorf, Manichanh C, Nielsen T, Pons N, detection of waterborne coliforms and Escherichia
Levenez F, Yamada T, Mende D, Li J, Xu J, LI S, coli. Appl Environ Microbiol. 1998;64:1018–23.
Li D, Cao J, Wang B, Liang H, Zheng H, Yie Y, Tap J, Tukey RH, Strassburg CP. Human
Lepage P, Bertalan M, Batto JM, Hansen T, Le UDP-glucuronosyltransferases: metabolism, expres-
Paslier D, Linneberg A, Nielsen HB, Pelletier E, sion, and disease. Annu Rev Pharmacol Toxicol.
Renault P, Sicheritz-Ponten T, Turner K, Zhu H, 2000;40:581–616.
Yu C, Li S, Jian M, Zhou Y, Zhang X, Li S, Yang H, Yun J, Kang S, Park S, Yoon H, Kim MJ, Heu S, Ryu
Wang J, Brunak S, Brunak J, Dore J, Guraner F, S. Characterization of a novel amylolytic enzyme
Kristiansen K, Pedersen O, Parkhill J, Wessenbach J, encoded by a gene from a soil-derived metagenomic
MetaHIT Consortium, Bork P, Ehrlich SD, Wang J. A library. Appl Environ Microbiol. 2004;70:7229–35.
human gut microbial gene catalog established by
deep metagenomic sequencing. Nature. 2010;464:
59–65.
Ram JL, Ritchie RP, Fang J, Gonzales FS, Selegean JP.
Sequence-based source tracking of Escherichia coli
based on genetic diversity of beta-glucuronidase. Functional Viral Metagenomics and
J Environ Qual. 2004;33:1024–32. the Development of New Enzymes
Rod TO, Midtvedt T. Origin of intestinal beta- for DNA and RNA Amplification and
glucuronidase in germfree, monocontaminated and
conventional rats. Acta Pathol Microbiol Scand.
Sequencing
1977;85([B]):271–6.
Rondon MR, August PR, Bettermann AD, Brady SF, Thomas W. Schoenfeld, Michael J. Moser and
Grossman TH, Liles MR, Loiacono KA, Lynch BA, David Mead
MacNeil IA, Minor C, Tiong CL, Gilman M, Osburne
MS, Clardy J, Handelsman J, Goodman RM. Cloning
Lucigen Corporation, Middleton, WI, USA
the soil metagenome: a strategy for accessing
the genetic and functional diversity of uncultured
microorganisms. Appl Environ Microbiol. 2000;66: Introduction
2541–7.
Russell WM, Klaenhammer TR. Identification and clon-
ing of gusA, encoding a new beta-glucuronidase from The enzymes of phages and other viruses were
Lactobacillus gasseri ADH. Appl Environ Microbiol. vital to the early development of molecular biol-
2001;67:1253–61. ogy and are still essential tools. However, the
Salleh HM, M€ullegger J, Reid SP, Chan WY, Hwang J,
Warren RA, Withers SG. Cloning and characterization
available viral enzymes represent a tiny sample
of Thermotoga maritima beta-glucuronidase. of the potential diversity found in the global
Carbohydr Res. 2006;341:49–59. virosphere. Viral metagenomics has revealed
Schmelz EM, Bushnev AS, Dillehay DL, Sullards MC, a vast diversity of novel genes and its virtually
Liotta DC, Merrill Jr AH. Ceramide-beta-
D-glucuronide: synthesis, digestion, and suppression
limitless potential to provide new enzymes for
of early markers of colon carcinogenesis. Cancer use in molecular analysis. An important chal-
Res. 1999;59:5768–72. lenge to both the understanding of viral ecology
Streit WR, Schmitz RA. Metagenomics-the key to the and development of new viral enzymes is func-
uncultured microbes. Curr Opin Microbiol. 2004;7:
492–8.
tional characterization of metagenomic
Tap J, Mondot S, Levenez F, Pelletier E, Caron C, sequences, which has lagged far behind the abil-
Furet JP, Ugarte E, Munoz-Tamayo R, Paslier DL, ity to collect sequence data. Described is
Functional Viral Metagenomics and the Development of New Enzymes 199 F
a program to identify and characterize replication Thermostable DNA polymerases (Pols) have
operons of viral metapopulations isolated from been a major research focus due mainly to their
natural thermal environments and develop the wide use in molecular detection and analysis.
gene products as thermostable enzymes for DNA polymerases are essential for PCR (Staley
nucleic acid amplification and sequencing. and Konopka 1985) and other target-specific
Approaches to functionally characterize viral (Petruska et al. 1998; Notomi et al. 2000) and
replicases include (1) expression and biochemi- whole genome amplification methods (Goodman
cal analysis of gene products identified by and Fygenson 1998) and are also essential com-
sequence similarity, (2) functional screens to dis- ponents of all the major DNA sequencing plat-
cover new families of genes, and (3) assembly of forms. Sanger (dideoxy chain termination) DNA
operons to predict function based on gene posi- sequencing was the first major sequencing
tion. These approaches have uncovered at least method to use DNA polymerases and was F
two diverse families of replication operons advanced by thermostable Pols (Tang
including dozens of genes for thermostable et al. 2008). All of the leading next-generation
DNA polymerases and reverse transcriptases, as sequencing-by-synthesis platforms (e.g., Roche/
well as likely replicase subunits. In addition, 454 FLX, Illumina Genome Analyzer, Helicos
functional screens have uncovered one viral Pol Heliscope, Pacific BioSystems SMRT, ABI
unrelated to any known protein. These enzymes SOLiD) (Mardis 2008b; Shendure and Ji 2008)
are being engineered as improved PCR, RT PCR, use at least one DNA polymerase for base dis-
and DNA sequencing reagents. Diversity in the crimination and/or template preparation. DNA
viral metagenomes is also being explored to opti- polymerase-based methods are driving discovery
mize the activity of the genes discovered in the in research labs and, increasingly, in the clinic
libraries and make them more suitable for the (Bhui-Kaur et al. 1998) as methods for nucleic-
targeted applications. acid-based detection of infectious agents, cancer
Gene products of phages and other viruses and genetic variation advance next-generation
(collectively referred to here as viruses) have diagnostics, and personalized medicine. Progress
historically provided many of the enzymatic in improving all these methods depends in part on
tools for molecular biology. However, most of more suitable DNA polymerases.
the commonly used viral enzymes are derived Viruses are rich sources of diverse new DNA
from a very limited number of cultivated viruses, polymerases. Compared to their cellular hosts,
primarily phages T4, T7, lambda, SP6, and phi29, viruses use a wide array of strategies to replicate
and retroviruses Moloney murine leukemia virus their genomes, and their genomes adopt nearly
(Mo-MLV) and avian myeloblastosis virus every conceivable form, including double-
(AMV). The program to study hot spring virology stranded and both positive and negative single-
in Yellowstone National Park (YNP), California, stranded RNA and DNA forms, with linear,
and Nevada has provided insight into viral ecol- circular, and multipartite topologies ranging in
ogy (Otto et al. 1998; Breitbart et al. 2004; size from 1.2 Mb (mimivirus) down to 3.2 kb
Schoenfeld et al. 2008) and has revealed (hepatitis B virus) (Blanco et al. 1989; Detter
a nearly unlimited source of diversity for the et al. 2002). While many of these replicative
search for new enzymes (Beechem et al. 1998; strategies rely on host enzymes, a substantial sub-
Moser et al. 2012; Perez et al. 2012). However, set of viral families supplies its own replication
current approaches to functional analysis of viral proteins. There is speculation that viruses may
metagenomes, while informative, are limited by have played a key role in the evolution of repli-
their reliance on sequence similarity to infer gene cation strategies used by cellular life (Koonin
function. Improvements in the ability to function- 2006).
ally characterize viral metagenomes are neces- As replicases, viral polymerases are function-
sary to advance the field. ally distinct from the bacterial and archaeal
F 200 Functional Viral Metagenomics and the Development of New Enzymes
enzymes currently used in molecular biology. Retroviral replicases (i.e., reverse transcriptases),
During prokaryotic cellular replication, especially Mo-MLV and AMV, are indispensable
processive leading-strand synthesis depends on for detection, analysis, and cloning of transcripts
a multisubunit complex including Pol III holoen- and RNA viruses (Morin et al. 2008; Wang
zyme, helicases, and primases. E. coli Pol III et al. 2008). Together, these qualities make viral
holoenzyme is a 791 kD protein comprised of Pols attractive targets for development as
nine subunits (reviewed in Xiang et al. (2008)). reagents.
Due to their complexity, no Pol III derivative has While the emphasis has been DNA polymer-
been developed as a molecular biology reagent. ases, viruses encode other useful enzymes. RNA
Cell-derived reagent Pols, e.g., Taq, Pfu, or polymerases, for example, are key components of
E. coli DNA polymerases, are all bacterial Pol a number of in vitro and in vivo transcription and
I or archaeal Pol II derivatives that are mainly translation systems, as well as several
responsible in vivo for lagging strand and repair transcription-mediated amplification methods
synthesis, neither of which requires strand sepa- (Guatelli et al. 1990; Compton 1991). Virtually
ration or processive synthesis of long sequences. all ligation methods used for cloning and linker
Viral Pols are functionally more like the leading- attachment depend on T4 DNA ligase due to its
strand replicases and, accordingly, exhibit higher high activity on 50 - and 30 -extended and blunt
fidelity, rates of synthesis, and processivity (Ley DNA. The integrases and recombinases of vari-
et al. 2008). Phage T7 Pol, for example, incorpo- ous phages (e.g., lambda red and P1 cre/lox) have
rates 300 nt per second, six times faster than been used to integrate genes into bacterial and
Escherichia coli Pol I; T4 phage replicates DNA eukaryotic genomes. Resolvases (e.g., T4 endo-
ten times faster than its E. coli host (Heckler nuclease VII and T7 endonuclease I) have been
et al. 1984). Phi29 Pol has a processivity of used to detect single nucleotide polymorphisms
>70,000 nucleotides (Blanco et al. 1989) (i.e., it (SNPs) (Babon et al. 2003). It is likely that these
incorporates over 70,000 nucleotides before dis- and many other methods that rely on viral
sociating), far greater than that of Taq Pol I, enzymes can be further improved by novel
which has a processivity of between 50 and enzyme activities. Functional metagenomic-
80 (Merkens et al. 1995). Phi29 also has based enzyme discovery and development should
a strong strand displacement capability that, benefit a wide range of applications.
together with its processivity, makes it the poly- The enzymes that have been isolated by culti-
merase of choice for whole genome amplification vation over the years demonstrate the potential of
by multiple displacement amplification (MDA) viruses as a source of new enzymes, but greatly
(Dean et al. 2001). T7 phage Pol holoenzyme underrepresent the richness of this resource. The
has a processivity of 1,000 nucleotides (Tabor extreme global abundance and diversity of
et al. 1987) and efficiently incorporates chain- viruses is well documented (Breitbart
terminating nucleotide analogs, which facilitated et al. 2002; Angly et al. 2006; Dinsdale
Sanger sequencing until it was displaced by et al. 2008; McDaniel et al. 2008; Schoenfeld
Thermo Sequenase, a Taq Pol derivative that et al. 2008). A liter of ocean water contains as
was engineered based on the nucleotide variation many viruses as there are humans on the planet
in T7 DNA Pol that conferred efficient incorpo- and much more genetic diversity (Wang
ration of dideoxynucleotides (Tabor and Richard- et al. 2007). In fact, the bulk of the world’s
son 1995). T5 Pol has both high processivity and genetic diversity is probably encoded in viral
a potent strand displacement activity that are genomes. Despite the richness of the global
independent of additional host or viral proteins virosphere as a source of diverse replicative pro-
(Andraos et al. 2004). T4 DNA Pol has a high teins, standard approaches to discovering new
proofreading activity that is commonly exploited enzymes by cultivating the viruses have proven
for generating blunt ends, especially in physically extremely inefficient and few new viral enzymes
sheared DNA (Karam and Konigsberg 2000). have been commercialized in the past decades.
Functional Viral Metagenomics and the Development of New Enzymes 201 F
Notably, despite their widespread potential appli- Ding et al. 2008; Lopatto et al. 2008; Schmidt
cations and notwithstanding substantial effort, et al. 2008) or a small number of thermoaci-
thermostable viral Pols have completely eluded dophilic Archaea, particularly Sulfolobus and
discovery by cultivation. There are now 34 fully Acidianus (reviewed in Rehrauer et al. (1998)),
sequenced genomes from thermophilic viruses in due to the relative ease of cultivating these hosts.
the NCBI database (February 2010): 27 archaeal Metagenomics promises to overcome these bar-
viruses and 7 bacteriophages. None of these riers and provide a largely unbiased sampling of
genomes or broad screens of hundreds of culti- viral populations.
vated Thermus phage (Lopatto et al. 2008) has In some respects viral metagenomes are espe-
produced a thermostable DNA polymerase. cially well suited for discovery of enzymes for
Extensive analysis of cultivated crenarchaeal use in molecular analysis. Viral genomes are
viral genomes from high-temperature environ- highly diverse and dense with genes associated F
ments reveals few recognizable features other with nucleic acid metabolism (Paulsen and
than a small number of methylases, helicases, Wintermeyer 1984). For example, a typical bac-
glycosyltransferases, and several unknown but terial genome of 2 Mb contains three to five DNA
shared genes (Rehrauer et al. 1998). At least one polymerase genes, only one of which, polA,
presumptive DNA polymerase has been identi- encodes enzymes that have been used as reagents.
fied in an archaeal viral genome (Baklanov In contrast, a comparable 2 Mb of viral
et al. 1984), but not expressed in the lab. At metagenome can yield up to 40 pol genes
least five Pols have been expressed from thermo- (Schoenfeld et al. 2008). However, the promise
philic bacteriophage genomes (Wang et al. 2006; of using this diversity to advance the understand-
Schmidt et al. 2008; T. Schoenfeld, unpublished); ing of global ecology and in developing useful
however, for unknown reasons, these enzymes from viral metagenomes is tempered by
enzymes are only moderately thermostable and the challenge in assigning function to the genes.
incapable of surviving thermocycling in PCR The gigabases of viral metagenomic sequence
or sequencing, despite the thermostability of data that have been generated over the past
their host Pols. In order to identify useful decade have provided only inferential insight
thermostable Pols, more efficient approaches are into function or biochemistry of the viral genes
needed. and, consequently, few new molecular tools.
One of the main barriers to discovery of new Efforts to glean insight from metagenomes are
viral enzymes is technical challenges associated hampered by the nearly complete reliance on
with cultivation. It is widely noted that cultiva- sequence similarity coupled with the extreme
tion in the lab selects against the great majority of viral genomic diversity and the dearth of anno-
Bacteria and Archaea. Cultivation of new viruses tated sequences. Depending on the environment,
introduces another extreme level of selection 40–90 % of viral metagenomic sequences are
against the vast majority of natural populations unknown, novel sequences (Angly et al. 2006;
because cultivation requires the investigator to Dinsdale et al. 2008; Bench 2007; Srinivasiah
choose a host that can be grown in the lab, 2008; Schoenfeld 2008). All the next-generation
which severely limits the comprehensiveness of platforms generate shorter reads that are even
the screens. When examining extreme environ- more difficult to assemble or align to sequences
ments like thermal springs, which are dominated in GenBank, resulting in artificially low BLASTx
by autotrophic microbes, this host selection is homologies or, conversely, artificially high num-
even more limiting. Most of these cultivation bers of “unique” sequence (Wommack 2008).
efforts have focused on viruses that infect hetero- The VIROME database (virome.dbi.udel.edu)
trophic Bacteria, especially Thermus (Reha- has cataloged 201 Mb of predicted open reading
Krantz et al. 1998; Karam and Konigsberg frames (ORFs) from long read sequence data
2000; Pavlov and Karam 2000; Bebenek (Feb 2010), the vast majority of which are novel
et al. 2001; Blondal et al. 2003; and functionally uncharacterized.
F 202 Functional Viral Metagenomics and the Development of New Enzymes
Functional characterization of viral level (Truncaite et al. 2006; Wang and Silverman
metagenomes has lagged far behind the ability 2006). When assembly criteria are reduced to as
to collect sequence data. Essentially none of the low as 50 %, much larger assembled contigs are
millions of gene functions inferred by sequence generated (Schoenfeld et al. 2008). This approach
similarity has been proven biochemically by has proven effective in generating contigs that
expression and analysis of the gene products. contain identifiable operons that not only allow
More importantly, the mere description of isolation of genes of related function but allow
sequence similarity does little to further the mapping of diversity onto the protein structure.
understanding of viral biology or to identify use- These sequence variations correspond to bio-
ful new enzymes. Furthermore, sequence- chemical differences in the gene products and
similarity screens only identify genes with an provide a guide to enzyme engineering. In the
annotated counterpart in a database. The relative work described below, a tripartite approach was
scarcity of functionally annotated viral genes in used for functional analysis of viral metagenomes
GenBank has likely prevented discovery of truly including (1) expression and biochemical charac-
novel enzyme families, which should be the terization of the “BLASTx hits,” (2) functional
strength of viral metagenomics. screens to identify enzymes too dissimilar to
Finally, a conceptual barrier associated with known genes to be detected by sequence similar-
the definition of related viral types has prevented ity, and (3) assembly of operons to infer gene
assembly of viral genomes, and, consequently, function based on position in the genome.
inferences into function that are based on gene
position. Phage genes of related function, espe-
cially replication-related genes, often occur in Methods
proximity within operons (El Omari et al. 2006).
Assembly of sequence reads should allow recon- Sampling, Library Construction, and
struction of operons; however, standard Sequencing
approaches relying on nucleotide identities of Sampling, library construction, and sequencing
greater than 95 % are ineffective in assembly of of the YNP samples have been described
viral metagenomes and only a few very small, (Schoenfeld et al. 2008). The Great Boiling
abundant phage genomes have been Spring samples were collected as described and
reconstructed from metagenomic data (Angly amplified using the Repli-g kit (GE Healthcare).
et al. 2006). Because even the relatively long DNA was sheared and inserted into pETite vector
Sanger reads are almost always too short to (Lucigen) and the library used to transform
include more than one complete gene, these asso- E. coli HI-Control BL21(DE3) cells (Lucigen).
ciations are generally missed. Since traditional Individual clones from both libraries were
shotgun sequencing, used in some of the work sequenced in their entirety using standard chem-
described below, involved the construction of istry (Life Technologies).
clone libraries, success in identifying adjacent
genes by sequencing entire inserts from archived Bioinformatics
clones was achieved, but even this approach is Sequence assemblies were performed using
limited by the sizes of inserts in the libraries, Sequencher (Gene Codes) or SeqMan
generally less than 5 kb. Since none of the next- (DNASTAR). ClustalW analysis was performed
generation sequencing methods uses clone librar- as described (Nandakumar and Shuman 2005).
ies, this approach is impossible for most of the
ongoing viral metagenomic projects. The funda- Functional Screens
mental problem is that viral populations are too The clones from the Great Boiling Spring sam-
molecularly diverse to accommodate this crite- ples were grown on Luria broth, pelleted, and
rion. Among cultivated viruses, closely related resuspended in buffer containing lysozyme.
phages are up to 50 % divergent at the nucleotide Lysates were incubated for 10 min at 70 C and
Functional Viral Metagenomics and the Development of New Enzymes 203 F
centrifuged, and the supernatants were tested for The degree of sequence conservation among
DNA polymerase activity using the standard pol genes in these libraries, while relatively low,
assay. Positive clones were cultivated at 50 ml was higher than most sequences found in viral
in LB and retested. The inserts of clones with metagenomes. The discovery of 156 partial genes
activity were sequenced in their entirety. among roughly 600 viral genome equivalents
suggests that sequence-based screens were rela-
Cloning, Expression, Purification, and tively efficient in identifying pol genes. Nonethe-
Mutagenesis less, there are important disadvantages to this
DNA polymerase genes that were further charac- approach. One is that the diversity of viral pol
terized were expressed at higher levels by inser- genes is likely to be high enough that interesting
tion into pET28 vector and expression in E. cloni new enzymes are missed. Another problem is that
EXPRESS BL21(DE3) (Lucigen). DNA poly- a gene must be situated in the random clone so F
merase was purified by heat treatment and stan- that an identifiable portion of it is within the read
dard chromatography methods. Mutagenesis was length of the sequencing method (>1,000 nucle-
performed using the QuikChange II Site-Directed otides by Sanger, much less by newer sequencing
Mutagenesis Kits (Agilent). approaches) and the gene must not extend beyond
the boundaries of the random insert so that it is
Biochemical Analysis and Applications incomplete. It is unknown how many genes failed
Development to fulfill the first criterion and were within the
Biochemical assays were performed using stan- insert, but not within the sequence range. Of the
dard methods (Mardis 2008; Marks et al. 2008). 156 identified candidate pol genes, only 38 %
fulfilled the second criterion and appeared com-
plete. Finally, the identification of a gene does not
Results and Discussion mean that the gene will express efficiently in
E. coli. For unknown reasons, among the 59 likely
Sequence-Based and Functional Discovery of complete genes, 83 % failed to express at detect-
New DNA Polymerases able levels.
In a recent study of viral metagenomes from Functional screens address many limitations
Yellowstone hot springs, more than 28,000 of sequence-similarity screens and can often
Sanger-based long sequence reads (nearly detect completely novel activities regardless of
30 Mb of sequence) were determined divergence from known genes or position in the
(Schoenfeld et al. 2008). BLASTx alignment to insert, as long as the complete gene is present. By
the nonredundant protein database indicated that their nature, functional screens only detect
156 ORFs had similarity to known pol genes. complete, expression-competent genes. Viral
Fifty-nine appeared to be complete genes and metagenomic DNA from the Great Boiling
were tested for DNA polymerase activity. Ten Spring, Gerlach, NV, kindly provided by Brian
showed activity and seven of these were Hedlund and Jeremy Dodsworth (University of
sequenced in their entirety. Although highly Nevada-Las Vegas), was used to construct
divergent from known viral and cellular genes, a library that was screened for expression of
four were loosely grouped with family thermostable pol activity. Screening of 2,800
A polymerases and three grouped with family clones resulted in the discovery of 12 that were
B polymerases. These pol genes are referred to positive for primer extension activity. Eleven of
as “PyroPhage” followed by an identifying these were more than 97 % identical to each other
number. The family A pols detected by this and are referred to as the “PyroPhage 74-like
screen were too divergent to be grouped, but the polymerases” in reference to the first member
family B Pols are referred to below as discovered. These pol genes share up to 45 %
“PyroPhage 4110-like Pols” in reference to the identity with the other polA-type genes from Yel-
first one discovered. lowstone (PyroPhage 3173 and 967) and 56 %
F 204 Functional Viral Metagenomics and the Development of New Enzymes
Functional Viral
Metagenomics and the
Development of New
Enzymes for DNA and
RNA Amplification and
Sequencing,
Fig. 1 Polymerase
phylogenetic tree. Full-
length viral metagenomic
DNA polymerase amino
acid sequences were
compared by ClustalW to
representative viral,
microbial, and eukaryotic
Pols and displayed in
a neighbor-joining tree
identity to PyroPhage 488, a pol gene isolated from these screens, as well as those retrieved
8 years earlier in a sequence-based screen of from GenBank, were noticeably more diverse
a metagenome from Little Hot Creek, Long Val- than cellular genes. Most PyroPhage pol genes
ley, CA, which is 400 km from Gerlach, NV, but are highly divergent from known cellular or viral
still in the Great Basin. The final clone identified pol genes. The exception is PyroPhage 3063,
in the functional screen, PyroPhage 347, had no which is related to several polA genes of
significant similarity to any known pol gene. In Aquificales family, which are known to be quite
fact the strongest E value to any known gene had divergent from other bacterial polA genes
a barely significant 0.750 score to an open read- (Griffiths and Gupta 2004).
ing frame of unknown function in a crenarchaeal Since the libraries were constructed from dif-
virus. Due to this lack of similarity to genes of ferent hot spring populations, direct comparisons
known function, this gene would never have been are difficult. However, while the overall rate of
identified by sequence similarity. discovery of apparent DNA polymerase genes
The pol genes discovered by both screens was comparable for the sequence-based and func-
were aligned by ClustalW to each other and to tional screens (156 pol genes from 28,000 clones
representative cellular and viral pol genes to con- compared to 12 from 2,800 clones, respectively),
struct a neighbor-joining tree (Fig. 1). Viral genes the rate of discovery of functional thermostable
Functional Viral Metagenomics and the Development of New Enzymes 205 F
enzymes was much lower for the sequence alignment to known proteins (Tabor and Richard-
screens than the functional screens (10 of son 1995), mutation F418Y (Fig. 3a) reduced
28,000 vs. 12 of 2,800). The diversity of the discrimination against chain terminators to nearly
enzymes in the GBS library was much lower zero, making the enzyme very effective for dye
than those from Yellowstone springs, presumably terminator cycle sequencing (Fig. 3b).
reflecting a lower overall population diversity.
Single-Enzyme RT PCR with 3173 DNA
Biochemical Characteristics and Directed Polymerase
Engineering Improve Use of PyroPhage Pols The thermostability and reverse transcriptase
in PCR and Sanger Sequencing activities seen in PyroPhage 3173 Pol allow effi-
PyroPhage 3173 and 347 Pols proved to be the cient RT PCR amplification of mRNA and viral
most thermostable of the newly discovered poly- RNA genomic targets with improved perfor- F
merases. In fact, these are the first viral Pols with mance compared to alternative single-enzyme
adequate thermostability for PCR. PyroPhage solutions (Fig. 4). Quantitative detection of viral
3173 Pol, which has been studied in greatest targets is linear over at least seven logs of dilution
detail (Table 1), has adequate thermostability (Fig. 5). These benefits have significant improved
for thermocycling, inherent reverse transcriptase detection of transcripts and RNA viruses (Moser
activity, and high fidelity that enable a number of et al. 2012).
applications for this enzyme. The proofreading Currently almost all RT PCR depends on ret-
activity proved highly beneficial for high-fidelity roviral RTs, i.e., M-MLV and AMV RTs, which,
PCR amplification (Fig. 2). However, many despite wide use, have well-documented defi-
applications benefit from the absence of proof- ciencies that compromise RT PCR. Side activi-
reading activity. Alignment of the PyroPhage ties in retroviral reverse transcriptases, including
3173 pol gene to E. coli polA (Beese and Steitz RNAse H and terminal transferase, lead to
1991) identified codons for two acidic residues, mismatch extension artifacts (Blumenthal 1980;
either of which could be mutated to eliminate Blumenthal and Hill 1980; Harrison and
exonuclease activity. This reduced fidelity to Zimmerman 1984; Pulsinelli and Temin 1991;
very close to that of Taq Pol, but simplifies its Shah et al. 1995; Vratskikh et al. 1995; Ho and
use in PCR and other amplification methods. Like Shuman 2002; van Dijk et al. 2004). Primer-
most family A Pols, 3173 has a strong discrimi- dependent bias in extension efficiency (Yin
nation against dideoxynucleotides that made it et al. 2003) and fidelity (Cheng et al. 2005) likely
less effective in Sanger sequencing. Based on account for documented inaccuracy of RT PCR
quantification (Loeffler et al. 2003), poor corre-
lation between tests (Nelson et al. 2001), and/or
Functional Viral Metagenomics and the Develop-
ment of New Enzymes for DNA and RNA Amplifica-
complete amplification failure depending on the
tion and Sequencing, Table 1 Biochemical RT and the abundance of transcript (Damasko
characteristics of PyroPhage 3173 DNAP et al. 2005). Inherently low synthesis fidelity
30 –50 exonuclease Strong (up to one error per 500 nt, 20X higher than Taq
50 –30 exonuclease None Pol) results in misincorporations, frameshifts,
Strand displacement Strong and deletions (Kerr and Sadowski 1972; Little
Extension from nicks Strong 1981; Heaphy et al. 1987). Strand-switching
T½ at 95 10 min (Strauch et al. 2003) probably causes the inter-
Km dNTPs 40 mM and intramolecular rearrangement artifacts
Km DNA 5.3 nM (Cherepanov and de Vries 2001) that can be pref-
Processivity 42 erentially extended (Sharp et al. 1994) and result
Fidelity 8 104 in recombination or insertion/deletion (indel)
30 ends of amplicons Blunt artifacts in cDNA synthesis (Evans et al. 1989;
Template DNA or RNA Snyder et al. 1992). A consequence of
F 206 Functional Viral Metagenomics and the Development of New Enzymes
Functional Viral
Metagenomics and the
Development of New
Enzymes for DNA and
RNA Amplification and
Sequencing,
Fig. 2 Fidelity of
PyroPhage 3173 Pol and
its exo- derivative.
Fidelities of PCR
amplification of PyroPhage
3173 wt and exonuclease
minus Pols were compared
to commercial sources of
thermostable Pols in the
lacI forward mutation assay
(Lundberg et al. 1991)
Functional Viral
Metagenomics and the
Development of New
Enzymes for DNA and
RNA Amplification and
Sequencing,
Fig. 3 Directed
engineering of 3173 Pol to
improve Sanger
sequencing. (a) Shown is
the increased incorporation
of dideoxy- and acyclo-
nucleotides by the F418Y
mutant of PyroPhage 3173
Pol, as indicated by
increased inhibition of Pol
activity by chain-
terminating nucleotides. (b)
The F418Y mutant was
used as a direct substitute
for Thermo Sequenase in
a BigDye ® (ABI)
sequencing reaction
two-enzyme RT PCR is that the RT step can these deficiencies include mutagenesis to
interfere with subsequent PCR (Harnett disable or remove the RNAse H domain
et al. 1985; McLaughlin et al. 1985; Evans et al. (Downie et al. 2004). These mutations reduce
1989; Petric et al. 1991; Snyder et al. 1992; Sharp rearrangements, but lead to increased substitution
et al. 1994), which compromises quantification of errors and bias (Blumenthal and Hill 1980;
low abundance targets. Efforts to ameliorate Middleton et al. 1985; Vratskikh et al. 1995).
Functional Viral Metagenomics and the Development of New Enzymes 207 F
Functional Viral
Metagenomics and the
Development of New
Enzymes for DNA and
RNA Amplification and
Sequencing,
Fig. 4 Reverse
transcription PCR using
PyroPhage 3173 Pol.
(a) Total human liver RNA
(1 mg, Promega) was
reverse transcribed by
M-MLV RT or PyroPhage
3173 Pol and then PCR
amplified using Lucigen F
EconoTaq ® PLUS Master
Mix. Shown are targets of
144, 246, and 298 bp.
(b) Single-enzyme RT PCR
amplifications by
PyroPhage 3173 Pol and
Tth (Epicentre) were
compared using a 160 bp
MS2 phage RNA target
over a 102- to 108-fold
dilution series. Shown are
real-time and post reaction
melt data (top) and
corresponding end point
RT PCR agarose gel
(bottom). Tth polymerase
was used with Mn2+ as
directed. Arrows show
correct melt Tm (top) and
amplicon (bottom)
Other enzymes have been explored as alterna- Assembly of Composite Contigs from Viral
tives to retroviral RTs (e.g., Tth Pol (Rand and Metagenomes
Gait 1984)), but none has proven a satisfactory One anticipated drawback of using
replacement for most methods that rely on metagenomics as an enzyme discovery tools
reverse transcription of RNA. PyroPhage 3173 was the fragmentary nature of the reads, which
is the most efficient Pol for single-enzyme RT was expected to hamper efforts to associate sub-
PCR and, as such, an alternative to the retroviral units of multisubunit enzymes. Many proteins,
RT-dependent methods. replicases in particular, function as multiple
F 208 Functional Viral Metagenomics and the Development of New Enzymes
Functional Viral Metagenomics and the Develop- electrophoresis. (b) The MS2 RNA was diluted from
ment of New Enzymes for DNA and RNA Amplifica- 101- to 107-fold and amplified using a primer pair
tion and Sequencing, Fig. 5 Single-enzyme, one-step corresponding to the 160 bp fragment in Panel A. Real-
RT PCR amplification of MS2 phage RNA using 3173 time PCR fluorescence in RFU (relative fluorescence
Pol. MS2 RNA was amplified by 40 cycles of RT PCR units) vs. PCR cycles. (c) Post-amplification thermal
using the primers shown in Table 1 and 3173 Pol. (a) melt in -dRFU/dTemperature vs. Temperature ( C).
Products from 89 to 362 bp in length were amplified Light blue region indicates melt curves for specific prod-
using one-step single-enzyme RT PCR cycling condi- ucts. (d) Standard curve PCR cycle threshold vs. log10
tions: 15 s at 94 C (10 s at 94 C, 30 s at 72 C)*40. RNA copy number in triplicate with linear least squares
Products were resolved by 2 % agarose gel best fit line
subunits. Indeed, the replicases of phages T4, T7, independently in vitro, the utility may be
and Phi29 and viruses Mo-MLV, vaccinia, and improved by additional subunits. For example,
herpes all function in vivo as multigene replica- T7 Pol apoenzyme, by itself, has low processivity
tion complexes encoding a number of subunits, and was not very effective in Sanger sequencing
e.g., helicases, primases, processivity factors, and without its host-derived processivity factor,
clamp loaders (Blanco et al. 1994; Bertram thioredoxin (Tabor et al. 1987; Tabor and Rich-
et al. 1998; Goodman 1998; Reha-Krantz ardson 1987). Because proteins in replication
et al. 1998; Tang et al. 1998). While, in most complexes often have highly specific contacts
cases, the polymerase subunits function with one another (Goodman 1998), it is important
Functional Viral Metagenomics and the Development of New Enzymes 209 F
that subunits are derived from the same viral metagenome with the biochemistry of the gene
genome and not from unrelated viruses. products (Fig. 6).
Because these functionally related genes are This 16.5 kb contig, assembled at 50 % iden-
often adjacent in operons, it is theoretically pos- tity, includes 187 reads (average coverage of
sible to identify them given long enough contig- 11 reads per nucleotide position). GeneMark
uous sequence. Experience shows that operons (Besemer and Borodovsky 2005) predicted
are almost always too large to be found in the 26 ORFs of greater than 100 nucleotides, which,
relatively small insert clones seen in typical when translated and annotated by BLASTp,
metagenomic libraries and, without modified appears to include at least a partial replication
assembly rules, are missed. With deep sequenc- operon. The genes with the strongest similarity
ing, these fragments could theoretically be to four of these ORFs encode two primase sub-
assembled to recover complete viral genomes. units, uracil DNA glycosylase, a family B DNA F
In practice, the high degree of sequence polymor- polymerase, and nucleotide excision repair
phism that characterizes viral metapopulations nuclease (dna G, udg, pol B, and ERCC4 genes,
confounds assembly of related genes and only respectively). Homologs of these ORFs belong to
very limited assembly has been possible by stan- crenarchaeal DNA replication/repair complexes
dard protocols. (Roberts et al. 2003; Dionne and Bell 2005; Barry
To accommodate this natural population and Bell 2006). The predicted pol B gene showed
diversity, assembly stringency was lowered 28 % identity to Pyrobaculum islandicus polB2
experimentally from the standard 95 % identity (Kahler and Antranikian 2000). Three of the dis-
to as low as 50 %. Assembly of the YNP Bear creet clones that include the pol B gene in this
Paw (74 ºC) and Octopus (93 ºC) metagenomes contig (PyroPhage 4110, 2783, and 2323 Pols;
at 50 % identity allowed recovery of composite Fig. 1) have been expressed in E. coli to produce
contigs as large as 35 kb. Fully 7.04 Mb (33 %) of a functional thermostable DNA polymerase (data
the Octopus reads assembled at this identity into not shown). This contig also contains apparent
17 contigs of greater than 10 kb (Schoenfeld homologs to a zinc fingerlike protein and a
et al. 2008). These assemblies appear very reli- transposon-like integrase/resolvase (tnp).
able in associating orthologous sequences. Par- Another ORF with highest similarity to the
ticularly in the Octopus library, the sequence CRISPR-associated sequence cas4 (Haft
reads are evenly distributed throughout the et al. 2005) is more likely a separate member
contigs with minimal stacking or other anomalies of the cas4 COG, presumably a recB-like
that would suggest amplification or cloning exonuclease gene.
artifacts. The high numbers of reads on both To correlate the level of sequence divergence
strands, evenly distributed throughout the with predicted gene function, SNP frequency was
contigs, suggest these contigs represent indepen- calculated and overlaid onto the 50 % assembly
dent clones of closely related genomes. Using the consensus sequence of the contig (Fig. 6). Overall
lower stringency assemblies, SNPs can be iden- distribution of SNPs in the contig was 0.705 per
tified and mapped to the coding sequences. As 10 bp. Replication-associated genes showed
additional biochemical and structural data noticeably lower molecular diversity than the
become available, molecular diversity may be other ORFs. SNP distribution in the dna G, udg,
correlated with variations in function and pol B, and ERCC homologs was 0.565, 0.617,
structure. 0.569, and 0.548 per 10 bp, respectively, while
the distribution in the Zn finger, cas4, and thy
Assembly of a Replication Operon from A homologs was 0.979, 1.31, and 0.728, respec-
a Viral Metagenome tively. Finer mapping of this diversity is being
One of these contigs provided a unique opportu- used to understand the functional differences in
nity to identify potential replicase subunits and the enzymes encoded by the constituent clones of
associate population diversity of an assembled this contig.
F 210 Functional Viral Metagenomics and the Development of New Enzymes
16542 bp
187 reads
87% two reads per strand
4
3.5
SNP’s per 10 bp
3
2.5
2
1.5
1
0.5
0
bp 0 2000 4000 6000 8000 10000 12000 14000 16000
ORFs
Functional Viral Metagenomics and the Develop- polymorphisms per 10 base pairs were normalized to the
ment of New Enzymes for DNA and RNA Amplifica- number of reads covering the respective nucleotide and
tion and Sequencing, Fig. 6 Assembly of a 16.5 kb are aligned with predicted open reading frames from the
viral metagenome consensus contig from Octopus hot consensus sequence in the contig and the gene name of the
spring showing single nucleotide polymorphism het- strongest BLASTx similarity. Direction of transcription is
erogeneity. (a) 16.5 kb contig was assembled at 50 % shown by the arrows. Similarities to known genes were
identity from the NYP Octopus hot spring library. identified by BLASTp (Reprinted with permission
Sequence coverage is shown on the top, with each line (Schoenfeld et al. 2008))
representing a separate read. Single nucleotide
Functional Viral Metagenomics and the Develop- C-terminal half of a 100 kD ORF, but vary in the amount
ment of New Enzymes for DNA and RNA Amplifica- of N-terminal sequence. Despite differences in the sizes of
tion and Sequencing, Fig. 7 Putative polyprotein open reading frames of the inserts, all PyroPhage 74-like
F
from Great Boiling Spring viral metagenome. The clones express a thermostable protein of about 55 kD
PyroPhage 74-like pol genes are aligned to the consensus (Panel B). The 347 clone, in contrast, produces a 35 kD
sequence (Panel A). All of the clones contain the thermostable protein
Molecular Biology of the PyroPhage 3173 from a highly divergent, less abundant virus,
Replicase Operon since reads from this clone failed to assemble at
Expression of PyroPhage 3173 Pol, described 95 % identity with any other read in the library.
above, illustrates another challenge in Assembly at 75 % identity resulted in a 7,299 nt
metagenomic-based enzyme discovery. Since, contig (Fig. 8a), comprised of four reads. This
as with all metagenomes, the intact virus has assembly was confirmed by PCR amplification of
never been cultivated and the sequence data is nearly the entire contig from viral DNA isolated
fragmentary, delineation of the open reading from the same hot spring 4 years later to produce
frame of the pol gene was unclear. For production a product of the predicted size (Fig. 8b). This
and study of the 3173 Pol, expression was initi- amplification also suggests the 3173-encoding
ated at an ATG codon that appeared to be the virus is more persistent in the environment than
most probable start site based on alignment to other viral families, none of which was detectable
bacterial pol genes. Despite the success in using in the later samples. This contig encodes four
this 55 kD expression product in RT PCR and open reading frames of greater than 100 nt. The
other applications (see above), anomalies were largest of these encodes a protein of 1608 amino
apparent in the open reading frame that was acids (170 kD), the carboxy terminal portion of
used for expression of this enzyme. First, there which includes the 55 kD PyroPhage 3173 DNA
was no obvious adjacent ribosome binding site or polymerase. The amino terminal portion contains
transcriptional promoter. Second, there was no a coding sequence with only weak similarity to
homologous ATG codon in the related 488 and known genes. The other open reading frames
967 clones (Fig. 1), despite overall alignment encode putative helicases and a cas4/recB endo-
with the 3173 gene. Finally, an open reading nuclease protein.
frame extended upstream from the putative start The amplification product of the entire 1608
codon to the insertion site of the viral sequence in amino acid ORF expressed in E. coli produced an
the cloning vector. 80 kD protein (Fig. 8c) that co-purified with
Low identity assembly of the 3173 clone thermostable DNA polymerase activity. The sim-
proved useful in dissecting the molecular biology plest explanation is that the 1608 amino acid
of this gene and allowed production of the com- protein (expected MW of 170 kD) is processed
plete enzyme corresponding to the likely in vivo in vivo or in vitro to generate the 80 kD product
product. In contrast to the 4110-like and 74-like and that the original 55 kD PyroPhage 3173 Pol
polymerase families, the 3173 clone was derived was a cloning anomaly. Supporting this
F 212 Functional Viral Metagenomics and the Development of New Enzymes
Functional Viral Metagenomics and the Develop- indicated E values. Primers derived from the assembly are
ment of New Enzymes for DNA and RNA Amplifica- indicated by arrows and their positions on the contig are
tion and Sequencing, Fig. 8 Analysis and PCR indicated by the associated numbers. These primers were
amplification of a 7.3 kb contig from 75 % NIAID assem- used to amplify viral DNA isolated 4 years after the
bly. A 7.3 kb contig was assembled from four clones in the original collection (Panel B). An amplicon covering the
hot springs viral metagenome. GeneMark identified four 1608 amino acid ORF (Panel B, lane 2) was used; inserted
open reading frames of greater than 100 amino acids, the into an expression system and used to produce an apparent
sizes of which (144, 229, 202, and 1,608 amino acids) are truncation product of ~80 kD, indicated by the arrow
indicated (Panel A). These genes had BLASTx similarity (Panel C); that co-purified with the Pol activity
to helicases, cas4 (recB), and DNA polymerases, with the
interpretation, amino acids 884 to 894 form the Sequence Variants of PyroPhage 3173 DNA
motif AYIYLGSIFVE, which was predicted by Polymerase Isolated from the Viral
cleavage site analysis to be both labile to auto- Metagenome
lytic cleavage and accessible on the surface Metagenomics has proven quite useful for new
(Cosstick et al. 1984). Cleavage between G and enzyme discovery. The utility of viral
S would result in a 704 amino acid (80 kD) pro- metagenomes is greatly expanded when it is
tein. The amino terminal amino acids from the used to guide engineering. One approach to
80 kD protein aligns with the 50 –30 exonuclease improving DNA polymerases is directed evolu-
domains of T. aquaticus and E. coli. The amino tion (Ghadessy et al. 2001) based on random
acids involved in nucleotide binding are con- mutagenesis. While effective, quite daunting is
served, but not the amino acids required for the sheer number of mutants that must be
hydrolysis. Although the 55 kD protein has screened to approach saturation. For an enzyme
shown great utility, it is possible that addition of of the size of Taq Pol (832 amino acids), this
this 25 kD amino terminal sequence, or a portion would require 20832 clones to completely saturate
thereof, would improve its function for certain the entire gene with mutagenized codons and test
applications. In addition to the 80 kD Pol protein, all the possible amino acids at each positions.
the other ORFs are being expressed to reconsti- Even a fraction of this number overwhelms any
tute the presumptive replicase holoenzyme. current or conceivable screening capability.
This work highlights an important caveat of To limit the search, algorithms have been devel-
enzyme discovery by metagenomics. The frag- oped to target mutagenesis to specific domains
mentary sequences can result in the recovery (Voigt et al. 2001).
of partial genes. Assembly of sequences can Metagenomic libraries are an alternative to
be the only means of verifying ORFs. In this random degenerate libraries as a source of molec-
case, the partial gene proved highly useful, ular diversity. Since, in native populations, nature
but in many cases, a functional protein could selects for active proteins, activities of variants in
easily be missed by recovery of partial the libraries may differ, but they should all retain
sequences. function. To study sequence variants, the 55 kD
Functional Viral Metagenomics and the Development of New Enzymes 213 F
Functional Viral Metagenomics and the Develop- was purified and tested for thermostability by incubating
ment of New Enzymes for DNA and RNA Amplifica- for 10 min at the indicated temperature and assaying using
tion and Sequencing, Fig. 9 Thermostability of the standard DNA polymerase assay (Panel A). Shown are
PyroPhage 3173 Pol variants. The amplification product amino acid alignments of a portion of the Q-helix from the
from Fig. 7b, lane 2, was cloned and expressed to produce prototype PyroPhage 3173 and the two least thermostable
thermostable protein. The clones grouped into at least four sequence variants (variants 1 and 11) (Panel B). These
families that were 97 % identical to one another and 93 % thermolabile variants had one or two unique amino acids,
identical to the original clone. The expressed Pol activity respectively, that mapped to this region
version of PyroPhage 3173 amplified from viral helix (four amino acids apart) and likely interact
DNA collected at Octopus hot spring (Fig. 8b) to stabilize or destabilize the alpha helix and
was cloned in an expression vector. Eleven thereby alter thermostability.
clones were used to express DNA polymerase While a goal of screening hot spring viromes
activity and the inserts were sequenced. The var- was to find the most thermostable enzymes pos-
iants were 93 % identical to the original 3173 sible, the lower thermostability variants have
isolate and at least 97 % identical to one another. value. Isothermal amplification methods such as
When the polymerases were partially purified and LAMP (Notomi et al. 2000) use intermediate
tested, they had a significant range of thermosta- temperature (i.e., 50–70 C) and do not require
bility (Fig. 9). The two most labile enzymes had extreme thermostability. Less thermostable
only one or two unique nucleotide polymorphism enzymes will likely have higher activity at these
each. Two of these independent sequence poly- intermediate temperatures (Giver et al. 1998).
morphisms map within four codons of each other. Equally important, amino acids that reduce ther-
No three-dimensional structure is available for mostability map to regions that can be targeted to
PyroPhage 3173 Pol, but, based on sequence increase thermostability (Bae and Phillips 2004)
alignment to Taq DNA polymerase and its and are attractive targets for mutagenesis.
known protein structure (Kim et al. 1995), the
polymorphisms associated with reduced thermo-
stability likely map to the same alpha helix (the Prospects
Q-helix) within one of the “fingers” of the Pol
structure. If so, the two affected amino acids are The focus of the efforts has been discovering and
at the proper spacing to be adjacent on the alpha improving thermostable DNA polymerases.
F 214 Functional Viral Metagenomics and the Development of New Enzymes
Metagenomics is playing a role in both the dis- DNA polymerase of bacteriophage RB69. J Biol
covery and development phases of this project. Chem. 2001;276(13):10387–97.
Beechem JM, Otto MR, Bloom LB, Eritja R, Reha-Krantz
Viral metagenomics has revealed new replicase LJ, Goodman MF. Exonuclease-polymerase active site
operons, thermophilic polyproteins, and entirely partitioning of primer-template DNA strands and equi-
new classes of Pols with novel and useful activ- librium Mg2+ binding properties of bacteriophage
ities for a number of methods of DNA and RNA T4 DNA polymerase. Biochemistry. 1998;37(28):
10144–55.
detection and analysis. In the near future, it may Beese LS, Steitz TA. Structural basis for the 30 -50 exonu-
be possible to assemble complete genomes from clease activity of Escherichia coli DNA polymerase I:
uncultivated viruses from thermal environments a two metal ion mechanism. EMBO J. 1991;10(1):
and recover intact replicase operons using the 25–33.
Bench SR, Hanson TE, Williamson KE, Ghosh D,
appropriate combination of sequencing strategy, Radosovich M, Wang K, Wommack KE.
assembly paradigm, and genome walking tech- Metagenomic characterization of Chesapeake Bay
niques. The information encoded in the viral virioplankton. Appl Environ Microbiol. 2007;73(23):
metagenomes is being used to direct an enzyme 7629–41.
Bertram JG, Bloom LB, Turner J, O’Donnell M, Beechem
improvement program. Additional applications JM, Goodman MF. Pre-steady state analysis of the
can likely be improved by the discovery of assembly of wild type and mutant circular clamps of
enzymes other than Pols. In many cases, viral Escherichia coli DNA polymerase III onto DNA.
metagenomes are excellent sources of diversity J Biol Chem. 1998;273(38):24564–74.
Besemer J, Borodovsky M. GeneMark: web software for
for these discovery programs and presumably any gene finding in prokaryotes, eukaryotes and viruses.
biochemical characteristic that can be measured Nucleic Acids Res. 2005;33:W451–4. Web Server
can be further improved by application of the issue.
knowledge gained through metagenomics. Bhui-Kaur A, Goodman MF, Tower J. DNA mismatch
repair catalyzed by extracts of mitotic, postmitotic,
and senescent Drosophila tissues and involvement of
mei-9 gene function for full activity. Mol Cell Biol.
1998;18(3):1436–43.
References Blanco L, Bernad A, Lazaro JM, Martin G, Garmendia C,
Salas M. Highly efficient DNA synthesis by the phage
Andraos N, Tabor S, Richardson CC. The highly phi 29 DNA polymerase. Symmetrical mode of DNA
processive DNA polymerase of bacteriophage T5. replication. J Biol Chem. 1989;264(15):8935–40.
Role of the unique N and C termini. J Biol Chem. Blanco L, Lazaro JM, de Vega M, Bonnin A, Salas
2004;279(48):50609–18. M. Terminal protein-primed DNA amplification.
Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Proc Natl Acad Sci U S A. 1994;91(25):12198–202.
Carlson C, Chan AM, Haynes M, Kelley S, Liu H, Blondal T, Hjorleifsdottir SH, Fridjonsson OF,
Mahaffy JM, Mueller JE, Nulton J, Olson R, Aevarsson A, Skirnisdottir S, Hermannsdottir AG,
Parsons R, Rayhawk S, Suttle CA, Rohwer F. The Hreggvidsson GO, Smith AV, Kristjansson
marine viromes of four oceanic regions. PLoS Biol. JK. Discovery and characterization of a thermostable
2006;4(11):e368. bacteriophage RNA ligase homologous to T4
Babon JJ, McKenzie M, Cotton RG. The use of resolvases RNA ligase 1. Nucleic Acids Res. 2003;31(24):
T4 endonuclease VII and T7 endonuclease I in muta- 7247–54.
tion detection. Mol Biotechnol. 2003;23(1):73–81. Blumenthal T. Interaction of host-coded and virus-coded
Bae E, Phillips Jr GN. Structures and analysis of highly polypeptides in RNA phage replication. Proc R Soc
homologous psychrophilic, mesophilic, and thermo- Lond B Biol Sci. 1980;210(1180):321–35.
philic adenylate kinases. J Biol Chem. Blumenthal T, Hill D. Roles of the host polypeptides in
2004;279(27):28202–8. Q beta RNA replication. Host factor and ribosomal
Baklanov MM, Riazankin IA, Butorin AS, Nechaev Iu S, protein S1 allow initiation at reduced GTP concentra-
Iamkovoi VI. Purification and characteristics of an tion. J Biol Chem. 1980;255(24):11713–6.
RNA-ligase preparation from bacteriophage T4. Prikl Breitbart M, Salamon P, Andresen B, Mahaffy JM, Segall
Biokhim Mikrobiol. 1984;20(2):191–9. AM, Mead D, Azam F, Rohwer F. Genomic analysis of
Barry ER, Bell SD. DNA replication in the archaea. uncultured marine viral communities. Proc Natl Acad
Microbiol Mol Biol Rev. 2006;70(4):876–87. Sci U S A. 2002;99(22):14250–5.
Bebenek A, Dressman HK, Carver GT, Ng S, Petrov V, Breitbart M, Wegley L, Leeds S, Schoenfeld T, Rohwer
Yang G, Konigsberg WH, Karam JD, Drake F. Phage community dynamics in hot springs. Appl
JW. Interacting fidelity defects in the replicative Environ Microbiol. 2004;70(3):1633–40.
Functional Viral Metagenomics and the Development of New Enzymes 215 F
Cheng Q, Nelson D, Zhu S, Fischetti VA. Removal of Stevens R, Valentine DL, Thurber RV, Wegley L,
group B streptococci colonizing the vagina and oro- White BA, Rohwer F. Functional metagenomic profil-
pharynx of mice with a bacteriophage lytic enzyme. ing of nine biomes. Nature. 2008;452(7187):629–32.
Antimicrob Agents Chemother. 2005;49(1):111–7. Dionne I, Bell SD. Characterization of an archaeal family
Cherepanov AV, de Vries S. Binding of nucleotides by T4 4 uracil DNA glycosylase and its interaction with
DNA ligase and T4 RNA ligase: optical absorbance PCNA and chromatin proteins. Biochem J. 2005;
and fluorescence studies. Biophys J. 2001;81(6): 387(Pt 3):859–63.
3545–59. Downie AB, Dirk LM, Xu Q, Drake J, Zhang D, Dutt M,
Clepet C, Le Clainche I, Caboche M. Improved full-length Butterfield A, Geneve RR, Corum 3rd JW, Lindstrom
cDNA production based on RNA tagging by T4 DNA KG, Snyder JC. A physical, enzymatic, and genetic
ligase. Nucleic Acids Res. 2004;32(1):e6. characterization of perturbations in the seeds of the
Compton J. Nucleic acid sequence-based amplification. brownseed tomato mutants. J Exp Bot. 2004;55(399):
Nature. 1991;350(6313):91–2. 961–73.
Cosstick R, McLaughlin LW, Eckstein F. Fluorescent El Omari K, Ren J, Bird LE, Bona MK, Klarmann G,
labelling of tRNA and oligodeoxynucleotides using LeGrice SF, Stammers DK. Molecular architecture F
T4 RNA ligase. Nucleic Acids Res. 1984;12(4): and ligand recognition determinants for T4 RNA
1791–810. ligase. J Biol Chem. 2006;281(3):1573–9.
Damasko C, Konietzny A, Kaspar H, Appel B, Dersch P, Evans GF, Snyder YM, Butler LD, Zuckerman
Strauch E. Studies of the efficacy of Enterocoliticin, SH. Differential expression of interleukin-1 and
a phage-tail like bacteriocin, as antimicrobial agent tumor necrosis factor in murine septic shock models.
against Yersinia enterocolitica serotype O3 in a cell Circ Shock. 1989;29(4):279–90.
culture system and in mice. J Vet Med B Infect Dis Vet Ghadessy FJ, Ong JL, Holliger P. Directed evolution of
Public Health. 2005;52(4):171–9. polymerase function by compartmentalized self-
Dean FB, Nelson JR, Giesler TL, Lasken RS. Rapid replication. Proc Natl Acad Sci U S A. 2001;98(8):
amplification of plasmid and phage DNA using Phi 4552–7.
29 DNA polymerase and multiply-primed rolling cir- Giver L, Gershenson A, Freskgard PO, Arnold
cle amplification. Genome Res. 2001;11(6):1095–9. FH. Directed evolution of a thermostable esterase.
Detter JC, Jett JM, Lucas SM, Dalin E, Arellano AR, Proc Natl Acad Sci U S A. 1998;95(22):12809–13.
Wang M, Nelson JR, Chapman J, Lou Y, Rokhsar D, Goodman MF. Purposeful mutations. Nature. 1998;
Hawkins TL, Richardson PM. Isothermal strand- 395(6699):221–3.
displacement amplification applications for high- Goodman MF, Fygenson KD. DNA polymerase fidelity:
throughput genomics. Genomics. 2002;80(6):691–8. from genetics toward a biochemical understanding.
Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Genetics. 1998;148(4):1475–82.
Cibulskis K, Sougnez C, Greulich H, Muzny DM, Griffiths E, Gupta RS. Signature sequences in diverse
Morgan MB, Fulton L, Fulton RS, Zhang Q, Wendl proteins provide evidence for the late divergence of
MC, Lawrence MS, Larson DE, Chen K, Dooling DJ, the Order Aquificales. Int Microbiol. 2004;7(1):41–52.
Sabo A, Hawes AC, Shen H, Jhangiani SN, Lewis LR, Guatelli JC, Whitfield KM, Kwoh DY, Barringer KJ,
Hall O, Zhu Y, Mathew T, Ren Y, Yao J, Scherer SE, Richman DD, Gingeras TR. Isothermal, in vitro ampli-
Clerc K, Metcalf GA, Ng B, Milosavljevic A, fication of nucleic acids by a multienzyme reaction
Gonzalez-Garay ML, Osborne JR, Meyer R, Shi X, modeled after retroviral replication. Proc Natl Acad
Tang Y, Koboldt DC, Lin L, Abbott R, Miner TL, Sci U S A. 1990;87(19):7797.
Pohl C, Fewell G, Haipek C, Schmidt H, Dunford- Haft DH, Selengut J, Mongodin EF, Nelson KE. A guild of
Shore BH, Kraja A, Crosby SD, Sawyer CS, 45 CRISPR-associated (Cas) protein families and mul-
Vickery T, Sander S, Robinson J, Winckler W, tiple CRISPR/Cas subtypes exist in prokaryotic
Baldwin J, Chirieac LR, Dutt A, Fennell T, genomes. PLoS Comput Biol. 2005;1(6):e60.
Hanna M, Johnson BE, Onofrio RC, Thomas RK, Harnett SP, Lowe G, Tansley G. A stereochemical
Tonon G, Weir BA, Zhao X, Ziaugra L, Zody MC, study of the mechanism of activation of donor oligo-
Giordano T, Orringer MB, Roth JA, Spitz MR, nucleotides by RNA ligase from bacteriophage T4
Wistuba II, Ozenberger B, Good PJ, Chang AC, Beer infected Escherichia coli. Biochemistry. 1985;24(25):
DG, Watson MA, Ladanyi M, Broderick S, 7446–9.
Yoshizawa A, Travis WD, Pao W, Province MA, Harrison B, Zimmerman SB. Polymer-stimulated ligation:
Weinstock GM, Varmus HE, Gabriel SB, Lander ES, enhanced ligation of oligo- and polynucleotides by T4
Gibbs RA, Meyerson M, Wilson RK. Somatic muta- RNA ligase in polymer solutions. Nucleic Acids Res.
tions affect key pathways in lung adenocarcinoma. 1984;12(21):8235–51.
Nature. 2008;455(7216):1069–75. Heaphy S, Singh M, Gait MJ. Effect of single amino acid
Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, changes in the region of the adenylylation site of T4
Brulc JM, Furlan M, Desnues C, Haynes M, Li L, RNA ligase. Biochemistry. 1987;26(6):1688–96.
McDaniel L, Moran MA, Nelson KE, Nilsson C, Heckler TG, Chang LH, Zama Y, Naka T, Chorghade MS,
Olson R, Paul J, Brito BR, Ruan Y, Swan BK, Hecht SM. T4 RNA ligase mediated preparation of
F 216 Functional Viral Metagenomics and the Development of New Enzymes
novel “chemically misacylated” tRNAPheS. Mardis ER. Next-generation DNA sequencing methods.
Biochemistry. 1984;23(7):1468–73. Annu Rev Genomics Hum Genet. 2008b;9:387–402.
Ho CK, Shuman S. Bacteriophage T4 RNA ligase Marks JL, Gong Y, Chitale D, Golas B, McLellan MD,
2 (gp24.1) exemplifies a family of RNA ligases Kasai Y, Ding L, Mardis ER, Wilson RK, Solit D,
found in all phylogenetic domains. Proc Natl Acad Levine R, Michel K, Thomas RK, Rusch VW,
Sci U S A. 2002;99(20):12709–14. Ladanyi M, Pao W. Novel MEK1 mutation identified
Kahler M, Antranikian G. Cloning and characterization of by mutational analysis of epidermal growth factor
a family B DNA polymerase from the hyperthermo- receptor signaling pathway genes in lung adenocarci-
philic crenarchaeon Pyrobaculum islandicum. noma. Cancer Res. 2008;68(14):5524–8.
J Bacteriol. 2000;182(3):655–63. McDaniel L, Breitbart M, Mobberley J, Long A,
Karam JD, Konigsberg WH. DNA polymerase of the Haynes M, Rohwer F, Paul JH. Metagenomic analysis
T4-related bacteriophages. Prog Nucleic Acid Res of lysogeny in Tampa Bay: implications for prophage
Mol Biol. 2000;64:65–96. gene expression. PLoS One. 2008;3(9):e3263.
Kerr C, Sadowski PD. Gene 6 exonuclease of bacterio- McLaughlin LW, Piel N, Graeser E. Donor activation in
phage T7. I. Purification and properties of the enzyme. the T4 RNA ligase reaction. Biochemistry.
J Biol Chem. 1972;247(1):305–10. 1985;24(2):267–73.
Kim Y, Eom SH, Wang J, Lee DS, Suh SW, Steitz Merkens LS, Bryan SK, Moses RE. Inactivation of the
TA. Crystal structure of Thermus aquaticus DNA 50 -30 exonuclease of Thermus aquaticus DNA
polymerase. Nature. 1995;376(6541):612–6. polymerase. Biochim Biophys Acta. 1995;1264(2):
Koonin EV. Temporal order of evolution of DNA replica- 243–8.
tion systems inferred by comparison of cellular and Middleton T, Herlihy WC, Schimmel PR, Munro
viral DNA polymerases. Biol Direct. 2006;1:39. HN. Synthesis and purification of oligoribonucleotides
Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, using T4 RNA ligase and reverse-phase chromatogra-
Chen K, Dooling D, Dunford-Shore BH, McGrath S, phy. Anal Biochem. 1985;144(1):110–7.
Hickenbotham M, Cook L, Abbott R, Larson DE, Morin RD, Aksay G, Dolgosheina E, Ebhardt HA,
Koboldt DC, Pohl C, Smith S, Hawkins A, Abbott S, Magrini V, Mardis ER, Sahinalp SC, Unrau
Locke D, Hillier LW, Miner T, Fulton L, Magrini V, PJ. Comparative analysis of the small RNA
Wylie T, Glasscock J, Conyers J, Sander N, Shi X, transcriptomes of Pinus contorta and Oryza sativa.
Osborne JR, Minx P, Gordon D, Chinwalla A, Zhao Y, Genome Res. 2008;18(4):571–84.
Ries RE, Payton JE, Westervelt P, Tomasson MH, Moser MJ, Difrancesco RA, Gowda K, Klingele AJ,
Watson M, Baty J, Ivanovich J, Heath S, Shannon Sugar DR, Stocki S, Mead DA, Schoenfeld TW. Ther-
WD, Nagarajan R, Walter MJ, Link DC, Graubert mostable DNA polymerase from a viral metagenome
TA, DiPersio JF, Wilson RK. DNA sequencing of is a potent rt-PCR enzyme. PLoS One. 2012;7(6):
a cytogenetically normal acute myeloid leukaemia e38371.
genome. Nature. 2008;456(7218):66–72. Nandakumar J, Shuman S. Dual mechanisms whereby
Little JW. Lambda exonuclease. Gene Amplif Anal. a broken RNA end assists the catalysis of its repair
1981;2:135–45. by T4 RNA ligase 2. J Biol Chem. 2005;280(25):
Loeffler JM, Djurkovic S, Fischetti VA. Phage lytic 23484–9.
enzyme Cpl-1 as a novel antimicrobial for pneumo- Nandakumar J, Ho CK, Lima CD, Shuman S. RNA sub-
coccal bacteremia. Infect Immun. 2003;71(11): strate specificity and structure-guided mutational anal-
6199–204. ysis of bacteriophage T4 RNA ligase 2. J Biol Chem.
Lopatto D, Alvarez C, Barnard D, Chandrasekaran C, 2004;279(30):31337–47.
Chung HM, Du C, Eckdahl T, Goodman AL, Nelson D, Loomis L, Fischetti VA. Prevention and elim-
Hauser C, Jones CJ, Kopp OR, Kuleck GA, ination of upper respiratory colonization of mice by
McNeil G, Morris R, Myka JL, Nagengast A, group A streptococci by using a bacteriophage lytic
Overvoorde PJ, Poet JL, Reed K, Regisford G, enzyme. Proc Natl Acad Sci U S A. 2001;98(7):
Revie D, Rosenwald A, Saville K, Shaw M, Skuse 4107–12.
GR, Smith C, Smith M, Spratt M, Stamm J, Thompson Notomi T, Okayama H, Masubuchi H, Yonekawa T,
JS, Wilson BA, Witkowski C, Youngblom J, Leung W, Watanabe K, Amino N, Hase T. Loop-mediated iso-
Shaffer CD, Buhler J, Mardis E, Elgin thermal amplification of DNA. Nucleic Acids Res.
SC. Undergraduate research. Genomics education 2000;28(12):E63.
partnership. Science. 2008;322(5902):684–5. Otto MR, Bloom LB, Goodman MF, Beechem JM.
Lundberg KS, Shoemaker DD, Adams MW, Short JM, Stopped-flow fluorescence study of precatalytic primer
Sorge JA, Mathur EJ. High-fidelity amplification strand base-unstacking transitions in the exonuclease
using a thermostable DNA polymerase isolated from cleft of bacteriophage T4 DNA polymerase. Biochem-
Pyrococcus furiosus. Gene. 1991;108(1):1–6. istry. 1998;37(28):10156–63.
Mardis ER. The impact of next-generation sequencing Paulsen H, Wintermeyer W. Incorporation of 1,
technology on genetics. Trends Genet. 2008a;24(3): N6-ethenoadenosine into the 30 terminus of tRNA
133–41. using T4 RNA ligase. 2. Preparation and ribosome
Functional Viral Metagenomics and the Development of New Enzymes 217 F
interaction of fluorescent Escherichia coli tRNAMetf. Snyder YM, Guthrie L, Evans GF, Zuckerman
Eur J Biochem. 1984;138(1):125–30. SH. Transcriptional inhibition of endotoxin-induced
Pavlov AR, Karam JD. Nucleotide-sequence-specific and monokine synthesis following heat shock in murine
non-specific interactions of T4 DNA polymerase with peritoneal macrophages. J Leukoc Biol. 1992;51(2):
its own mRNA. Nucleic Acids Res. 2000;28(23): 181–7.
4657–64. Srinivasiah S, Bhavsar J, Thapar K, Liles M, Schoenfeld
Perez LE, Merrill GA, Delorenzo RA, Schoenfeld TW, T, Wommack KE. Phages across the biosphere: con-
Vats A, Moser MJ. Evaluation of the specificity trasts of viruses in soil and aquatic environments. Res
and sensitivity of a potential rapid influenza screening Microbiol. 2008 Jun;159(5):349–57.
system. Diagn Microbiol Infect Dis. 2012;75(1): Staley JT, Konopka A. Measurement of in situ activities of
77–80. nonphotosynthetic microorganisms in aquatic and
Petric A, Bhat B, Leonard NJ, Gumport RI. Ligation with terrestrial habitats. Annu Rev Microbiol. 1985;39:
T4 RNA ligase of an oligodeoxyribonucleotide to 321–46.
covalently-linked cross-sectional base-pair analogues Strauch E, Kaspar H, Schaudinn C, Damasko C,
of short, normal, and long dimensions. Nucleic Acids Konietzny A, Dersch P, Skurnik M, Appel B. Analysis F
Res. 1991;19(3):585–90. of enterocoliticin, a phage tail-like bacteriocin. Adv
Petruska J, Hartenstine MJ, Goodman MF. Analysis of Exp Med Biol. 2003;529:249–51.
strand slippage in DNA polymerase expansions of Tabor S, Richardson CC. DNA sequence analysis
CAG/CTG triplet repeats associated with neurodegen- with a modified bacteriophage T7 DNA
erative disease. J Biol Chem. 1998;273(9):5204–10. polymerase. Proc Natl Acad Sci U S A. 1987;84(14):
Pulsinelli GA, Temin HM. Characterization of large dele- 4767–71.
tions occurring during a single round of retrovirus Tabor S, Richardson CC. A single residue in DNA poly-
vector replication: novel deletion mechanism involv- merases of the Escherichia coli DNA polymerase
ing errors in strand transfer. J Virol. 1991;65(9): I family is critical for distinguishing between deoxy-
4786–97. and dideoxyribonucleotides. Proc Natl Acad Sci U S A.
Rand KN, Gait MJ. Sequence and cloning of bacterio- 1995;92(14):6339–43.
phage T4 gene 63 encoding RNA ligase and tail fibre Tabor S, Huber HE, Richardson CC. Escherichia coli
attachment activities. EMBO J. 1984;3(2):397–402. thioredoxin confers processivity on the DNA polymer-
Reha-Krantz LJ, Marquez LA, Elisseeva E, Baker RP, ase activity of the gene 5 protein of bacteriophage T7.
Bloom LB, Dunford HB, Goodman MF. The proof- J Biol Chem. 1987;262(33):16212–23.
reading pathway of bacteriophage T4 DNA polymer- Tang M, Bruck I, Eritja R, Turner J, Frank EG,
ase. J Biol Chem. 1998;273(36):22969–76. Woodgate R, O’Donnell M, Goodman MF. Biochem-
Rehrauer WM, Bruck I, Woodgate R, Goodman MF, ical basis of SOS-induced mutagenesis in Escherichia
Kowalczykowski SC. Modulation of RecA nucleopro- coli: reconstitution of in vitro lesion bypass
tein function by the mutagenic UmuD’C protein com- dependent on the UmuD’2C mutagenic complex and
plex. J Biol Chem. 1998;273(49):32384–7. RecA protein. Proc Natl Acad Sci U S A. 1998;95(17):
Roberts JA, Bell SD, White MF. An archaeal XPF repair 9755–60.
endonuclease dependent on a heterotrimeric PCNA. Tang H, Yang X, Wang K, Tan W, Li H, He L, Liu B.
Mol Microbiol. 2003;48(2):361–71. RNA-templated single-base mutation detection based
Schmidt CJ, Romanov M, Ryder O, Magrini V, on T4 DNA ligase and reverse molecular beacon.
Hickenbotham M, Glasscock J, McGrath S, Talanta. 2008;75(5):1388–93.
Mardis E, Stein LD. Gallus GBrowse: a unified geno- Truncaite L, Zajanckauskaite A, Arlauskas A, Nivinskas
mic database for the chicken. Nucleic Acids Res. R. Transcription and RNA processing during expres-
2008;36(Database issue):D719–23. sion of genes preceding DNA ligase gene 30 in
Schoenfeld T, Patterson M, Richardson PM, Wommack T4-related bacteriophages. Virology. 2006;344(2):
KE, Young M, Mead D. Assembly of viral 378–90.
metagenomes from Yellowstone hot springs. Appl van Dijk AA, Makeyev EV, Bamford DH. Initiation of
Environ Microbiol. 2008;74(13):4164–74. viral RNA-dependent RNA polymerization. J Gen
Shah JS, Liu J, Buxton D, Hendricks A, Robinson L, Virol. 2004;85(Pt 5):1077–93.
Radcliffe G, King W, Lane D, Olive DM, Klinger Voigt CA, Mayo SL, Arnold FH, Wang ZG. Computa-
JD. Q-beta replicase-amplified assay for detection of tionally focusing the directed evolution of proteins.
Mycobacterium tuberculosis directly from clinical J Cell Biochem Suppl. 2001;37:58–63.
specimens. J Clin Microbiol. 1995;33(6):1435–41. Vratskikh LV, Komarova NI, Yamkovoy VI. Solid-phase
Sharp RL, May PC, Mayne NG, Snyder YM, Burnett synthesis of oligoribonucleotides using T4 RNA ligase
JP. Cyclothiazide potentiates agonist responses at and T4 polynucleotide kinase. Biochimie. 1995;77(4):
human AMPA/kainate receptors expressed in oocytes. 227–32.
Eur J Pharmacol. 1994;266(1):R1–2. Wang Y, Silverman SK. Efficient RNA 50 -adenylation by
Shendure J, Ji H. Next-generation DNA sequencing. Nat T4 DNA ligase to facilitate practical applications.
Biotechnol. 2008;26(10):1135–45. RNA. 2006;12(6):1142–6.
F 218 Functional Viral Metagenomics and the Development of New Enzymes
Wang LK, Schwer B, Shuman S. Structure-guided muta- Xiang Z, Zhao Y, Mitaksov V, Fremont DH, Kasai Y,
tional analysis of T4 RNA ligase 1. RNA. Molitoris A, Ries RE, Miner TL, McLellan MD,
2006;12(12):2126–34. DiPersio JF, Link DC, Payton JE, Graubert TA,
Wang LK, Nandakumar J, Schwer B, Shuman S. The Watson M, Shannon W, Heath SE, Nagarajan R,
C-terminal domain of T4 RNA ligase 1 confers Mardis ER, Wilson RK, Ley TJ, Tomasson
specificity for tRNA repair. RNA. 2007;13(8): MH. Identification of somatic JAK1 mutations in
1235–44. patients with acute myeloid leukemia. Blood.
Wang X, Sun Q, McGrath SD, Mardis ER, Soloway PD, 2008;111(9):4809–12.
Clark AG. Transcriptome-wide identification of novel Yin S, Ho CK, Shuman S. Structure-function analysis of
imprinted genes in neonatal mouse brain. PLoS One. T4 RNA ligase 2. J Biol Chem. 2003;278(20):
2008;3(12):e3839. 17601–8.
Wommack KE, Bhavsar J, Ravel J. Metagenomics: read Yin S, Kiong Ho C, Miller ES, Shuman S. Characteriza-
length matters. Appl Environ Microbiol. 2008 tion of bacteriophage KVP40 and T4 RNA ligase 2.
Mar;74(5):1453–63. Virology. 2004;319(1):141–51.
G
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
G 220 Genome Atlases, Potential Applications in Study of Metagenomes
0k
25
0k
P. ubique
1000k
HTCC1062 1,308,759 bp
50
32 >
_ 09
0k
1>1
DRS
SCA
k
750
dna
2 A
093
11_
SAR
Su
rfa
ce
Genome Atlases, Potential Applications in Study of HTCC1062 followed by the genome’s annotation lane.
Metagenomes, Fig. 1 A BLAST Atlas representing the Then comes the BLAST lanes, where the BLAST result
comparison of marine bacterium Pelagibacter ubique to for the query genome against the reference is shown. The
the other four Pelagibacter genomes and seven BLAST hit significance is indicated with the color inten-
metagenome samples. The six innermost lanes show the sity, where higher intensity corresponds to a more
DNA properties of the reference genome P. ubique significant hit
genome of 1.3 Mbp, first isolated from Saragossa The Metagenome projects that are chosen are
Sea (Giovannoni et al. 1990), and requires added Moore Marine Microbial Sequencing (Sun
reduced sulfur for growth (Tripp et al. 2008). et al. 2011), Global Ocean Sampling (GOS)
The genome comparisons in this study include (Yooseph et al. 2007), Whale Fall (Tringe
other Pelagibacter species and Pelagibacterium et al. 2005), Acid Mine Drainage (Tyson
halotolerans B2 (Huo et al. 2012). Note the et al. 2004), Microbial Community Genomics at
darker green colors for the P. ubique lane and the HOT/ALOHA (DeLong et al. 2006), Waseca
for other closely related Pelagibacter species. County Farm Soil (Tringe et al. 2005), and Wash-
However, apart from the reference strain, there ington Lake (Kalyuzhnaya et al. 2008). In all
are some regions of missing genes (gaps) that comparisons, the P. ubique proteins were com-
can be seen. pared against the metagenomes using the BLAST
Genome Atlases, Potential Applications in Study of Metagenomes 221 G
tool of the database itself with default parameters, regions around 510–564 kb contains the genes
and the results are then visualized with BLAST that are related to amino sugar metabolism
Atlas. Moore Marine Microbes, GOS, and (rfaD,rfaE), pentose phosphate pathway (tktC),
HOT/ALOHA samples have protein annotations; lipopolysaccharide synthesis (gmhA and gmhB),
therefore, a BLASTP search was used. The other streptomycin biosynthesis (rpbB), and transfer-
metagenomes are assembled but not annotated, ase activity (spsA, rfaG, rfaK). This gap region
so TBLASTN comparison was made. and a few bases downstream is marked as “sur-
Metagenomes that are not assembled were not face” because this area contains proteins related
used in this study, because protein comparison to surface features (ompS, LPS biosynthesis,
against metagenome reads was not very reliable. etc.). Another gap includes a “giant protein”
In the BLAST Atlas, the six innermost lanes (Strom et al. 2012), annotated as “hypothetical
show some of the DNA properties (Jensen protein SAR11_0932,” and is 7,317 amino acid
et al. 1999; Pedersen et al. 2000) of the reference residues long. The reason why this protein seems
chromosome, P. ubique HTCC1062; these are, to be partially found in Marine Microbes and G
from innermost to outermost: the average AT GOS metagenomes (dark blue lane) is due to the
percentage (over a 10,000 bp average), GC many repeat regions in the protein, which might
Skew (10,000 bp average), Global Direct look like other regions in the proteins of the other
Repeats, Nucleosome Position Preference genomes. But the whole protein itself is not found
(green regions represent chromatin-free areas; because it varies even within the same species;
Satchwell et al. 1986; Baldi et al. 1996), DNA these “giant proteins” are known to be variable
helix stacking energy (on this scale, red regions and thought to be involved in protection against
will melt more readily, and green regions are viral attacks, as well as predation by protists
more stable; Ornstein et al. 1978), and intrinsic (Strom et al. 2012). Some of the other gaps are
curvature (blue means highly curved areas, and due to tRNA or rRNAs, because the BLAST lanes
yellow indicates low levels of curvature; Bolshoy only compare protein sequences. When looked at
et al. 1991; Shpigelman et al. 1993). The next the other metagenome BLAST lanes, the BLAST
outer lane is the annotations, coding sequences on hits are seen very weak meaning that P. ubique
plus and minus strand. After the annotations the genes that are compared here are not present in
BLAST lanes start, which show the BLAST hits those metagenome samples.
on each position. The color intensity indicates In summary, BLAST Atlas is a way to visual-
how good a BLAST hit is, with darker colors ize the mapping of bacterial genomes against
representing regions of conserved proteins and metagenomes, and this can be used to compare
grey areas contain poor or no matches. The first many different environments. If a certain protein,
BLAST lane is P. ubique itself as a control. The a set of proteins, or a genomic region is being
next few lanes are other Pelagibacter sp., and investigated, this tool will guide in finding the
they show high resemblance to the reference presence or absence of those proteins. It is also
P. ubique. The 5th lane is a Pelagibacterium possible to zoom in to desired ranges of the
which should not be mixed because it is classified genome to see local differences (Hallin
as a completely different clade in Alphaproteo- et al. 2008).
bacteria, as can be seen from the low protein
similarity. However it’s BLAST hit profile still
resembles the other Pelagibacter sp. References
According to this figure, we can see that
almost all the coding genes of P. ubique are Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ.
found in the CAMERA Marine Microbes sam- Basic local alignment search tool. J Mol Biol.
1990;215(3):403–10.
ples, and most are also found in the GOS data,
Baldi P, Brunak S, Chauvin Y, Krogh A. Naturally occur-
which means that the bacterium is present in ring nucleosome positioning signals in human exons
these environments, as expected. One of the gap and introns. J Mol Biol. 1996;263(4):503–10.
G 222 Genome Portal, Joint Genome Institute
Bolshoy A, McNamara P, Harrington RE, Trifonov Tringe SG, von Mering C, Kobayashi A, et al. Compara-
EN. Curved DNA without A-A: experimental estima- tive metagenomics of microbial communities.
tion of all 16 DNA wedge angles. Proc Natl Acad Sci Science. 2005;308(5721):554–7.
USA. 1991;88:2312–6. Tripp HJ, Kitner JB, Schwalbach MS, et al. SAR11 marine
Brown MV, Lauro FM, DeMaere MZ, et al. Global bio- bacteria require exogenous reduced sulphur for
geography of SAR11 marine bacteria. Mol Syst Biol. growth. Nature. 2008;452(7188):741–4.
2012;8:595. Tyson GW, Chapman J, Hugenholtz P, et al. Community
DeLong EF, Preston CM, Mincer T, et al. Community structure and metabolism through reconstruction of
genomics among stratified microbial assemblages in microbial genomes from the environment. Nature.
the ocean’s interior. Science. 2006;311(5760):496–503. 2004;428(6978):37–43.
Garcı́a-Martı́nez J, Rodrı́guez-Valera F. Microdiversity of Yooseph S, Sutton G, Rusch DB, et al. The Sorcerer II
uncultured marine prokaryotes: the SAR11 cluster and global ocean sampling expedition: expanding the uni-
the marine Archaea of group I. Mol Ecol. 2000;9(7): verse of protein families. PLoS Biol. 2007;5(3):e16.
935–48.
Giovannoni SJ, Britschgi TB, Moyer CL, Field
KG. Genetic diversity in Sargasso Sea bacterio-
plankton. Nature. 1990;345(6270):60–3. Genome Portal, Joint Genome
Giovannoni SJ, Tripp HJ, Givan S, et al. Genome Institute
streamlining in a cosmopolitan oceanic bacterium.
Science. 2005;309(5738):1242–5.
Hallin PF, Binnewies TT, Ussery DW. The genome Igor V. Grigoriev, Susannah Tringe and
BLAST atlas – a GeneWiz extension for visualization Inna Dubchak
of whole-genome homology. Mol Biosyst. 2008;4(5): US Department of Energy Joint Genome
363–71.
Huo Y-Y, Cheng H, Han X-F, et al. Complete genome Institute, Walnut Creek, CA, USA
sequence of Pelagibacterium halotolerans B2(T).
J Bacteriol. 2012;194(1):197–8.
Jensen LJ, Friis C, Ussery DW. Three views of microbial Synonyms
genomes. Res Microbiol. 1999;150(9–10):773–7.
Kalyuzhnaya MG, Lapidus A, Ivanova N, et al. High-
resolution metagenomics targets specific functional Comparative genomics; Data integration;
types in complex microbial communities. Nat Genome analysis; Genome projects;
Biotechnol. 2008;26(9):1029–34. Metagenomics
Markowitz VM, Chen I-MA, Chu K, et al. IMG/M: the
integrated metagenome data management and compar-
ative analysis system. Nucleic Acids Res. 2012;40-
(Database issue):D123–9. Definition
Ornstein RL, Rein R, Breen DL, MacElroy R. An
optimised potential function for the calculation of
nucleic acid interaction energies. I. Base stacking. The US Department of Energy (DOE) Joint
Biopolymers. 1978;17:2341–60. Genome Institute (JGI) is a national user facility
Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, Ussery with massive-scale DNA sequencing and analy-
DW. A DNA structural atlas for Escherichia coli. sis capabilities dedicated to advancing genomics
J Mol Biol. 2000;299(4):907–30.
Satchwell SC, Drew HR, Travers AA. Sequence period- for bioenergy and environmental applications.
icities in chicken nucleosome core DNA. J Mol Biol. The JGI Genome Portal is an integrated geno-
1986;191(4):659–75. mic resource, which provides for the research
Shpigelman ES, Trifonov EN, Boishoy A. Curvature: soft- community around the world access to the large
ware for the analysis of curved DNA. Comput Appl
Biosci. 1993;9:435–40. collection of genomic data for plants, fungi,
Strom SL, Brahamsha B, Fredrickson KA, Apple JK, microbes, and metagenomes and to web-based
Rodr’iguez AG. A giant cell surface protein in interactive tools for their analysis.
Synechococcus WH8102 inhibits feeding by
a dinoflagellate predator. Environ Microbiol. 2012;
14(3):807–16. Introduction
Sun S, Chen J, Li W, et al. Community cyberinfrastructure
for advanced microbial ecology research and
analysis: the CAMERA resource. Nucleic Acids Res. The Department of Energy (DOE) Joint Genome
2011;39(Database issue):D546–51. Institute (JGI) was established for the Human
Genome Portal, Joint Genome Institute 223 G
Genome Project (Lander et al. 2001) and later plants (Phytozome; Goodstein et al. 2012),
was transformed into a national user facility for fungi (MycoCosm; Grigoriev et al. 2012),
genome research in the DOE mission areas of microbes (Integrated Microbial Genomes or
bioenergy, carbon cycling, and biogeochemistry. IMG; Markowitz et al. 2012b), and metagenomes
JGI provides expertise and resources in DNA (IMG/M; Markowitz et al. 2012a).
sequencing, technology development, and bioin- The JGI Genome Portal provides a unified
formatics to the broader scientific community. access point to all JGI genomic databases and
Scientists around the world can make proposals analytical tools, as well as worldwide statistics
to the JGI Community Sequencing Program on the usage of the JGI resources and the infor-
(CSP; e.g., Martin et al. 2011) to sequence mation about the latest genome releases and new
genomes, transcriptomes, and metagenomes and tool development. A user can find all DOE JGI
address important scientific questions of DOE sequencing projects and their status, search for
mission relevance. Massive amounts of genomic and download raw data, assemblies and annota-
data are assembled, annotated, and delivered to tions of sequenced genomes and metagenomes, G
users by means of integrated databases and inter- and interactively explore those datasets and com-
active analytical tools interconnected within the pare them with other sequenced microbes, fungi,
JGI Genome Portal (http://genome.jgi.doe.gov; plants, or metagenomes using specialized sys-
Grigoriev et al. 2012). tems tailored to each particular class of organ-
Leading the world in the number of sequenced isms. All these can serve as building blocks in
plants, fungi, microbes, and metagenomes comprehensive analyses of individual organisms
(according to the Genomes Online Database or systems of interacting organisms.
(GOLD; Pagani et al. 2012)), JGI has dramati-
cally increased its sequencing capabilities using
new sequencing technologies. JGI projects A Catalogue of Genome Sequencing
evolved from sequencing three of the human Projects
chromosomes (Lander et al. 2001) to the large-
scale “Grand Challenge” projects such as the Metagenomic analysis requires reference
Genomic Encyclopedia of Bacteria and Archaea genomes for better interpretation of sequence
(GEBA; Wu et al. 2009), the 1,000 Fungal data derived from complex microbial communi-
Genome Project (Grigoriev et al. 2011), and the ties. The democratization of sequencing allows
metagenomic projects targeting soil and rhizo- many scientists to sequence appropriate genome
sphere. Since tracking individual organisms and references in their own labs prior to approaching
samples at such a scale becomes critical, metagenomes. Consolidation of genomic data
genomes and metagenomes sequenced or sequenced in different places around the world
selected for sequencing are carefully catalogued is an important step in both genomics and
and made available to the public along with their metagenomics.
status and links to the produced data and avail- JGI’s collection of genomic projects includes
able tools. thousands projects of different types and is pub-
The sequenced data are assembled, annotated, licly available and searchable. Product types
and analyzed using various computational pipe- include standard or improved genome drafts, fin-
lines developed for each of the products delivered ished genomes, gene expression profiling,
by JGI to its users. The resulting annotations are resequencing, metagenome projects, and others.
available for download and also can be interac- The Project List (http://genome.jgi.doe.gov/
tively viewed using the JGI Genome Portal offer- genome-projects) is available from most of the
ing a wide array of databases and analytical Portal pages as a menu item and includes
systems to interpret the data. Some systems a detailed description of each project including
work across multiple JGI databases, while others its scope and current status, taxon, the JGI pro-
allow users to specifically manage datasets on gram, and the project lead. The Resources
G 224 Genome Portal, Joint Genome Institute
column lists tools available for this project. Some Organism home pages. Clicking on a branch
of these tools, e.g., download, are available for all name produces a menu displaying available
genomes, while others are taxon, project type, or genomes in this kingdom, phylum, class, or
stage dependent. For example, a plant or fungal order (Fig. 1), each connected to pages in differ-
genome will be linked to Phytozome or ent analytical resources. The same pages can be
MycoCosm, respectively. reached in a step-by-step genome selection from
All JGI projects are also registered in the a hierarchical selection menu on the top of the
GOLD database, which includes a larger collec- page or searching for genomes by keyword (e.g.,
tion of projects sequenced around the world plants, Eukaryota), name, taxonID, or projectID.
(Pagani et al. 2012). Currently it contains a list Each of the genomic datasets can be analyzed
of about 16,000 genomes including over 3,000 with a collection of tools linked directly to their
that are complete and over 2,000 metagenomes. genome databases. Each organism’s home page
Besides utility for metagenomics, having contains a description of the project, BLAST,
a comprehensive list of sequencing projects download, and links to specialized resources as
from all laboratories around the world also helps described in the next section.
to avoid redundancy when sequencing targets are
selected for the large-scale projects like GEBA or
1,000 Fungal Genomes. Comparative Databases and Tools
Genome Portal, Joint Genome Institute, Fig. 1 The JGI Genome Portal. A pull- projects at the DOE JGI. The bottom portion of the page connects to the specialized
down menu for the “Marine” category of Metagenomes is shown. BLAST and Down- databases in microbes (IMG) and metagenomes (IMG/M), fungi (MycoCosm), and
load functions are available for the entire selected group. Each genome is linked to the plants (Phytozome)
associated resources. “Project list” on the top leads users to the list of all sequencing
G
G
G 226 Genome Portal, Joint Genome Institute
genomes and to conduct comparative and and analysis space from a single organism to the
genome-centric analyses and community annota- entire list of fungal genomes.
tion; and the IMG family of tools for large-scale The Genome browser with configurable selec-
comparative analysis of microbial genomes and tion of tracks displays predicted gene models and
metagenomes. annotations along with different lines of evidence
Phytozome (http://phytozome.net; Goodstein in support of these predictions, such as gene and
et al. 2012) gives access to the sequences and protein expression profiles. Gene models and
functional annotations of a growing number of annotations are linked to community annotation
complete plant genomes (31 in release v8.0), tools to revise them if needed. Functional profiles
including land plants and selected algae. of each genome summarize gene annotations
Phytozome provides both organism-centric and according to the GO, KEGG, and KOG classifica-
gene family-centric views as well as access to the tions and can be compared with each other to study
BLAST, BLAT, and Search capabilities. gene family expansions or contractions at different
Phytozome provides a view of the evolutionary levels of granularity. Clustering using BLAST
history of every plant and every plant gene at the alignments of all proteins and MCL can expand
level of sequence, gene structure, gene family, and these analyses to gene families even without anno-
genome organization. The Phytozome project tation and enable side-by-side comparison of each
organizes the proteomes of green plants into gene of the cluster members for pattern of protein
families defined at the nodes on the green plant domains, intron-exon structure, and synteny.
evolutionary tree. Genes have been annotated with MycoCosm comparative views combine the
PFAM, KOG, KEGG, and PANTHER assign- abovementioned tools to study entire groups of
ments, and publicly available annotations from genomes corresponding to MycoCosm nodes.
RefSeq, UniProt, TAIR, and JGI are hyperlinked Unlike the genome-centric view, there is no ref-
and searchable. The gene family view gives access erence genome in this analysis, and, for example,
to the information on each family and its members, a keyword or BLAST search for protein kinases
organized to highlight shared attributes. in Basidiomycota or Ascomycota will show dif-
GBrowse provides genome-centric views for ferences in the number of found genes or BLAST
all genomes included in Phytozome. Each organ- hits across different members of these phyla.
ism browser displays a number of tracks includ- IMG, the Integrated Microbial Genomes
ing a gene prediction track, a track of database (http://img.jgi.doe.gov; Markowitz
homologous sequences from related species et al. 2012a, b), is a system designed for flexible
aligned against the genome, supporting EST and comparative analyses of microbial genomic data,
VISTA tracks identifying regions of this genome which incorporates all complete public microbial
that are syntenic with other plant genomes. genomes as well as those sequenced at JGI. IMG
MycoCosm (http://jgi.doe.gov/fungi; with microbiome samples (IMG/M) is an
Grigoriev et al. 2012) brings together genomic expanded database that includes metagenome
data and analytical tools for diverse fungi that are data from diverse environments, both sequenced
important for energy and environment. Genomic at JGI and submitted by external users.
data from the JGI and its users are integrated and In addition to importing all public genomes
curated via user community participation in data and their annotations from NCBI’s RefSeq, IMG
submission, curation, annotation, and analysis. curates the data by adding features missed by
Over 150 newly sequenced and annotated fungal many annotation pipelines, such as small RNAs;
genomes are available to the public through assigning proteins and domains to all major pro-
MycoCosm for genome-centric and comparative tein family databases (e.g., COG, TIGRfam); and
analyses. Visual navigation across the MycoCosm linking to organism metadata stored in GOLD,
tree (Fig. 2b), where each node represents a group such as oxygen requirements or environment of
of phylogenetically related fungi and is linked to origin. Annotations can be viewed in detailed
analysis tools, allows users to redefine the search gene pages or summarized in genome pages that
Genome Portal, Joint Genome Institute 227 G
Genome Portal, Joint Genome Institute, Fig. 2 Comparative genomic resources at JGI: (a) Phytozome for plants,
(b) MycoCosm for fungi, and (c) IMG family of tool
include organism metadata in addition to statis- It also includes a “scaffold cart” for exploring
tics on genome size and gene counts within genes within a given set of contigs or scaffolds
various categories. as well as the option to categorize contigs/
The tools available in IMG allow for analyses scaffolds into population “bins” based on oligo-
at the gene, function, or genome level, using nucleotide composition or other features.
customizable “carts” for each of these data Recent developments in IMG and IMG/M
types. Thus, any given analysis can readily be include the capacity to add and view (meta)
performed on a single (meta)genome or several transcriptome and (meta)proteome data in the
and can be extended to many individual genes, context of a reference and compare expression
functions, or pathways. IMG/M includes profiles across experiments.
a number of metagenome-specific functions,
including the option to account for different
organism abundances by weighting comparative Metagenome Analysis
analyses according to estimated gene copies,
based on the contig read coverage reported in Analysis of metagenome data presents a number
the assembly rather than simple gene counts. of challenges beyond those faced in isolate
Genome Portal, Joint Genome Institute 229 G
Genome Portal, Joint Genome Institute, Fig. 3 Metagenomic analysis. A protein recruitment plot showing
alignment of genes from a hot spring sample to genomes from the family Hydrogenothermaceae
genome analysis, in particular the wide variation there?) or a functional one (i.e., what are they
in individual organism abundances and the shal- doing?). Each of these uses a specific suite of
low coverage of low-abundance, but nonetheless tools, though nearly all rely on a well-curated
biologically important, taxa. Both of these tend to database of genes with known phylogenies and
result in highly fragmented assemblies, which are functions. For phylogenetic analysis, genes or
most readily interpreted when high-quality refer- gene fragments are assigned to phylogenetic line-
ence genome data are available. ages based on homology to genes of known phy-
Most metagenome analyses approach the data logenetic origin. This can be done for all genes
from either a phylogenetic perspective (i.e., who is from a metagenome dataset, for example, using
G 230 Genome Portal, Joint Genome Institute
MEGAN (Huson and Mitra 2012), or for a set of large amounts of genomic data being produced in
conserved phylogenetic markers which can be different parts of the world. Effective analysis of
placed onto a tree of known sequences from isolate genomic and metagenomic data depends on the
genomes and/or amplified from uncultivated availability of comprehensive catalogues of ref-
organisms, for example, using pplacer (Matsen erence genome data for annotation and compara-
et al. 2012). IMG/M allows for both approaches – tive genomics as well as computational tools able
an overall perspective of all the genes in a dataset to process the large amounts of sequence data.
or on a specific set of contigs is provided through The JGI Genome Portal (http://genome.jgi.doe.
the “Phylogenetic Distribution of Genes” option gov) provides a unified access point to all JGI
on the main metagenome page or in the scaffold genomic databases and analytical tools including
cart, and genes with homology to particular phyla, list of sequencing projects at JGI and around the
families, genera, or species can be retrieved. When world, a comprehensive collection of annotated
there are good reference genomes available, align- genomes in all domains of life, and specialized
ments of protein-coding genes to those genomes databases for comparative analysis of plant, fun-
can be viewed in a recruitment plot (Fig. 3). gal, and microbial genomes and metagenomes.
Phylogenetic marker genes can also be extracted The latter is still in early stages of development,
and incorporated into trees using the “Phyloge- and data generated at unprecedented scale and
netic Marker COGs” option under the “Find complexity for metagenomes will require new
Functions” tab. approaches to data processing, analysis, and
Functional or “gene-centric” approaches visualization.
enable the comparison of metagenome datasets
at the functional level to both assess their relative
similarity and identify genes or functions that are
References
over- or underrepresented in a given dataset. This
type of approach is utilized by metagenome anal- Berka RM, Grigoriev IV, Otillar R, et al.
ysis systems like MG-RAST (Meyer et al. 2008). Comparative genomic analysis of the thermophilic
IMG/M provides several options for whole biomass-degrading fungi Myceliophthora thermophila
and Thielavia terrestris. Nat Biotechnol. 2011;29:
metagenome comparisons. Metagenomes can be
922–7.
clustered (under the “Compare Genomes” tab) Bowler C, Allen AE, Badger JH, et al. The Phaeodactylum
according to gene content, using either functional genome reveals the evolutionary history of diatom
(e.g., COG, Pfam) or phylogenetic criteria, and genomes. Nature. 2008;456:239–44.
Colbourne JK, Pfrender ME, Gilbert D, et al. The
the results visualized via hierarchical clustering,
ecoresponsive genome of Daphnia pulex. Science.
principal components analysis (PCA), or 2011;331:555–61.
a correlation matrix. Relative abundances of spe- Eastwood DC, Floudas D, Binder M, et al. The plant
cific gene families can be viewed via the abun- cell wall-decomposing machinery underlies the
functional diversity of forest fungi. Science.
dance profile function also under the “Compare
2011;333:762–5.
Genomes” tab. As mentioned above, these com- Frazer KA, Pachter L, Poliakov A, et al. VISTA: compu-
parisons can be made between partly assembled tational tools for comparative genomics. Nucleic
genomes by taking contig read depth into account Acids Res. 2004;32:W273–9.
Fritz-Laylin LK, Prochnik SE, Ginger ML, et al. The
when calculating gene abundance.
genome of Naegleria gruberi illuminates early eukary-
otic versatility. Cell. 2010;140:631–42.
Goodstein DM, Shu S, Howson R, et al. Phytozome:
Summary a comparative platform for green plant genomics.
Nucleic Acids Res. 2012;40:D1178–86.
Grigoriev IV, Cullen D, Goodwin SB, et al. Fueling
Technological innovations leading to the democ- the future with fungal genomics. Mycology. 2011;2:
ratization of genome sequencing have resulted in 192–209.
Genome-Based Studies of Marine Microorganisms 231 G
Grigoriev IV, Nordberg H, Shabalov I, et al. The
genome portal of the Department of Energy Joint Genome-Based Studies of Marine
Genome Institute. Nucleic Acids Res. 2012;40:
D26–32. Microorganisms
Hess M, Sczyrba A, Egan R, et al. Metagenomic discovery
of biomass-degrading genes and genomes from cow Xinqing Zhao1, Chao Chen2, Liangyu Chen2,
rumen. Science. 2011;331:463–7. Yumei Wang2 and Xiang Geng2
Huson DH, Mitra S. Introduction to the analysis of envi- 1
ronmental sequences: metagenomics with MEGAN. School of Life Science and Biotechnology,
Methods Mol Biol. 2012;856:415–29. Dalian University of Technology, Dalian,
King N, Westbrook MJ, Young SL, et al. The genome of People’s Republic of China
the choanoflagellate Monosiga brevicollis and the ori- 2
Dalian University of Technology, Dalian, China
gin of metazoans. Nature. 2008;451:783–8.
Lander ES, Linton LM, Birren B, et al. Initial sequencing
and analysis of the human genome. Nature.
2001;409:860–921. Synonyms
Markowitz VM, Chen IM, Chu K, et al. IMG/M: the G
integrated metagenome data management and compar-
ative analysis system. Nucleic Acids Res. 2012a;40: Genome mining of marine microorganisms
D123–9.
Markowitz VM, Chen IM, Palaniappan K, et al. IMG: the
Integrated Microbial Genomes database and compara-
tive analysis system. Nucleic Acids Res. 2012b;40: Definition
D115–22.
Martin F, Aerts A, Ahren D, et al. The genome of Laccaria
bicolor provides insights into mycorrhizal symbiosis. Genome-based studies of marine microorgan-
Nature. 2008;452:88–92. isms mean utilizing genetic information
Martin F, Cullen D, Hibbett D, et al. Sequencing the retrieved from genomic sequences of marine
fungal tree of life. New Phytol. 2011;190:818–21.
microorganisms to guide the discovery of useful
Matsen FA, Hoffman NG, Gallagher A, et al. A format
for phylogenetic placements. PLoS One. 2012;7: enzymes and natural products from marine
e31009. microorganisms. Chemical structures of natural
Meyer F, Paarmann D, D’Souza M, et al. The products potentially synthesized by marine
metagenomics RAST server – a public resource for
microorganisms can be predicted by aligning
the automatic phylogenetic and functional analysis of
metagenomes. BMC Bioinforma. 2008;9:386. the biosynthetic genes with known gene
Pagani I, Liolios K, Jansson J, et al. The Genomes OnLine sequences that are responsible for the biosynthe-
Database (GOLD) v. 4: status of genomic and sis of natural products, and the physicochemical
metagenomic projects and their associated metadata.
properties (UV spectrum, molecular weight,
Nucleic Acids Res. 2012;40:D571–9.
Tringe SG, von Mering C, Kobayashi A, et al. Compara- polarity, etc.) obtained from the prediction can
tive metagenomics of microbial communities. be used to guide further purification and struc-
Science. 2005;308:554–7. ture elucidation of the compounds. In case that
Tuskan GA, Difazio S, Jansson S, et al. The genome of
the interested genes or gene clusters are not
black cottonwood, Populus trichocarpa (Torr. &
Gray). Science. 2006;313:1596–604. expressed or express in low level, various
Tyler BM, Tripathy S, Zhang X, et al. Phytophthora methods can be employed to activate the expres-
genome sequences uncover evolutionary origins and sion of biosynthetic genes. Identification of tar-
mechanisms of pathogenesis. Science. 2006;313:
get natural products can be achieved by
1261–6.
Walsh DA, Zaikova E, Howes CG, et al. comparative metabolic profiling, heterologous
Metagenome of a versatile chemolithoautotroph from expression, and other genome-mining strategies.
expanding oceanic dead zones. Science. 2009;326: For unculturable or yet-uncultured marine
578–82.
microbes in given environments, metagenomic,
Wu D, Hugenholtz P, Mavromatis K, et al. A phylogeny-
driven genomic encyclopaedia of bacteria and archaea. metatranscriptomic, and metaproteomic
Nature. 2009;462:1056–60. sequences can be employed. Function-based or
G 232 Genome-Based Studies of Marine Microorganisms
OH
O NH
H2N N NH2
N OH O OH OH
H H
O
N
HO H
OH OH
NH N
H O O H
Coelichelin Salinilactam
HO
OH O
G
N COOH
H O OH OMe
Thailandamide A
Genome-Based Studies of Marine Microorganisms, Fig. 1 Structures of compounds discovered by genome mining
bioinformatic tools such as antiSMASH and NP. isotope amino acid precursors feeding into the
searcher (Nikolouli and Mossialos 2012). Due to culture broth and subsequent detection of the
the limited knowledge on enzymatic functions labeled molecule to identify NRPS or mixed
and metabolic cross talks, the prediction of chem- PKS/NRPS compounds.
ical structures is not always correct, and accurate Although low-level production of target mol-
annotation of gene functions and prediction of ecules can be identified by genomisotopic
chemical structures requires more advanced bio- method, some metabolites are only produced
informatic tools. under special circumstances; activation of pro-
In case that the biosynthetic genes are actively duction of these molecules requires mimicking
expressed under lab conditions, information on specific nutritional, environmental, and biologi-
the physicochemical properties of the target mol- cal conditions, such as special carbon and nitro-
ecules such as UV spectrum, molecular weight, gen source, high temperature, UV irradiation,
and polarity obtained from the bioinformatic pre- osmotic stress treatments, and coculture with
diction can be used to guide the further purifica- another microbial strain (Scherlach and
tion of the compounds. Thailandamide A was Hertweck 2009). In addition, genetic methods
discovered by genome mining of Burkholderia can also be employed to activate production of
thailandensis (Nguyen et al. 2008). Being tem- certain metabolites identified by genome min-
perature and light sensitive and also being pro- ing, including overexpression of activation reg-
duced in the early growth stage, thailandamide ulators and deletion of repressive regulators
A may not have been identified using classical (Scheffler et al. 2013). Heterologous expression
methods without the genomic-guided isolation of the entire gene cluster in well-defined host
(Nguyen et al. 2008). The structure of strains, including E. coli, Streptomyces, Bacil-
thailandamide A was shown in Fig. 1. lus, and Saccharomyces cerevisiae, has also
Genomisotopic approach was first described been employed in genome mining (Zhang
with the discovery of orfamides from Pseudomo- et al. 2011). Selection of suitable host strains
nas fluorescens (Gross et al. 2007), which stable and expression vectors are critical to achieve
G 234 Genome-Based Studies of Marine Microorganisms
Genome-Based Studies
of Marine
Microorganisms,
Fig. 2 Genome mining for
identification of natural
products
heterologous production of target active metagenomic libraries yield positive clones with
molecules. Scheme of genome mining was aimed sequences (reviewed by Brady et al. 2009).
depicted in Fig. 2. Novel enzymes such as laccase, aromatic hydro-
carbon dioxygenase, and halogenase have been
Metatranscriptomic and Metaproteomic isolated from marine metagenomic studies (Fang
Studies for Discovery of Novel Enzymes and et al. 2011; Marcos et al. 2012; Bayer et al. 2013,
Small Molecules reviewed by Kennedy et al. 2011), which have
In addition to culture-dependent genome-mining great potential for industrial applications and
studies, genome-based discovery of novel environmental bioremediation. In addition,
enzymes and natural products from environmen- novel natural products were also identified in
tal samples can also be achieved using culture- metagenomic libraries (reviewed by Brady
independent tools. It has been estimated that less et al. 2009), and Streptomyces and Ralstonia
than 1 % of the bacteria in most environmental metallidurans were used as hosts for heterolo-
samples are culturable (reviewed by Brady gous expression of metagenomic library.
et al. 2009), and it is thus important to study the Metagenome mining of symbiotic bacteria of
yet-uncultured microorganisms in marine envi- marine sponge Theonella swinhoei resulted in
ronment. Metagenome stands for a collection of the discovery of polytheonamides which are
genetic materials (genomic DNA) of a mixed extensively posttranslationally modified ribo-
community of organisms recovered directly somal peptides (Freeman et al. 2012).
from given environmental samples. Environmen- Metagenomic workflow was illustrated in Fig. 3.
tal DNA (eDNA) extracted from marine sedi- Metatranscriptomic and metaproteomic stud-
ments, seawater, or marine sponges, plants, or ies focus on the expression of certain genes in
animals can serve as starting point for a given environment at a given time (Schweder
metagenomic studies. Metagenomic DNA is et al. 2008; Stewart et al. 2012) and have been
cloned into various host cells, the most popular used to characterize metabolic behavior of micro-
host being E. coli. Phenotypic-based screening bial community. Such techniques have not been
and DNA sequencing-based screening of employed to study the isolation of novel enzymes
Genome-Based Studies of Marine Microorganisms 235 G
Genome-Based Studies
of Marine
Microorganisms,
Fig. 3 Metagenomic
method to discover novel
natural products or
enzymes
and small molecules from marine environment. will facilitate discovery of more novel marine
In comparison to metagenomic studies, metatran- enzymes and natural products for biotechnologi-
scriptomics and metaproteomics overlook genes cal applications.
that are not expressed in certain time and thus
have limitation to fully explore the biosynthetic Acknowledgments The authors are regretful for not
potential of marine microorganisms. However, being able to cite more references due to space limitation.
same problems of silent gene expression can
also be encountered when the metagenomic
libraries are propagated in certain host cells;
References
therefore, choosing diverse host cells and testing
Bayer K, Scheuermayer M, Fieseler L, Hentschel U.
various conditions for expression of Genomic mining for novel FADH(2)-dependent
metagenomic libraries are important to identify halogenases in marine sponge-associated microbial
novel enzymes and small molecules in marine consortia. Mar Biotechnol (NY). 2013;15(1):63–72.
Brady SF, Simmons L, Kim JH, Schmidt EW.
environment.
Metagenomic approaches to natural products from
free-living and symbiotic organisms. Nat Prod Rep.
2009;26(11):1488–503.
Summary Challis GL, Ravel J. Coelichelin, a new peptide
siderophore encoded by the Streptomyces coelicolor
genome: structure prediction from the sequence of its
Genome mining has speeded up the discovery of non-ribosomal peptide synthetase. FEMS Microbiol
natural products and novel enzymes from micro- Lett. 2000;187(2):111–4.
organisms by exploring their full biosynthetic Fang Z, Li T, Wang Q, Zhang X, Peng H, Fang W,
Hong Y, Ge H, Xiao Y. A bacterial laccase from
potentials. Metagenomic studies combined with
marine microbial metagenome exhibiting chloride tol-
genome mining promote the advancement of erance and dye decolorization ability. Appl Microbiol
studies of yet-uncultured marine microorgan- Biotechnol. 2011;89:1103–10.
isms. The discovery of marine natural products Freeman MF, Gurgui C, Helf MJ, Morinaka BI, Uria AR,
Oldham NJ, Sahl HG, Matsunaga S, Piel J.
and novel enzymes using genome-based methods Metagenome mining reveals polytheonamides as
is still in its early stage; however, development of posttranslationally modified ribosomal peptides.
genome mining and metagenomic approaches Science. 2012;338(6105):387–90.
G 236 GeoChip-Based Metagenomic Technologies
763 gene variants involved in nitrogen cycling GeoChip 4.0 not only contains all functional cat-
(nirS, nirK, nifH, amoA), methane oxidation egories from GeoChip 3.0 but also includes addi-
(pmoA), and sulfite reduction (dsrAB). Then, an tional functional categories, such as genes from
expanded array was developed with 2,402 genes bacterial phages and those involved in stress
involved in organic contaminant biodegradation response and virulence (Hazen et al. 2010; He
and metal resistance to monitor microbial et al. 2012a). All evaluation and studies demon-
populations and functional genes involved in strate that GeoChip is a powerful tool for specific,
biodegradation and biotransformation (Rhee sensitive, and quantitative analysis of microbial
et al. 2004). Specificity evaluation with represen- communities from a variety of habitats
tative pure cultures indicated that the designed (He et al. 2011, 2012a, b).
probes appeared to be specific to their
corresponding target genes. The detection limit
was 5–10 ng of genomic DNA in the absence of GeoChip Development
background DNA and 50–100 ng of pure-culture
genomic DNA in the presence of background GeoChip development involves several major
DNA. Real-time PCR analysis was very consis- steps, including selection of target genes,
tent with the microarray-based quantification sequence retrieval and verification, oligonucleo-
(He et al. 2011). tide probe design, probe validation, and array
Although the prototype and GeoChip 1.0 construction as well as future automatic update,
arrays were used to probe specific functional which are generally implemented by a GeoChip
groups or activities, they lacked a truly com- development and data analysis pipeline (http://
prehensive probe set covering key microbial ieg./ou.edu/) (He et al. 2010a).
functional processes occurring in different
environments. Therefore, more comprehensive Selection of Target Genes and Sequence
GeoChips have been developed and evaluated. Retrieval
For example, GeoChip 2.0, containing 24,243 A variety of functional genes can be used as
(50-mer) oligonucleotide probes, targeting functional markers targeting different processes,
~10,000 functional gene variants from 150 gene such as biogeochemical cycling of C, N, S, P, and
families involved in the geochemical cycling metals, contaminant bioremediation, antibiotic
of C, N, and P, sulfate reduction, metal reduction resistance, and stress response. For example,
and resistance, and organic contaminant degrada- 292 functional gene families were selected for
tion, was developed as the first comprehensive GeoChip 3.0 with 41 for C cycling, 16 for
FGA (He et al. 2007). After 2 years, GeoChip 3.0 N cycling, 3 for P utilization, 4 for S cycling,
was developed, which contained about 28,000 173 for biodegradation of a variety of organic
probes and targeted ~57,000 sequences from contaminants, 41 for metal reduction and resis-
292 gene families (He et al. 2010a). GeoChip tance, 11 for antibiotic resistance, and 2 for
3.0 is more comprehensive and has several other energy processing. In addition, a phylogenetic
distinct features compared to GeoChip 2.0, such marker (gyrB) was also chosen (He et al.
as a common oligo reference standard (CORS) 2010a). More importantly, when sequences for
for data normalization and comparison, a soft- a known functional gene are available, they can
ware package for data management and future be added in an updated GeoChip. For example,
updating, the gyrB gene for phylogenetic analy- when GeoChip was updated to GeoChip 4.0,
sis, and additional functional groups including functional gene families involved in stress
those involved in antibiotic resistance and responses, bacterial phages, and virulence were
energy processing (He et al. 2010a). Based on added, resulting in 410 functional gene families
GeoChip 3.0, GeoChip 4.0 was developed, on GeoChip 4.0 (Hazen et al. 2010; He
which contains ~84,000 probes and targeting et al. 2012a).Generally, genes are chosen for
>152,000 genes from 410 functional families. key enzymes or proteins with the corresponding
GeoChip-Based Metagenomic Technologies 239 G
function(s) of interest. If a process involves mul- related sequences will be chosen for array con-
tiple steps or a protein complex, those genes struction. GeoChip can be constructed in-house,
responsible for catalytic subunits or with the such as GeoChips 2.0 and 3.0 (He et al. 2007,
active site(s) will be selected (He et al. 2011). 2010a), or commercially, like GeoChip 4.0
Sequence retrieval is performed generally (Hazen et al. 2010; He et al. 2012a).
with a pipeline with a database integrated for
managing all retrieved sequences and subse-
quently designed probes. For each functional GeoChip Operation and Data Analysis
gene, the first step is to submit a query to the
GenBank protein database and fetch all candidate Generally, GeoChip operation and data analysis
amino acid sequences. The key words may include target preparation, GeoChip hybridiza-
include the name of the target gene/enzyme, its tion, image and data preprocessing, and data
abbreviation and enzyme commission number analysis (Fig. 1).
(EC), and affiliated domains of bacteria, archaea, G
and fungi. Second, retrieved sequences are vali- Target Preparation
dated by seed sequences (those sequences that Target preparation involves a few steps, includ-
have been experimentally confirmed to produce ing nucleic acid extraction and purification, label-
the protein of interest and that the protein func- ing, and hybridization (Fig. 1a). The most
tions as expected) with the HMMER program. important step for successful GeoChip analysis
Finally, all confirmed protein sequences are is nucleic acid extraction and purification from
searched against GenBank again to obtain their environmental samples generally using a well-
corresponding nucleic acid sequences for probe established method, which is able to produce
design (He et al. 2010a). large fragments of DNA. High-quality DNA
should have ratios of A260/A280 ~ 1.8 and
Oligonucleotide Probe Design A260/A230 > 1.7. Low A260/A230 ratios indicate
A new version of CommOligo (He et al. 2012a) impurities in the DNA sample and can negatively
with group-specific probe design features can be influence subsequent labeling and hybridization.
used to design both gene- and group-specific oli- Generally, since 1–5 mg of DNA or 5–20 mg of
gonucleotide probes with different degrees of RNA is required for GeoChip hybridization,
specificity based on the following criteria: (i) a whole-community genome amplification
gene-specific probe must have 90 % sequence (WCGA) for DNA and whole-community RNA
identity, 20-base continuous stretch, and amplification (WCRA) for RNA are necessary
35 kcal/mol free energy; (ii) a group-specific (He et al. 2012b). Non-amplified or amplified
probe has to meet the above requirements for nucleic acids are then labeled with fluorescent dye
nontarget groups, and it also must have 96 % (e.g., Cy3, Cy5) using random priming with the
sequence identity, 35-base continuous stretch, Klenow fragment of DNA polymerase for DNA
and 60 kcal/mol free energy within the group. and SuperScriptTM II/III RNase H-reverse tran-
Computational and experimental evaluation indi- scriptase for RNA. The labeled nucleic acids are
cates that these designed probes are highly spe- then purified and dried for hybridization (Fig. 1a).
cific to their targets (He et al. 2007, 2010a).
Hybridization, Imaging, and Data
Probe Validation and GeoChip Construction Preprocessing
All designed probes are subsequently verified Labeled nucleic acid target is suspended in
against the GenBank (NR) nucleic acid database a hybridization buffer containing 40–50 % form-
for specificity. Normally, multiple (e.g., 20) amide and hybridized on GeoChip at 42–50 C
probes for each sequence or each group of (He et al. 2007, 2010a, 2012b). The hybridization
sequences are designed, but only the best probe stringency can be adjusted by changing the
set for each sequence or each group of closely temperature and/or formamide concentration.
G 240 GeoChip-Based Metagenomic Technologies
GeoChip-Based Metagenomic Technologies for Ana- microbial communities from a variety of habitats. (a)
lyzing Microbial Community Functional Structure Target preparation, (b) GeoChip hybridization and data
and Activities, Fig. 1 A schematic presentation of target processing, (c) GeoChip data analysis (This figure is
preparation, GeoChip operation, and data analysis of adapted from Fig. 1 by He et al. (2012b))
For every 1 % increase in formamide, the effective spots, evenness of control spot hybridization sig-
temperature increases by 0.6 C (He et al. 2011). nals across the slide surface, and background
Hybridized arrays are imaged with levels are assessed to determine overall array
a microarray scanner having a resolution of at quality. Spots flagged as poor or low quality are
least 10 mm for homemade arrays and 2 mm for removed along with outliers: positive spots with
commercially manufactured arrays. Microarray (signal – mean signal intensity of all replicate
analysis software is then used to quantify the spots) greater than three times the replicate
signal intensity (pixel density) of each spot. spots signal standard deviation (He et al. 2011).
Spot quality is also evaluated at this point using The signal intensities are then normalized for
predetermined criteria, and positive spots are further statistical analysis (Fig. 1b).
called generally based on signal-to-noise ratio
[SNR; SNR ¼ (signal mean – background GeoChip Data Analysis
mean)/background standard deviation] or signal- Data analysis is the most challenging part in the
to-both-standard-deviations ratio [SSDR; use of GeoChip for microbial community analy-
SSDR ¼ (signal mean – background mean)/(sig- sis, and a variety of methods have been used to
nal standard deviation – background standard address fundamental microbial ecology questions
deviation)] (He et al. 2012b). (Fig. 1c). First, various diversity indices (e.g.,
Raw GeoChip data are further evaluated via richness, evenness, diversity) based on the num-
the GeoChip data analysis pipeline ber of functional genes detected and their abun-
(He et al. 2010a). The quality of individual dances are used to examine the functional
GeoChip-Based Metagenomic Technologies 241 G
diversity of microbial communities. The relative determined using variance partitioning analysis
abundance of specific genes or gene groups can (VPA). In addition, further correlations of
be determined based on the total signal intensity GeoChip data with environmental parameters
of the relevant genes or the number of genes can be performed with the Mantel test
detected. The percentage of genes shared by dif- (He et al. 2007, 2010a, b, 2011, 2012b). Finally,
ferent samples can also be calculated to compare GeoChip data can be used to infer functional
microbial communities examined. Second, for molecular ecological networks for revealing
statistical analysis of the overall microbial com- interactions of functional genes and their associ-
munity composition and structure with FGA data, ated populations. A recent study indicated that
ordination techniques can be used such as princi- elevated CO2 substantially altered the network
pal component analysis (PCA), detrended corre- interaction of soil microbial communities and
spondence analysis (DCA), cluster analysis (CA), the shift in network structures is significantly
and nonmetric multidimensional scaling correlated with soil properties (He et al. 2012b;
(NMDS). PCA and DCA are multivariate statis- Zhou et al. 2010) (Fig. 1c). G
tical methods, which reduce the number of vari-
ables needed to explain the data and highlight the
variability between samples. CA groups samples GeoChip Applications
based on the overall similarity of gene patterns.
NMDS finds both a nonparametric monotonic Different versions of GeoChip have been used to
relationship between the dissimilarities in the analyze microbial communities from different
item-item matrix and the Euclidean distances habitats, such as aquatic systems, soils, extreme
between items and the location of each item in environments, human microbiomes, and bioreac-
the low-dimensional space. Also, the response tors for addressing fundamental scientific ques-
ratio can be used to determine changes of specific tions related to global change, bioenergy,
functional genes between the control and the bioremediation, agricultural management, land
treatment. In addition, analysis of variation use, and human health and disease as well as
(ANOVA), analysis of similarities (ANOISM), ecological theories (He et al. 2011, 2012b). Sev-
nonparametric multivariate analysis of variance eral recent studies are highlighted, especially
(Adonis), and multi-response permutation proce- with a focus on soil and water microbial commu-
dure (MRPP) can be used to discern dissimilar- nities. A list of representative studies with differ-
ities of microbial communities over time and ent GeoChip versions is shown in Table 1.
space (He et al. 2011, 2012b). Third, if environ-
mental data or other metadata are available, Soils
GeoChip data can be used to correlate environ- Soil may harbor the most complex microbial
mental variables with the functional microbial communities among known habitats, and
community structure. These include the recently GeoChips have been used to investigate
Pearson’s correlation coefficient (PCC), canoni- soil microbial communities to address fundamen-
cal correspondence analysis (CCA), and Mantel tal ecological questions related to global change
test. PCC measures the strength of linear depen- (e.g., elevated CO2, elevated O3, warming), bio-
dence between two variables, such as functional remediation of oil-contaminated fields, land use,
gene abundances detected by GeoChip, and envi- agricultural management, and livestock grazing.
ronmental variables. CCA has been used in many Three recent studies focused on the response
cases in GeoChip-based studies to better under- of soil microbial communities to global change,
stand how environmental factors affect the com- including elevated CO2, temperature, and O3.
munity structure (He et al. 2011, 2012b). Also, First, GeoChip 3.0 was used to analyze soil
based on the results of the CCA, the relative microbial communities under elevated CO2 at
influence of environmental variables on the a multifactor grassland experiment site, BioCON
microbial community structure can be (biodiversity, CO2, and nitrogen deposition), in
G 242 GeoChip-Based Metagenomic Technologies
GeoChip-Based Metagenomic Technologies for Analyzing Microbial Community Functional Structure and
Activities, Table 1 Summary of representative GeoChip applications. If no references are cited, those studies are
described in a previous review (He et al. 2012b)
Habitat or
ecosystem Ecosystem/sample type GeoChip Objectives of study/biological questions
Aquatic Marine sediment GeoChip Functional microbial community structure of marine
systems 1.0 sediments in the Gulf of Mexico
Ebro and Elbe river sediment GeoChip Pesticide impacts on European rivers
2.0
Coral-associated marine water GeoChip Microbial communities in healthy and yellow-band
2.0 diseased coral (Montastraea faveolata)
Soils Antarctic latitudinal transect GeoChip Microbial C and N cycling across an Antarctic latitudinal
soil 2.0 transect
Deciduous forest soil GeoChip Gene-area relation in microorganisms
2.0
Native grassland soil GeoChip Afforestation impacts soil microbial communities and their
2.0 functional potential
Strawberry farmland soil GeoChip Microbial responses to farm management
2.0
Grassland soil GeoChip Microbial responses to plant invasion
2.0
Agricultural soil GeoChip Agricultural practices/land use (Xue et al. 2013)
2.0
Grassland soil GeoChip Global change (elevated CO2) (He et al. 2010b)
3.0
Grassland soil GeoChip Global change (warming) (Zhou et al. 2012)
3.0
Wheat rhizosphere soil GeoChip Global change (elevated O3) (Li et al. 2013)
3.0
Citrus rhizosphere soil GeoChip Rhizosphere microbial community responses to
3.0 Candidatus Liberibacter asiaticus-infected citrus trees
Grassland soil GeoChip The effect of grazing on microbial communities (Yang
4.0 et al. 2013)
Contaminated U-contaminated underground GeoChip Bioremediation of U-contaminated groundwater
sites water (Oak Ridge, TN) 1.0
GeoChip Bioremediation of U-contaminated groundwater (Van
2.0 Nostrand et al. 2011)
U-contaminated sediment GeoChip Bioremediation of U-contaminated sediments
(Oak Ridge, TN) 2.0
U-contaminated underground GeoChip Bioremediation of U-contaminated groundwater (Liang
water (Rifle, CO) 2.0 et al. 2012)
PCB-contaminated soil GeoChip Microbial bioremediation of PCB-contaminated soil
2.0
Oil-contaminated soil GeoChip Bioremediation of oil-contaminated soil
2.0
Arsenic-contaminated soil GeoChip Rhizosphere microbial community responses to arsenic
3.0 contamination and phytoremediation
Landfill groundwater GeoChip Microbial responses to landfill-derived contaminants in
3.0 groundwater (Lu et al. 2012)
Oil-spill seawater GeoChip Microbial bioremediation of oil-spill sites (Hazen
4.0 et al. 2010)
(continued)
GeoChip-Based Metagenomic Technologies 243 G
GeoChip-Based Metagenomic Technologies for Analyzing Microbial Community Functional Structure and
Activities, Table 1 (continued)
Habitat or
ecosystem Ecosystem/sample type GeoChip Objectives of study/biological questions
Extreme Deep-sea hydrothermal vent GeoChip Functional gene diversity of deep-sea hydrothermal vent
environments (chimney) 2.0 microbial communities
Deep-sea basalt samples GeoChip Functional gene diversity and structure of deep-sea basalt
2.0 microbial communities
GSL hypersaline water GeoChip Functional gene diversity and structure of hypersaline
2.0 water microbial communities
Acid mine drainage (water) GeoChip Functional gene diversity of microbial communities in acid
2.0 mine drainage (AMD) systems
Bioreactors Fluidized bed reactor for GeoChip Microbial bioremediation of hydrocarbon-contaminated
bioremediation 2.0 water
Microbial electrolysis cell for GeoChip Microbial hydrogen production using wastewater
hydrogen production 3.0 G
the Cedar Creek Ecosystem Science Reserve, GeoChip 3.0 was used to investigate the func-
MN (He et al. 2010b). The results showed that tional composition, and structure of rhizosphere
the functional microbial community structure microbial communities from O3-sensitive and
was markedly different between ambient CO2 O3-relatively-sensitive wheat (Triticum aestivum
and elevated CO2 as indicated by DCA of L.) cultivars under elevated O3 (eO3). Based on
GeoChip 3.0 data and 16S rRNA gene-based GeoChip hybridization signal intensities,
pyrosequencing data. Also, genes involved in although the overall functional structure of rhizo-
labile C degradation and C and N fixation were sphere microbial communities did not signifi-
significantly increased under elevated CO2 cantly change by eO3 or cultivars, the results
although the abundance of recalcitrant showed that the abundance of specific functional
C degradation genes remained unchanged. In genes involved in C fixation and degradation,
addition, changes in the microbial community N fixation, and sulfite reduction did significantly
structure were significantly correlated with soil alter in response to eO3 and/or wheat cultivars.
C and N contents and plant productivity Also, the O3-sensitive cultivar appeared to harbor
(He et al. 2010b). Second, GeoChip 3.0 was microbial functional communities in the rhizo-
used to understand the effect of increased tem- sphere more sensitive in response to eO3 than
perature on soil microbial communities and their the O3-relatively sensitive cultivar. In addition,
roles in regulating soil carbon dynamics at CCA suggested that the functional structure of
a tallgrass prairie ecosystem in the US Great microbial communities involved in C cycling was
Plains of Central Oklahoma. The results suggest largely shaped by soil and plant properties includ-
soil microorganisms may regulate soil carbon ing pH, dissolved organic carbon (DOC), micro-
dynamics through three primary feedback mech- bial biomass C, C/N ratio, and grain weight
anisms: (i) shifting microbial community compo- (Li et al. 2013). Those studies indicate that global
sition, leading to the reduced temperature change significantly impacts soil microbial com-
sensitivity of heterotrophic soil respiration; munities, which may in turn regulate ecosystem
(ii) differentially stimulating labile C but not functioning through different feedback
recalcitrant C degradation genes to maintain mechanisms.
long-term soil carbon stability and storage; and Various agriculture management practices
(iii) enhancing nutrient-cycling processes to pro- may have significant influences on soil microbial
mote plant growth (Zhou et al. 2012). Third, communities and their ecological functions.
G 244 GeoChip-Based Metagenomic Technologies
GeoChip 2.0 was used to evaluate the potential Groundwater and Aquatic Ecosystems
functions of soil microbial communities under Due to human activities, groundwater and aquatic
conventional (CT), low-input (LI), and organic ecosystems are often contaminated from various
(ORG) management systems at an agricultural sources (e.g., mining, oil spill, landfill) and with
research site in Michigan. Compared to CT, a variety of toxic compounds (e.g., heavy metals,
a high diversity of functional genes was observed herbicides, antibiotics, pesticides) and conditions
in LI. The functional gene diversity in ORG did (e.g., low pH, high salinity). To understand how
not differ significantly from that of either CT or such contamination impacts groundwater and
LI. The abundance of genes encoding enzymes aquatic ecosystems, GeoChips were used to inves-
involved in C, N, P, and S cycling was generally tigate those microbial communities to explore the
lower in CT than in LI or ORG, but functional potential of in situ bioremediation of contaminated
genes involved in lignin degradation, methane sites by indigenous microbial communities.
generation/oxidation, and assimilatory N reduc- A pilot-scale system was established to exam-
tion remained unchanged. Also, significant ine the feasibility of in situ U(VI) immobilization
correlations were observed between NO3 con- at a highly contaminated aquifer in Oak
centration and denitrification gene abundance, Ridge, TN. Ethanol was injected intermittently
NH4+ concentration and ammonification gene as an electron donor to stimulate microbial
abundance, and N2O flux and denitrification U(VI) reduction, leading to a decrease of
gene abundance, indicating a close linkage U(VI) concentrations below the Environmental
between soil N availability or utilization and Protection Agency drinking water standard.
associated functional potential of soil microbial GeoChip 2.0 was used to monitor microbial
communities (Xue et al. 2013). communities in three wells during active
Livestock grazing is a type of global land-use U(VI) reduction and maintenance phases. The
activity. However, the effect of free livestock results showed that the overall microbial commu-
grazing on soil microbial communities at the nity structure exhibited a considerable shift over
functional gene level remains unclear. GeoChip the remediation phases examined and functional
4.0 was used to examine the effects of free live- populations of Fe(III)-reducing bacteria (FeRB),
stock grazing on the microbial community at an nitrate-reducing bacteria (NRB), and sulfate-
experimental site in Tibet, a region known to be reducing bacteria (SRB) reached their highest
very sensitive to anthropogenic perturbation and levels during the active U(VI) reduction phase
global warming. The results showed that grazing (days 137–370), in which denitrification, Fe(III)
changed the microbial community functional reduction, and sulfate reduction occurred sequen-
structure, in addition to aboveground vegetation tially, suggesting that these functional
and soil geochemical properties. Further statisti- populations could play an important role in both
cal analysis showed that microbial community active U(VI) reduction and maintenance stability
functional structures were closely correlated of reduced U(IV) (Van Nostrand et al. 2011).
with environmental variables and variations in To better understand the microbial functional
microbial community functional structures were diversity changes with subsurface redox condi-
mainly controlled by aboveground vegetation, tions during in situ U(VI) bioremediation,
soil C/N ratio, and NH4+-N. Therefore, these GeoChip 2.0 was applied to examine groundwa-
results indicated that soil microbial community ter microbial communities at a uranium mill tail-
functional structure was very sensitive to live- ings remedial action (UMTRA) site (Rifle, CO).
stock grazing and revealed the role of soil micro- The results indicated that functional microbial
bial communities in the regulation of soil N and communities altered with a shift in the dominant
C cycling, supporting the necessity to include metabolic process and the abundance of dsrAB
microbial components in evaluating the conse- and mcr genes increased when redox conditions
quence of land use and/or climate change (Yang shifted from Fe-reducing to sulfate-reducing con-
et al. 2013). ditions, while cytochrome genes were primarily
GeoChip-Based Metagenomic Technologies 245 G
detected from Geobacter species and decreased Other Environments
with lower subsurface redox conditions. Statisti- GeoChips were also used to analyze microbial
cal analysis of environmental parameters and communities from other habitats/ecosystems,
functional genes indicated that acetate, U(VI), including various contaminated sites (e.g.,
and redox potential were the most significant chromate-contaminated water, U-contaminated
geochemical variables linked to the microbial sediments, polychlorinated biphenyl- and arsenic-
functional gene structures. This study indicates contaminated soils), extreme environments (e.g.,
that microbial functional genes could be very acid mine drainage, hypersaline lakes, deep-sea
useful for tracking microbial community struc- basalts, deep-sea hydrothermal vents), bioleaching
ture and dynamics during bioremediation (Liang systems, and bioreactors as well as the human
et al. 2012). microbiome (He et al. 2011, 2012b).
In another study, GeoChip 3.0 was used to
study the functional gene diversity and structure
of groundwater microbial communities in Summary G
a shallow landfill leachate-contaminated aquifer
in Norman, OK. Samples were taken from eight Although GeoChip technology has been demon-
wells at the same aquifer depth immediately strated to be specific, sensitive, and quantitative
below a municipal landfill or along the predomi- and applied to analyze microbial communities
nant downgradient groundwater flowpath. The from different habitats, some key issues and chal-
results showed that functional gene richness and lenges still remain, including probe coverage,
diversity immediately below the landfill and the specificity, sensitivity, quantitative capability,
closest well were considerably lower than those nucleic acid quality, the detection of microbial
in downgradient wells and that landfill leachate community activity, and challenges by high-
impacted the diversity, composition, structure, throughput sequencing technologies. It should
and functional potential of groundwater micro- be noted that probe coverage on GeoChip is rel-
bial communities as a function of groundwater atively low compared to the availability of func-
pH and concentrations of sulfate, ammonia, and tional gene sequences in databases, especially for
dissolved organic carbon (Lu et al. 2012). earlier versions of FGAs. One of the reasons is
In 2010, the Deepwater Horizon oil spill that some sequences do not have specific probes
occurred in the Gulf of Mexico. GeoChip 4.0 was based on the availability of sequence databases
used to examine the functional composition and and software. Also, GeoChip probe sets need
structure of water microbial communities from the continuous updates to reflect the current status
oil plume and control sites. The results indicated of functional gene sequence information.
that the water microbial community composition Critical issues with GeoChip design and
and structure were dramatically altered in deep-sea detection are specificity, sensitivity, and quanti-
oil plume samples. A variety of functional genes tative capability, which are especially important
involved in both aerobic and anaerobic hydrocar- since many gene variants within each environ-
bon degradation were highly enriched in the plume mental sample are unknown. Array specificity is
compared with outside the plume, indicating controlled by probe design and hybridization
a great potential for intrinsic bioremediation or conditions. A novel microarray probe design soft-
natural attenuation in the deep sea. Various other ware tool, CommOligo (He et al. 2012a), and its
microbial functional genes that are relevant to C, improved versions were used to design probes for
N, P, S, and iron cycling, metal resistance, and GeoChip 2.0, GeoChip 3.0, and GeoChip 4.0.
bacteriophage replication were also enriched in Experimental evaluations of GeoChip 2.0 and
the plume. Overall, this study suggests that indig- GeoChip 3.0 indicated that low percentages of
enous microbial communities could have false positives (0.002–0.025 %) were observed
a significant role in biodegradation of oil spills in (He et al. 2007; He et al. 2010a). GeoChip hybrid-
deep-sea environments (Hazen et al. 2010). izations are generally performed at 42–50 C
G 246 GeoChip-Based Metagenomic Technologies
with 50 % formamide. Sensitivity is another RNA extraction methods are necessary to use
major concern since many gene variants are environmental RNA for GeoChip analysis. Alter-
expected to be low abundant in environmental natively, other techniques, such as stable isotope
samples. The current level of sensitivity for oli- probing (SIP), enzyme activity, metaproteomic
gonucleotide arrays using environmental samples analysis, and metabolite assays, may be used to
is approximately 50–100 ng or 107 cells, or study the functional activity and ecosystem func-
approximately 5 % of the microbial community, tions of microbial communities.
providing a coverage of only the most dominant High-throughput sequencing technologies (e.g.,
community members. Several strategies have 454, Illumina) are available for microbial commu-
been utilized to increase sensitivity. For example, nity analysis, which challenge GeoChip technolo-
with WCGA and WCRA approaches, the sensi- gies. However, although these sequencing-based
tivity of GeoChip hybridization could increase to technologies can discover novel sequences, it can
10 fg. Also, array surface modifications, be expensive to do in-depth shotgun sequencing of
a decrease of hybridization solution, and the use a community. In addition, it suffers from lack of
of new labeling techniques could increase appropriate conserved primers for many target
GeoChip detection sensitivity (He et al. 2011, genes. Also, sequencing-based technologies have
2012a). An important goal in microarray analysis a disadvantage of random sampling, and/or under-
is to provide quantitative information. GeoChip sampling, making it difficult to compare different
has been shown to have a linear relationship samples, while microarray-based technologies
between target DNA or RNA concentrations and have a defined probe set, which is good for com-
hybridization signal intensities. However, this munity comparisons (He et al. 2012b). Therefore,
relationship can be affected by sequence diver- due to the unique features and advantages and
gence (i.e., the more divergent the sequence, the disadvantages of both microarray-based and
lower the signal intensity). Therefore, two strate- sequencing-based technologies, it is preferable
gies are used to improve quantitative ability: that they be used complementarily for microbial
mismatch probes and using relative comparisons community analysis in order to address fundamen-
across samples rather than absolute comparisons tal questions in microbial ecology and environ-
(He et al. 2012a). mental biology.
The quality and quantification of environmen-
tal nucleic acids are one of the most important for
Acknowledgments This work conducted by ENIGMA
successful GeoChip hybridization and reliable (Ecosystems and Networks Integrated with Genes
data generation. DNA with large fragments and and Molecular Assemblies) (http://enigma.lbl.gov),
minimal amounts of contaminants are especially a Scientific Focus Area Program at Lawrence Berkeley
National Laboratory, was supported by the Office of Sci-
important when samples need to be amplified
ence, Office of Biological and Environmental Research, of
using WCGA. Accurate measurement of DNA the US Department of Energy under Contract
yields is also important, so quantification should No. DE-AC02-05CH11231 and by the Oklahoma Applied
be based on double-strand DNA (dsDNA) Research Support (OARS), Oklahoma Center for the
Advancement of Science and Technology (OCAST),
measurement (e.g., PicoGreen) rather than via
State of Oklahoma, through AR11-035 and AR062-034.
absorbance. While DNA detection provides
information on the presence of functional genes
in the environment, it does not provide uncondi-
References
tional evidence for microbial activity. Population
changes can be used to infer microbial activity, Gans J, Wolinsky M, Dunbar J. Computational improve-
but this may not be accurate. To monitor micro- ments reveal great bacterial diversity and high metal
bial activity, mRNA should be used. However, toxicity in soil. Science. 2005;309:1387–90.
Hazen TC, Dubinsky EA, DeSantis TZ, Andersen GL,
since mRNA is easily degraded with rapid turn-
Piceno YM, Singh N, et al. Deep-sea oil plume
over, usually has a low abundance, and has enriches indigenous oil-degrading bacteria. Science.
a small proportion of the total RNA, improved 2010;330:204–8.
GHOSTM 247 G
He Z, Gentry TJ, Schadt CW, Wu L, Liebich J, Chong SC, Xue K, Wu L, Deng Y, He Z, Van Nostrand J, Robertson
et al. GeoChip: a comprehensive microarray for inves- PG, et al. Functional gene differences in soil microbial
tigating biogeochemical, ecological and environmen- communities from conventional, low-input, and
tal processes. ISME J. 2007;1:67–77. organic farmlands. Appl Environ Microbiol.
He Z, Deng Y, Van Nostrand JD, Tu Q, Xu M, Hemme 2013;79:1284–92.
CL, et al. GeoChip 3.0 as a high-throughput tool for Yang Y, Wu L, Lin Q, Yuan M, Xu D, Yu H,
analyzing microbial community composition, struc- et al. Responses of the functional structure of soil
ture and functional activity. ISME J. 2010a;4: microbial community to livestock grazing in the
1167–79. Tibetan alpine grassland. Glob Chang Biol.
He Z, Xu M, Deng Y, Kang S, Kellogg L, Wu L, 2013;19:637–48.
et al. Metagenomic analysis reveals a marked Zhou J, Deng Y, Luo F, He Z, Tu Q, Zhi X. Functional
divergence in the structure of belowground microbial molecular ecological networks. mBio. 2010;1(4):
communities at elevated CO2. Ecol Lett. e00169.
2010b;13:564–75. Zhou J, Xue K, Xie J, Deng Y, Wu L, Cheng X,
He Z, Van Nostrand JD, Deng Y, Zhou J. Development et al. Microbial mediation of carbon-cycle feedbacks
and applications of functional gene microarrays in the to climate warming. Nat Clim Chang. 2012;2:106–10.
analysis of the functional diversity, composition, and G
structure of microbial communities. Front Environ Sci
Engin China. 2011;5:1–20.
He Z, Deng Y, Zhou J. Development of functional gene
microarrays for microbial community analysis. Curr
Opin Biotechnol. 2012a;23:49–55.
GHOSTM
He Z, Van Nostrand JD, Zhou J. Applications of
functional gene microarrays for profiling microbial Yutaka Akiyama
communities. Curr Opin Biotechnol. 2012b;23: Department of Computer Science, Tokyo
460–6.
Institute of Technology, Meguro-ku,
Li X, Deng Y, Li Q, Lu C, Wang J, Zhang H, et al. Shifts of
functional gene representation in wheat rhizosphere Tokyo, Japan
microbial communities under elevated ozone. ISME
J. 2013;7(3):660–71.
Liang Y, Van Nostrand JD, N’Guessan LA, Peacock AD,
Deng Y, Long PE, et al. Microbial functional gene
Definition
diversity with a shift of subsurface redox conditions
during in situ uranium reduction. Appl Environ GHOSTM is a homology search tool developed
Microbiol. 2012;78:2966–72. for metagenomics and accelerated by
Lu Z, He Z, Parisi VA, Kang S, Deng Y, Van Nostrand JD,
GPU-computing. GHOSTM can be used as the
et al. GeoChip-based analysis of microbial functional
gene diversity in a landfill leachate-contaminated aqui- alternative of BLASTX program, which searches
fer. Environ Sci Technol. 2012;46:5824–33. protein databases using a translated nucleotide
Rhee S-K, Liu X, Wu L, Chong SC, Wan X, Zhou query. The GHOSTM system achieved calcula-
J. Detection of genes involved in biodegradation and
tion speeds that were 130 times faster than
biotransformation in microbial communities by using
50-mer oligonucleotide microarrays. Appl Environ BLAST with 1 GPU. It also had a calculation
Microbiol. 2004;70:4303–17. speed that was 3.4 times faster than BLAT with
Torsvik V, Ovreas L, Thingstad TF. Prokaryotic diver- higher search sensitivity. GHOSTM is distributed
sity – magnitude, dynamics, and controlling factors.
under the MIT license and its source code is
Science. 2002;296:1064–6.
Van Nostrand JD, Wu L, Wu W-M, Huang Z, Gentry TJ, available for download at http://code.google.
Deng Y, et al. Dynamics of microbial community com/p/ghostm/.
composition and function during in situ bioremedia-
tion of a uranium-contaminated aquifer. Appl Environ
Microbiol. 2011;77:3860–9.
Whitman WB, Coleman DC, Wiebe WJ. Prokaryotes: the Introduction
unseen majority. Proc Natl Acad Sci USA.
1998;95:6578–83. In metagenomic analysis, the DNA sequence
Wu L, Thompson DK, Li G, Hurt RA, Tiedje JM, Zhou
fragments obtained from environmental samples
J. Development and evaluation of functional gene
arrays for detection of selected genes in the environ- frequently include DNA sequences from many
ment. Appl Environ Microbiol. 2001;67:5780–90. different species, and closely related reference
G 248 GHOSTM
exception of the low-score hits, GHOSTM suc- GHOSTM, Table 1 Comparison of search speed
cessfully identified more than 90 % of the hits Time Acceleration
identified by SSEARCH. This result suggests that Program #GPUs (s) ratio
GHOSTM is sufficiently accurate for general GHOSTM (K ¼ 4) 1 2,855 129.5
usage. GHOSTM (K ¼ 4) 4 909 406.7
The computational times of BLAST, BLAT, BLAT 9,898 37.3
and GHOSTM for 100 thousand reads are shown BLASTX (1 thread) 369,678 1
in Table 1. Each query read has the length from BLASTX 102,255 3.6
(4 threads)
60 to 75 bp and the search target is KEGG Genes
(“genes.pep”) database (Kanehisa et al. 2010) with
approximately 2.5 GB. The GHOSTM program
achieved a calculation speed approximately faster than BLAT despite of its higher search sen-
130 and 400 times faster than the BLAST program sitivity. GHOSTM achieves both high search
using 1 thread and 4 threads, respectively. More- speed and high search sensitivity compared with
over, GHOSTM was approximately 3.4 times previous homology search tools.
G 250 GHOSTM
Horizontal Gene Transfer and are tiny, unicellular organisms with relatively
Bacterial Diversity small genomes, variations observed in their cel-
lular architectures, metabolic properties, and eco-
Chitra Dutta1 and Munmun Sarkar2 logical preferences are remarkable. Such
1
Structural Biology & Bioinformatics Division, enormous diversity may be attributed to the
CSIR-Indian Institute of Chemical Biology, extremely dynamic genomes of bacteria that
Kolkata, West Bengal, India evolve rapidly through alteration, acquisition,
2
CSIR-Indian Institute of Chemical Biology, deletion, and rearrangements of relevant genetic
Kolkata, India information through various molecular mecha-
nisms. These mechanisms include not only the
processes of internal modification of genetic
Synonyms materials like mutation or homologous recombi-
nation but also exchange of specific set of genes
Lateral gene transfer with other species through the process of hori-
zontal transfer (Ochman et al. 2000). Mutations
usually lead to slow, subtle, but continuous refine-
Definition ment and alteration of existing genes that may
foster diversification and speciation of microor-
Horizontal gene transfer (HGT) is the process in ganisms on an evolutionary time scale. HGT, on
which genetic material is transmitted between the contrary, is capable of introducing abrupt
two organisms that are not parent and offspring. large-scale changes in the gene repertoire of an
HGT is pervasive among bacteria, even among organism that may confer novel physiological
very distantly related ones. Through transmission traits to the recipient and enable an organism to
of distinct physiological traits from one organism explore new ecological niches and even can gen-
to another, it may cause drastic changes in the erate new variants of bacterial strains by “genetic
ecological and pathogenic character of bacterial quantum leaps” (Groisman and Ochman 1996).
species and thereby may catalyze the diversifica-
tion of bacterial lineages.
Mechanisms of HGT
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
H 252 Horizontal Gene Transfer and Bacterial Diversity
There are three principal mechanisms for relatives are likely to have greater sequence iden-
interspecies transmission of DNA elements in tity and hence higher probability of homologous
HGT (Ochman et al. 2000): recombination as well as HFIR. Bacteria with the
(i) Transformation – uptake of naked DNA same restriction–modification system can more
element from environment easily share a phage or a plasmid and exchange
(ii) Transduction – the bacteriophage-mediated their DNA elements. DNAs of short length
transmission of genetic materials between (carrying one to several genes) usually has
organisms recognized by the phage a greater probability of undergoing a successful
(iii) Conjugation – transfer of DNA from the adaptive HGT, even across deeply divergent bac-
donor to the recipient through cell-to-cell teria, as it may allow an organism to selectively
contact via sexual pilus pick up a niche-transcending gene or set of genes
However, mere insertion of the donor DNA without acquiring the niche-specifying genes of
element into a recipient cytoplasm does not the donor. Furthermore, a short DNA may also
ensure a successful HGT, unless this foreign survive in a host with distinct restriction–modifi-
DNA sequence becomes stable in the host chro- cation system, as it is less likely to contain a given
mosome. Though the transfer or uptake of a short recognition sequence and may thereby be more
DNA sequence is usually indiscriminate with protected from cleavage by the restriction system
respect to the functional or compositional fea- of the host. And, needless to say, a niche-
tures of the transmitted sequence, stabilization transcending HGT that provides an important
of this foreign DNA element into the host organ- adaptation to a recipient will always have
ism depends critically on the compatibility of the a selective advantage.
transferred genes with the transcriptional and Among the mechanistic barriers limiting
translational machinery of the host (Dutta and unregulated uptake of foreign DNA in bacteria
Pan 2002). Stable incorporation of the newly are the lack of similarity between the donor and
acquired DNA into the host genome can be the recipient, which may prohibit the integration
mediated by any of the following processes: of new sequence into a replicating genetic unit,
(i) homologous recombination, which normally surface exclusion that may create an effective
limits the process among closely related organ- barrier against conjugative transfer into cells,
isms; (ii) persistence as an episome, if favored by and presence of distinct restriction/modification
natural selection; (iii) integration mediated by systems present in the host (Thomas and
mobile genetic elements; and (iv) illegitimate Nelsen 2005).
incorporation through chance events of double- A protein’s connectivity may be another
strand break repair. important factor for the transferability of genes
across organisms. The complexity hypothesis
(Jain et al. 1999) predicts a low rate of transfer
Factors Regulating the Events of HGT of genes, products of which are involved in many
and Their Outcomes complex interactions. Transfer of only one part of
a complex set of coadapted structures is likely to
Depending on the organisms involved and the bring about an incompatibility and loss of func-
gene transfer mechanisms that are operational, tion. It is thought that bacterial genes may be
there are a number of factors that can foster or broadly classified into two categories according
limit the transfer, uptake, stabilization, and to their transferability (Nakamura et al. 2004):
expression of foreign DNA molecules in bacteria. (i) less transferable “informational” genes
Factors that may foster an event of HGT involved in replication translation and transcrip-
(Wiedenbeck and Cohan 2011) include both tion and (ii) frequently transferable “operational”
mechanistic as well as functional aspects. The genes involved in metabolism. It has also been
phylogenetic closeness of the donor and the reported that among operational genes, those
recipient often facilitate HGT, since close involved in cell surface, DNA binding, and
Horizontal Gene Transfer and Bacterial Diversity 253 H
pathogenicity-related functions have higher niche-transcending traits that are commonly
probability of HGT as compared to the genes introduced in bacterial species through HGT are
related to amino acid biosynthesis, biosynthesis as follows.
of cofactors, energy metabolism, intermediary
metabolism, fatty acid and phospholipid metabo- Novel Metabolic Traits and Niche Adaptation
lism, and nucleotide biosynthesis. In bacteria, a substantial portion of species-
Any recipient organism would also try to resist specific functions can be attributed to HGT.
an event of HGT that might incur harmful pleio- Through HGT, divergent bacterial populations
tropic effects. The deleterious side effects of may share an adaptation that transcends their
a new acquisition often drive natural selection differences in cellular architectures, physiologi-
toward “domesticating” the acquired DNA, i.e., cal capabilities, and ecological niches. For
toward ameliorating its negative fitness effects instance, enterotoxigenic Escherichia coli that
(Wiedenbeck and Cohan 2011). Newly acquired attacks the epithelial cells of the small intestine
genes may have higher rates of evolution than shares the class 5 fimbriae with Burkholderia
other genes in the genome. Another mechanism cepacia that resides in human lungs of cystic
for domesticating a horizontally acquired adapta- fibrosis patients and attacks the respiratory epi- H
tion involves initial repression of the acquired thelium. On the other hand, closely related bac-
gene(s) in the host genome by histone-like teria or even strains of same species may exhibit
nucleoid-structuring proteins (H-NS) (Dorman radically different metabolic, physiological, or
2004). The compositional differences between pathogenic traits – thanks to HGT. Bacillus
a donor segment and the recipient are diminished anthracis (strain Ames ancestor), Bacillus cereus
over time as incorporated genes are subjected to (ATCC1098), and Bacillus thuringiensis (serovar
the mutational bias of the host (Lawrence and konkukian str. 97–27), all are considered as
Ochman 1997). a single species, as they show more than 94 %
ANI and have highly syntenic gene repertoire.
And yet they are drastically different in their
Bacterial Diversity Incurred by HGT phenotypes – a highly virulent pathogen and
potentially lethal bioterror agent, a source of
HGT is thought to be a prime contributor to food poisoning, and an eco-friendly organic bio-
bacterial evolution. As more and more genome pesticide, respectively (Doolittle and Papke
sequences are being determined, it is becoming 2006).
clear that cross-species transmission of genetic HGT, in many cases, endows the recipient
information through HGT is pervasive among with novel metabolic capabilities that enable it
bacteria and that it may occur at vast phyloge- either to invade a new niche or to improve its
netic distances and that it may confer novel phe- performance in its current niche (Cohan and
notypes and functions to the host organism by Koeppel 2008). For example, acquisition of the
introducing fully functional genes and gene clus- lac operon has enabled Escherichia coli to uti-
ters. Unlike point mutations that can only adjust lize the milk sugar lactose as a carbon source
preexisting phenotypes, HGT may result in dras- and thereby to explore a new niche, the mam-
tic changes in metabolic, pathological, or ecolog- malian colon, where it has established a
ical character of a microbial species, thereby commensal relationship (Ochman et al. 2000).
allowing effective and competitive exploitation An event of HGT may even allow for conversion
of new niches (Lawrence 1999; Hacker and of the recipient into a radically different organ-
Kaper 2000). In cases where habitat differences ism that may inhabit niches completely
suggest ecological differentiation between close unexplorable by the organisms relying on muta-
relatives, a genome-based analysis often iden- tional processes alone. Examples include the
tifies one or more events of HGT as the primary aerobic methanotrophs that have gained the
cause of the ecological divergence. Some of the ability to synthesize critical cofactors for
H 254 Horizontal Gene Transfer and Bacterial Diversity
Single-Cell Genomics
Introduction
Microfluidic devices are currently routinely used
It is widely appreciated today that viruses are to control and manipulate small volumes of liq-
a dominant and critical part of Earth’s biosphere. uid, including trapping and analyzing single cells
Yet despite the major advances in the study of (Kalisky et al. 2011). Once trapped, individual
environmental viruses in most cases, our knowl- cells are lysed, and their genetic content can be
edge of which viruses go with which hosts is probed. In the case of host-virus interaction, the
meager. In the classic phage isolation technique, genome of the virus forms a unique association
known as the plaque assay, a confluent layer of with the bacterial cell (Fig. 1). Thus, in an ideal
host cells is infected with a low density of viral scenario, both the genome of the host and its virus
particles. When a virus infects a cell within this would be sequenced. Although single-cell
“lawn” of host cells, the cell lyses, and new viral sequencing has been demonstrated (Kalisky
particles infect adjacent cells thereby creating et al. 2011; Kalisky and Quake 2011), for practi-
a clearing, or plaque, in the lawn. This technique cal reasons, the number of cells that may be
for isolating viruses requires that the host of the interrogated using this method remains at present
virus be culturable. However, given that >99% of quite low. As an alternative approach, it is possi-
microbes on Earth cannot be cultured at this time, ble to analyze single bacterial cells by PCR using
the vast majority of phage-host systems cannot be microfluidic digital polymerase chain reaction
investigated in the laboratory using these classi- (dPCR) arrays. This method, which is relatively
cal phage enrichment techniques. Consequently, high throughput, can currently interrogate several
little is known about the biology of most viruses thousands of single cells within days.
and their host specificity in the wild.
Metagenomic studies of environmental
viruses circumvent the cultivation limitation and Microfluidic Digital PCR
therefore have offered a unique glimpse into the
genetic composition of environmental viruses In microfluidic digital PCR a sample consisting
(Kristensen et al. 2010; Mokili et al. 2012). In of either DNA or cells is partitioned uniformly
low complexity environments such as natural onto an “array” of nanoliter or subnanoliter PCR
acidophilic biofilms, metagenomic analysis can chambers, with each chamber ideally containing
utilize antiviral defense systems known as a single DNA molecule or a single cell
Host-Virus Interaction: From Metagenomics to Single-Cell Genomics 259 H
a
H
0.1
150 mm
0.01 SSU rRNA gene
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 bacterial cell
Cycle
150 mm
Host-Virus Interaction: From Metagenomics to for retrieval are outlined in gray. FA indicates false alarm
Single-Cell Genomics, Fig. 1 End point fluorescence (a probable terminase primer-dimer). (b) Normalized
measured in a panel of a microfluidic digital PCR amplification curves of all chambers in (a) after linear
array. (a) The measured end point fluorescence from the derivative baseline correction (red/viral, green/rRNA).
rRNA channel (right half of each chamber) and the (c) Specific physical associations between a bacterial
terminase channel (left half of each chamber) in cell and the viral marker gene resulting in colocalization
a microfluidic array panel. Each panel in this array (one include, for example, an attached or assembling virion, an
of twelve) consists of 765 150 150 270 mm3 (6 nL) injected DNA, an integrated prophage, or a plasmid
reaction chambers. Retrieved colocalizations are outlined containing the viral marker gene (Tadmor et al. 2011)
in orange, and positive rRNA chambers randomly selected
(Kalisky et al. 2011; Kalisky and Quake 2011). spurious reactions and contaminating molecules
Each chamber is loaded with a mixture of primers such as residual genomic DNA that is intrinsic to
and fluorescent probes that target the loci of some reagents. These factors together provide the
interest. The advantage of performing quantita- sensitivity required to PCR amplify and detect
tive PCR (qPCR) reactions in such tiny volumes single molecules. Once thermocycling is com-
is that the likelihood of contamination is reduced, pleted, chambers containing the targets of interest
and the fluorescent signal per PCR chamber is are identified via the fluorescent signal, sampled
greatly intensified. In a standard benchtop qPCR and post-amplified in the laboratory for sequenc-
reaction, for example, the reaction volume is ing using conventional benchtop PCR machines.
15 ml compared to a dPCR reaction volume of 6 An appealing aspect of this technology is that
nl. Therefore dPCR Ct values are reduced by cells may be harvested directly from the environ-
about log2 (2,500) ¼ 11.3 cycles. In addition, ment and loaded onto a microfluidic dPCR array.
the large dilution factor ensures that the vast This method therefore does not require that cells
majority of dPCR chambers are free from be cultured beforehand and does not depend on
H 260 Host-Virus Interaction: From Metagenomics to Single-Cell Genomics
gene expression, the position of the targets in the that is ubiquitous in the environment of interest
genome or on the physiological state of the cell at should be identified.
the time of harvest (Ottesen et al. 2006).
The first application of microfluidic digital
PCR technology to study environmental bacteria Requirements from a Viral Marker Gene
involved colocalization of two genes present in
the same individual bacterium (Ottesen Not all viral genes are suitable to be unequivocal
et al. 2006). In this study, one marker targeted markers of a viral entity. As an example, the
an important functional gene expressed by certain integrase gene, which codes for an enzyme that
members of the microbial community resident in is used by the virus to integrate into the genome,
the hindgut of termites, and the second marker is prevalent not only among phages, but also
targeted the small subunit ribosomal RNA (SSU among certain nonviral entities such as plasmids,
rRNA) gene that was used to phylogenetically pathogenicity islands, and integrons (Casjens
identify the bacterium (also known as the 16S 2003). Similar arguments apply to viral genes
marker). By colocalizing and subsequently involving lysis, regulation of gene expression,
sequencing both markers from many individual and DNA replication in viruses (Casjens 2003).
cells, the identity of cells carrying the functional Casjens therefore argues that ideal “cornerstone”
gene was ascertained in cases of repeated phage genes (or at least prophages genes) are
colocalizations. genes involved in the assembly of the virion. Of
To study host-virus interaction, the dPCR these, genes that appear to be not only virus
approach described above was extended to specific but also particularly conserved are the
colocalize the SSU rRNA gene with a marker large terminase subunit (TerL) and portal protein
targeting a certain viral gene prevalent in the genes (Casjens 2003).
environment of interest, demonstrating proof-of- TerL genes have certain additional features
concept on the termite system (Tadmor that make them particularly attractive as viral
et al. 2011). Targeting viruses, however, which markers. The TerL gene is a component of the
are fundamentally different biological entities DNA packaging and cleaving mechanism present
than bacteria, raised certain questions that needed in numerous double-stranded DNA phages (Rao
to be addressed. First and foremost, unlike pro- and Feiss 2008). It contains an N-terminal
karyotes that have universal markers such as the ATPase domain, which is the “engine” of the
SSU rRNA gene, viruses do not have a single DNA packaging motor, and a C-terminal nucle-
shared gene that can be used as a universal ase domain (Rao and Feiss 2008). The ATPase
marker (Rohwer and Edwards 2002). In fact, domain of the TerL gene is conserved in a wide
viral metagenomic studies have shown that variety of dsDNA phages, including the eukary-
viruses are likely the largest reservoirs of otic herpes virus (Przech et al. 2003), suggesting
unknown genetic diversity with the majority of it is an ancient viral domain (Rao and Feiss 2008).
putative viral sequences exhibiting no significant Indeed, Koonin et al., who define “hallmark viral
similarity to currently known genes (Edwards genes” as “genes shared by many diverse groups
and Rohwer 2005; Kristensen et al. 2010; Mokili of viruses with only distant homologs in cellular
et al. 2012). To make matters worse, viruses are organisms and with strong indications of mono-
notorious for replicating their genetic material phyly of all viral members of the respective gene
with borderline fidelity. Consequently the defini- families” and thus “can be viewed as
tion of a viral gene in the environment is rela- distinguishing characters of the virus state”
tively fluid. Finally, many genes present in the (Koonin et al. 2006), identified the ATPase subunit
genome of the virus may be of prokaryote origin of the terminase gene as such a hallmark viral
making them poor signature markers for the gene. Since TerL genes have particularly well-
virus. Thus, to utilize digital PCR to study host- conserved functional residues and motifs (Rao
virus interaction, an unequivocal viral marker and Feiss 2008), they are well suited for
Host-Virus Interaction: From Metagenomics to Single-Cell Genomics 261 H
degenerate primer design. At the same time, across contain head-related proteins (Daw and Falkiner
biology TerL genes do not share overall significant 1996). GTAs can only be ruled out if the entire
sequence similarity (Rao and Feiss 2008), thereby genome of the putative viral entity is obtained.
making them sensitive viral markers. Full length viral genomes may be obtained by
Targeting a “cornerstone” or “hallmark” gene means of single-cell sequencing techniques.
of a virus may, however, be of questionable use if
the selected marker tags a defective prophage.
Since a necessary condition for the virus to be Identifying Ubiquitous Viral Markers
active is that its cornerstone gene be functional, it
is important to verify that the cornerstone gene is Although universally shared viral genes do not
under negative selection pressure (Nei and exist, it is beneficial to select a viral marker that is
Kumar 2000). Nonfunctional genes may contain ubiquitous in the environment of interest. Ubiqui-
errant stop codons, frameshift mutations, or tous markers not only have the potential to recover
mutations in certain highly conserved residues greater genetic diversity from the environment, but
essential for the proper function of the protein. can possibly also be found in similar or related
Yet demonstrating that a particular family of environments. Identification of a ubiquitous viral H
TerL alleles from the environment of interest is marker in the environment of interest, assuming
under negative selection pressure does not guar- one exists, is not straightforward and requires
antee that the virus is active in this environment sophisticated metagenome data mining approaches.
since a viral gene may remain functional while To address this problem the authors developed
the prophage itself is defective. Such a situation a bioinformatic program called MetaCAT
can occur if there was insufficient time for point (metagenome cluster analysis tool), which
mutations to have accumulated in the gene of employs a heuristic clustering and ranking
interest after the prophage was inactivated approach that aims to identify the most abundant
(Casjens 2003). Therefore, viruses carrying the genes of a given class (e.g., viral genes) in
viral marker may have been active only in recent a metagenome, without relying on superficial
evolutionary history. In an alternative scenario, features such as gene annotation (Tadmor
the putative marker indeed degraded over time et al. 2011). The input to MetaCAT is
upon prophage inactivation, but it was subse- a metagenome (either assembled translated
quently repaired by a recombination event with contigs or raw nucleotide reads) and a reference
another phage that was likely functional (Casjens library of known genes (e.g., all known viral
2003). In such a case the marker can serve as an genes). The output of MetaCAT is a list of
indicator for indirect infection. known reference genes (derived from an input
Another possibility is that the putative marker reference library) that were found to be present
was recruited by the bacterium because it confers in the metagenome, ranked by their abundance in
on the bacterium a competitive advantage, as is the metagenome. Abundance of a reported gene
the case with lysogenic conversion genes. In this is defined as the number of metagenome gene
case the gene would remain under negative selec- objects or reads that yield significant alignments
tion pressure, while the rest of the prophage with respect to this gene. A key feature of
degenerated (Boyd and Br€ussow 2002; Casjens MetaCAT is that it uses an iterative dynamic
2003). It is unlikely, however, that the host will clustering algorithm to group putative homolo-
recruit TerL genes since these are highly special- gous reference genes from the input reference
ized motors required for virion synthesis. Alter- library. The clustering is dynamic in the sense
natively, the putative marker could be part of that it is performed on the fly based on the
a functional non-phage entity such as a gene matches found in the given metagenome, thereby
transfer agent (GTA) or a bacteriocin (Casjens avoiding loss of information that would occur if
2003). In the case of TerL genes, bacteriocins can the reference library was a priori clustered. The
be ruled out since these taillike structures do not clustering is performed iteratively until all
H 262 Host-Virus Interaction: From Metagenomics to Single-Cell Genomics
Host-Virus Interaction: From Metagenomics to ability to cluster the list of known reference genes per
Single-Cell Genomics, Fig. 2 Schematic illustration metagenome and report a minimally redundant list of
of the MetaCAT algorithm. MetaCAT maps clusters of known genes that have putative homologs in the
similar known reference genes to groups of metagenome metagenome, ranked by their abundance in the
gene objects or reads. MetaCAT defines two known ref- metagenome. This list can then be used to generate
erence genes as being similar or “related” if their footprint hypotheses about the given metagenome. In this figure
in the metagenome has a significant overlap. The abun- the left oval depicts a reference database of genes (black
dance of a given cluster of related known reference genes dots), and the right oval depicts a metagenome, with black
in the metagenome is defined as the number of dots representing metagenome gene objects. Hexagons in
metagenome gene objects (or reads) with an E value the reference database represent clusters of related refer-
below a given threshold found when BLASTing ence genes identified by MetaCAT. Each hexagon is
a representative from the gene cluster against the linked to a corresponding cluster of metagenome gene
metagenome. The key feature of MetaCAT lies in its objects depicted by a circle of matching color
identified redundancy is removed. In this way the datasets are favorable candidates for putative
final reported list of ranked genes (or clusters of ubiquitous markers.
genes) contains orders of magnitude fewer ele- In the case of the termite hindgut, the list of
ments than the reference library and is amenable representative contigs corresponding to reported
to manual inspection (Fig. 2). genes in Table 1 was BLASTed against the
If gene annotation information is included in genome of Treponema primitia, a spirochete iso-
the reference database this information will be lated from a lower termite collected from North-
provided by MetaCAT in the ranked list of ern California. Performing this analysis revealed
genes making it a straightforward task to identify that the representative contig of the top ranking
genes of interest. As an example, Table 1 lists all gene found by MetaCAT indeed had significant
the TerL genes identified by MetaCAT in the hits (E value of ~105) in the genome of
metagenome of the hindgut of a higher termite T. primitia and mapped to two prophage-like
collected from Costa Rica (Tadmor et al. 2011). elements. In this case, BLASTing the TerL gene
Each known reference gene found by from the prophage-like element back against the
MetaCAT to be present in the metagenome can metagenome revealed close homologs with a
be paired with a metagenome gene object that similarity of 70 to 78% identity (Tadmor et al.
yielded the lowest E value. This metagenome 2011). Such a bootstrapping approach enabled
gene object is referred to as the “representative the identification of a ubiquitous viral marker in
contig” of the known reference gene. By the termite hindgut environment. Indeed, degen-
BLASTing the representative contigs erate primers designed against this marker were
corresponding to the top ranking candidate able to amplify closely related homologs of this
markers against other metagenomes from similar marker in other species of termites (as well as a
environments, or against genomes of organisms wood-feeding roach) collected from various loca-
isolated from similar environments, it is possible tions in the United States (Tadmor et al. 2011).
to identify ubiquitous viral genes, if present In this context, it is worthwhile to mention that
(Fig. 3). Closely related genes found in multiple MetaCAT is not restricted to ranking only viral
Host-Virus Interaction: From Metagenomics to Single-Cell Genomics 263 H
Host-Virus Interaction: From Metagenomics to Single-Cell Genomics, Table 1 TerL genes identified by
MetaCAT in the metagenome of a hindgut of a higher termite collected from Costa Rica. The following table
lists TerL genes with minimal E values 107 and abundances 5 that were identified by MetaCAT to have putative
homologs in the metagenome of the hindgut of a Nasutitermes sp. termite. TerL genes are ranked by the number of
metagenome gene objects yielding alignments with E value scores below 0.001. Also shown are the E value scores
obtained when BLASTing the representative contig of each RefSeqTerL gene cluster against the genome of Treponema
primitia (ZAS-2), using a cutoff value of 0.01, with values above this threshold marked as not significant (N.S.)
BLAST BLAST
RefSeq gene representative
against contig against
No. of hits in metagenome ZAS-2
Organism name Virus classification metagenome (E value) (E value)
Clostridium dsDNA viruses Caudovirales; Myoviridae 19 4.0E-40 2.0E-05
phage phiC2
Streptococcus dsDNA viruses Caudovirales 11 3.0E-34 N.S.
phage SMP
Pseudomonas dsDNA viruses Caudovirales; Podoviridae 7 2.0E-09 N.S.
phage PaP3 H
Enterobacteria dsDNA viruses Caudovirales; Siphoviridae 6 2.0E-180 N.S.
phage lambda
Enterobacteria dsDNA viruses Caudovirales; Siphoviridae 6 8.0E-69 N.S.
phage HK022
Host-Virus Interaction: From Metagenomics to dataset, such as another metagenome from a similar envi-
Single-Cell Genomics, Fig. 3 Bioinformatic approach ronment or a genome of an isolate from a similar environ-
to identify ubiquitous viral markers in a given envi- ment. If the percent identity is sufficiently high allowing
ronment. In the proposed approach to identify putative for degenerate primer design, this candidate can be con-
ubiquitous viral markers, a metagenome from a given sidered a putative viral marker and can be further evalu-
environment is first analyzed by MetaCAT to produce ated by experiment. If the percent identity is not high, but
a list of candidate viral genes abundant in the the E value is significant, a bootstrap-like approach may
metagenome. The corresponding representative contig of be employed where the contig/gene from the new dataset
each candidate viral gene (defined as the contig yielding is BLASTed back against the original metagenome,
the lowest E value) is then BLASTed against a second thereby potentially revealing more conserved markers.
H 264 Host-Virus Interaction: From Metagenomics to Single-Cell Genomics
Host-Virus Interaction:
From Metagenomics to
Single-Cell Genomics,
Fig. 4 Workflow using the
microfluidic digital PCR
array for host-virus
colocalization in a novel
environmental sample
(Tadmor et al. 2011)
genes, but it is possible to define other taxonomic bioinformatic sources (e.g., metagenomes and
groups as input reference libraries. For example, sequenced genomes), and degenerate primers
one can use MetaCAT to find the most abundant targeting the marker of interest may be designed.
genes in a given environment involved in Colocalization of viral genes is, however, com-
a certain metabolic pathway, the most abundant plicated by the fact that the low replication fidel-
mitochondrion-related genes in a given sample, ity of viruses makes it unlikely to recover the
the most abundant antibiotic genes in a soil sam- exact same allele twice from the dPCR arrays,
ple, and so on. MetaCAT can therefore be thought contrary to the case of colocalizing two bacterial
of as a useful tool for generating hypotheses genes. P-values can, nevertheless, still be
regarding a given environment. (Requests to assigned to repeated SSU rRNA ribotypes
obtain a beta version of MetaCAT may be sent retrieved from a given array, irrespective of the
to arbel.tadmor@tron-mainz.de.) paired gene, using knowledge of the frequency of
the given ribotype on the array. It is possible to
estimate ribotype frequencies by randomly
Colocalizing Viral-SSU rRNA Genes on selecting chambers positive for the SSU rRNA
Digital PCR Arrays gene and constructing a phylogenetic library of
array ribotypes (Tadmor et al. 2011). Host-phage
Once a viral marker has been selected, a diversity cophylogeny can then be reconstructed from gen-
of this marker can be retrieved from various uine colocalizations, providing a unique glimpse
Human Gut Microbial Genes by Metagenomic Sequencing 265 H
into the evolutionary dynamics of the system and Kalisky T, Blainey P, Quake SR. Genomic analysis at the
shedding light on the flow of viral genes between single-cell level. Annu Rev Genet. 2011;45:431–45.
Koonin E, Senkevich T, Dolja V. The ancient virus world
hosts in the given environment. An overview of and evolution of cells. Biol Direct. 2006;1(1):29.
the workflow using dPCR to colocalize host-virus Kristensen DM, Mushegian AR, Dolja VV, Koonin EV.
genes is provided in Fig. 4. New dimensions of the virus world discovered through
metagenomics. Trends Microbiol. 2010;18(1):11–9.
Mokili JL, Rohwer F, Dutilh BE. Metagenomics and
future perspectives in virus discovery. Curr Opin
Summary and Outlook Virol. 2012;2(1):63–77.
Nei M, Kumar S. Molecular evolution and phylogenetics.
The method outlined in this review provides USA: Oxford University Press; 2000.
Ottesen E, Hong J, Quake S, Leadbetter J. Microfluidic dig-
a general scheme for analyzing host-virus inter- ital PCR enables multigene analysis of individual envi-
actions at the single-cell level without having ronmental bacteria. Science. 2006;314(5804):1464–7.
to culture either host or virus. The method Przech AJ, Yu D, Weller SK. Point mutations in exon I of
first involves a bioinformatic analysis of a the herpes simplex virus putative terminase subunit,
UL15, indicate that the most conserved residues are
metagenomic dataset or datasets from the envi- essential for cleavage and packaging. J Virol.
ronment of interest to recover a ubiquitous viral 2003;77(17):9613–21.
H
marker. This marker is then colocalized with Rao VB, Feiss M. The bacteriophage DNA packaging
a universal gene identifying the host by means motor. Annu Rev Genet. 2008;42:647–81.
Rohwer F, Edwards R. The phage proteomic tree: a
of dPCR performed on single cells. The methods genome-based taxonomy for phage. J Bacteriol.
presented in this review are general and can be 2002;184(16):4529–35.
applied to other environments. Tadmor AD, Ottesen EA, Leadbetter JR, Phillips
R. Probing individual environmental bacteria for
viruses by using microfluidic digital PCR. Science.
2011;333(6038):58–62.
Cross-References
References Synonyms
Andersson AF, Banfield JF. Virus population dynamics Genes in the human gut microbial community;
and acquired virus resistance in natural microbial com- Metagenome of the human gut microbiota
munities. Science. 2008;320(5879):1047–50.
Boyd EF, Br€ussow H. Common themes among
bacteriophage-encoded virulence factors and diversity
among the bacteriophages involved. Trends Microbiol. Definition
2002;10(11):521–9.
Casjens S. Prophages and bacterial genomics: what have
we learned so far? Mol Microbiol. 2003;49(2):
A gene is identified in human distal gut (colon)
277–300. microbes when reads from high-throughput
Daw MA, Falkiner FR. Bacteriocins: nature, function and sequencing of fecal samples are assembled and
structure. Micron. 1996;27(6):467–79. an open reading frame (ORF) is predicted from
Edwards R, Rohwer F. Viral metagenomics. Nat Rev
Microbiol. 2005;3(6):504–10.
the resulting DNA sequence. Such a gene could
Kalisky T, Quake SR. Single-cell genomics. Nat Methods. usually be mapped to a group of bacterial species
2011;8(4):311–4. and linked to certain functions. Metagenomic
H 266 Human Gut Microbial Genes by Metagenomic Sequencing
studies on other parts of the gastrointestinal tract Illumina sequencing technology) have come of
are often performed invasively using animals and age in metagenomic studies. Considering the
are not discussed here. nonuniform abundance of gut microbial species
and the high level of discordance between
individual humans, deep sequencing and wide
Introduction sampling are critical for a comprehensive under-
standing of the human gut flora. In 2010,
The human gut has long been known to contain high-throughput short-read sequencing was intro-
microbial species. Until the advent of duced into human gut microbiome research and
high-throughput metagenomic sequencing, how- showed great potential (Qin et al. 2010).
ever, these mysterious microbes largely eluded Bacterial DNA obtained from human fecal
interrogations by their human host. Recent samples could be readily used for high-
advancements described here and in other entries throughput sequencing on the Illumina platform.
reveal awe-inspiring complexity, dynamics, and After a few quality control steps, the short reads
significance of the gut microbiota. from each sample were assembled de novo
Eubacteria dominate the microbial commu- (Fig. 1), using software such as SOAPdenovo
nity in the human gut (Scanlan and Marchesi (Kultima et al. 2012). Protein-coding genes
2008; Marchesi 2010; Parfrey et al. 2011). Both were then predicted from the assembled contigs
eubacteria and archaebacteria species are routinely (Kultima et al. 2012). Genes from multi-samples
classified to genus level according to their 16S were pooled together and compared with one
rRNA gene sequences. Unfortunately, taxonomic another to remove redundancy. Finally, a
classification of commensal eukaryotes in the gut nonredundant gene catalog was generated and
has remained a tedious process (Parfrey et al. 2011). could serve as a basis for functional and phylo-
As a consequence, our understanding of the eukary- genetic analyses (Fig. 1) (Qin et al. 2010).
otic minorities in the gut lags far behind that of the Alternative to de novo assembly, mapping of
bacterial communities. The term “gut microbes” is reads to an existing gene catalog allows conve-
equivalent to “gut bacteria” hereafter. nient identification of genes in a sample. Natu-
Metagenomic sequencing of total DNA rally, such a time-saving approach requires the
extracted from fecal samples constitutes a key gene catalog to encompass a complete set of
step in forging our understanding of gut bacteria high-quality reference genes.
beyond taxonomy. The approach allows
researchers to obtain complete genome
sequences, identify genes, and predict functions. Total Gene Number and Its Variability
Such metagenomic information is especially pre-
cious for those bacteria that are yet to succumb to Metagenomic sequencing of 124 Europeans
laboratory culture conditions. (as part of the MetaHIT (Metagenomics of the
This overview is intended to briefly summa- Human Intestinal Tract) project) resulted in a gut
rize our current roll call of the various genes microbial gene catalog containing 3.3 million
present in the human gut flora as well as the nonredundant genes (Qin et al. 2010). Although
functional relevance of these genes to the micro- this gene number might still increase as more
organisms and human beings under normal and samples are sequenced, especially those from
perturbed states. patients of a particular disease (e.g., in Qin
et al. 2012), this number of known gut microbial
genes is already 150-fold greater than the number
Identification of Gut Microbial Genes of genes encoded by the human genome.
Two hundred ninety-four thousand one hun-
Next-generation, high-throughput, and cost- dred ten of the gut microbial genes were found in
efficient short-read data (mainly produced by at least 50 % of individuals, which were termed
Human Gut Microbial Genes by Metagenomic Sequencing 267 H
nonredundant genes, of which 204,056 3,603
(around 38 %) were common genes. Thus, signif-
icant interpersonal differences exist in terms of
the number, type, and sequence of the genes.
nutrition has only become realized through stud- Antibiotic administration could lead to pro-
ies of the gut (fecal) microbiota. Various digested found and long-lasting alterations in the intestinal
or indigestible components of the diet arrive at microbiome (Dethlefsen and Relman 2010; Cho
the colon and constitute a major environmental et al. 2012). The distortion is typically manifested
factor shaping the gut microbial ecosystem as a sharp decrease in microbial diversity accom-
(Fig. 2). Complex carbohydrates are fermented panied by an overgrowth of Proteobacteria,
by bacteria of the phylum Firmicutes, producing especially in pathogenic Enterobacteriaceae
short-chain fatty acids (SCFAs, including acetate, populations (Nyberg et al. 2007). Antibiotic
propionate, and butyrate) for use by the host cells. intake exerts a strong selective pressure on the
In contrast, if the host diet relies more on simple intestinal flora and increases transfer of
sugars, as has become common in the United antibiotic-resistant genes (ARGs) among gut
States, enzymes for metabolizing mono- and microbes, leading to an accumulation of resis-
disaccharides could be more prominent in the gut tance strains (Sullivan et al. 2001; Schjørring
flora (Yatsunenko et al. 2012). Similarly, dietary and Krogfelt 2011). These antibiotic-resistant
intake of amino acids and vitamins appears to pathogens and nonpathogens could persist in
modulate the balance between their catabolism the gut well after removal of the selective
and anabolism by gut bacteria. pressure.
Bile acids (BAs) secreted by the host to emul- Notably, current evidence suggests that
sify dietary fats make a strong impact on the while commensal bacterial species vary between
gut microbiota. On one hand, primary BAs are hosts of different genetic background and envi-
known to be converted to more effective second- ronmental factors, the individuality is smaller
ary BAs through 7a-dehydroxylation by intesti- at the functional level, i.e., similar genes in
nal bacteria. On the other hand, with their different gut bacteria could serve similar pur-
amphipathic properties, BAs show a strong anti- poses and are selected by similar factors (Spor
microbial activity. Rats on a diet supplemented et al. 2011).
with the BA cholic acid recapitulated effects of
high-fat diet on the gut flora (reported in mice),
namely, an increased ratio of Firmicutes to Gut Microbiota and Diseases
Bacteroidetes and a declining microbial diversity
(Islam et al. 2011; Ley et al. 2005). Thus, ele- A growing body of evidence suggests that the
vated bile secretion stimulated by high-fat diet gut microbial flora is central to human health.
likely plays a major role in reshaping the gut Although we are very far from a definitive com-
microbiome (Islam et al. 2011). prehension of healthy versus diseased gut
Human Gut Microbial Genes by Metagenomic Sequencing 269 H
microbiota, it is fair to say that a productive Metagenome-Wide Association Study
and well-balanced symbiotic relationship with for Diagnosis
our little gut residents is of key importance
for us human beings. Altered gut microbial To go beyond a descriptive account of genes pre-
composition has been reported in various gut- sent in healthy or unhealthy human gut microbiota,
related diseases such as colorectal cancer and it could be very helpful to perform a metagenome-
inflammatory bowel diseases (IBDs) and extend wide association study (MGWAS) for identifica-
to conditions like anorexia, allergies, cardiovas- tion of disease markers and evaluation of disease
cular diseases, and even autism (Clemente prospect (Fig. 3). A standard genome-wide associ-
et al. 2012; Tremaroli and B€ackhed 2012). ation study (GWAS) looks for genetic variants,
These diseases are more or less accompanied typically single-nucleotide polymorphisms (SNPs)
by dysbiosis, a state where benign or beneficial in a genome, and relates them to a phenotype such
gut microbes are overtaken by pathogens and as a disease. MGWAS stems from the concept of
normal processes like fermentation, synthesis “metagenome.” Accordingly, the relative abundance
of metabolites, barrier function, etc. become of a gene in a metagenome, instead of the presence of
disrupted. a SNP, is used to establish correlation with disease. H
On a metagenomic level, the gut microbiome The proof-of-principle study for MGWAS
of leptin-deficient obese mice (ob/ob) showed was performed on type 2 diabetes mellitus (Qin
an increased capacity for energy harvest from et al. 2012). In a reference gene catalog updated
the gut, encoding enzymes that could initiate from previous work (Qin et al. 2010), 3,298,811
breakdown of otherwise indigestible polysac- genes were found in the healthy or diabetic
charides (Turnbaugh et al. 2006). However, cohorts (total n ¼ 145). After filtering for shared
the end products of bacterial fermentation, genes and clustering based on numerical relation-
SCFAs especially butyrate, appear protective ships and phylogeny, the dimensionality was
and negatively regulate inflammation in the gut reduced to 1,138,151 genes. The first stage of anal-
(Maslowski et al. 2009). Butyrate synthesis ysis concluded with 278,168 statistically signifi-
genes in the gut flora were depleted in diabetes cant gene markers for diabetes. In Stage II, new
and symptomatic atherosclerosis patients com- samples (n ¼ 100 for each cohort) were sequenced
pared to healthy controls (Qin et al. 2012; and profiled with the markers from Stage I. The
Karlsson et al. 2012). Together with studies on analysis further reduced the number of gene
butyrate and IBDs (Thibault et al. 2010; Scharl markers to 52,484. For lowest error rate, as few
and Rogler 2012), current results point to a key as 50 gene markers were found to be optimal and
role of butyrate metabolism in colon health, with were successfully applied to diabetic/nondiabetic
extensive interplays between the gut flora and classification of 23 additional samples. Besides
the host. gene markers, markers from functional annotations
Another common theme in gut microbial (KEGG orthologous groups, eggNOG orthologous
homeostasis may be the handling of oxidative groups) and metagenomic linkage groups (MLG)
stress. The gut metagenome of diabetes patients that represent taxonomic units also present valu-
was enriched for genes involved in sulfate able information (Qin et al. 2012).
reduction and oxidative stress resistance (Qin In addition to diabetes, the same study identified
et al. 2012). Atherosclerosis was associated with gene markers and orthologous group markers for
an underrepresentation of phytoene dehydroge- IBDs and for enterotypes (Qin et al. 2012), raising
nase gene and a matching reduction in the anti- the stakes for routine application of MGWAS to
oxidant b-carotene in patient serum (Karlsson other microbiota-related diseases. It remains to be
et al. 2012). Oxidative stress is also known to seen how factors such as age, gender, and BMI
contribute to IBDs such as Crohn’s disease (body mass index) confound MGWAS in various
(Iborra et al. 2011). diseases, especially during initial marker
H 270 Human Gut Microbial Genes by Metagenomic Sequencing
Human Gut Microbial Genes by Metagenomic Genes and species that are under- or overrepresented in
Sequencing, Fig. 3 Metagenome-wide association patients are selected following a rigorous procedure. The
study for gut flora-related diseases. For each sample, analysis results in gene markers and metagenomic linkage
sequencing reads are mapped to the reference gene catalog groups that can be used for diseased/undiseased classifi-
(Fig. 1) and relative abundance of genes is computed. cation and potentially for prognosis and diagnosis
palates, tonsils, throat, and saliva. The At about the time we recognized the need to
microbiome of the oral cavity (Dewhirst create a taxonomic framework for the oral
et al. 2010) and its niches have been examined microbiome, the National Institute of Dental
based on 16S rRNA sequencing (Aas et al. 2005; and Craniofacial Research released a request
Bik et al. 2010; Human Microbiome Project from proposal on “The metagenome of the oral
2012a, b). The metagenome of the oral cavity microbiome.” We responded with a proposal
has been studied to a limited degree prior to entitled “A foundation for the oral microbiome
2012 due to the complexity of the site (Alcaraz and metagenome,” which was funded as
et al. 2012; Belda-Ferre et al. 2012; Xie DE016937. The goals of the grant were to
et al. 2010). More than 700 prevalent species (1) set up the HOMD web-accessible database
comprise the oral microbiome, but many taxa with a provisional taxonomic scheme and to pre-
are present at less than 0.1 % of the microbial sent all oral genomes in a graphical interface,
population (Dewhirst et al. 2010). As oral bacte- (2) to complete reference genomes for oral taxa,
rial reference genomes are becoming available, and (3) to obtain isolates of previously
primarily through the efforts of the Human uncultivated taxa and make them available to
Microbiome Project (Human Microbiome Pro- the research community by placing them in
ject 2012a, b), it is becoming possible to attribute national-type culture collections. We have made
metagenomic sequences to organisms at genus steady progress in achieving these goals, and this
and species level (Martin et al. 2012). The project is currently in its seventh year of funding.
anchoring of metagenome sequence information
to specific organisms in a taxonomic framework
is key to developing a full description of the The HOMD Website
bacteria-bacteria and bacteria-host interactions
that underlie human oral health and disease. The HOMD contains various types of informa-
The Human Oral Microbiome Database tion on human oral microorganisms including
(HOMD) was developed in response to the lack taxonomic, genomic, and bibliographic. The pur-
of any naming or taxonomic scheme for the thou- pose of the HOMD website (http://www.homd.
sands of human oral 16S rRNA clone sequences org) is to provide an easy-to-use online interface
that were being generated in the early 2000s and to search, retrieve, and navigate among these
dumped into GenBank without any taxonomic different types of information. HOMD also pro-
anchor. Investigators were publishing manu- vides web-based bioinformatics software tools
scripts using clone names (such as BU063) as for data mining and analyses.
provisional taxonomic names. The only way to Technically, the HOMD website is
phylogenetically place an oral clone was to per- constructed using a LAMP system and hosted
sonally align sequences and generate one’s own on the web server computers. The LAMP system
phylogenetic trees. We recognized that there was provides a Linux operating system, Apache web
a need for a 16S rRNA-based provisional taxo- service, MySQL relational database, and PHP
nomic scheme to name and provide reference dynamic web page rendering. Textual contents
sequences for unnamed taxa known only from such as the taxonomy and metagenomic informa-
clone or isolate 16S rRNA sequences. The nam- tion are queried and results dynamically
ing scheme had to be provisional because formal displayed in the web browser by the LAMP sys-
naming under the bacterial code requires isola- tem. A dedicated high-performance computer
tion in pure culture and full phenotypic charac- cluster is deployed to handle the computational
terization; 16S rRNA sequence by itself is demanding analysis such as homology sequence
insufficient for formal naming. The taxonomic searches.
scheme described more fully below is based on The HOMD has been designed to be compat-
a Human Oral Taxon number which runs cur- ible with most commonly used web browsers
rently from 001 to 918. such as Microsoft Internet Explorer, Firefox,
Human Oral Microbiome Database (HOMD) 273 H
Google Chrome, and Safari. We suggest the use The HOMD home page also includes a
of one of these popular web browsers to ensure top-down oriented expandable menu on the left
the functionalities of HOMD web pages and side and an introductory paragraph in the center.
tools. All the HOMD information and tools are On the right side are the Meta-Database Search,
viewable and available to the general public with- the Announcement, and the Database Update
out having to log in or acquiring a user account. boxes. The Meta-Database Search is very useful
The log-in function is mainly for the purpose of for searching desired information across all the
maintaining the website and the curation of the subsets of HOMD databases, including the tax-
database information. If a user has been desig- onomy, the metagenomic information, as well as
nated a curator, he or she will see additional the dynamic genome annotations. The result lists
administrative submenus. the number of matches to the keyword that pro-
Detailed functionalities, web interfaces, and vides links, leading to detailed information. The
tools as well as useful usage tips are presented Announcement box displays the important
below. Technical information such as the imple- system-wise updates and news for the HOMD.
mentation and design of the HOMD has been The Database Update box is automatically
published elsewhere (Chen et al. 2010). updated by the HOMD dynamic genome annota- H
tion pipeline (see “Dynamic Annotation of Geno-
Features of the HOMD Web Pages mic Sequences” section) to keep track of the
The design of the website was based on the feed- status of the genome annotation.
back of several researchers in the field of oral HOMD also provides comprehensive docu-
microbiology over the past several years. The mentation and updates history of data and tools.
user interface was designed to be user-friendly, The HOMD User’s Guide (i.e., the help docu-
intuitive, and practical. On top of every HOMD mentation) was designed to help users to use the
page (Fig. 1), there is a top banner for the HOMD tools, navigate the information, and interpret the
logo, which automatically reduces to smaller size results provided by HOMD. The User’s Guide is
(in height) once the user navigates away from the accessible through the top navigation menu on all
home page so that the banner will not take up too pages and is dynamically linked to the relevant
much space from the requested content. Clicking guide for each different tool. For example, when
the top banner image also brings the user back to users are viewing the Taxon Table page, the
the HOMD home page. Top navigation menu is “How to Use This Page” menu item shown in
located right below the top banner and is also the top navigation menu will lead directly to the
accessible throughout all the HOMD pages. The page that explains the use of the Taxon Table.
top navigation menu provides access points to all Alternatively users can also browse the entire
HOMD’s tools and information on all the web user documentation by clicking the “Table of
pages. content” tab shown on top of each documentation
Another useful feature of the HOMD web page as well as the “User’s Guide” links on top
pages is the unique page ID system. The rightmost menu and side menu of home page. Every docu-
item displayed on the top navigation menu is the ment of HOMD can be searched either through
page ID – a unique code that distinctly identifies the search box located at the bottom of the table
the current page that a user is viewing. For exam- of contents of the documentation page or through
ple, the page ID of the HOMD home page is the Meta-Database Search box located at the
“HP1” (Fig. 1), and once a user navigates away top-right part of the home page.
from home page to, e.g., the Taxon Table page, the The design of the online interfaces of HOMD
page ID automatically changes to “TT1.” This has been driven by suggestions from HOMD users.
feature allows precise page referencing. This is HOMD is open to suggestions and feedback from
particularly useful when a user needs to refer to the research community to further improve its
a specific page on HOMD site for discussion, bug interface and content. Currently, HOMD provides
reporting, or suggestion. several different ways to communicate with the
H 274 Human Oral Microbiome Database (HOMD)
Human Oral Microbiome Database (HOMD), Fig. 1 Screenshot of the HOMD home page
research team and research community. The con- The HOMD Database Schema
tact information provides e-mail addresses for The information and data provided by HOMD are
direct communication with the HOMD research stored in several databases. The Oral Taxon IDs
team. There is also a mailing list for important and the genome IDs serve as the keys to cross-
updates and announcement. Users can use their link these databases. The database table struc-
own e-mail address to subscribe to the HOMD tures and the contents can be downloaded from
Mailing List (https://groups.google.com/forum/ the HOMD FTP (file transfer protocol) site at
#!forum/homd-mail) by sending an empty e-mail ftp://ftp.homd.org to allow users to reconstruct
to the e-mail address: homd-mail+subscribe@ the databases and perform advance queries on
googlegroups.com. An automatic e-mail will be their own computers.
sent to the subscriber for confirmation. HOMD
also provides a discussion platform for the Download Data from HOMD
research community (https://groups.google.com/ Most of the data recorded in HOMD, including
forum/#!forum/homd-forum). Note that these taxonomy, genomics, and 16S rRNA reference
web links may change over time. In any case, sequences, can be downloaded from the HOMD
current or updated web links provided here will FTP site (ftp://ftp.homd.org). The FTP site pro-
be available on the HOMD website. vides both current and archived versions of the
Human Oral Microbiome Database (HOMD) 275 H
data for comparison. The FTP site can be sequences were manually aligned in a secondary
accessed directly in the web browser. Each folder structure-based database using the program RNA
comes with a “readme” text file explaining the (Paster and Dewhirst 1988). Distance matrices
data, data format, and potential usage. Selected and neighbor-joining trees were generated to
data such as the aligned reference sequence determine the clustering of sequences. Sequences
dataset, aligned 16S rRNA datasets for each with similarity equal to or greater than 98.5 %
taxon, and an HOMD taxonomy database in were grouped together into a single taxon.
Excel format can be downloaded from the links Sequences were extensively checked for chi-
provided in the HOMD web pages. meras and several sequences and some provi-
sional taxa were removed. As a result, several
hundred apparently novel full 16S rRNA
Taxonomy sequences were identified this way.
To share the information of both the named
Compilation of the HOMD Taxa and novel human oral microbial taxa with the
The HOMD describes information linked to oral research community, we decided to build
microbe species. For bacteria, or archaea, that a database and designed web query interfaces H
have not been validly named, there is no defini- and tools. When the HOMD was publicly
tion of “species.” Molecular methods to identify launched in 2010, there were a total of
novel species generally have used 16S rRNA 619 Human Oral Taxa in the initial release of
sequencing of isolates or 16S rRNA-based anal- the HOMD database. The 753 reference 16S
ysis of clone libraries. These strains or clones can rRNA gene sequences upon which this analysis
then be clustered into phylotypes or taxa based on was done have been released publicly for down-
their 16S rRNA sequences. Phylotype can be load on the HOMD website as version 10. At the
defined for any similarity cutoff. In HOMD, time of writing this chapter, the total number of
a cutoff of 98.5 % 16S rRNA sequence similarity taxa described in the HOMD taxonomy database
was used to cluster the 16S rRNA sequences at has grown to 688, represented by a total of
the species level to define novel oral bacterial 833 reference 16S rRNA sequences (HOMD
phylotypes. Each validly named species and RefSeq Version 13.1).
novel phylotype cluster was given a unique inte-
ger number called Human Oral Taxon (HOT) ID. Navigating the HOMD Taxa
The original collection of oral microbial tax- The HOMD taxonomy information can be
onomy information came from a combination of viewed and retrieved in several different ways.
literature, primarily reports from Forsyth Insti- The information can be viewed online directly in
tute investigators (Dzink et al. 1985, 1988; a web browser or downloaded as text files. For the
Socransky and Haffajee 1994; Tanner online web browser viewing, the taxonomy pages
et al. 1979, 1998) and from Lillian Holderman can be searched with keywords or by visual nav-
Moore and Ed Moore (Moore et al. 1982, 1983; igation with the Taxon Table (Fig. 2) and the
Moore and Moore 1994) formerly at the Anaer- Taxonomic Hierarchy (Fig. 3). The Taxon
obe Laboratory at the Virginia Polytechnic Insti- Table can also be downloaded in Excel and
tute. 16S rRNA sequences for these named tab-delimited plain text file from the Tools &
species came either from sequences obtained in Download page or through the HOMD FTP site.
our laboratory or from GenBank. Over the past The keyword search can be done through the
20 years, our laboratory constructed and Meta-Database Search box on the home page or
sequenced over 600 16S RNA gene libraries and on the Taxon Table page. Both search boxes look
obtained over 35,000 clone sequences. The clon- for input keyword(s) in all text fields of the
ing, sequencing, aligning, treeing, and clustering HOMD taxonomy database table.
methods used to create HOMD are described On the Taxon Table page, all the human oral
elsewhere (Dewhirst et al. 2010). In brief, microbial taxa are listed in a table ordered
H 276 Human Oral Microbiome Database (HOMD)
Human Oral Microbiome Database (HOMD), Fig. 2 Screenshot of the Taxon Table
alphabetically by organism names. The order can by two numbers enclosed in the square brackets
be changed by clicking the column name HOT indicating the number of taxa and genome
IDs, Genus, or Species names, to toggle the dis- sequences. For example, “Phylum Proteobacteria
play sort order. Three commonly used filters are [107, 144]” indicates that in the phylum
also provided to show only those taxa with Proteobacteria, 107 taxa were identified in the
“named species,” “unnamed cultivated species,” oral cavity and 144 strains have genomic
or “uncultured phylotypes.” Each taxon listed in sequences available at HOMD. If a strain has
the table contains links to the individual Taxon been sequenced by multiple groups, or multiple
Description page (described later) and to the strains sequenced for a species, we provide each
genomic information, if available. sequence when available.
The taxa can also be viewed in the taxonomic Another way to check the summary of the
hierarchical order, i.e., from domain, phylum, HOMD taxa is to view the number of taxa at
class, order, family, genus, to species levels, on various taxonomy levels. The Taxonomic Level
the Taxonomic Hierarchy page (Fig. 3). The hier- page provides a list of taxa and the number
archical tree is fully collapsed by default and can of taxa at the next lower level for each of the
be dynamically expanded at any given level 7 taxonomic levels: Currently, the numbers are
(or all levels). The link, at the species level, Domain (2), Phylum (14), Class (24), Order
brings users to the detailed Taxon Description (40), Family (83), Genus (183), and Species
page. The designation of each level is followed (688).
Human Oral Microbiome Database (HOMD) 277 H
Human Oral Microbiome Database (HOMD), Fig. 3 Screenshot of the Taxonomic Hierarchy expanded at the order
level Bacteroidales
Human Oral Microbiome Database (HOMD), Fig. 4 Screenshot of the Taxon Description page
portion of the text “tax_NNN” is also click- Clicking “tax_302” in this GenBank record
able and links to the corresponding taxon page in the web browser will bring the user
on the HOMD website. For example, the to the corresponding taxon page on HOMD
GenBank record for the partial 16S rRNA (http://www.homd.org/taxon¼302). NCBI
sequence of the Alloprevotella rava clone embeds external database reference IDs in
GB024 (Accession No. GU409552, http:// the GenBank records for cross-database
www.ncbi.nlm.nih.gov/nuccore/GU409552) referencing. More information can be found
contains the text /db_xref¼“HOMD:tax_302,” at this link http://www.ncbi.nlm.nih.gov/
because the HOT ID for A. rava is 302. genbank/collab/db_xref.
Human Oral Microbiome Database (HOMD) 279 H
Status – This field displays the culturing status reference sequence(s) on top which were
for the taxon. A taxon can be either a validly used as the template for alignment. To view
named cultivated species, an unnamed culti- the alignment in color format and for further
vated species, or an unnamed uncultured adjustment, third-party alignment viewing
phylotype. This status is shown in this field software may be used, such as SeqView
and will be updated upon the change of actual (http://pbil.univ-lyon1.fr/software/seaview.
status of the taxon. html) and BioEdit (http://www.mbio.ncsu.edu/
Type strain/reference strain – If the taxon’s sta- BioEdit/bioedit.html). Because some pairs
tus is validly named cultivated species, the of clone sequences may be nonoverlapping
Type Strain is listed here; if the taxon is an (i.e., 500-base sequences at opposite end of
unnamed isolate, the strain information will be the molecule), this file must be used with cau-
listed as Reference Strain. If no cultivated tion for tree construction.
strain is available yet, the Reference Strain Phylogeny – A phylogenic tree showing the posi-
field will be listed as “None, not yet tion of this taxon among related HOMD taxa is
cultivated.” provided here. The tree images are in PDF
Classification – The Taxon Description page lists format and can be viewed or downloaded H
the nomenclatures of each taxonomic level with the link provided in this field. A link to
from Domain to Species. This classification a list of all the downloadable phylogenetic tree
is defined by HOMD and may be different images encompassing all the HOMD taxa is
from the NCBI Taxonomy. The NCBI Taxon- also provided.
omy can be accessed using a dynamic link. Prevalence by molecular cloning – The number
The HOMD taxonomy is based on analysis of clones found for this taxon in an analysis of
of where each taxon falls in phylogenetic approximately 35,000 clones (Dewhirst
trees generated using several treeing methods et al. 2010). Based on the number of clones
and including over 100 non-oral reference found, the rank abundance of the taxon (out of
taxa identified by searching the “greengenes” 619) is given.
16S rRNA gene database (http://greengenes. Synonyms – Lists previous names for the organ-
lbl.gov). For example, in HOMD, an organism ism if validly named. Isolate or clone designa-
such as Eubacterium saburreum is placed in tions are given as synonyms when they have
the family Lachnospiraceae (because that is appeared in the literature as “names” for the
where it falls phylogenetically), rather than in taxon, such as “BU063.” (Zuger et al. 2007).
the family Eubacteriaceae (because its incor- NCBI taxonomy – For validly named species,
rect genus name “Eubacterium” has not yet there is a link to the NCBI Taxonomy. NCBI
been revised). Synonyms of the taxon that has no taxonomy for unnamed taxa; hence, the
are currently in use or were used before in reason HOMD was created.
the literature or publications are also provided. PubMed search – The number of hits when the
16S rRNA gene sequence – GenBank accession name (genus plus species) of this taxon is used
number and link to NCBI corresponding in the PubMed search. HOMD automatically
Entrez record to one or more 16S rRNA gene and periodically updates this hit number every
sequences associated with the taxon. 2 weeks. To get a most up-to-date search,
16S rRNA gene sequence alignment – This field simply click the “PubMed Link” to pull up
provides the link to the downloadable clone the search result live from NCBI PubMed
sequences preliminarily aligned to the refer- site. In general, there are no results for
ence sequence to which the clones belong. The unnamed taxa, hence the need for HOMD.
current set contains the approximately 35,000 When articles referencing these taxa (often
clone sequences (Dewhirst et al. 2010) aligned through clone numbers) are found by HOMD
for each taxon. The clone alignments are pro- curators or community members, they are
vided concatenated FASTA format with the manually added to the Taxon Description.
H 280 Human Oral Microbiome Database (HOMD)
Nucleotide search – Similar search as above using sequences diverging by more than 10 bases
NCBI Entrez “nucleotide” as reference data- within a taxon.
base. The latest result (hit count) is displayed HOMD provides two primary sets of 16S
with link to NCBI for most updated search. rRNA gene reference sequence (RefSeq) for
Protein search – Similar search as above using download and for BLAST search. The first set is
NCBI Entrez “protein” as reference database. the HOMD 16S rRNA RefSeq. This set contains
The latest result (hit count) is displayed with sequences representing all currently named and
link to NCBI for most updated search. unnamed oral taxa. In the latest reference
Genomic sequence – Number of genomes that sequence set (version 13.1 at the time of writing),
have been sequenced is indicated here with there are 834 reference sequences representing
a link to a detailed list of these genomes. the 688 taxa. The second is the HOMD 16S
Hierarchy structure – An expandable/collapsible rRNA Extended RefSeq. This set contains addi-
view of a dynamically displayed taxonomy tional16S rRNA reference gene sequences that
tree indicating the position of the taxon on are distinctively different from existing taxa but
the page. have not yet been assigned with a taxon ID.
Cultivability – Conditions and media for growing The HOMD reference sequences are corrected
strains of this taxon, if available. consensus sequences. Many have been corrected
Phenotypic characteristics – Generic phenotypic and extended based on alignment with other
description of the taxon if the taxon has culti- sequences for that taxon and Ns and indels
vated member(s). removed. Therefore, for many sequences, there
Prevalence and source – Describes the fre- will be differences between the reference
quency and source of clones and isolates sequence and the GenBank sequence listed in
from different oral sites and states of health the header information. We have not yet updated
or disease when known. our own GenBank sequences and cannot update
References – Literature and publications those from other depositors. We believe these are
referencing this taxon. These references are currently the best reference sequences available
manually curated with up to ten key references and, for the purposes of BLAST analysis, have
which may also include older references not the advantage of being of a uniform length.
indexed in PubMed. On the HOMD 16S rRNA Sequence Identifi-
Community comments – Registered and logged- cation page (Fig. 5), users can copy and paste the
in users can provide their feedbacks related to query sequences in the text field or upload from
this taxon. The comment requires the approval user’s computer. The query sequences should be
of the HOMD curators before it is shown to the in the concatenated FASTA format. The maximal
public. number of query sequences allowed to upload in
a single search is 5,000. Since viewing of the
Identification of 16S rRNA Gene Sequence by BLAST results in the web browser for over
BLAST Search 5,000 query sequences becomes very slow, for
One of the most used HOMD software tools is the search over 5,000 sequences, please contact the
customized BLAST search specifically designed HOMD team. The HOMD 16S rRNA BLAST
to identify user-provided 16S rRNA sequences online tool was only designed for a modest number
against the comprehensive collection of the 16S of sequences, up to a couple of thousands, which
rRNA reference gene sequences. Currently there can be submitted in several batches. It is not capa-
are a total of 688 taxa defined based on version ble of handling larger numbers of sequence reads,
13.1 of the 16S rRNA reference sequences. Since such as hundreds of thousands of reads from the
a phylotype can include members with up to next-generation sequencing pipeline. For larger
1.5 % sequence divergence (23 bases for a full numbers of sequences, the search can be done on
1,500-base sequence), multiple reference a collaboration basis. HOMD provides secure FTP
sequences have been selected where we have (sFTP) upload for large batches of user sequences,
Human Oral Microbiome Database (HOMD) 281 H
Human Oral Microbiome Database (HOMD), Fig. 5 HOMD 16S rRNA Sequence Identification. (a) Query
sequence input interface; (b) Result page
and the search will be sent manually to the HOMD result page. The match identity is presented as
BLAST server cluster on user’s behalf and results straight BLAST results and as an adjusted percent
made downloadable through the sFTP site. The identity (API) calculated as
upload page also provides options for adjusting
the BLAST search parameters although the default API ¼ 100 M=ðM þ MMÞ
setting should be sensitive enough to pick up
matches with even short oligo sequences. where M is the matched (identical) and MM the
Once the query sequences are submitted, the mismatch sequence length between the query and
sequences are uploaded to the HOMD computer the reference sequence, respectively. This calcu-
servers and queued for the BLAST search. Once lation excludes any gaps introduced during the
all the searches are done, the results are presented alignment process of the BLAST search. We
back to submitter in a tabularized format. Results have found that this correction gives much better
containing up to 20 top matches for each query values for single primer sequence reads where the
sequence can be downloaded in text or Excel file sequence adjacent the primer often includes
formats. Original full BLAST results including indels. The top hits are ordered by their API
the alignments can also be accessed from the rank, and sequences with alignment shorter than
H 282 Human Oral Microbiome Database (HOMD)
95 % of query sequence are excluded from rank- conveniently accessible by users. Icons or links to
ing. The top four matched reference sequences available tools pertaining to a specific genome are
are listed by this method, and the table shown on automatically presented on relevant page to users.
the web page contains links to the original Important genomic data and bioinformatics tools
BLAST results as well as to the Taxon Descrip- provided by HOMD are described below. Addi-
tion pages for reference sequences. The results tional information on tools is also available in the
for the 20 top matches can be downloaded as previous publication (Chen et al. 2005).
plain text or in Microsoft Excel format.
Genome Table
HOMD organizes genomes in three viewing
Genomics options: Taxa with Annotated Genomes, Taxa
with Genomes in Progress, and View All
Genomics Tools Overview Genomes. The first option lists the oral taxa
Complimentary to the taxonomy information, the with annotated (static or dynamic) genomic infor-
HOMD also provides comprehensive informa- mation and provides links to all the genomes
tion and tools for studying genomes of the available for each taxon. The View Genome but-
human oral microbes. HOMD genomics database ton links to the Genome Table showing all the
serves as the curated repository for the molecular available genomes of a specific taxon. The
sequences of human oral microbiome, including Genome Table shows the Oral Taxon ID (HOT),
complete and partial genomics sequences, as well the Genus and Species names, Strain Culture
as 16S rRNA mentioned in the previous section. Collection, HOMD Sequence ID (SEQ ID), num-
Genomic sequences available at HOMD can be ber of contigs and singlets, combined sequence
fully assembled genomes, high-coverage genomes, length, and links to available tools and informa-
or genome surveys. HOMD also keeps tracks of the tion. The second option (Taxa with Genomes in
status of ongoing genome sequencing projects for Progress) lists those oral taxa with genomic
human oral microorganisms. A Sequence Meta sequencing project still in progress but no
Information page is created to hold relevant sequence is yet available. The third option shows
genomics and sequence meta information if all the genomes in the alphabetical order and pro-
a sequencing project for a human oral microbe is vides searching and sorting function for easy nav-
announced and available in the NCBI Genome igation. Each genome listed has a link to the
Project Database. The genome project status is Sequence Meta Information page described next.
updated biweekly based on information collected
from the NCBI Genome Project Database with an Sequence Meta Information
automatic query script. Once genomic sequences The Sequence Meta Information page provides
are publicly released, they are dynamically anno- detailed biological, molecular biological,
tated by HOMD (Dynamic Annotation). Annota- genetic, genomic, and taxonomic as well as anno-
tion done by other data centers, if available, is tation information for a particular strain that has
termed “static annotation” and is viewable in been, is being, or will be sequenced (Fig. 6).
a separate panel in the Genome Viewer Information on these pages is semiautomatically
(described below). Relevant tools are provided for updated. Updated information from both
viewing and searching the annotation. These tools Genomes OnLine and NCBI Genome Project
were first developed as part of the Bioinformatics Database is retrieved biweekly and compared
Resource for Oral Pathogens (BROP: http://www. with the existing database automatically. New
brop.org; Chen et al. 2005). The programs and the or modified Genomic Project information are
data-mining schemes used in HOMD are designed then added to the Sequence Meta Information
for both finished and unfinished (collections of pages with confirmation by curators. The
multiple contigs) genome sequences. The tools Sequence Meta Information page contains the
are integrated with the HOMD website and are following human-curated information related to
Human Oral Microbiome Database (HOMD) 283 H
Human Oral Microbiome Database (HOMD), Fig. 6 Screenshot of the Sequence Meta Information page
the target organism: Oral Taxon ID, HOMD Both types of genomes are annotated and depos-
Sequence ID (SEQ ID), Organism Name (genus, ited in a public database such as GenBank.
species), Culture Collection Entry Number, Iso- HOMD aims to provide frequently updated geno-
late Origin, Sequencing Status, NCBI Genome mic annotation for oral bacterial genomes (see
Project ID, NCBI Taxonomy ID, Genomes below). In addition, HOMD provides graphical
Online Goldstamp ID, NCBI Genome Survey genomic viewing for static annotations done by
Sequence Accession ID, JCVI (previously other public data centers such as NCBI or JCVI.
TIGR) CMR ID, Sequencing Center, number of
contigs and singlets, combined length (Kbp), GC Genome Surveys
percent, DNA molecular summary, ORF annota- One of the original major goals of the
tion summary, and 16S rRNA gene sequence. NIH-funded project “A Foundation for the Oral
In addition, original external information such Microbiome and Metagenome,” DE016937, was
as NCBI Genome Project Database, NCBI to partially sequence up to 100 representative
Taxonomy Database, Genomes OnLine Data- human oral microbial species. A total of 12
base, and rRNA in NCBI Nucleotide Database, low-coverage partial genomic sequences were
if available, is parsed into separate tables below sequenced and deposited in NCBI before this
the Sequence Meta Information for convenient project fused with the Human Microbiome Pro-
referencing. ject. The genome information for these 12 surveys
is still maintained on HOMD even though they
Full and High-Coverage Genomes currently also have complete or high-coverage
Full genomes are the oral microbial genomes that genomes (The Forsyth Metagenomic Support
have been fully assembled, while the high- Consortium and Izard 2010). Since the launch
coverage genomes are not fully assembled but of the Human Microbiome Project, the HOMD
represent coverage of most of the genomes. team has been providing genomic DNA from
H 284 Human Oral Microbiome Database (HOMD)
human oral microbes to the four HMP sequencing frequency is approximately a month for all the
centers for high coverage rather than survey 300+ genomes. Additional genomes are being
sequencing (The Forsyth Metagenomic Support added to the annotation pipeline as more
Consortium and Izard 2010). sequences are made available by other public
sequencing projects such as the Human
Dynamic Annotation of Genomic Sequences Microbiome Project (http://www.hmpdacc.org).
One of the major features of the HOMD Genomic A live update status of the genome annotation is
Database is the automatic and frequent updating provided on the HOMD home page indicating the
of genomic annotation pipeline for genomes of latest genome annotated or updated. HOMD aims
oral isolates. Although the amount of sequence to maintain frequent and dynamic computer
data is still growing rapidly, the computational annotation for genomic sequence of at least one
power needed for bioinformatic analysis of this isolate from each oral taxon whenever sequences
data is catching up and the cost and energy con- are made publicly available, as well as static
sumption per CPU decreasing due to the avail- annotation of all annotated releases.
ability of multi-core CPU formats. The lower cost
of computational power has made it feasible for Genome Explorer
us to set up a small computation farm dedicating Genome Explorer is the centralized web interface
to the annotation of human oral microbial that interconnects all the genomics resources in
genomes. HOMD recruited a cluster of multi- HOMD (Fig. 7). The front end of Genome
core multi-node computer servers to frequently Explorer is a user-friendly interface that allows
update the annotation. Current HOMD genome investigators to navigate among all the genomics
annotation algorithms include (i) BLASTP information provided at HOMD. HOMD Geno-
(http://www.ncbi.nih.gov/BLAST; Altschul mics Tools can be accessed either by selecting the
et al. 1997) search against weekly updated tool or the genome first. If the user chooses
NCBI nonredundant protein data (ftp://ftp.ncbi. the desired tool first, the user is then directed to
nih.gov/blast/db/), (ii) BLASTP search against the Genome Explorer interface for selecting
Swiss-Prot protein data (http://us.expasy.org/ genomes. Once a target genome is chosen, the
sprot/; Boeckmann et al. 2003), and (iii) interface dynamically presents all the tools,
InterProScan search against various sequence including linked external databases, available
databases (Zdobnov and Apweiler 2001; http:// for the selected genome. Currently available
www.ebi.ac.uk/interpro/). To provide data on tools include Genome Viewer, Dynamic Annota-
functional potential of genomes, BLASTP search tion, BLAST, Annotator, EMBOSS, KEGG path-
results against Swiss-Prot are further processed ways (Kanehisa 2002), Gene Ontology Tree
for the construction of KEGG metabolic path- (Ashburner et al. 2000), Genomewide ORF
ways and Gene Ontology Trees. We take advan- Alignment, and Sequence Download. The back
tage of the fact that the well-annotated Swiss-Prot end of Genome Explorer is a searchable annota-
protein sequence descriptions contain interlinks tion database that integrates all the results gener-
to the ENZYME (Bairoch 2000) and Gene Ontol- ated from the data-mining pipeline described
ogy (Camon et al. 2003). The dynamic genome below. The search result is presented in
annotation is running full time daily on the ded- a paginated and sortable table that also provides
icated computer cluster except during the week- web links to (i) a summary page for individual
end, when the latest NCBI nonredundant protein ORF, (ii) Genome Viewer to show the exact
database, Swiss-Prot, and InterPro databases are location of the target ORF in the genome, and
being downloaded to and updated on our server. (iii) the original BLAST or InterProScan results.
Currently a total of 324 genomes representing The summary page provides all the information
306 taxa are being repeatedly annotated by this and tools available for a specific ORF, including
pipeline. On average, each genome takes ~ 3 h to all the data-mining results mentioned above, as
be annotated; thus, the current re-annotation well as convenient links to other web tools for
Human Oral Microbiome Database (HOMD) 285 H
Human Oral Microbiome Database (HOMD), Fig. 7 HOMD Genome Explorer displaying results of Dynamic
Annotation for the genome Aggregatibacter actinomycetemcomitans HK1651
performing fresh search and analysis. In short, different annotations can be viewed and com-
Genome Explorer is a one-stop interface for all pared side by side in the Genome Viewer (http://
the genomic information available for each target www.homd.org/index.php?name¼GenomeExp&
genome or gene. org¼pgin&gprog¼gview).
Human Oral Microbiome Database (HOMD), Fig. 8 HOMD Genome Viewer displaying multiple sources of
annotations for Aggregatibacter actinomycetemcomitans HK1651
a group of genomes related at any taxonomic hierarchy. As shown in Fig. 9, upon starting the
level (species, genus, etc.). The BLAST parame- HOMD Genomic BLAST, the taxonomy hierar-
ters are dynamically presented after the genome chical tree is fully expanded by default and can be
selection, and the results are available on the web dynamically collapsed at any given level. The
and for download in multiple formats. links, at the species level or genomes level, lead
The HOMD Genomic BLAST query interface to the detailed Taxon Description or Sequence
starts with the selection of the genomes to be Meta Information page, respectively. Numbers
searched against. All the HOMD genomes avail- indicated in the square brackets at each level are
able for search are displayed and selectable in the numbers of oral taxa, genomes with meta
a collapsible tree based on the taxonomy information, genomes with HOMD annotation,
Human Oral Microbiome Database (HOMD) 287 H
Human Oral Microbiome Database (HOMD), Fig. 9 Screenshot of the HOMD Genomic BLAST tool – the genome
selection page showing 107 Bacteroides genomic sequences selected for BLAST Search
and genomes with NCBI annotation, respec- The query sequence, in FASTA format, can be
tively. The genome selection is flexible and can copied and pasted into the sequence field or
be a single genome, any randomly selected indi- uploaded directly from user’s computer. Multiple
vidual genomes, a group of genomes at any tax- sequences are allowed with the limit of ten
onomy level (from Domain to Species), all the sequences. BLAST parameters are dynamically
genomes dynamically annotated at HOMD, all changed based on the type of query and subject
the genomes with static annotations by NCBI, sequences. The query sequences can be either
or a representative genome from all the species. nucleotide or protein sequences. The subject can
The total number of genomes selected is shown be whole genomic DNA sequences or nucleotide
on top of the page. or amino acid sequences of the annotated proteins
After the genomes are selected, users are of the selected genomes. Once the sequence type
directed to the next page for providing the (nucleotide or protein) is selected by user for both
query sequence and options for BLAST search query and subject sequences, suitable BLAST
(Fig. 10). A summary of the selected genome(s) is programs are dynamically displayed for selec-
presented on top of this page with an option tion. For example, if both query and subject
for going back and modifying the selection. sequences are proteins, only BLASTP is avail-
Below the summary is the query sequence form. able for search; likewise, if both queries and
H 288 Human Oral Microbiome Database (HOMD)
Human Oral Microbiome Database (HOMD), Fig. 10 The HOMD Genomic BLAST tool – query sequence input
and BLAST parameter adjustment page
Human Oral Microbiome Database (HOMD) 289 H
subjects are nucleotides, the search can be done parameters. The search strategy including the
with BLASTN, BLASTX, or TBLASTX. Fur- query, subject, and BLAST parameters can be
thermore, alternative algorithms are available saved or downloaded for future reference. The
for nucleotide to nucleotide searches, including actual BLAST results are presented in a manner
MegaBLAST (Morgulis et al. 2008) and similar to the typical HTML format. They include
Discontiguous MegaBLAST (Morgulis a Graphical Overview section (Fig. 3) to display
et al. 2008). Similarly, for protein to protein the alignment of the “high-scoring pairs” (HSPs)
searches, available algorithms are BLASTP, between the query and the subject sequences.
PSI-BLAST (Altschul et al. 1997), PHI-BLAST HSPs are plotted against the query sequence and
(Zhang et al. 1998), and DELTA-BLAST highlighted by different colors based on align-
(Boratyn et al. 2012). For each BLAST program, ment scores. Every HSP on the plot is
only the parameters and options corresponding to hyperlinked with the corresponding pairwise
the selected program type and algorithm appear alignment in the Alignment section. Subject
on this page. Detailed information about BLAST sequences that matched the query are listed in
parameters is available under the link “Help.” For the Descriptions section, sorted by the expected
the advanced users, the command-line style (e) values. The Alignment section presents the H
BLAST+ parameters can be added in Advanced alignments of the HSPs as a series of pairwise
Option section (Camacho et al. 2009). alignments. Each alignment contains a hyperlink
Upon submission of the BLAST search, the to the corresponding HOMD- or NCBI-annotated
requested job is sent to the back-end service for gene, if such information is available.
processing. The back-end service consists of To provide the research community with sat-
a computer cluster to handle multiple requests isfactory experience with and the convenient fea-
from the query interface. The selected genomes/ tures of the HOMD Genomic BLAST, we
nucleotides/proteins are dynamically compiled to currently allow up to ten query sequences to be
a virtual sequence database searchable by the searched in a single job request. Since the time
BLAST programs, using the “blastdb_aliastool” needed for the computation is linear-proportional
tool provided by BLAST+ (Camacho et al. 2009). to the numbers of both query and subject
The searched jobs are distributed to the computer sequences, we expect the maximal waiting time
nodes of the cluster, which is managed by the to be no longer than 10 min, provided no previous
TORQUE resource manager (http://www. job is waiting in the job queue. In fact, when
adaptivecomputing.com/products/open-source/ a total of ten protein sequences with the size of
torque). During the search process, user is 500 amino acids in length were submitted to an
presented with an intermediate page to monitor empty queue to search against all the protein
the job status. This status page reports sequences of all HOMD genomes, the job was
a summary of the job as well as time/duration completed in about 400 s, without any prior jobs
elapsed since submission. The status page peri- waiting in the cluster queue. Special requests may
odically refreshes itself, effectively polling the be considered for jobs containing more query
server while the job runs. BLAST result is auto- sequences than the current limit, on the collabo-
matically presented when the job completes. ration basis.
BLAST results are presented dynamically in The number of the genomes hosted by HOMD
the output interface (Fig. 11). Users can check the database has been growing from approximately
details of BLAST job information and choose 600 genomes at launch (June 2011) to nearly
to download the results in different formats, 1,200 genomes towards the beginning of 2013.
such as HTML, archive, text, tabular, CSV, and We expect the number continue to grow, in con-
XML. Additional jobs can also be submitted for cordance with the growth or the NCBI microbial
the same queries and subjects with modified genomes, as well as the progress of the Human
H 290 Human Oral Microbiome Database (HOMD)
Human Oral Microbiome Database (HOMD), Fig. 11 The HOMD Genomic BLAST tool result summary page
showing different download option for the BLAST search results
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
I 294 Insights into Environmental Microbial Denitrification
microbial communities. Although the advances high acidity in the source zone (pH 3–4) also
in cultivation-independent molecular analyses suppresses microbial activity and diversity
of microbial communities have been well adver- (Fields et al. 2005; Hemme et al. 2010). Despite
tised (e.g., high-throughput amplicon sequencing the restrictive conditions, there is evidence for
(e.g., Caporaso et al. 2011), metagenomics significant nitrous oxide production in the near-
(e.g., Tringe et al. 2005), and metatran- source zone (Spalding and Watson 2008). As the
scriptomics (e.g., Poretsky et al. 2009)), parallel low pH is ameliorated down-gradient of the
advances in cultivation have also been made, source zone, nitrate, nitrous oxide, and soluble
including the use of lower organic carbon uranium are attenuated without active remedia-
media, extended incubation, single-cell encapsu- tion, due to both microbial and geochemical
lation approaches, and overall improved mimick- processes (Kowalsky et al. 2011).
ing of natural conditions within a culture vessel The contaminant levels in the near-source
(e.g., Bollmann et al. 2007; Kaeberlein et al. zone are alarming, and source zone remediation
2002; Zengler et al. 2002). Here, data from strategies have been examined, with limited suc-
metagenomic sequencing and isolation, physio- cess (Wu et al. 2007). The extraordinary levels of
logical testing, and whole-genome sequencing of nitrate must be removed before microbial reduc-
denitrifying bacteria from the highly contami- tion of U(VI) to U(IV) can proceed (Akob
nated subsurface of the Oak Ridge Integrated et al. 2008; Luo et al. 2005; Wu et al. 2006,
Field Research Challenge (ORIFRC) site are con- 2010), and down-gradient remediation has been
sidered and the implications of this analysis on more effective as nitrate is essentially absent
understanding the environmental distribution and (e.g., Gihring et al. 2011). The presence of nitrous
ecological niche of denitrifying bacteria. oxide in the source zone wells suggested the
presence of in situ denitrification, and thus grew
The ORIFRC Site an interest in microorganisms capable of nitrate
The ORIFRC site is highly contaminated with reduction at in situ pH, with the hope that stimu-
spent uranium and a wide variety of other con- lation of these native organisms could aid in the
taminants (e.g., other radionuclides, heavy long-term removal of uranium from the site
metals, and volatile organic contaminants) as groundwater. Initial studies revealed significant
a result of long-term uranium enrichment for diversity in nitrite reductase genes in groundwa-
nuclear weapons, coupled with improper disposal ter at the site, including both genes encoding for
in unlined ponds (S-3 ponds) (Brooks 2001; copper-containing (nirK) and cytochrome (nirS)
Kostka and Green 2011; NABIR 2003; Watson forms (Palumbo et al. 2004; Yan et al. 2003).
et al. 2004). Although the ponds have been sub- Based on metagenomic analysis of acidic
sequently drained, much of the contaminant has groundwater from the site, Hemme et al. (2010)
migrated into the subsurface, where it serves to hypothesized that denitrification comprised the
feed a plume migrating down-gradient across the predominant form of metabolism in the near-
site (Watson et al. 2005). Uranium is the priority source zone microbial community due to the
contaminant of concern, though the nitrate in the low oxygen and lack of fermentation genes
near-source zone (adjacent to the former S-3 observed there. The overabundance of nitrate/
ponds) reaches extraordinarily high concentra- nitrite antiporters in the metagenome was
tions (in the range of 10–1,000 mM) due to the interpreted as a further indication of the strong
use of nitric acid in the processing of uranium. effect of the elevated nitrate on the source zone
The high level of nitrate complicates remediation microbial community.
strategies at the site by inhibiting microbial Prior to the metagenome sequencing of the
reduction of soluble hexavalent uranium to an acidic groundwater at the ORIFRC site,
insoluble mineral form of tetravalent uranium cultivation-independent molecular surveys had
(e.g., Finneran et al. 2002; Kostka and Green been performed to track denitrifying organisms.
2011; Shelobolina et al. 2003). The moderately As the denitrification phenotype is a polyphyletic
Insights into Environmental Microbial Denitrification 295 I
trait, and can be acquired readily via lateral gene targeting unique nirK genes, and whole-genome
transfer, ribosomal RNA gene sequencing is not sequences were also recovered from
suitable for identifying and tracking denitrifying non-denitrifying reference strains related to
organisms. Functional genes assays – targeting organisms isolated from the field site.
nitrate, nitrite, nitric oxide, and nitrous oxide Bacteria from six distinct genera of
reductases – have been performed for this pur- denitrifiers were isolated, including strains
pose. Yan et al. (2003) and Palumbo et al. (2004) of Hyphomicrobium (Alphaproteobacteria),
performed site-wide surveys of nitrite reductase Afipia (Alphaproteobacterium), Pseudomonas
genes at the ORIFRC site. No clear pattern relat- (Gammaproteobacteria), Rhodanobacter
ing the composition and relative abundance of (Gammaproteobacteria), Bacillus (Firmicutes),
nitrite reductase genes with groundwater geo- and Intrasporangium (Actinobacteria) (Green
chemical conditions was observed, however. et al. 2010). Under laboratory conditions, all
For example, a principal component analysis of strains were capable of growth with nitrate as
clusters of nirK (gene encoding for copper- the sole electron acceptor, though the Gram-
containing nitrite reductase) sequences grouped positive strains produced only nitrous oxide as
all wells across the pH gradient together, with the a terminal product, while Rhodanobacter spp.
exception of one high nitrate groundwater sam- produced a mixture of nitrous oxide and nitrogen
I
ple. In all wells, the most abundant nirK gas. Physiological and genetic characterization of
sequences were most similar to the nirK gene the isolates from the genus Rhodanobacter was
sequence derived from Hyphomicrobium prioritized, as these organisms had been detected
zavarzinii, and all sequences were most similar in great abundance in acidic groundwater as well
to gene sequences derived from Proteobacteria. as sediments from the near-source zone (Green
Thus, although a substantial diversity of nitrite et al. 2010, 2012). Bacteria from this genus were
reductase genes was observed, with many novel revealed to have extraordinarily high relative
gene sequences recovered, more recent data from abundance in the near-source zone, over multiple
genome and metagenome sequencing indicates sampling seasons, and were sometimes the only
that the predominant denitrifiers were not active organisms detected in RNA-based ana-
detected in single-gene surveys (Green lyses of groundwater samples (Green
et al. 2010, 2012; Hemme et al. 2010). et al. 2012). Highly similar strains were indepen-
dently isolated from ORIFRC site sediment using
Combined Cultivation and Direct Molecular a diffusion chamber approach (Bollmann
Studies of Denitrifying Bacteria et al. 2010), and in a metagenomic survey of
The study of denitrifying microorganisms at the acidic groundwater from the site, one of the dom-
ORIFRC field site was approached in inant organisms detected (so-called FW106 gI) is
a multipronged fashion, including (a) site-wide clearly a member of the genus Rhodanobacter
microbial community characterization using (Hemme et al. 2010). This organism contained
DNA extraction from sediment and groundwater, a full denitrification pathway.
coupled with high-throughput bacterial ribo- Despite the apparent numerical abundance of
somal RNA (rRNA) gene amplicon sequencing, members of the genus Rhodanobacter in the
(b) quantitative PCR (qPCR) analyses of bacte- acidic source zone, these organisms were not
rial small subunit (SSU) rRNA and nitrite reduc- detected in prior molecular surveys of denitrifi-
tase (nirK) gene abundance in groundwater and cation pathway genes at the ORIFRC site
sediment samples, (c) cultivation and physiolog- (Palumbo et al. 2004; Yan et al. 2003). Nor
ical testing of denitrifying bacteria from sediment could PCR amplification of nirS (cytochrome cd
and groundwater, and (d) de novo whole-genome 1-containing nitrite reductase), nirK, or nosZ
sequencing of denitrifying isolates. Subse- (nitrous oxide reductase) genes be achieved
quently, genomic DNA (gDNA) samples from using standard primer sets (Green et al. 2010).
the site were reanalyzed with novel primers Similar challenges were presented by the other
I 296 Insights into Environmental Microbial Denitrification
isolated strains, excepting Afipia. For the and helped determine the cause of PCR amplifi-
Hyphomicrobium strain, a novel primer set cation failure. First, the putative nitrite reductase
targeting nirK was designed based on genes from these organisms were highly diver-
a reference gene available in GenBank, but no gent from many sequences present in gene data-
similar reference sequences were available for bases, and the sequences contained a large
the other strains. Subsequently, metagenome number of mismatches with the most commonly
sequence data from acidic groundwater acquired used primer sets for targeting bacterial nirK genes
at the site (Hemme et al. 2010) was surveyed, and (e.g., 10 and 11 mismatches, respectively,
two novel nirK sequences were identified. Using between primer R3Cu and first and second nirK
these de novo assembled sequences, primer sets gene of R. denitrificans 2APBS1 (Green
were developed that allowed the amplification of et al. 2010; Hallin and Lindgren 1999)). In addi-
a nirK gene from the Rhodanobacter isolates and tion, most Rhodanobacter spp. have two highly
from putative Rhodanobacter organisms from divergent nirK genes located in different posi-
environmental genomic DNA (Green et al. tions in the genome (Green et al. 2010; Kostka
2010, 2012). Quantitative PCR analysis was uti- et al. 2012). Two strains of Rhodanobacter inde-
lized to quantitate SSU rRNA and nirK gene pendently isolated (Bollmann et al. 2010) simi-
abundance in groundwater from across the water- larly contain two nirK genes apiece, and both are
shed, and this analysis revealed that nirK genes nearly (>99% similar) or completely identical to
were present in abundance across the ORIFRC nirK genes from R. denitrificans 2APBS1T. Both
site, including nirK genes derived from forms of nirK are expressed under denitrifying
Rhodanobacter (Green et al. 2012). Coupled conditions in R. denitrificans 2APBS1T, but the
with relative abundance measurements derived purpose of two copies of the gene is not yet clear
from qPCR of rRNA genes and from rRNA (Green et al. 2012). One copy of the gene, collo-
gene amplicon sequencing, this analysis revealed quially called “nirK-B,” is most similar to nirK
that Rhodanobacter were the most abundant genes from certain Proteobacteria, including
organisms in the near-source zone, that nirK Betaproteobacteria from the genera Burkholderia
genes most similar to those from Rhodanobacter and Ralstonia. The second copy, called
strains were most abundant in the near-source “nirK-V,” is most similar to the nirK gene from
zone, and that Rhodanobacter organisms were Opitutus terrae PB90-1, within the phylum
active, not just present in the near-source zone. Verrucomicrobia.
Coupled with in vitro analysis of the physiologi- To examine this phenomenon on a broader
cal capabilities of Rhodanobacter strains in pure phylogenetic scale, Green et al. (2010) recovered
culture, these data led to the hypothesis that bac- complete nirK and nosZ genes from a number of
teria from the genus Rhodanobacter are the dom- microorganisms which had been sequenced by
inant near-source zone denitrifiers at the ORIFRC the Joint Genome Institute. These genes were
site. This hypothesis is supported by studies aligned and primer binding sites were identified.
conducted in other ecosystems which demon- This analysis revealed that the difficulty in ampli-
strate that Rhodanobacter spp. dominate under fying nirK genes from ORIFRC site isolates is
low pH, denitrifying conditions (e.g., van den symptomatic of a broader difficulty in detecting
Heuvel et al. 2010). denitrifying bacteria through single primer set
Direct PCR amplification of nitrite reductase amplification due to large numbers of mis-
genes from Rhodanobacter and other denitrifiers matches between primer and gene sequences.
isolated from the site was not successful using The commonly used primer sets (including quan-
standard primers, and subsequently, de novo titative PCR primer sets) target a relatively nar-
shotgun genome sequencing and draft assembly row range of organisms, primarily within the
of these bacterial denitrifiers was performed. The Proteobacteria (Green et al. 2010). Thus, molec-
initial draft sequences of Rhodanobacter and ular approaches that depend upon single primers,
Intrasporangium recovered complete nirK genes even heavily degenerate primers, cannot be used
Insights into Environmental Microbial Denitrification 297 I
suitably to detect or quantify denitrifiers in envi- (Prakash et al. 2012; van den Heuvel et al.
ronmental samples, and the true diversity and 2010). More recently, a novel species, R. caeni,
abundance of denitrifiers is most likely greatly was described as capable of nitrate reduction to
underestimated from current surveys. Alternate nitrite, but no evidence for complete denitrifica-
approaches, which utilize the full availability of tion was demonstrated (Woo et al. 2012).
reference sequence data derived from de novo Likewise, R. sp. strain A2-61, shown to form
genome sequencing and from shotgun intracellular uranium-phosphate complexes, was
metagenome sequencing of environmental sam- unable to reduce nitrate (Sousa et al. 2013).
ples, must be developed to more fully assess the To understand the genetic basis of the differ-
distribution of these important organisms. ences in physiology with respect to denitrifica-
Although the nitrite reductase gene is tion, the genomes of five additional strains of
a particularly dramatic example, it is not unique bacteria from the genus Rhodanobacter were
in this regard, and other functional genes of sig- sequenced (Kostka et al. 2012). In total, three
nificance to biogeochemical processes have strains of denitrifying Rhodanobacter were
shown similar levels of sequence diversity. The sequenced (R. denitrificans 2APBS1T,
sequence diversity of nirK may be in part due to R. denitrificans 116-2, R. thiooxydans) alongside
the multiple physiological roles for nitrite reduc- three strains of apparent non-denitrifying (from
I
tion (detoxification, respiration), different condi- nitrate) Rhodanobacter (R. fulvus Jip2
tions under which the enzymes may be active (Im et al. 2004), R. spathiphylli B39 (De Clercq
(e.g., prior to anoxic conditions, after total et al. 2006), and R. sp. 115, isolated from the
anoxia), and multiple locations for nitrite reduc- ORIFRC site (Kostka et al. 2012)). Preliminary
tases (periplasm, inner membrane) and for the analysis of the genomes of the six Rhodanobacter
different forms of the gene (copper nitrite reduc- strains revealed that all members of the genus
tase, nirK, and cytrochrome-cd1 nirS). This broad contained nearly complete denitrification path-
sequence divergence but with retained function is ways, including two copies of the nitrite reduc-
present in other functional genes, including other tase gene nirK (excepting R. spathiphylli, with
genes in the denitrification pathway (e.g., nosZ; only a single copy). All denitrifying isolates
Green et al. 2010; Jones et al. 2013; Sanford contained many genes in the dissimilatory deni-
et al. 2012). trification pathway, but non-denitrifying isolates
Although many Rhodanobacter spp. isolated were missing several key genes involved in
from the ORIFRC site subsurface were capable of nitrate respiration, such as nitrate reductase
complete denitrification, some members of the genes (i.e., narG, narH, narJ, and narI). The
genus were incapable of growth on nitrate. Sim- genomic context of these genes was further
ilarly, in a survey of the literature regarding examined, and it was observed that the nitrous
Rhodanobacter, most strains were identified as oxide genes (e.g., nosZ) showed the greatest
aerobic bacteria, incapable of nitrate reduction. synteny among all six genomes (Fig. 1). Since
Strains isolated independently from the ORIFRC relatively few organisms conduct nitrous oxide
site were observed to be acid tolerant (arrest of reduction alone, it may be supposed that the high
growth was observed at pH 3.5–4), tolerant of level of synteny in this gene and the lower
high levels of nitrate (up to 250 mM), and mod- synteny in other parts of the denitrification path-
erately tolerant of various heavy metals, includ- way favor the hypothesis that the ancestral com-
ing uranium (Bollmann et al. 2010). The initial mon ancestor of the bacteria within the genus
description of R. thiooxydans, the closest relative Rhodanobacter likewise contained a full denitri-
of R. denitrificans, indicated that the organism fication pathway, with subsequent rearrangement
was capable of nitrate, but not nitrite, reduction of the genes in the pathway. Further clarity will
(Lee et al. 2007). Subsequent work, however, be obtained with additional whole-genome
demonstrated that these organisms are capable sequences of related organisms from the
of complete denitrification from nitrate Xanthomonadaceae.
I 298 Insights into Environmental Microbial Denitrification
Insights into Environmental Microbial Denitrifica- unknown function DUF2165; hip, high potential
tion from Integrated Metagenomic, Cultivation, iron-sulfur protein; hisK, sensor histidine kinase; HYP,
and Genomic Analyses, Fig. 1 Gene order in the hypothetical protein; nosD, periplasmic copper-binding
genomic region of the nitrous oxide reductase gene protein; nosF, ABC transporter related protein; nosL,
(nosZ) in denitrifying and apparent non-denitrifying NosL protein; nosR, nitrous oxide expression regulator,
strains of bacteria from the genus Rhodanobacter. NosR; nosY, ABC-type transport system involved in
Strong gene synteny is observed between denitrifying multi-copper enzyme maturation, permease component;
(highlighted in green) and apparent non-denitrifying line- nosZ, nitrous oxide reductase; PGA, peptidase S45 peni-
ages (highlighted in pink). Gene order in Marinobacter cillin amidase; tatA, twin-arginine translocation protein,
aquaeolei VT8 (Gammaproteobacteria, Alteromo- TatA/E; tatB, twin-arginine-targeting protein translocase
nadaceae), capable of anaerobic growth on nitrate, was TatB; tatC, twin-arginine-targeting protein translocase
included as an out-group organism with a complete subunit TatC; trxB, thioredoxin reductase oxidoreductase;
genome sequence. Gene symbols: apbE, ApbE family badM/Rrf2, BadM/Rrf2 family transcriptional regulator;
lipoprotein; cheY-like, two-component system sensor his- nifB, molybdenum cofactor biosynthesis protein A; ppiC,
tidine kinase-response regulator hybrid protein; dapE, PpiC-type peptidyl-prolyl cis-trans isomerase
succinyldiaminopimelate desuccinylase; DUF, protein of
denitrification. Through a close coupling of have been partially misleading regarding the
cultivation-based and molecular approaches, potential ecological niche for these organisms,
characterization of denitrifying bacteria from and only when coupled with whole-genome
the ORIFRC site has significant implications not sequencing has the putative in situ functional
just for broader characterization of denitrifying capability of these organisms been revealed. In
organisms but also for the application of an analysis of Bacillus isolate and culture-
PCR-based approaches to characterize microbial collection strains, Verbaendert et al. (2011)
functional groups. With specific reference to revealed that nitrate was not always a suitable
denitrification, it was observed that the most electron acceptor for verification of denitrifica-
commonly used primers targeting functional tion capability and that 20 % of denitrifying
genes within the dissimilatory denitrification strains could use nitrite but not nitrate-to-initiate
pathway were highly biased to a select group of denitrification. They opine that the true abun-
genes largely derived from bacteria within the dance of denitrifiers is underestimated because
Proteobacteria and the genes from organisms out- typically only nitrate is used as an electron accep-
side this group could not conceivably be targeted tor when testing for denitrification capability, and
with PCR due to the excessively large number of this is consistent with observations of isolates of
mismatches between primer and gene sequence. the genus Rhodanobacter. Remarkably, they also
Thus, results generated from single-gene primer observed that growth conditions can also affect
(even degenerate) sets must be interpreted care- electron acceptor utilization, and this can further
fully. A similar finding has been obtained for lead to missing identification of physiological
nitrous oxide genes as well (Sanford capability. No doubt analogous situations for
et al. 2012). Since de novo genome and shotgun other genes, organisms, and functions are with
metagenome sequences generate gene sequences us, waiting to be identified. Thus, it seems clear
that are clearly identifiable as nitrite (or nitrous that for more robust physiological characteriza-
oxide) reductases but also impossible to target tion of bacterial strains, genome-guided physio-
with common primers, new strategies must be logical testing must be implemented. Such an
developed to detect a broader collection of deni- approach will have profound implications for
trifiers in the environment. As the organisms the assessment of the ecological role of
capable of denitrification are broadly distributed bacteria taxa.
and are polyphyletic, functional gene analyses Prior to the acquisition of multiple genomes
will continue to be essential to identify and quan- from the genus Rhodanobacter, the denitrifica-
titate denitrifying microorganisms and to charac- tion phenotype in Rhodanobacter strains was
terize denitrifying microbial communities. hypothesized to result from a relatively recent
One of the essential extrapolations of these lateral gene transfer rather than from vertical
findings is that the true abundance of denitrifica- transmission, as appears to be the case (Green
tion capability in bacterial lineages is et al. 2010). Hemme et al. (2010) also opined
underestimated due to two processes revealed in that the inferred lateral gene transfer events
this study. First, the high sequence divergence most likely occurred after the introduction of
present in functional genes in the denitrification contamination at the site. With multiple genomes
pathway limits the detection of denitrification in hand, phylogenetic analysis of the nitrite
genes from isolates through PCR and sequencing. reductase genes from the whole-genome
Second, the partial pathway observed in sequences of multiple Rhodanobacter strains
Rhodanobacter strains suggests that when revealed a phylogeny consistent with that of the
searching for denitrification capabilities, other rRNA genes from the same organisms. If there
electron acceptors besides nitrate should be were lateral gene transfer events, these predated
tested. In a sense, cultivation approaches and the last common ancestor of the genus
physiological testing of Rhodanobacter strains Rhodanobacter, with the most parsimonious
Insights into Environmental Microbial Denitrification 301 I
interpretation being that nitrate reduction capa- Cross-References
bility was later lost from certain members of the
genus. The evolutionary history of the full deni- ▶ Culture Collections in the Study of Microbial
trification pathway, however, appears to be Diversity, Importance
fragmented – for example, the nirK genes do ▶ Functional Viral Metagenomics and the
appear to be derived from a lateral gene transfer, Development of New Enzymes for DNA and
but this transfer is not recent and certainly is RNA Amplification and Sequencing
independent of the ORIFRC site. The ▶ GeoChip-Based Metagenomic Technologies
Rhodanobacter nosZ genes are more consistent for Analyzing Microbial Community
with other Gammaproteobacterial denitrifiers. It Functional Structure and Activities
is possible, though entirely speculative, that ▶ Lateral Gene Transfer and Microbial Diversity
Rhodanobacter previously had type (or class)
I soluble periplasmic nitrite reductases, like
those present in Pseudomonas denitrificans, and
these have been subsequently replaced by type II References
cytoplasmic membrane nitrite reductases. The
Akob DM, Mills HJ, Gihring TM, Kerkhof L, Stucki JW,
ecologic benefit derived from this is not clear
Anastacio AS, Chin KJ, Kusel K, Palumbo AV, Wat- I
yet, but may relate to activity under aerobic and son DB, Kostka JE. Functional diversity and electron
anaerobic conditions, as has been observed for donor dependence of microbial populations capable
nitrate reductases (Bedzyk et al. 1999). of U(VI) reduction in radionuclide-contaminated
subsurface sediments. Appl Environ Microbiol.
2008;74:3159–70.
Bedzyk L, Wang T, Ye RW. The periplasmic nitrate
Summary reductase in Pseudomonas sp. strain G-179 catalyzes
the first step of denitrification. J Bacteriol.
1999;181:2802–6.
A combination of approaches to the study of
Bergaust L, Bakken LR, Frostegard A. Denitrification
denitrifying bacteria in a contaminated subsur- regulatory phenotype, a new term for the characteriza-
face environment, including cultivation and tion of denitrifying bacteria. Biochem Soc Trans.
physiological testing of denitrifying bacteria, de 2011;39:207–12.
Bollmann A, Lewis K, Epstein SS. Incubation of environ-
novo whole-genome sequencing, and shotgun
mental samples in a diffusion chamber increases the
metagenome sequencing, revealed key limita- diversity of recovered isolates. Appl Environ
tions to the application of more straightforward Microbiol. 2007;73:6386–90.
molecular approaches. Commonly used PCR Bollmann A, Palumbo AV, Lewis K, Epstein SS. Isolation
and physiology of bacteria from contaminated
primers targeting functional genes in the denitri-
subsurface sediments. Appl Environ Microbiol.
fication pathway are shown to be incapable of 2010;76:7413–9.
detecting a broad diversity of environmental Brooks SC. Waste characteristics of the former S-3 ponds
denitrifiers. Likewise, some denitrifiers are inca- and outline of uranium chemistry relevant to NABIR
Field Research Center studies. Oak Ridge: NABIR
pable of nitrate reduction from nitrate and may be
Field Research Center; 2001.
misidentified in routine physiological testing of Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D,
bacterial isolates. Bacteria from the genus Lozupone CA, Turnbaugh PJ, Fierer N, Knight
Rhodanobacter, which can be abundant in highly R. Global patterns of 16S rRNA diversity at a depth
of millions of sequences per sample. Proc Natl Acad
contaminated environments with low pH, appear
Sci U S A. 2011;108 Suppl 1:4516–22.
to be native denitrifiers, while metal resistance De Clercq D, Van Trappen S, Cleenwerck I,
genes appear to have been acquired via lateral Ceustermans A, Swings J, Coosemans J, Ryckeboer
gene transfer. Overall, Rhodanobacter dominate J. Rhodanobacter spathiphylli sp nov.,
a gammaproteobacterium isolated from the roots of
in certain environments with low pH, heavy
Spathiphyllum plants grown in a compost-amended
metal contamination, and conditions favoring potting mix. Int J Syst Evol Microbiol.
denitrification phenotype. 2006;56:1755–9.
I 302 Insights into Environmental Microbial Denitrification
Fields MW, Yan TF, Rhee SK, Carroll SL, Jardine PM, strains, isolated from soils and the terrestrial subsur-
Watson DB, Criddle CS, Zhou JZ. Impacts on face, with variable denitrification capabilities.
microbial communities and cultivable isolates from J Bacteriol. 2012;194:4461–2.
groundwater contaminated with high levels of nitric Kowalsky MB, Gasperikova E, Finsterle S, Watson D,
acid-uranium waste. FEMS Microbiol Ecol. Baker G, Hubbard SS. Coupled modeling of hydrogeo-
2005;53:417–28. chemical and electrical resistivity data for exploring
Finneran KT, Housewright ME, Lovley DR. the impact of recharge on subsurface contamination.
Multiple influences of nitrate on uranium solubility Water Resour Res. 2011;47.
during bioremediation of uranium-contaminated Lee CS, Kim KK, Aslam Z, Lee ST. Rhodanobacter
subsurface sediments. Environ Microbiol. 2002;4: thiooxydans sp. nov., isolated from a biofilm on sulfur
510–6. particles used in an autotrophic denitrification process.
Gihring TM, Gengxin Z, Brooks SC, Campbell JH, Int J Syst Evol Microbiol. 2007;57:1775–9.
Watson DB, Brandt CC, Yang Z, Criddle CS, Luo J, Cirpka OA, Wu WM, Fienen MN, Jardine PM,
Lowe K, Overholt WA, Wu W-M, Mehlhorn T, Mehlhorn TL, Watson DB, Criddle CS, Kitanidis PK.
Kostka JE, Green SJ, Schadt CW. A limited microbial Mass-transfer limitations for nitrate removal in a
consortium is responsible for longer-term bioreduction uranium-contaminated aquifer. Environ Sci Technol.
of uranium in a contaminated aquifer. Appl Environ 2005;39:8453–9.
Microbiol. 2011;77:5955–65. NABIR. Bioremediation of metals and radionucli-
Green SJ, Prakash O, Gihring TM, Akob DM, Jasrotia P, des. . .What it is and how it works. Berkeley: Lawrence
Jardine PM, Watson DB, Brown SD, Palumbo AV, Berkeley National Laboratory; 2003.
Kostka JE. Denitrifying bacteria from the terrestrial Palumbo AV, Schryver JC, Fields MW, Bagwell CE,
subsurface exposed to mixed waste contamination. Zhou JZ, Yan T, Liu X, Brandt CC. Coupling of
Appl Environ Microbiol. 2010;76:3244–54. functional gene diversity and geochemical data from
Green SJ, Prakash O, Overholt WA, Cardenas E, environmental samples. Appl Environ Microbiol.
Hubbard D, Akob DM, Tiedje JM, Watson DB, Jardine 2004;70:6525–34.
PM, Brooks SC, Kostka JE. Denitrifying bacteria from Poretsky RS, Hewson I, Sun S, Allen AE, Zehr JP, Moran
the genus Rhodanobacter dominate bacterial commu- MA. Comparative day/night metatranscriptomic anal-
nities in the highly contaminated subsurface of ysis of microbial communities in the North Pacific
a nuclear legacy waste site. Appl Environ Microbiol. subtropical gyre. Environ Microbiol. 2009;11:
2012;78:1039–47. 1358–75.
Hallin S, Lindgren PE. PCR detection of genes encoding Prakash O, Green SJ, Jasrotia P, Overholt WA, Canion A,
nitrite reductase in denitrifying bacteria. Appl Environ Watson DB, Brooks SC, Kostka JE. Rhodanobacter
Microbiol. 1999;65:1652–7. denitrificans sp. nov., isolated from nitrate-rich zones
Hemme CL, Deng Y, Gentry TJ, Fields MW, Wu L, of a contaminated aquifer. Int J Syst Evol Microbiol.
Barua S, Barry K, Tringe SG, Watson DB, He Z, 2012;62:2457–62.
Hazen TC, Tiedje JM, Rubin EM, Zhou Sanford RA, Wagner DD, Wu QZ, Chee-Sanford JC,
J. Metagenomic insights into evolution of a heavy Thomas SH, Cruz-Garcia C, Rodriguez G, Massol-
metal-contaminated groundwater microbial commu- Deya A, Krishnani KK, Ritalahti KM, Nissen S,
nity. ISME J. 2010;4:660–72. Konstantinidis KT, Loffler FE. Unexpected
Im WT, Lee ST, Yokota A. Rhodanobacter fulvus nondenitrifier nitrous oxide reductase gene diversity
sp. nov., a beta-galactosidase-producing gammapro- and abundance in soils. Proc Natl Acad Sci U S A.
teobacterium. J Gen Appl Microbiol. 2004;50:143–7. 2012;109:19709–14.
Jones CM, Graf DR, Bru D, Philippot L, Hallin S. The Shelobolina ES, O’Neill K, Finneran KT, Hayes LA,
unaccounted yet abundant nitrous oxide-reducing Lovley D. Potential for in situ bioremediation of
microbial community: a potential nitrous oxide sink. a low-pH, high-nitrate uranium-contaminated
ISME J. 2013;7:417–26. groundwater. Soil Sediment Contam. 2003;12:
Kaeberlein T, Lewis K, Epstein SS. Isolating 865–84.
“uncultivable” microorganisms in pure culture in Sousa T, Chung AP, Pereira A, Piedade AP, Morais PV.
a simulated natural environment. Science. Aerobic uranium immobilization by Rhodanobacter
2002;296:1127–9. A2–61 through formation of intracellular uranium–
Kostka JE, Green SJ. Microorganisms and processes phosphate complexes. Metallomics. 2013;5(4):
linked to uranium reduction and immobilization. In: 390–397.
Stolz JF, Oremland RS, editors. Microbial metal and Spain AM, Krumholz L. Cooperation of three denitrifying
metalloid metabolism: advances and applications. bacteria in nitrate removal of acidic nitrate- and
Washington, DC: ASM Press; 2011. uranium-contaminated groundwater. Geomicrobiol J.
Kostka JE, Green SJ, Rishishwar L, Prakash O, Katz LS, 2012;29:830–42.
Marino-Ramirez L, Jordan IK, Munk C, Ivanova N, Spalding BP, Watson DB. Passive sampling and analyses
Mikhailova N, Watson DB, Brown SD, Palumbo AV, of common dissolved fixed gases in groundwater.
Brooks SC. Genome sequences for six rhodanobacter Environ Sci Technol. 2008;42:3766–72.
Integrated Database Resource for Marine Ecological Genomics 303 I
Tringe SG, von Mering C, Kobayashi A, Salamov AA,
Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Integrated Database Resource for
Detter JC, Bork P, Hugenholtz P, Rubin
EM. Comparative metagenomics of microbial commu- Marine Ecological Genomics
nities. Science. 2005;308:554–7.
van den Heuvel RN, van der Biezen E, Jetten MS, Renzo Kottmann
Hefting MM, Kartal B. Denitrification at pH 4 by Max Plank Institute for Marine Microbiology,
a soil-derived Rhodanobacter-dominated community.
Environ Microbiol. 2010;12:3264–71. Bremen, Germany
Verbaendert I, Boon N, De Vos P, Heylen
K. Denitrification is a common feature among mem-
bers of the genus Bacillus. Syst Appl Microbiol. Synonyms
2011;34:385–91.
Watson DB, Kostka JE, Fields MW, Jardine PM. The
Oak Ridge field research center conceptual model. Database; Environmental data; Environmental
NABIR Field Research Center Report, Oak Ridge; genomics; GIS; Integration; Marine;
2004. Metagenomics
Watson DB, Doll WE, Gamey TJ, Sheehan JR, Jardine
PM. Plume and lithologic profiling with surface resis-
tivity and seismic tomography. Ground Water.
2005;43:169–77. Definition
Woo SG, Srinivasan S, Kim MK, Lee M. Rhodanobacter I
caeni sp. nov., isolated from sludge from a sewage
disposal plant. Int J Syst Evol Microbiol. Megx.net, the integrated database resource for
2012;62:2815–21. marine ecological genomics, is the first database
Wu WM, Carley J, Fienen M, Mehlhorn T, Lowe K, to integrate bacterial and archaeal genes,
Nyman J, Luo J, Gentile ME, Rajan R, Wagner D, genomes, and metagenomes from the marine
Hickey RF, Gu BH, Watson D, Cirpka OA, Kitanidis
PK, Jardine PM, Criddle CS. Pilot-scale in situ biore- environment with curated contextual metadata,
mediation of uranium in a highly contaminated aqui- as well as environmental data from heteroge-
fer. 1. Conditioning of a treatment zone. Environ Sci neous resources.
Technol. 2006;40:3978–85.
Wu WM, Carley J, Luo J, Ginder-Vogel MA, Cardenas E,
Leigh MB, Hwang CC, Kelly SD, Ruan CM, Wu LY,
Van Nostrand J, Gentry T, Lowe K, Mehlhorn T, Introduction
Carroll S, Luo WS, Fields MW, Gu BH, Watson D,
Kemner KM, Marsh T, Tiedje J, Zhou JZ, Fendorf S,
Kitanidis PK, Jardine PM, Criddle CS. In situ Over the last years, microbial ecology and envi-
bioreduction of uranium (VI) to submicromolar levels ronmental microbiology have undergone
and reoxidation by dissolved oxygen. Environ Sci a paradigm shift, moving from a single experi-
Technol. 2007;41:5716–23.
ment science to a high-throughput endeavor.
Wu W-M, Carley J, Green SJ, Luo J, Kelly SD,
Nostrand J, Lowe K, Mehlhorn T, Carroll S, Although the genomic revolution is rooted in
Boonchayanant B, Lofller FE, Watson DB, Kemner medicine and biotechnology, it is currently the
KM, Zhou J, Kitanidis PK, Kostka JE, Jardine PM, environmental sector, specifically the marine,
Criddle CS. Effects of nitrate on the stability of ura-
which delivers the greatest quantity of data
nium in a bioreduced region of the subsurface. Environ
Sci Technol. 2010;44:5104–11. (Gilbert and Dupont 2011). Marine ecosystems,
Yan TF, Fields MW, Wu LY, Zu YG, Tiedje JM, Zhou covering >70 % of the Earth’s surface, host the
JZ. Molecular diversity and characterization of nitrite majority of biomass and significantly contribute
reductase gene fragments (nirK and nirS) from nitrate-
to global organic matter and energy cycling.
and uranium-contaminated groundwater. Environ
Microbiol. 2003;5:13–24. Microorganisms are known to be the “gate-
Zengler K, Toledo G, Rappe M, Elkins J, Mathur EJ, Short keepers” of these processes, and insights into
JM, Keller M. Cultivating the uncultured. Proc Natl their lifestyle and fitness can enhance our ability
Acad Sci U S A. 2002;99:15681–6.
to monitor, model, and predict future changes.
Zumft WG, Kroneck PM. Respiratory transformation of
nitrous oxide (N2O) to dinitrogen by Bacteria and Recent developments in sequencing technol-
Archaea. Adv Microb Physiol. 2007;52:107–227. ogy have made routine sequencing of whole
I 304 Integrated Database Resource for Marine Ecological Genomics
microbial communities from natural environ- from the GOS microbial dataset. Finally, megx.
ments possible. Prominent examples in the net also incorporates all sequenced marine phage
marine field are the Global Ocean Sampling genomes in MegDB, which is the first step
(GOS) campaign (Rusch et al. 2007), ICOMM, towards integrating viral genomic and biogeo-
TaraOceans, Malaspina, and the Ocean Sampling chemical data (Duhaime et al. 2011).
Day 2014 of the Micro B3 project. In an effort towards integrating microbial
These large-scale sequencing projects bring diversity with specific sampling sites, megx.net
new challenges to data management and software includes georeferenced small and large subunit
tools for assembly, gene prediction, and annota- rRNA gene sequences from the SILVA rRNA
tion, which are fundamental steps in genomic gene databases project (Quast et al. 2013). As of
analysis. Several dedicated database resources SILVA release 102, only 9 % (16S/18S) and 2 %
have emerged to tackle the current need for (23S/28S) of over one million sequences in
large-scale metagenomic data management and SILVA SSUParc (16S/18S) and LSUParc
analysis, among which are CAMERA (Sun (23S/28S) databases are georeferenced.
et al. 2010), IMG/M (Markowitz et al. 2008), All genomic sequences in megx.net are
and MG-RAST (Meyer et al. 2008). Neverthe- supplemented with contextual data from GOLD
less, it is increasingly apparent that the full poten- (Pagani et al. 2012), NCBI Genome Projects, and
tial of comparative genome and metagenome Moore Foundation’s Marine Microbial Genome
analysis can be achieved only if the geographic Sequencing Project.
and environmental context of the sequence data is The main environmental data is retrieved from
considered. The metadata describing a sample’s three sources:
geographic location and environment, the details 1. World Ocean Atlas: a set of objectively ana-
of its processing, from the time of sampling to lyzed (one decimal degree spatial resolution)
sequencing and subsequent analyses are impor- climatological fields of in situ measurements
tant for modeling species’ responses to environ- 2. World Ocean Database: a collection of scien-
mental change or the spread and niche adaptation tific, quality-controlled ocean profiles
of bacteria and viruses. Megx.net’s unique inte- 3. SeaWIFS chlorophyll a data
gration of contextual and sequence data allows These data are described at 33 standard depths
microbial ecologists and marine scientists to bet- for annual, seasonal, and monthly intervals.
ter compare biological data to understand the Together, the location and time data (x, y, z, and t)
complex interplay between organisms, genes, serve as a universal anchor and link environmental
and their environment. data to the sequence and contextual data.
Integrated Database Resource for Marine Ecological Blue crosses and label indicating the number of significant
Genomics, Fig. 1 Geographic distribution of BLAST BLAST hits in the GOS metagenome samples. The map is
results of a proteorhodopsin from Dokdonia sp. PRO95. generated using the web service of the Genes Mapserver
retrieved via simple Web requests, as specified by access to data in their domains, integration of
the Web Map Service (WMS) standard. The base their data across domains requires megx.net to
URL for WMS requests is http://www.megx.net/ develop a set of new tools and Web services to
wms/gms, where one can also find a tutorial on facilitate seamless interoperability between the
how to use this service. Megx.net also provides different data domains.
access to MIxS reports in Genomic Contextual
Data Markup Language (GCDML) XML files
for all marine phage genomes through similar Summary
HTTP queries, e.g., http://www.megx.net/gcdml/
Prochlorococcus_phage_P-SSP7.xml (Kottmann Megx.net’s unique integration of environmental
et al. 2008). and sequence data allows microbial ecologists
and marine scientists to better contextualize and
compare biological data, using, e.g., the Genes
Current and Future Developments Mapserver and GIS tools. The integrated datasets
facilitate a holistic approach to understanding the
Currently, megx.net is further developed within complex interplay between organisms, genes, and
the FP7 EU project Micro B3 as an open source their environment. As such, megx.net is continu-
project to become an integral part of the Micro B3 ously improved to serve as a fundamental resource
Information System. This information system in the emerging field of ecosystems biology.
builds on a handful of long-established data
resources that span marine science. These data
resources include SeaDataNet and its network Cross-References
of National Oceanographic Data Centers
(oceanographic data), EurOBIS (macrobiological ▶ A 123 of Metagenomics
data), and EBI’s European Nucleotide Archive ▶ Computational Approaches for Metagenomic
(EBI-ENA; molecular sequence data). While Datasets
these resources exist to broaden and simplify ▶ SILVA Databases
Integrons as Repositories of Genetic Novelty 307 I
References Sun S, Chen J, Li W, Altintas I, Lin A, Peltier S,
et al. Community cyberinfrastructure for advanced
Altschul SF, Gish W, Miller W, Myers EW, Lipman microbial ecology research and analysis: the
DJ. Basic local alignment search tool. J Mol Biol. CAMERA resource. Nucleic Acids Res. 2010;
1990;215:403–10. 39(Database):D546–51.
Duhaime MB, Kottmann R, Field D, Glöckner Yilmaz P, Kottmann R, Field D, Knight R, Cole JR,
FO. Enriching public descriptions of marine phages Amaral-Zettler L, et al. Minimum information about
using the MIGS standard: a case study assessing the a marker gene sequence (MIMARKS) and minimum
contextual data frontier. Stand Genomic Sci. information about any (x) sequence (MIxS) specifica-
2011;4(2):1. tions. Nat Biotechnol. 2012;29(5):415–20. doi:
Gilbert JA, Dupont CL. Microbial metagenomics: beyond 10.1038/nbt.1823.
the genome. Ann Rev Mar Sci. 2011;3(1):347–71.
Annual Reviews.
Hankeln W, Buttigieg PL, Fink D, Kottmann R,
Yilmaz P, Glöckner FO. MetaBar – a tool for
consistent contextual data acquisition and standards
compliant submission. BMC Bioinforma.
Integrons as Repositories of Genetic
2010;11:358. Novelty
Hankeln W, Wendel NJ, Gerken J, Waldmann J, Buttigieg
PL, Kostadinov I, et al. CDinFusion – submission- Bridget Mabbutt1, Chandrika Deshpande1,
ready, on-line integration of sequence and contextual I
Visaahini Sureshan1 and Stephen J. Harrop2
data. PLoS ONE. 2011;6(9):e24797. Highlander SK, 1
editor. Department of Chemistry and Biomolecular
Hirschman L, Clark C, Cohen KB, Mardis S, Luciano J, Sciences, Macquarie University, Sydney,
Kottmann R, et al. Habitat-lite: a GSC case study NSW, Australia
based on free text terms for environmental metadata. 2
School of Physics, University of New South
OMICS J Integr Biol. 2008;12(2):129–36.
Kottmann R, Gray T, Murphy S, Kagan L, Kravitz S, Wales, Sydney, NSW, Australia
Lombardot T, et al. A standard MIGS/MIMS compli-
ant XML schema: toward the development of the
Genomic Contextual Data Markup Language
(GCDML). OMICS J Integr Biol. 2008;12(2):
Synonym
115–21.
Markowitz VM, Ivanova NN, Szeto E, Palaniappan K, Novel proteins engaged for LGT within the gene
Chu K, Dalevi D, Chen IM, Grechkin Y, Dubchak I, cassette/integron system
Anderson I, Lykidis A, Mavromatis K, Hugenholtz P,
Kyrpides NC. IMG/M: a data management and analy-
sis system for metagenomes. Nucleic Acids Res. 2008.
PMID:17932063 Definition
Meyer F, Paarmann D, D’Souza M, Olson R, Glass E,
Kubal M, et al. The metagenomics RAST server –
An important vehicle for lateral (or horizontal)
a public resource for the automatic phylogenetic and
functional analysis of metagenomes. BMC gene transfer in bacteria is the integron: it enables
Bioinforma. 2008;9(1):386. the capture and expression of genes as small
Pagani I, Liolios K, Jansson J, Chen I-MA, Smirnova T, mobile elements, or gene cassettes. These mobile
Nosrat B, et al. The Genomes OnLine Database
gene cassettes encompass a vast pool of genetic
(GOLD) v. 4: status of genomic and metagenomic
projects and their associated metadata. Nucleic Acids novelty, ostensibly for purposes of adaptation.
Res. 2012;40(Database issue):D571–9. In most cases, their functional annotation is
Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, obscured by their characteristically high
Yarza P, et al. The SILVA ribosomal RNA gene data-
sequence novelty. Our isolation and solving of
base project: improved data processing and web-based
tools. Nucleic Acids Res. 2013;41(Database issue): protein structures encoded by the cassette
D590–6. metagenome reveals a relatively high proportion
Rusch DB, Halpern AL, Sutton G, Heidelberg KB, of completely novel folds. These newly defined
Williamson S, Yooseph S, et al. The Sorcerer II global
crystal structures are found to encompass diverse
ocean sampling expedition: Northwest Atlantic
through Eastern Tropical Pacific. PLoS Biol. topologies and fold families and delineate new
2007;5(3):e77. protein domains.
I 308 Integrons as Repositories of Genetic Novelty
Integrons as Repositories of Genetic Novelty, using primers (green arrows) targeting the 59-be ele-
Fig. 1 Recovery of gene cassettes from integron arrays. ments, cassette PCR has the capacity to recover gene
The structure of an integron, showing core features includ- cassettes and arrays independently of any specific encoded
ing the intI gene (beige) with its Pint promoter, the attI sequence. This allows recovery of entirely novel gene
attachment site and the Pc promoter. Three integrated cassettes (Adapted from Boucher et al. 2007 and Stokes
gene cassettes (blue, red, and yellow) are shown. By et al. 2001)
I
Integrons as Repositories of Genetic Novelty, Hfx_cass1. Each subunit within the oligomeric organiza-
Fig. 2 Ribbon depiction of novel cassette-encoded pro- tion is indicated in a different color. Putative binding sites,
tein structures: (a) Hfx_cass2, (b) Vpc_cass2, (c) for interaction with either small molecule ligands or,
Hfx_cass5, (d) Vch_cass3, (e) Vch_cass14, (f) potentially, other protein partners, are highlighted in cyan
a flexible loop. Pronounced acidic surface fea- opposing faces of the dimer and possibly gated
tures extend perpendicular to each cavity due to by residues of the flexible loop, appears highly
Glu and Asp side chains of an outer helix. This appropriate for hydrophobic and/or basic sub-
unique binding groove, presented twice on strates or protein partners.
I 310 Integrons as Repositories of Genetic Novelty
A distinct all-helical protein had also been basic groups that line the pronounced surface
identified in a gene cassette recovered from clefts on both faces of the tetramer.
a V. metecus strain, Vpc_cass2 (PDB 3JRT). Derived from a strain of V. cholera, the struc-
The fold incorporates a four-helix bundle with ture of Vch_cass3 (PDB 3FY6) reveals an
helical extensions wrapping about at midpoint unusual two-layered a + b organization. Within
(Fig. 2b); orthogonal packing of two chain pairs the dimer, central helices stack end-to-end, so
creates a globular-shaped dimer. Sequence separating and exposing two distinct sheet com-
homologs (Shewanella baltica and Moritella ponents (Fig. 2d). A long pronounced surface
genomes, at ~50 % identity) highlight preserva- cleft is enclosed between the outer edge strands
tion of exposed residues (Lys63, Glu66, of these two sheets, flanked by acidic side chains.
His1090 , Val110’) clustered across the dimeric To date, two sequence homologs (~40 % identity)
interface, indicating a possible substrate- have been detected: within Desulfatibacillus
binding site. This fold is weakly related to the alkenivorans from polluted water and
substrate-binding domain of the kanamycin a metagenomic sample of Antarctic bloom-
nucleotidyltransferase (KNTase-C) clan of pro- forming cyanobacterium. These sequence rela-
teins, yet the shape of the dimeric interface in tives do not, however, retain the distinctive
Vpc_cass2 is distinct to that found in its closest Asp/Glu residues surrounding the proposed bind-
KNTase-C relatives (e.g., HI0074 from ing cleft within the Vch_cass3 structure.
Haemophilus influenzae). Lehmann and Another V. cholera-derived gene cassette,
workers have documented substrate-binding/ Vch_cass14, also incorporates an a + b dimer,
nucleotide-binding module pairs prevalent in in this case within a two-layer sandwich fold
bacterial genomes, particularly from harsh con- (PDB 3IMO, Fig. 2e). Sequence relatives of this
ditions and pathogens (Lehmann et al. 2003). gene cassette have been found in the genomes of
Thus, the mobile gene cassette Vpc_cass2 may several soil- and water-dwelling bacteria.
comprise one half of a bipartite system with the A particularly long and deep ligand cavity is
capacity to organize with a nucleotidyl- internalized within this protein, appropriate for
transferase domain into a functional enzyme. a linear hydrophobic substrate (e.g., fatty acid or
alcohol). The features of this binding cavity are
a + b Fold Members retained across all sequence relatives; 20 of the
A gene cassette also isolated from a sewage out- Vch_cass14 internal residues are conserved in its
fall (Halifax, Canada), Hfx_cass5, occurs as two closest homologs. A high degree of conser-
two domain-swapped a + b dimers organized vation is also seen among residues responsible for
into a tetramer (PDB 3IF4; Fig. 2c). Across the mediating dimerization of the module, pointing
center of the tetramer, 310 helices of two oppos- to a dimeric functional protein. A notable feature
ing subunits stack via polar and charged groups. of the dimer, possibly of functional importance, is
The flattened nature of the tetramer and the the projection of positively charged surface clus-
asymmetrical interactions of its component ters from the two exposed b-sheets.
dimers result in two large faces with markedly
different surface features. A small group of a/b Fold Members
sequence homologs (55–71 % identity) include An unusual trimeric protein is encoded by
gene cassettes from contaminated environments: Hfx_cass1, a gene cassette extracted from a salt
a geographically distinct sewage outfall in Can- marsh environment (Koenig et al. 2008).
ada and an Australian industrial site (Stokes Although there are no sequence homologs in cur-
et al. 2001). Residues mediating the tetrameric rent databases, the unique three-layered a/b fold
organization are preserved across all members of bears some topological relationship to the zinc
this emerging sequence family, indicating this to transporter CzrB of Thermus thermophilus. This
be the functional form. Also conserved is the new cassette-encoded protein presents three
inter-module linker segment, which presents clefts at each inter-subunit interface across the
Integrons as Repositories of Genetic Novelty 311 I
Integrons as Repositories of Genetic Novelty, indicated in a different color. Putative binding sites, for
Fig. 3 Ribbon depiction of cassette-encoded new vari- interaction with either small molecule ligands or, poten-
ants of known folds: (a) Cass2, (b) Bal32a, (c) iMazG. tially, other protein partners, are highlighted in cyan
Each subunit within the oligomeric organization is
flattened trimer surface (Fig. 2f). The clefts are Thus, it can be proposed that the Cass2 family
polar in nature, occupied in the crystal structure has the capacity to form functional transcription
I
by water, and surrounded by pronounced acidic regulator complexes and possibly represents evo-
loops. Although the chemical organization of the lutionary precursors to multidomain regulators of
binding site is unique to Hfx_cass1, some com- cationic compounds.
ponents are common to active site chemistry of
enzymes known to engage with adenosine- a + b Barrel Transporter
and/or nicotinamide-based cofactors. A gene cassette derived from industrially polluted
soil has yielded a new member of the highly
adaptable a + b barrel family of transport proteins
New Variants of Known Folds Encoded and enzymes (Fig. 3b). The dimeric structure of
by Gene Cassettes Bal32a (PDB 1TUH (Robinson et al. 2005)) fea-
tures cone-shaped binding pockets within each
Cationic Drug-Binding Module barrel, common to this superfamily for engaging
The structure (PDB 3GK6) of gene cassette small hydrophobic substrates or peptides. The
Cass2 derived from environmental V. cholera Bal32a structure is, however, unique in that each
has identified an independent binding module of its central cavities is unusually deep and iso-
related to domains of the AraC/XylS transcrip- lated from solvent by a flexible loop. A potential
tion activator system (Deshpande et al. 2011). catalytic site of clustered polar groups within the
Sequence analysis identifies the cassette-encoded barrel is equivalently positioned to corresponding
protein to be representative of a group of inde- active sites within structurally related enzymes.
pendent binding modules undergoing lateral gene Although these enzymes likely share a common
transfer within Vibrio and related species. Closest evolutionary ancestry, with preservation of active
structural relatives of the Cass2 b-barrel (Fig. 3a) site features internal to the barrel, their very low
occur as domains of multidrug-binding proteins overall sequence relationship to Bal32a (<20 %
(including BmrR), incorporating a hydrophobic identity) suggests a wide adaptation of the a + b
binding pocket with a signature glutamate side barrel fold for varied demands. Within its origi-
chain. Cass2 has been demonstrated to bind nating cassette array, the Bal32a gene cassette was
a range of cationic drug compounds. The struc- immediately adjacent to a second cassette,
ture of this module depicts a surface proximal to Bal32b, encoding a likely membrane-associated
the drug-binding cavity with features homolo- protein. This suggests the two components may
gous to those engaged for protein interaction well possibly function in concert as a combined
within multidomain transcriptional regulators. binding and transport system.
I 312 Integrons as Repositories of Genetic Novelty
Gene Cassettes Encode Novel Protein ▶ Lateral Gene Transfer and Microbial Diversity
Folds with Distinct Binding Features ▶ Metagenomic Potential for Understanding
Horizontal Gene Transfer
Regardless of the degree of novelty displayed, all
gene cassette-derived structures appear to be con-
sistent with adaptive functions (e.g., secondary References
metabolism, DNA modification) and possibly
selective advantage (e.g., drug resistance). Bornberg-Bauer E, Alba MM. Dynamics and adaptive
benefits of modular protein evolution. Curr Opin Struct
A tendency to form homo-oligomers has been Biol. 2013;23(3):459–66.
a consistent observation across this structural Boucher Y, Stokes HW. The roles of lateral gene transfer
survey of cassette proteins, with only one excep- and vertical descent in vibrio evolution. In: Fabiano
tion to date (the cationic drug-binding protein Lopes Thompson BA, Swings JG, editors. The biology
of vibrios. Washington, DC: ASM Press; 2006.
Cass2 from Vibrio (Deshpande et al. 2011)).
p. 84–94.
This clear preference for oligomerization may Boucher Y, Nesbo CL, Joss MJ, Robinson A, Mabbutt BC,
be a consequence of the relatively short sequence Gillings MR, et al. Recovery and evolutionary analysis
lengths of genes cassettes within arrays, stabiliz- of complete integron gene cassette arrays from Vibrio.
BMC Evol Biol. 2006;6:3.
ing small protein modules which can perhaps also
Boucher Y, Labbate M, Koenig JE, Stokes HW. Integrons:
be readily and flexibly mixed for different func- mobilizable platforms that promote genetic diversity
tions. Such modules may readily combine with in bacteria. Trends Microbiol. 2007;15(7):301–9.
IPRStats, Overview 313 I
Cambray G, Guerout A, Mazel D. Integrons. Annu Rev Rowe-Magnus DA, Guerout AM, Biskri L, Bouige P,
Genet. 2010;44:141–66. Mazel D. Comparative analysis of superintegrons:
Cohen O, Gophna U, Pupko T. The complexity hypothesis engineering extensive genetic diversity in the
revisited: connectivity rather than function constitutes Vibrionaceae. Genome Res. 2003;13(3):428–42.
a barrier to horizontal gene transfer. Mol Biol Evol. Roy Chowdhury P, Boucher Y, Hassan KA, Paulsen IT,
2011;28(4):1481–9. Stokes HW, Labbate M. Genome sequence of Vibrio
Deshpande CN, Harrop SJ, Boucher Y, Hassan KA, Di rotiferianus strain DAT722. J Bacteriol.
Leo R, Xu X, et al. Crystal structure of an integron 2011;193(13):3381–2.
gene cassette-associated protein from Vibrio cholerae Stokes HW, Holmes AJ, Nield BS, Holley MP,
identifies a cationic drug-binding module. PLoS One. Nevalainen KM, Mabbutt BC, et al. Gene cassette
2011;6(3):e16934. PCR: sequence-independent recovery of entire genes
Elsaied H, Stokes HW, Nakamura T, Kitamura K, Fuse H, from environmental DNA. Appl Environ Microbiol.
Maruyama A. Novel and diverse integron integrase 2001;67(11):5240–6.
genes and integron-like gene cassettes are prevalent Sureshan V, Deshpande CN, Boucher Y, Koenig JE,
in deep-sea hydrothermal vents. Environ Microbiol. Stokes HW, Harrop SJ, et al. Integron gene cassettes:
2007;9(9):2298–312. a repository of novel protein folds with distinct inter-
Hall RM. Integrons and gene cassettes: hotspots of diver- action sites. PLoS One. 2013;8(1):e52934.
sity in bacterial genomes. Ann N Y Acad Sci.
2012;1267:71–8.
Joss MJ, Koenig JE, Labbate M, Polz MF, Gillings MR,
Stokes HW, et al. ACID: annotation of cassette and
integron data. BMC Bioinformatics. 2009;10:118. IPRStats, Overview I
Koenig JE, Boucher Y, Charlebois RL, Nesbo C,
Zhaxybayeva O, Bapteste E, et al. Integron-associated Iddo Friedberg
gene cassettes in Halifax Harbour: assessment of
a mobile gene pool in marine sediments. Environ
Department of Microbiology, Miami University,
Microbiol. 2008;10(4):1024–38. Oxford, OH, USA
Koonin EV, Wolf YI. Genomics of bacteria and archaea:
the emerging dynamic view of the prokaryotic world.
Nucleic Acids Res. 2008;36(21):6688–719.
Labbate M, Boucher Y, Luu I, Chowdhury PR, Stokes
Abbreviations
HW. Integron associated mobile genes: Just
a collection of plug in apps or essential components EBI European Bioinformatics Institute
of cell network hardware? Mob Genet Elements. GO Gene Ontology
2012;2(1):13–8.
Lehmann C, Lim K, Chalamasetty VR, Krajewski W,
IMG/M Integrated Microbial Genome/
Melamud E, Galkin A, et al. The HI0073/HI0074 Metagenomics
protein pair from Haemophilus influenzae is pHMM Profile hidden Markov model
a member of a new nucleotidyltransferase family: PSSM Position-specific scoring matrix
structure, sequence analyses, and solution studies. Pro-
teins. 2003;50(2):249–60.
SQL Structured Query Language
Robinson A, Wu PS, Harrop SJ, Schaeffer PM, XML Extensible Markup Language
Dosztanyi Z, Gillings MR, et al. Integron-associated
mobile gene cassettes code for folded proteins: the
structure of Bal32a, a new member of the adaptable
alpha + beta barrel family. J Mol Biol.
Definition
2005;346(5):1229–41.
Robinson A, Guilfoyle AP, Harrop SJ, Boucher Y, Stokes IPRStats is a lightweight platform-independent
HW, Curmi PM, et al. A putative house-cleaning open-source licensed software package for stor-
enzyme encoded within an integron array: 1.8
A crystal structure defines a new MazG subtype. Mol
ing and visualizing metagenomic data annotated
Microbiol. 2007;66(3):610–21. by InterProScan. IPRStats is unique in that it
Robinson A, Guilfoyle AP, Sureshan V, Howell M, provides the user with the same annotation
Harrop SJ, Boucher Y, et al. Structural genomics of choices offered by the popular open reading
the bacterial mobile metagenome: an overview.
Methods Mol Biol. 2008;426:589–95.
frame annotation pipeline, InterProScan.
Rothschild LJ, Mancinelli RL. Life in extreme environ- IPRStats can be installed either as a Web server
ments. Nature. 2001;409(6823):1092–101. or as a stand-alone software.
I 314 IPRStats, Overview
Availability
IPRStats, Overview, Fig. 1 Overview of IPRStats. IPRStats, Overview, Fig. 1 (continued) (d) Graphic dis-
(a) Protein sequence information as a single FASTA file play. (e) Table display. (f) Toggle between results from
submitted to InterProScan (one or more proteins). different InterPro member databases (Reproduced from
(b) InterProScan XML output imported into IPRStats seven under BMC CC 2.0 license, copyright owned by
SQL database. (c) Display of sequence signature statistics. authors)
I 316 I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments
References Introduction
Eddy S. Profile hidden Markov models. Bioinformatics. Recent advances in high-throughput sequencing
1998;14(9):755–63.
technologies have enabled life-science
Kelly RJ, Vincent DE, Friedberg I. IPRStats: visualization
of the functional potential of an InterProScan run. researchers to rapidly sequence and characterize
BMC Bioinformatics. 2010;11:S13. the entire genomic content of microbial commu-
McDowall J, Hunter S. InterPro protein classification. nities residing in diverse ecological niches. A key
Methods Mol Biol. 2011;694:37–47. doi: 10.1007/
978-1-60761-977-2_3. PMID:21082426.
advantage of characterizing microbial communi-
Schnoes AM, Brown SD, Dodevski I, Babbitt PC. Anno- ties in this fashion is that it enables the concom-
tation error in public databases: misannotation of itant characterization of several microbes
molecular function in enzyme superfamilies. PLoS (constituting the community), most of which can-
Comput Biol. 2009;5(12):e1000605. doi: 10.1371/
not be studied using traditional culture-based
journal.pcbi.1000605. Epub 2009 Dec 11.
PMID:20011109. genomic approaches. Moreover, this approach
Zdobnov EM, Apweiler R. InterProScan – an integration (referred to as “Metagenomics”) is useful in
platform for the signature-recognition methods in understanding the interaction patterns between
InterPro. Bioinformatics. 2001;17(9):847–8.
the resident microbes as well as between the
microbes and the environment.
Characterizing and comparing the taxonomic
as well as functional diversity of microbial com-
I-rDNA and C16S: Identification and munities (obtained from varied ecological
Classification of Ribosomal RNA niches) are the broad objective of metagenomic
Gene Fragments projects. These objectives are attained using two
well-established approaches (Fig. 1). In the first
Algorithms for Efficient In Silico Identification approach (commonly referred to as the amplicon-
and Classification of Ribosomal RNA Gene based approach), a quick snapshot of taxonomic
Fragments in Metagenomic Datasets diversity of a given environmental sample is
obtained by specifically amplifying, cloning,
Sharmila Mande, Tarini Shankar Ghosh and and sequencing gene or gene fragments
Mohammed Monzoorul Haque corresponding to one or more phylogenetic
Biosciences R & D, TCS Innovation Labs, Tata marker genes. The 16S rRNA gene is the most
Research Development & Design Centre, Tata widely used phylogenetic marker gene employed
Consultancy Services Limited, Pune, MH, India in such amplicon-based approaches. Subse-
quently, bioinformatic approaches are used for
taxonomically classifying these sequenced
Synonyms genes or gene fragments. The relative proportions
of various taxonomic groups present in the
Classification of 16S rRNA gene fragments; metagenomic dataset (representing a given envi-
In silico identification of 16S rRNA gene ronmental sample) are then obtained from the
fragments identified taxa. In the second approach
I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments 317 I
I-rDNA and C16S: Identification and Classification of a given environment. Advantages and limitations of each
Ribosomal RNA Gene Fragments, Fig. 1 An overview approach are also summarized. Black regions depicted in
of different approaches adopted by metagenomic projects the genomic fragments correspond to entire or a fragment
for profiling the taxonomic and/or functional diversity of of 16S rRNA gene
I 318 I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments
(commonly referred to as the shotgun-sequencing MG-RAST (Meyer et al. 2008) and CAMERA
approach), the genomic content of a given envi- (Seshadri et al. 2007). Given the robustness of
ronmental sample is extracted and sequenced. BLAST/BLAT algorithms, this approach has
Genomic fragments (referred to as “reads”) high sensitivity in identifying/classifying 16S
obtained from the sequencing platforms are then rDNA fragments (even for reads with lengths
computationally analyzed in terms of taxonomy <100 bp) originating from known and
and function. Since the shotgun-sequencing characterized genomes.
approach generates millions of reads originating The BLAST-based approach, although iden-
from random positions/locations within the tifies 16S rDNA sequences with high sensitivity,
genomes of various microbes constituting requires huge compute power for performing
a given environmental sample, a subset of these alignments of millions of metagenomic reads
reads (hereafter referred to as 16S rDNA frag- with thousands of reference 16S rRNA gene
ments) are expected to originate from genomic sequences. This makes it unsuitable for practical
regions that specifically encompass the 16S use in research labs lacking access to high-end
rRNA genes of the resident microbes. Identifying computational infrastructure. Another alignment-
16S rDNA fragments (from within millions of based methodology attempts to address/over-
reads constituting a typical metagenomic dataset) come this limitation by employing hidden Mar-
and subsequently classifying them is therefore kov models (HMMs) that represent the
expected to aid in quickly deciphering the taxo- universally conserved sequence architecture of
nomic diversity of a given metagenomic dataset. the 16S rRNA gene (Huang et al. 2009). These
The following sections describe two algorithms, HMMs, built separately for bacterial and archaeal
namely, i-rDNA and C16S, which are used for the kingdoms, reflect the sequence conservation pat-
identification and taxonomic classification of 16S tern observed within the 16S rRNA genes of
rDNA fragments in metagenomic datasets, microbes belonging to these two lineages. For
respectively. identification of 16S rDNA fragments, reads in
a metagenomic dataset are individually aligned to
i-rDNA: Algorithm for Identification of 16S these two HMMs. Reads obtaining significant
rRNA Gene Fragments in Metagenomic alignment scores are then tagged as 16S rDNA
Datasets fragments. Given that the alignments of individ-
One of the simplest ways of identifying 16S ual reads are done only against two HMMs, rather
rDNA fragments in a metagenomic dataset is by than against thousands of individual reference
performing similarity searches of all reads con- 16S rRNA gene sequences (as in the case of the
stituting the dataset against a database containing BLAST-based approach), the execution time as
known 16S rRNA gene sequences. Such similar- well as the requirements of compute power are
ity searches are typically performed using popu- significantly reduced. Moreover, this approach is
lar algorithms such as BLAST (Altschul observed to achieve similar levels of detection
et al. 1990) and BLAT (Kent 2002). Similarity sensitivity as that of BLAST-based approach.
of a read with sequences in the database is eval- Though the above-described HMM-based
uated based on how it aligns with these approach represents a rapid way of identifying
sequences. Reads having significant similarity 16S rDNA fragments (as compared to the
(similarity being defined in terms of alignment BLAST-based approach), it still involves
parameters such as e-value, identity, and align- performing alignments of each individual read
ment length) with database sequences are identi- (in metagenomic datasets) against two HMMs.
fied as 16S rDNA fragments. Since this approach Consequently, adopting the HMM-based
enables identification as well as taxonomic clas- approach (on a standard work-station) for identi-
sification of 16S rDNA fragments, it is currently fication of 16S rDNA fragments within huge
incorporated as a standard procedure in popular metagenomic datasets (e.g. the Human
metagenomic analysis platforms such as Microbiome Project containing more than
I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments 319 I
I-rDNA and C16S: Identification and Classification of sequencing. Advantages and limitations of each approach
Ribosomal RNA Gene Fragments, Fig. 2 Available are also summarized. Black regions depicted in the geno-
approaches for identification of 16S rRNA gene fragments mic fragments correspond to entire or a fragment of 16S
in metagenomic datasets obtained using shotgun rRNA gene
32 million sequences) is expected to take several sequences and subsequently provide this small
hours to a few days. The recently published subset of reads as input to the HMM-based
i-rDNA method (Mohammed et al. 2011) has approach. This step of prefiltering data (based
addressed this issue by employing a sequence on compositional characteristics) essentially
composition-based step prior to the similarity reduces the volume of data which are provided
search step performed against the bacterial and as input to the HMM alignment step. The finer
archaeal HMMs (Fig. 2). This precursor step is algorithmic details of the i-rDNA method are
based on the following premise/observations. explained in the subsequent paragraphs.
Given that significant portions of 16S rRNA The i-rDNA method first captures the oligo-
gene sequences are universally conserved across nucleotide usage patterns which are specific to
all prokaryotic lineages, genomic regions 16S rRNA gene sequences. This procedure is
encompassing 16S rRNA genes are characterized performed as a one-time preprocessing step. For
by distinct sequence compositions (in terms of this purpose, genomic fragments (of lengths
oligonucleotide usage patterns) as compared to 1,000 bp each) from all completely sequenced
the other regions of the genome. The i-rDNA prokaryotic genomes are first obtained. Each
method utilizes this observation to first identify fragment is then represented as a 256-dimensional
a subset of reads which have an oligonucleotide vector containing the frequencies of all
composition similar to that of 16S rRNA gene possible tetranucleotides. Subsequently, vectors
I 320 I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments
corresponding to all fragments are clustered approaches, namely, BLAST based, HMM based,
(using k-means clustering algorithm) based on and i-rDNA, for four simulated metagenomic
their tetranucleotide frequency patterns. This datasets. These datasets were generated by pro-
generates a feature vector space with a number viding 35 prokaryotic genomes as input to the
of clusters. Centroids of the clusters are then MetaSim sequence simulator software (Richter
calculated based on the fragments contained in et al. 2008). Sequences in each of these datasets
them. Each cluster in the feature vector space is simulated the lengths and error rates of four pop-
thus represented by its centroid. Given the unique ular sequencing platforms, viz., Sanger (sequence
sequence composition of the 16S rRNA gene, length approximately 800 bp), 454-titanium
genomic fragments encompassing this gene are (~400 bp), 454-standard (~250 bp), and Illumina
localized to a subset of these clusters. In the (~ 110 bp). These comparative evaluations were
preprocessing step of i-rDNA method, clusters performed on a standard Linux workstation hav-
containing significant proportions of 16S rRNA ing a 2.33 GHz dual core processor and 2GB
gene fragments (as compared to other clusters) RAM memory. Results in this table indicate the
are identified and tagged as “probable 16S” clus- utility of the i-rDNA method in reducing the
ters (Fig. 3). This information is stored in the overall time taken for identification of 16S
form of a mapping file that contains cluster cen- rDNA fragments in metagenomic datasets. The
troids along with their respective tags (either i-rDNA method is observed to be 50 and 8 times
probable 16S or non-16S). faster in identifying 16S rDNA fragments as
The i-rDNA method identifies 16S rDNA compared to the BLAST-based and
fragments (from amongst all reads constituting HMM-based meta-rna program, respectively. As
a given metagenomic dataset) in the following can be observed, this reduction in time for iden-
manner. For each read, the distances of its tification is not accompanied by a noticeable
tetranucleotide frequency vector to all the cluster decrease in detection sensitivity.
centroids in the mapping file (obtained as
described in the previous paragraph) are first C16S: Algorithm for Taxonomic Classification of
computed. This step helps in identification of 16S rRNA Gene Fragments in Metagenomic
a set of clusters having tetranucleotide composi- Datasets
tion most similar to that of the read. If Extraction and classification of 16S rRNA gene
a significant proportion of the identified clusters fragments is one of the quickest ways to estimate
are observed to be pre-tagged (in the mapping taxonomic diversity of any microbial commu-
file) as “probable 16S,” the read is classified as nity. Due to the presence of several characteristic
a “probable 16S rDNA” fragment (Fig. 3). Only features, the 16S rRNA gene has been used as an
those reads classified as “probable 16S rDNA” ideal taxonomic marker. Primarily, this gene is
fragments are provided as input to the down- ubiquitously present within the genomes of all
stream HMM search. Adoption of the above strat- prokaryotic organisms. Secondly, given its role
egy in the published study (Mohammed in key cellular processes (e.g., protein synthesis),
et al. 2011) indicated a six to ten times reduction the probability of this gene being involved in
in the number of sequences provided as input to lateral gene transfer events is also minimal (Jain
the HMM search, thereby drastically reducing the et al. 1999; Daubin et al. 2003). This property
overall time for identifying 16S rDNA fragments enables its use as a phylogenetic marker to study
(in metagenomic datasets). Furthermore, this the evolutionary patterns in diverse prokaryotic
noticeable reduction in the overall analysis time lineages with high confidence. Furthermore, 16S
was observed to be achieved without any signif- rRNA genes are characterized by highly con-
icant loss in detection sensitivity (Mohammed served regions (U1-U8) that flank hypervariable
et al. 2011). regions (V1-V9) (Jonasson et al. 2002). Univer-
Table 1 provides an additional comparison of sal/customized primers designed against these
detection sensitivity and execution time for three conserved stretches (which are adjacent to the
I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments 321 I
I-rDNA and C16S: Identification and Classification of containing genomic fragments harboring portions of 16S
Ribosomal RNA Gene Fragments, Fig. 3 A concep- rRNA gene in significant proportions, are tagged as “prob-
tual overview of the framework used by the i-rDNA able 16S” clusters. Red dots: fragments originating from
method. (a) A schematic representation of the genomic regions harboring portions of 16S rRNA gene.
preprocessing step of i-rDNA method. A feature vector Blue dots: fragments not containing any portion of the 16S
space is generated by performing a k-means clustering rRNA gene. Black dots: centroids corresponding to each
(using tetranucleotide frequencies) of genomic fragments of the clusters in the feature vector space. (b) Identifica-
from all completely sequenced microbial genomes. In this tion workflow of the i-rDNA method. Tetranucleotide
feature vector space, clusters C3, C5, C6, and C7, frequency vectors corresponding to query reads
I 322 I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments
hypervariable regions) facilitate specific isola- classification. Although using a stringent set of
tion, PCR-based amplification and subsequent BLAST thresholds (for evaluating alignment
sequencing of the entire length (or specific por- quality prior to assignment) is expected to reduce
tions) of 16S rRNA genes. The hypervariable the misclassification rate (to some extent), a large
regions within the sequenced 16S rRNA gene number of 16S rDNA fragments may remain
fragments are specific to each organism and unassigned/unclassified. It may be noted that var-
thus serve as “taxonomic barcodes.” These ious read mapping algorithms, e.g., BWA (Li and
“barcodes” can be used to classify 16S rRNA Durbin 2010), Bowtie (Langmead et al. 2009),
gene fragments sampled from a given environ- etc., have also been used for aligning query 16S
ment into different taxonomic groups. rDNA fragments with sequences in reference
Various strategies are currently employed for databases. The premise and the overall method-
classification of 16S rDNA fragments (Fig. 4). ology for inferring the taxonomic origin of query
Overall, these strategies involve comparing the sequences however remain the same as in the
sequences and/or the compositions of 16S rDNA BLAST-based approach.
fragments with sequences/models corresponding Inferring the taxonomic origin of query 16S
to known taxonomic groups. Details of these rDNA sequences can also be performed by map-
strategies are described below. The BLAST- ping/aligning them to precomputed multiple
based approach (described in the previous sec- sequence alignments (MSAs). These MSAs are
tion) is also employed for classifying 16S rDNA generated by pre-aligning well-annotated 16S
fragments. For this purpose, 16S rDNA frag- rRNA gene sequences belonging to organisms
ments having significant hits with reference 16S of known taxonomic lineages. A detailed descrip-
rRNA gene sequences (from known and charac- tion of the methods adopting such strategies is
terized microbes) are assigned to the taxa provided in another review (Sun et al. 2011).
corresponding to the best hit(s). In this process, MSA-based approaches, though observed to pro-
the quality of the BLAST hit (obtained between vide robust taxonomic inferences, are critically
the query 16S rDNA fragment and reference 16S dependent on the quality and the taxonomic cov-
rRNA gene sequence) is judged based on user- erage of the reference sequences which are used
specified thresholds of alignment parameters for generating the precomputed alignments. Fur-
such as bit score, e-value, identity percentage, thermore, given the algorithmic complexity of
etc. Apart from the huge compute power require- the process of performing/generating multiple
ment (for performing the alignment step), the sequence alignment(s), enormous amount of
BLAST-based approach has the following limi- time and compute power are typically required
tation. In a given metagenomic dataset, a large for MSA-based analyses.
proportion of query 16S rDNA fragments typi- The widely popular RDP classifier (Wang
cally originate from hitherto unknown taxa. Such et al. 2007) attempts to address the limitations
sequences may belong to an entirely new species associated with the above-described BLAST-
or genus or family or order or class or even a new based as well as MSA-based approaches. This
phylum. Attempting to map such novel query 16S method taxonomically classifies a query 16S
rDNA fragments to known taxonomic groups is rDNA fragment by comparing its compositional
expected to result in incorrect taxonomic properties (e.g., oligonucleotide usage pattern)
ä
I-rDNA and C16S: Identification and Classification of cluster centroids C3, C5, C6, and C7 (all of which are
Ribosomal RNA Gene Fragments, Fig. 3 (continued) pre-tagged as “probable 16S” clusters). Consequently,
(R1 and R2) are first mapped to the feature vector space read R1 is identified as a “probable 16S rDNA” fragment.
(generated in the preprocessing phase of i-rDNA as Read R2 is in close proximity to clusters C8, C9, and C10
described above in (a). Read R1 maps to an area (within (all of which are pre-tagged as “non-16S” clusters). Read
the feature vector space) that is in close proximity to R2 is therefore identified as a “non-16S rDNA” fragment
I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments 323 I
I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments, Table 1 Performance
of i-rDNA, meta-rna (a HMM-based identification method) and BLAST in terms of detection sensitivity and execution
time. The approximate length of reads constituting each of the four simulated test datasets is indicated in brackets
Detection sensitivity (%) Execution time (in seconds)
Test dataset Number of reads i-rDNA meta_ma BLAST i-rDNA meta_rna BLAST
Illumina (~110 bp) 1,000,000 93.1 94.6 98.1 102 1,110 6,317
454-Standard (~250 bp) 400,000 90.6 96.4 99.2 97 1,026 6,681
454-Titanium (~400 bp) 250,000 91.3 97.1 99.6 92 947 6,128
Sanger (~800 bp) 100,000 87.6 95.2 99.8 105 929 5,783
I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments, Fig. 4 Different
approaches available for the taxonomic classification of 16S rRNA gene fragments
with models generated using compositional fea- query sequence, the RDP classifier first identifies
tures of sequences of known taxonomic lineages. a model (and the corresponding genus) whose
For this purpose, it first creates (as a 8-mer word frequencies are “most” similar to
preprocessing step) Naive Bayesian models that that of the query sequence. The classifier then
capture 8-mer oligonucleotide word frequencies employs a bootstrapping procedure to compute
in 16S rDNA sequences belonging to known gen- a confidence score of assignment to each taxa
era. During the classification step, for a given belonging to the taxonomic lineage of the
I 324 I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments
identified genus. The query sequence is then identifying an appropriate level of taxonomic
assigned to a taxon (within this lineage) that is assignment for each query sequence. The strategy
at the most specific taxonomic level and also of correlating the HMM score with the taxonomic
generates a confidence score that exceeds the level of assignment is based on the empirical
user-specified confidence score threshold. observation that the HMM score decreases with
Besides being alignment-free, a major advantage increasing taxonomic divergence between the
of the RDP classifier is the bootstrapping proce- taxa corresponding to the query and the HMM
dure employed to compute the confidence scores. (Ghosh et al. 2012).
The overall strategy of this procedure ensures the The classification methodology adopted by
accurate assignment of novel 16S rDNA C16S has the following advantages. First,
sequences (i.e., originating from hitherto employing representative genus-specific HMMs
unknown organisms) to related taxa at appropri- significantly reduces the time and compute
ately higher taxonomic levels. However, it is power as compared to that typically required by
important to note that the overall process of clas- BLAST-based or MSA-based classification
sification involves scoring and identifying the approaches. Second, the use of precomputed
“best” taxonomic lineage corresponding to the threshold scores in C16S ensures assignment of
query sequence. This scoring however does not query sequences (originating from unknown
take into account the actual level of composi- organisms) at appropriately higher taxonomic
tional similarity between the model levels, thereby reducing its misclassification rate
corresponding to the “best” taxonomic lineage as compared to that by the RDP classifier. Finally,
and the query sequence. Consequently, in cases given that the identified taxonomic levels are spe-
where 16S rDNA fragments originate from taxo- cific to the extent possible, the overall specificity
nomic lineages that have minor representation in of assignments by C16S is not compromised.
existing 16S rDNA databases, the classification The above observations (with respect to clas-
accuracy of the RDP classifier has been shown to sification efficiency of C16S) are also reflected in
decrease (Biers et al. 2009). the results of a comparative evaluation between
In contrast to all the three methods described the C16S algorithm and the RDP classifier (run
above, the recently published C16S algorithm with default parameters). This evaluation was
(Ghosh et al. 2012) employs genus-specific performed using five simulated 16S rDNA
HMMs for the taxonomic classification of 16S datasets (each comprised of 30,000 sequences).
rDNA fragments. The overall classification strat- While one of these datasets consisted of full-
egy is based on the following premise. 16S rDNA length 16S rRNA gene sequences from taxonom-
sequences contain alternating conserved and ically diverse microbes, the others consisted of
hypervariable regions. The latter regions are 16S rDNA fragments that mimicked the length
characterized by clade-specific sequence varia- and the sequencing error rates associated with
tion patterns. As a preprocessing step, the C16S four popular sequencing platforms, viz., Sanger,
algorithm captures these clade-specific patterns 454-titanium, 454-standard, and Illumina. Fur-
at the taxonomic level of genus. For this purpose, thermore, for each dataset, evaluation was
genus-specific HMMs are first generated and sub- performed in four different simulated
sequently utilized for classifying query 16S metagenomic scenarios, wherein the input 16S
rDNA fragments. During the classification rDNA sequences mimicked those originating
phase, a query 16S rDNA fragment is first from entirely new genera, families, orders, and
mapped to these precomputed genus-specific classes, respectively. These simulated scenarios
HMMs and the genus corresponding to the best were generated by progressively removing the
scoring HMM is identified. The score obtained in models corresponding to the genus, family,
this process is then utilized for dynamically order, and class of the source organisms
I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments 325 I
I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments, Table 2 Distribution
of taxonomic assignments obtained using C16S and RDP classifier for five simulated metagenomic datasets (each
comprised of 30,000 sequences)
Database scenario
Minus genus Minus family Minus order Minus class
Assignment category C16S RDP C16S RDP C16S RDP C16S RDP
Illumina dataset (average read length ~ 110 bp)
Correct 86.7 81.2 91.2 86.2 88.7 86.2 85.3 84.9
Higher levels 16.7 10.2 21.2 16.6 39.2 34.7 65.7 63.9
Intermediate levels 56.4 56.7 52.2 48.5 25.6 26.3 0 0
Specific levels 13.6 14.3 17.8 21.1 23.9 25.2 19.6 21
454-Standard dataset (average read length ~ 250 bp)
Correct 95.7 80.5 92.6 88.2 84.5 80.7 84.6 84.3
Higher levels 12.6 4.5 19.8 8.3 37.8 31.5 60.4 60.2
Intermediate levels 56.4 46.7 47.8 45.9 18.4 22.6 0 0
Specific levels 26.7 29.3 25 34 28.3 26.6 24.2 24.1
454-Titanium dataset (average read length ~ 400 bp)
Correct 94.4 79.5 92.6 84.3 87.2 70.3 86.4 79.5
I
Higher levels 1.2 2.7 20.3 4.1 22.2 7.3 47.2 43.4
Intermediate levels 56.4 42.5 45.9 44.2 18.4 21.3 0 0
Specific levels 36.8 34.3 26.4 36.0 46.6 41.7 39.2 36.1
Sanger dataset (average read length ~ 800 bp)
Correct 82.2 58.1 88.4 73.1 90.1 59.6 88.2 68.6
Higher levels 11.2 2.1 19.9 3.1 37.0 6.0 48.2 32.4
Intermediate levels 34.1 20.2 41.1 32.2 6.4 25.8 0 0
Specific levels 36.9 35.8 27.4 37.8 46.7 27.8 40.0 36.2
Dataset with full-length 16S rRNA gene sequences
Correct 90.1 57.6 88.8 64.9 78.8 48.3 70.7 51.8
Higher levels 3.3 2.1 6.4 3.4 11.2 4.4 19.8 13.8
Intermediate levels 44 16 47.3 26.4 19.6 15.8 0 0
Specific levels 42.8 39.5 35.1 35.1 48.0 28.1 50.9 38.0
(corresponding to the query 16S rDNA frag- and specificity of C16S is observed to be notice-
ments) from the databases utilized by RDP clas- ably better than the RDP classifier.
sifier as well as the C16S algorithm. Results of Correct assignments are assignments made to
this evaluation (Table 2) indicate improved levels taxa lying in the path between the root and the
of classification accuracy of C16S algorithm as source genus of the query sequence.
compared to the RDP classifier. Interestingly, the Correct assignments are further
improvement in performance, with respect to subcategorized into “specific levels,” “intermedi-
both classification accuracy and specificity, is ate levels,” and “higher levels” as described
especially pronounced in simulated scenarios, below
wherein query sequences originate from hitherto (a) Specific levels: If HMMs corresponding to
unknown genomes lacking counterpart models at genus or family or order or class are absent
the levels of order and class (in the databases of from the reference database, assignment of
both algorithms). Furthermore, for full-length a query sequence is classified as “correct” at
and Sanger datasets, the classification accuracy “specific level,” only if the assignment is
I 326 I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments
made to a correct taxon at the immediate gene or gene fragments. On the other hand, the
higher taxonomic level. For instance, in relatively higher classification accuracy of the
a “new family” simulated database scenario C16S method (as compared to other contempo-
(wherein HMMs corresponding to the source rary classification methods) is expected to pro-
family of the query 16S rDNA fragment are vide an accurate picture of taxonomic diversity
absent from the reference database), an of microbial communities inhabiting any given
assignment of the query sequence to the environment.
corresponding order is categorized as correct
at specific level.
(b) Intermediate levels: Correct assignments to
Cross-References
taxa lying between the phylum level and the
specific level (as described above) are classi-
▶ Computational Approaches for Metagenomic
fied as “correct” at “intermediate levels.”
Datasets
(c) Higher level: Assignments to root or cellular
▶ Conserved Regions in 16S Ribosome RNA
organisms or to superkingdom levels are cat-
Sequences and Primer Design for Studies of
egorized as correct assignments at “higher
Environmental Microbes
levels.”
▶ Microbial Diversity, Bar-Coding Approaches
▶ Nucleotide Composition Analysis: Use in
Metagenome Analysis
Summary
▶ Phylogenetics, Overview
▶ RITA: Rapid Identification of High-
One of the major objectives of most
Confidence Taxonomic Assignments for
metagenomic projects is to profile and subse-
Metagenomic Data
quently compare the spatial and temporal varia-
tions of microbial communities residing in
diverse ecological niches. Analyzing such varia-
tions helps in the identification of microbial References
groups that confer specific characteristics to
a given environment in terms of phenotype/func- Altschul SF, Gish W, et al. Basic local alignment search
tool. J Mol Biol. 1990;215(3):403–10.
tion. Development of efficient in silico methods
Biers EJ, Sun S, et al. Prokaryotic genomes and diversity
for identifying and classifying 16S rRNA genes in surface ocean waters: interrogating the global ocean
(or gene fragments) from metagenomic datasets sampling metagenome. Appl Environ Microbiol.
(obtained using amplicon-based or shotgun 2009;75(7):2221–9.
Daubin V, Moran NA, et al. Phylogenetics and the cohe-
sequencing approach) is therefore an important sion of bacterial genomes. Science. 2003;301(5634):
computational problem. This article describes 829–32.
two recently reported methods, viz., i-rDNA Ghosh TS, Gajjalla P, et al. C16S - a Hidden Markov
and C16S, that cater to the tasks of identification Model based algorithm for taxonomic classification
of 16S rRNA gene sequences. Genomics.
and classification of 16S rDNA fragments in
2012;99(4):195–201.
metagenomic datasets. The i-rDNA method rep- Huang Y, Gilna P, et al. Identification of ribosomal RNA
resents an approach which is efficient in terms of genes in metagenomic fragments. Bioinformatics.
execution speed as well as detection sensitivity. 2009;25(10):1338–40.
Jain R, Rivera MC, et al. Horizontal gene transfer among
Given its ability to directly identify 16S genomes: the complexity hypothesis. Proc Natl Acad
rDNA fragments from metagenomic datasets Sci U S A. 1999;96(7):3801–6.
(obtained using the shotgun sequencing Jonasson J, Olofsson M, et al. Classification, identification
approach), it holds the potential to completely and subtyping of bacteria based on pyrosequencing
and signature matching of 16S rDNA fragments.
bypass the experimental procedures (and the APMIS. 2002;110(3):263–72.
related costs of the same) associated with extrac- Kent WJ. BLAT–the BLAST-like alignment tool.
tion, cloning, and sequencing of the 16S rRNA Genome Res. 2002;12(4):656–64.
I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments 327 I
Langmead B, Trapnell C, et al. Ultrafast and memory- Richter DC, Ott F, et al. MetaSim: a sequencing simulator
efficient alignment of short DNA sequences to the for genomics and metagenomics. PLoS One.
human genome. Genome Biol. 2009;10(3):R25. 2008;3(10):e3373.
Li H, Durbin R. Fast and accurate long-read alignment Seshadri R, Kravitz SA, et al. CAMERA: a
with Burrows-Wheeler transform. Bioinformatics. community resource for metagenomics. PLoS Biol.
2010;26(5):589–95. 2007;5(3):e75.
Meyer F, Paarmann D, et al. The metagenomics RAST Sun Y, Cai Y, et al. A large-scale benchmark study of
server - a public resource for the automatic phyloge- existing algorithms for taxonomy-independent micro-
netic and functional analysis of metagenomes. BMC bial community analysis. Brief Bioinform.
Bioinformatics. 2008;19(9):386. 2011;13(1):107–21.
Mohammed MH, Ghosh TS, et al. i-rDNA: alignment-free Wang Q, Garrity GM, et al. Naive Bayesian classifier for
algorithm for rapid in silico detection of ribosomal rapid assignment of rRNA sequences into the new
gene fragments from metagenomic sequence data bacterial taxonomy. Appl Environ Microbiol.
sets. BMC Genomics. 2011;12 Suppl 3:S12. 2007;73(16):5261–7.
I
K
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
K 330 KEGG and GenomeNet, New Developments, Metagenomic Analysis
KEGG and GenomeNet, New Developments, Metagenomic Analysis, Fig. 2 Screenshot of KEGG
Metagenomes page
The examples of T numbers include T30001 for pages for all genomes, metagenomes,
planktonic microbial communities from North pangenomes, and EST datasets. Also, users can
Pacific Subtropical Gyre (retrieved from NCBI), search for genes of interest and jump to pathway
T30003 for human gut metagenome collected maps, functional hierarchy, modules, etc.
from a healthy Japanese adult male F1-S
(retrieved from Metagenome.jp), and T30016
for human gut microbial gene sample from KEGG PATHWAY Maps and BRITE
healthy Danish female (retrieved from Functional Hierarchy
MetaHIT).
For users interested in an organism (identified KEGG PATHWAY maps (http://www.kegg.jp/
by the KEGG Organism code) or a sample kegg/pathway.html) and BRITE functional hier-
(identified by the T number), embedded links archy (http://www.kegg.jp/kegg/brite.html) gen-
make it is easy to jump to the corresponding erally do not focus on a specific organism. BRITE
summary pages. Clicking the “T30003”, for contains a number of hierarchical classifications
instance, in the KEGG Metagenomes page takes of vocabularies used in journal articles and
the user to the summary page specific for the other public data in academic communities. The
sample T30003. KEGG provides this type of “reference” pathway maps are the combined
K 332 KEGG and GenomeNet, New Developments, Metagenomic Analysis
pathways present in a number of organisms and (1) pathway modules – representing smaller path-
are consensus among many published articles. way units than KEGG PATHWAY maps, such as
Only the reference pathway map is manually M00002 (glycolysis, core module involving
drawn with in-house software called KegSketch, three-carbon compounds; see Fig. 3, right);
whereas all other organism-specific maps are (2) structural complexes – often forming molec-
computationally generated. The user can conduct ular machineries, such as M00072
a search limited to an organism of interest as well (oligosaccharyltransferase); (3) functional sets,
as a comprehensive search throughout all of the for other types of essential sets, such as M00360
genome-sequenced organisms. In the pathway (aminoacyl-tRNA synthetases, prokaryotes); and
maps, rectangles and circles represent gene prod- (4) signature modules, as markers of phenotypes,
ucts (mostly proteins) and other molecules such as M00363 (EHEC pathogenicity signature,
(mostly metabolites), respectively. The maps are Shiga toxin).
colored in black and white in reference pathways,
i.e., when no organism has been specified. When
the user can specify an organism of interest, the KEGG Orthology (KO)
organism-specific pathways include some col-
ored rectangles indicating that the specified Coloring the rectangles in the organism-specific
organism possesses the corresponding genes or pathways, i.e., estimating the presence/absence in
proteins in the genome (Fig. 3, left). White rect- the respective genes in pathway maps, is deter-
angles indicate that no genes have been annotated mined based on the KEGG Orthology (KO). KO
to the corresponding function. This does not nec- collects the groups of orthologous genes having
essarily mean the organism does not possess the a common function and the same evolutionary
corresponding genes, but it is possible that the origin. A group of orthologous genes (a KO
genes have not been identified yet. entry) is given an identification number
(K number) and in principle corresponds to
more than one gene derived from more than one
KEGG Module organism. Genes assigned to the same K number
correspond to the same rectangle in
KEGG has three different levels of resolutions for a PATHWAY map (Fig. 3, left), MODULE
visualizing pathways: global maps (Fig. 6), (Fig. 3, right), and BRITE hierarchy. The top
(conventional) pathway maps (Fig. 3, left), and page of KO (http://www.kegg.jp/kegg/ko.html)
pathway modules (Fig. 3, right). Mapping genes provides the form to obtain an ortholog table
to global maps helps users to grasp the overview (Fig. 4), which shows currently annotated genes
of the sample. Mapping genes to pathway maps is in individual genomes for a given set of
useful to check the functional capability of the K numbers, together with coloring of adjacent
genome or metagenome. There are some cases genes on the chromosome. Each KEGG Module
where the smaller functional units, as defined in also contains a link to the corresponding ortholog
KEGG Modules, are more helpful to conduct the table. The ortholog table is a useful tool to check
detailed analysis. KEGG Modules include con- completeness and consistency of genome anno-
secutive reaction steps, operon or other regula- tations. KO entries for complete genomes are
tory units, and phylogenetic units by genome manually defined and annotated by the KEGG
comparison. KEGG have recently been focusing expert curators based on the phylogenetic profiles
effort on the development and annotation of and functional annotations of the genes. On the
KEGG Modules, leading to the increase of the other hand, KO for draft genomes, metagenomes,
number of entries. KEGG Module (http://www. pangenomes, and EST datasets are automatically
kegg.jp/kegg/module.html) collects functional annotated by KAAS (KEGG Automatic Annota-
units classified into the following four categories: tion Server), one of the GenomeNet tools.
KEGG and GenomeNet, New Developments, Metagenomic Analysis
333
KEGG and GenomeNet, New Developments, Metagenomic Analysis, (right). Rectangles colored in green indicate that human genome possesses the
Fig. 3 Mapping human genome onto glycolysis pathway and module. Human corresponding genes. KEGG Orthology entries are used to define KEGG Modules,
K
genome mapped onto glycolysis pathway map00010 (left) and module M00002 which is part of pathway maps, as indicated by the red lines
K
K 334 KEGG and GenomeNet, New Developments, Metagenomic Analysis
KEGG and GenomeNet, New Developments, links to the genes corresponding to the orthologs
Metagenomic Analysis, Fig. 4 Screenshot of the (K numbers) in genome-sequenced species. Columns and
ortholog table for module M00002. Ortholog tables contain rows represent orthologs and species, respectively
KEGG and GenomeNet, New Developments, Metagenomic Analysis, Fig. 5 KEGG Automatic Annotation
Server (KAAS)
not only for comparing genomes but also for a global map, where green lines indicate genes
visualizing host-microbiome relationship such that the human genome (only) possesses, red
as in human gut microbiome, host-symbiont rela- lines indicate gut metagenome (only) genes, and
tionship, and host-pathogen relationship. If a user blue lines indicate genes possessed by both.
inputs “hsa + pfa”, meaning human (Homo sapi- Figure 7 shows an example of the reconstructed
ens) plus a pathogen (Plasmodium falciparum thiamine metabolism pathway by mapping
3D7), the resulting pathways will be double col- human genome (hsa, colored in green) and
ored. The two colors would represent the gene human intestine metagenome (T30003, colored
products from the two organisms. This option in pink). Thiamine diphosphate shown in this
accepts any combinations up to a total of ten pathway works as an essential nutritional factor
genomes. For instance, the query “hsa + mmu + for human, but this cannot be synthesized without
dme”, which means human (Homo sapiens) + the help of the symbiotic bacteria in human intes-
mouse (Mus musculus) + fruit fly (Drosophila tine. By clicking one of the pink-colored rectan-
melanogaster), provides the three-colored map. gles (e.g., ThiC), a user can see the list of
Metagenomes can also be viewed with KEGG corresponding genes in the metagenome
Mapper. Figure 6 shows the human genome and (Fig. 8). The possible common sets of functions
a human intestine metagenome mapped onto between human genome and human gut
K 336 KEGG and GenomeNet, New Developments, Metagenomic Analysis
KEGG and GenomeNet, New Developments, Metagenomic Analysis, Fig. 6 Mapping of human genome and
human intestine metagenome on a global map
KEGG and GenomeNet, New Developments, Metagenomic Analysis, Fig. 7 Mapping of human genome and
human intestine metagenome on thiamine metabolism
KEGG and GenomeNet, New Developments, Metagenomic Analysis 337 K
KEGG and GenomeNet, New Developments, Metagenomic Analysis, Fig. 8 Examples of the metagenome
sequences annotated in the place of ThiC
metagenome can also be compared in terms of the genome has been annotated to have such
possessed KEGG Module entries. From the top a function.
page of a metagenome samples (e.g., T30003),
the user can jump to the module page, where
the thiamine biosynthesis module (M00127) is Conclusion
present (Fig. 9). In contrast, the human genome
also has the corresponding page, but there is no This review introduced the KEGG and
such module, meaning that no gene in human GenomeNet resources, putting emphasis on the
K 338 KEGG and GenomeNet, New Developments, Metagenomic Analysis
KEGG and GenomeNet, New Developments, Metagenomic Analysis, Fig. 9 KEGG Module entries assigned for
a metagenome sample
Krona: Interactive Metagenomic Visualization in be seen even though they would be large enough. The
a Web Browser, Fig. 1 Types of overviews. The tradi- multilayer pie chart (b) depicts ranks more dynamically,
tional pie chart (a) shows abundances of organisms in dividing high-level classifications into more specific ones
a metagenome, summarized at the phylum level. Many toward the outside of the circle. This allows more details
phyla are still too small to compare, while genus- and to be shown for large phyla while small phyla are grouped
species-level classifications for the larger phyla cannot and labeled
Krona: Interactive Metagenomic Visualization in a Web Browser 341 K
Krona: Interactive Metagenomic Visualization in and cause wedges to become nearly rectangular. As
a Web Browser, Fig. 2 Zoomed multilayer pie chart. a result, it is less intuitive to discern relative abundances
Standard zooming can show more detail for a region of and hierarchical organization
a multilayer pie chart, but can move the center off screen
there is still a trade-off between comparing the high-level overviews and detailed views of spe-
most abundant organisms and viewing their most cific portions as needed (Shneiderman 2002).
specific classifications (Huson et al. 2007; Meyer Though an overview can (and usually must)
et al. 2008). Krona uses multilevel pie charts to omit some complexity, this view helps users
visualize both the most abundant organisms and determine which areas to view in further detail
their most specific classifications (Fig. 1). Rather and provides context as they browse between
than hiding lower ranks in its overview, Krona sections. Multilevel pie charts are a good option
hides low-abundance organisms, which can be for metagenomic overviews because they can
expanded interactively. Additionally, Krona’s convey hierarchy implicitly, nesting lower-level
browser-based implementation allows it to be wedges within higher ones (Draper et al. 2009).
much more portable than other interactive This allows the abundances of multiple levels to
metagenomic visualization tools. be shown in the same view and using the same
scale. As in other metagenomic visualizations,
some nodes will have to be hidden for the over-
Overviews and Details view to be informative. The benefit of multilevel
pie charts is that the nodes are hidden based on
Interactive visualizations can make complex abundance rather than specificity of their classi-
results more accessible by providing both fications. This gives priority to nodes that make
K 342 Krona: Interactive Metagenomic Visualization in a Web Browser
Krona: Interactive Metagenomic Visualization in A wedge in the overview (a, green) is stretched around
a Web Browser, Fig. 3 Polar zooming. Zooming in the center (b–e) until it fills the entire circle (f). The
polar space allows the zoomed region to retain the intui- detailed view also serves as a new overview from which
tive properties of the original multilayer pie chart. the process can be repeated with smaller wedges
Krona: Interactive Metagenomic Visualization in a Web Browser 343 K
up the greatest portion of the sample, which are
typically of the most interest. A potential draw-
back, however, is that simply zooming in on the
smaller nodes would cause them to lose their
resemblance to a pie chart (Fig. 2). Krona avoids
this problem using polar zooming, in which
a wedge is stretched around the center until it
forms a new multilayer pie chart (Fig. 3). The
zoomed in view also serves as a new overview for
further zooming, allowing even complex hierar-
chies to be explored with only a small amount of
navigation.
Krona: Interactive Metagenomic Visualization in red (signifying low confidence) to green (signifying high
a Web Browser, Fig. 5 Classification confidence. confidence), allowing it to be depicted in tandem with
Classification confidence is mapped to a gradient from abundance and hierarchy
samples. These two methods can also be com- both types of data focus on quantities within
bined to provide a clearer picture of sample hierarchies, both are suited to visualization with
variation. Krona charts. To create Krona HTML files
from these data, many common formats can be
Applications in Metagenomics imported with KronaTools, a software pack-
Metagenomic analyses typically produce data age for Unix-based systems. Classifications
from one of the two categories: taxonomic and can be directly imported from the RDP
functional. Taxonomic classifications, which Classifier, Phymm/PhymmBL, FCP, MG-RAST,
place sequences on the tree of life, are inherently or the Web-based bioinformatics platform
hierarchical because of the various ranks in the Galaxy. For raw BLAST results downloaded
tree (species, genus, etc.). Functional classifica- from NCBI or the METAREP metagenomic
tions, which describe the roles of predicted pro- repository, KronaTools performs MEGAN-like
teins, are often made hierarchical by grouping (lowest common ancestor) classification using
specific functions into more general ones. Since NCBI taxonomy information. When importing
Krona: Interactive Metagenomic Visualization in a Web Browser 345 K
Krona: Interactive Metagenomic Visualization in of wedge coloring between samples helps the user keep
a Web Browser, Fig. 6 Comparing datasets. Differ- track of individual wedges and draws attention to ones that
ences between samples are shown with an animated tran- change by significant amounts
sition from one sample (a) to the next (b). The persistence
K
Lateral Gene Transfer and Microbial biological diversity. This estimate of less than
Diversity 1 % of prokaryotes being represented by cultures
(Torsvik et al. 1990) suggests that exploration of
Tania Nasreen, Rebecca J. Case and Yan Boucher the untapped diversity of microbial species, genes,
Department of Biological Sciences, University of pangenomes, metabolism, behaviors, and complex
Alberta, Edmonton, AB, Canada interactions will be a fruitful endeavor. However,
how we discover and understand microbial diver-
sity has been heavily influenced by LGT. We can
Synonyms compare any bacterial or archaeal 16S rRNA gene
directly recovered from the environment to the
Horizontal gene transfer (HGT); Lateral gene comprehensive public sequence databases. This
transfer (LGT) allows the identification of this gene’s host based
on its similarity and phylogenetic placement rela-
tive to sequences from described (and therefore
Definition cultured) prokaryotes. These sequences can also
be compared to all the other 16S rRNA gene
LGT is genetic changes within an individual or sequences directly retrieved from the environ-
a population that occur through the acquisition of ment; however, without a described culture, little
DNA from individuals that are not an organism’s can be inferred about their hosts physiology and
direct cellular parent or progenitor. One of its thereby their role in an ecosystem. This is further
effects on microbial populations is to alter diver- complicated by the prevalence of LGT making
sity through the acquisition of genetic material or inferences of few if any phenotypic characteristics
by homogenization of a population. If molecular of a species, genera, family, or phylum impossible.
sequences are available for a community, statis- Therefore, a sequence rarely tells us anything
tical estimators can be used to calculate its total about the ecology of an organism and its real
diversity and structure so that it can be compared value is that it can tell us something about the
to other ecosystems. biological diversity of an ecosystem.
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
L 348 Lateral Gene Transfer and Microbial Diversity
information to describe an ecosystem (Gravel often studied (with the exception of phytoplank-
et al. 2011). The diversity of a system is not ton in aquatic systems). Often a specific process
simply the number of organisms or unique DNA is of interest, such as degradation of xenobiotics
sequences identified. Probability-based estima- or denitrification. This presents one of the biggest
tors can be used to extrapolate the total diversity dilemmas for microbial ecologists as they cannot
from subsampling the diversity of operational study a phylogenetic group (for which there are
taxonomic units (OTUs) defined as a similarity many 16S rRNA-based primers and probes that
threshold of the 16S rRNA gene sequence. This could be used in targeted studies) and infer the
can be done for populations with parametric (e.g., function of the group (Case et al. 2007).
a rarefaction) or nonparametric (e.g., Chao1) dis- Macroecologists can infer that plants are primary
tributions. This is analogous to capture-recapture producers at the base of the food web and provide
methods of determining the population size of shelter for other species as habitat-forming spe-
animals. For example, to determine the popula- cies, which is not possible for microecologists.
tion size of swamp wallabies, several wallaby’s This is the result of LGT, as this phenomenon
ears are tagged within a population and subse- facilitates the movement of genes among phylo-
quent sampling of the population can be used to genetically distant organisms. This means that
estimate its size by calculating the probability of phylogeny based on universal marker genes
recapturing tagged wallabies among non-tagged such as 16S rRNA is not a predictive tool of
wallabies. Molecular microbial ecology is much function in microbiology.
more powerful than such macroecology studies Molecular methods have been adapted to cir-
as it rarely focuses on a single species, but rather cumvent this conundrum so that functional genes
the total bacterial and/or archaeal community and (such as hupL for hydrogen oxidation) can
the numerous populations that encompass thou- be directly targeted through PCR (Balskus
sands of species. Diversity estimators are then et al. 2011). Such functional genes can then be
used to calculate the total diversity and structure used in community fingerprinting, clone libraries,
of the community using indices such as or CARD-FISH, which has been adapted to iden-
Simpson’s Diversity Index (proportional distri- tify mRNA to look at expression of specific genes
bution of all species), species evenness inside cells. Such gene-omic (sequencing of
(distribution of individuals among species), and a single marker gene directly from an environ-
Shannon Index (entropy of community measured mental sample) approaches are popular for
from the richness and evenness of community). targeted studies and can be adapted to high-
These indices allow us to compare natural and throughput sequencing techniques. Datasets that
experimental communities to identify factors that include deep sequencing of a gene involved with
influence diversity such as the volume of water in a specific function can be used to identify redun-
tree holes (Bell et al. 2005) or a chronosequence dancy in a system. Such redundancy is important
within a lichen (Mushegian et al. 2011). for the stability of an ecosystem through
Microbial systems rarely have a perceived environmental change, as genetic redundancy
intrinsic value in that people do not marvel at represents the diversity of organisms able to
a termite’s hindgut as they do old growth forests. perform a function within a system. The alterna-
Their value is in what they do, their function. tive to gene-omic approaches is metagenomics,
Diversity is a powerful measurement in microbial whose popularity has been greatly influenced by
ecology as it has a major influence on the produc- the disconnect created by LGT between phylog-
tivity and stability (or resilience) of an ecosystem eny and function. Metagenomics retrieves
(Gravel et al. 2011). The indices described above large nontargeted sequence datasets from an
are useful in characterizing these systems as it environment such that metabolic networks
can be used to compare their productivity. How- and interactions can be inferred from the
ever, in microbial systems, the productivity is not community’s metagenome. This method can be
Lateral Gene Transfer and Microbial Diversity 349 L
coupled to metatranscriptomics (RNA) and/or transformation (the uptake of DNA directly from
metaproteomics (proteins) to move beyond the the environment or from a membrane vesicle),
genetic potential of a metagenome to the tran- conjugation (cell-to-cell contact mediated by the
scribed and translated. These methods, however, apparatus encoded on a conjugative element
have their greatest power when targeted or used or by a cytoplasmic fusion), or transduction
in low-diversity systems (Hugenholtz and Tyson (introduction of DNA by a phage) (Fig. 1). Sec-
2008). ond, integration into the new host genome is
required, which can be achieved by homologous
recombination (i.e., this requires a homologous
Mechanisms Responsible for the region of DNA between the donor and recipient),
Generation of Genetic Diversity in heterologous recombination (i.e., that does not
Microbes require a homologous region of DNA between
the donor and recipient DNA), or extrachromo-
What is measured through gene-omic approaches somal maintenance and replication.
such as the 16 rRNA gene or nifH clone libraries We can now obtain minimal LGT estimates
is nucleotide sequence diversity. The latter, how- through quantification of homologous recombi-
ever, is not the only type of genetic diversity. nation. This type of LGT directly affects
Metagenomic or genomic approaches allow the sequence diversity and is usually simply termed
measurement of gene content diversity, which is “recombination” in most molecular population
the measurement of differences in the genes studies. This is because mathematical models
found in various genomes or metagenomes. currently used in population genetics can only
Both of these are strongly affected by LGT, take into account changes in genetic material
which influences not only the rate at which they that is present in all members of the population,
change but also how they change. therefore excluding acquisition of novel genetic
Sequence Diversity. The only force responsi- material through heterologous recombination and
ble for de novo creation of genetic diversity is as extrachromosomal elements. Population
mutation. It can be defined as changes in the DNA recombination rates therefore only include events
sequence of a genome that is inherited from in which foreign DNA, through replacement of
a progenitor. The nature of such changes can a homologous locus by recombination, is inte-
vary: base pair substitutions, insertion/deletion grated in the host genome. Studies that have
of one or more nucleotide(s), as well as larger or compared population mutation and recombina-
more complex changes (such as chromosomal tion rates in various prokaryotic lineages have
rearrangement or gene duplication) (Fig. 1). The found a relatively even split between those in
physical causes of mutations are also diverse: which mutation introduces most of the changes
unforced DNA replication errors, errors during and those where homologous recombination is
proofreading or post-replication mismatch repair, responsible for most nucleotide variations
and DNA damage leading to replication errors or (sequence diversity).
inaccurate repair. Although mutation is responsi- Gene Content Diversity. LGT also (if not pre-
ble for creating diversity, it is not the only phe- dominantly) introduces change through the
nomena introducing variation in particular acquisition of novel genetic material through het-
groups or lineages of microbes. Genetic changes erologous recombination. This, in combination
within an individual or a population can occur with gene loss and gene duplication, leads to
through the acquisition of DNA from individuals changes in the gene content of an organism. For
that are not an organism’s direct cellular progen- example, strains of the marine heterotrophic bac-
itor. This process is LGT. In bacteria and archaea, terial genera, Vibrio, which are identical at one or
it has two main steps. First, foreign DNA pene- more protein-coding housekeeping gene, can be
trates the cellular envelope in one of three ways: differentiated by genome size (up to 800 kb
L 350 Lateral Gene Transfer and Microbial Diversity
Lateral Gene Transfer and Microbial Diversity, Fig. 1 Description of the processes generating genetic diversity in
bacteria and archaea
variation) (Thompson et al. 2005). Also strains of hypothesized to hold for bacteria that partially
the nitrogen-fixing soil bacteria Frankia that are overlap in their ecological niche (Konstantinidis
more than 97 % identical in their rRNA gene et al. 2006). Sequence diversity dominates for
sequences – the conventional cutoff value for bacteria with identical or almost entirely
a bacterial species – can differ by as many as overlapping niches (little change in gene con-
3,500 genes, which represents nearly half of tent), and gene content diversity is more pro-
their 7.5 Mb genomes (Normand et al. 2007). nounced when bacteria occupy separate niches.
Although gene content and sequence diversity Ecological adaptation is therefore directly linked
are often correlated, it is not always the case. with gene content diversity but less so with
According to empirical data, the correlation is sequence diversity.
Lateral Gene Transfer and Microbial Diversity 351 L
Impact of LGT on the Phenotypic (Faruque and Mekalanos 2012). Thus, two
Diversity of Microbes individual LGT events involving these two
phages have the potential to make almost any
Microorganisms exhibit great diversity in their Vibrio cholerae strain into a potent human
cellular structures, metabolic properties, interac- pathogen.
tions, and ecological niches. It is well established Various metabolic properties, virulence, and
that mutation (sequence diversity) has contrib- antibiotic resistance traits can also be carried on
uted to this phenotypic diversifications of plasmids or transposons or a combination of the
microorganisms. However, growing numbers of two. This makes these genes more likely to be
genomic studies suggest that LGT influences the transferred through LGT. For example, Tn10 is
acquisition of novel functions through its effect a transposon consisting of a pair of IS10,
on gene content, not sequence, diversity. For a tetracycline determinant and a regulatory
example, recent studies of the genomic context gene. Similarly, transposon Tn5 consists of two
and phylogenetic relatedness of proteorhodopsin IS50 elements and a three-gene operon that attri-
genes suggested that they had been transferred by butes resistance to kanamycin, bleomycin, and
LGT from marine Archaea to Proteobacteria. streptomycin. Both of these transposons can be
This single gene is hypothesized to provide its incorporated into the chromosomes of phyloge-
host with a competitive advantage by allowing it netically diverse groups of bacteria. Plasmids are
to harness light energy for cellular function. As the other major mediator of antibiotic resistance
these organisms reside in the photic zone of the gene acquisition by LGT. Not only are plasmids
ocean, proteorhodopsin allows them to take full themselves transfer agents, but they can also
advantage of available UV energy (Frigaard change rapidly through LGT. For example,
et al. 2006). based on gene organization and sequence simi-
In some species, most of the genetic variation larity, plasmid pKF3-140 found in Klebsiella
and adaptation occurs through LGT. Although pneumoniae has been speculated to have origi-
Prochlorococcus species have a conserved core nated from Escherichia coli (plasmids
of genes, they show a significant variation in the p1ESCUM and pUTI89) and further modified
genes present on genomic islands. These repre- by acquiring resistance genes from different
sent the evolutionary hot spots inside their enteric bacteria by LGT.
genomes. It is hypothesized that these genomic Another genetic element facilitating LGT and
islands are acquired by LGT and undergo exten- phenotypic diversity is the integron. This genetic
sive rearrangement, suggesting a common mech- element carries genes for site-specific recombi-
anisms of niche differentiation in microbial nation known as mobile gene cassettes in the host
species. The pathogenicity islands of pathogenic genome. It has been found that about 17 % of the
bacteria also share the same characteristics sequenced bacterial genomes have integrons. For
(Coleman et al. 2006). Some genomic island example, many species of Pseudomonas contain
associated LGTs are thought to be mediated by integrons with a variable number of gene cas-
phages, since they can carry host genome frag- settes (10–32) that are considered to have been
ments. For example, the cholera toxin gene in obtained by LGT at the late stage of species
Vibrio cholerae that is actually encoded within segregation (Vaisvila et al. 2001).
a bacteriophage (CTXf) genome that necessarily These are only a few representative examples
needs the toxin co-regulated pilus (TcpA), an of the contribution of LGT to the phenotypic and
intestinal colonization factor, as its receptor. genotypic diversity of microbial populations.
TcpA is encoded within the pathogenicity island Importantly, this diversity is not only driven by
named VP1. However, this VP1 region mainly natural selection. Microbes have evolved the
constitutes the genome of another bacteriophage ability to sense the environments and generate
L 352 Lateral Gene Transfer and Microbial Diversity
methods (Pignatelli and Moya 2011; Charuvaka for the assembly of metagenomic datasets like
and Rangwala 2011; Mende et al. 2012). Genovo (Laserson et al. 2011), IDBA-UD (Peng
The effect of average read length on gene et al. 2012), and MetaVelvet (Namiki et al. 2012)
annotation has been addressed by Wommack shows a promising improvement in metagenomic
et al. (2008). They simulated the subsampling assembly, although only for low complexity with
of existing Sanger-sequenced metagenomic communities with phylogenetically distant mem-
datasets producing shorter (<400 bp) reads char- bers. An approach that should be used in all
acteristic of the next-generation sequencing tech- assembly benchmarking studies is the compari-
nologies 454 and Illumina. Their simulations son of the assembly obtained with the mixed
revealed that short reads can miss up to 72 % of simulated metagenomic dataset against the
the annotated functions revealed by longer assembly obtained with an independent assembly
(~750 bp) Sanger reads and can detect only of each species since most simulated datasets are
highly conserved sequences with phylogeneti- produced from the annotated complete genomes
cally close relatives in reference databases from isolates, as done by Charuvaka and
(Wommack et al. 2008). The simulations also Rangwala (2011) and Namiki et al. (2012).
indicate that even an increase in sampling depth
with short reads (as promised by the Illumina Evaluation of Ecological Aspects
platform) does not improve the annotation Computer simulations have been long used in
achieved by long reads. In addition, a related community ecology for modeling communities
study using simulated datasets to assess the effect (Garfinkel 1962) and testing the performance of
of sequencing error on gene prediction (Hoff diversity indexes (e.g., Heltshe and Forrester
2009) revealed that all metagenomic gene predic- 1983). But the use of computer-simulated
tion tools show a reduced accuracy at gene call- datasets to study the diversity of microbial com-
ing with increasing sequencing error rates and munities had to wait until molecular methods
that their individual performance seems to be were available to study microbial communities
affected by the taxonomic composition of the (Liu et al. 1997; Bent and Forney 2008). Simu-
samples, except when using Sanger reads with lated communities are the only option to test
error rates below 0.15 % (Hoff 2009). Pignatelli the performance of diversity metrics on
and Moya (2011) adapted the FAMeS commu- metagenomic datasets, since currently no natural
nity profiles to the 454 and Illumina sequencing community has been sampled to exhaustion and
platforms and at a deeper sequencing coverage hence no real diversity measure is accurately
and demonstrated that all de novo assemblers known that we can compare our estimations
produce a significant amount of chimeric contigs against. The design of an artificial community
(up to 10 %) that have a profound impact on the in vitro and its subsequent sequencing (Morgan
functional and phylogenetic annotation of et al. 2010) is at best methodologically and eco-
metagenomic sequences. Since domain and nomically unfeasible to test the performance of
motif databases like Pfam and TIGRfam rely on several replicated datasets (Angly et al. 2012).
short conserved sequences, they may give better Three published studies exist that use
annotations at a more functionally general anno- simulated datasets to evaluate the performance
tation (Pignatelli and Moya 2011). of community diversity metrics, two of which
All these studies reveal that the assembly of deal with 16S rRNA amplicon-derived datasets
metagenomic datasets is highly influenced by the (Kuczynski et al. 2010; Parks and Beiko 2012)
community composition complexity, depth of and one with metagenomic datasets (Bonilla-
sequencing coverage, and average length of the Rosso et al. 2012).
sequenced reads, discouraging the assembly of Bent and Forney (2008) were the first to
metagenomic datasets. Nevertheless, the recent implement large-scale sequencing simulations
development of software specifically designed to evaluate alpha diversity (species diversity in
Lessons Learned from Simulated Metagenomic Datasets 357 L
individual samples) metrics from 16S rRNA demonstrate that patterns are more readily iden-
amplicon clone libraries and T-RFLPs. They tified with several low-coverage samples than
demonstrated that most alpha diversity metrics with few deep-coverage datasets (Kuczynski
are sensitive to the number of rare and uncom- et al. 2010). These results were further extrapo-
mon species, which are precisely the ones likely lated for similarity metrics that incorporate phy-
to be undersampled by 16S rRNA amplicon- logenetic information, and it was found that most
based techniques (the so-called tragedy of the distance metrics are highly intercorrelated, and
uncommons). Moreover, they show that different highly robust to rooting, choice of threshold for
methods applied on the same community can defining OTUs and the presence of basal lineages
produce radically different estimations for these (Parks and Beiko 2012).
metrics (Bent and Forney 2008). Using
a replicated simulated dataset of nine communi-
ties in a cross-gradient of species richness and Perspectives
dominance, Bonilla-Rosso et al. (2012) demon-
strated that the use of conserved protein genes in Often obscured by the large amount of data pro-
metagenomic datasets outperforms 16S rRNA duced, metagenomics is still a very young disci-
genes at reflecting the original community. More- pline where a consensus set of rigorously tested
over, they show that the most common alpha analytical tools is still lacking. Moreover, the
diversity metrics derived from metagenomic rapid advance of sequencing technologies causes
samples are biased because of insufficient sam- a constant development and diversification of
pling and variations in the taxonomic composi- their accompanying bioinformatic tools and
tion representation. These last two studies point approaches that require an objective quantifica-
toward the use of scale-dependent metrics tion of their performance. This is worsened by the
such as Rényi’s profiles or Hill’s numbers as lack of theoretical understanding of the assembly,
a better representation of alpha diversity’s dynamics, and functioning of natural microbial
multidimensional nature. communities. The use of simulated datasets after
Two studies have addressed the performance sequencing modeling is the best alternative to
of beta diversity metrics (similarity in species approach the benchmarking of technical and ana-
composition between samples) with simulated lytical methodologies as well as the testing of
datasets. The use of simulated datasets to test theories and hypotheses. However, a much more
ecological hypotheses was first implemented efficient benchmarking framework is still
with deep sequencing of 16S rRNA amplicons needed.
(Kuczynski et al. 2010). Addressing the effect A set of source communities from where new
of the environment on community structure, datasets are to be simulated need to be consensu-
they simulated datasets to model communities ally designed by the academic community as the
that were either shaped along an environmental minimal standard benchmarking start point, so
gradient or where the environment partitioned that the comparison of the performance of bioin-
them into discrete clusters. They found that the formatic tools across studies and sequencing plat-
patterns from environmental gradients were more forms is achieved. This was the original intention
easily detected than those from ecological clus- of the FAMeS dataset (Mavromatis et al. 2007),
tering, specially when differences between clus- but currently almost each new tool developed is
ters were subtle. Moreover, qualitative methods tested against a tailored simulated dataset, in part
(richness based) performed better on clustered because the three FAMeS communities cover
datasets, while quantitative methods (abundance a narrow range of community composition
based) performed better on gradients, so both options. Ideally, this standard source community
types of methods should be applied if the dataset should be designed in a way that spans
underlying pattern is unknown. Finally, they a wide spectrum across three dimensions of
L 358 Lessons Learned from Simulated Metagenomic Datasets
MEMOSys: Platform for Genome- been compiled for a number of different organ-
Scale Metabolic Models isms (Henry et al. 2010). Each model is in general
a network consisting of metabolites that are
Stephan Pabinger1,2 and Zlatko Trajanoski1 connected by reactions. Genome-scale models
1
Division of Bioinformatics, Biocenter, include all reactions occurring in a living organ-
Innsbruck Medical University, Innsbruck, ism and are primarily reconstructed using the
Austria annotated genome and literature information.
2
AIT – Austrian Institute of Technology, Metabolic models can be used to provide an
Health & Environment Department, alternative approach for integrating large
Molecular Diagnostics, Vienna, Austria amounts of data about biological systems to
gain novel insights into their interconnected func-
tionality (Kay and Wren 2009). Moreover, they
Synonyms have already been used for a variety of different
purposes including strain engineering (Benedict
Bioinformatics platform for genome-scale et al. 2012), gene deletion studies (Choi et al.
metabolic models 2010), biofuel production (de Jong et al. 2011),
and interpretation of gene and protein expression
data (Gowen and Fong 2010).
Definition The generation of new models is a well-
documented iterative process comprising a multi-
MEMOSys is a web-based platform for tude of different steps (Thiele and Palsson 2010),
constructing, managing, and storing genome- where often 10 % of construction time is needed to
scale metabolic models. It provides sophisticated model 90 % of reactions and 90 % to collect the
query and data exchange mechanisms, offers an remaining 10 % (Rocha et al. 2008). Until the final
integrated version control system, and allows version of a model is assembled, usually several
researchers to easily compare models. MEMOSys intermediate revisions are generated. During this
is freely available at http://www.icbi.at/memosys reconstruction process, simulated results are con-
under the GNU Affero General Public License. stantly compared to experimental data, and if they
do not agree, the model is critically reevaluated
(Baart and Martens 2012). It is therefore of great
Introduction importance to be able to review all changes, extract
previous versions, compare different versions of
Driven by recent innovations in sequencing tech- one model, and have access to easy to use software
nology, genome-scale metabolic models have for creating and manipulating models.
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
M 362 MEMOSys: Platform for Genome-Scale Metabolic Models
The MEtabolic MOdel research and develop- corresponding Systems Biology Ontology (SBO)
ment System (MEMOSys) (Pabinger et al. 2011, term and arranged in a hierarchy to support fine-
2014) has been developed to support the con- grained compartmentalization when exporting
struction, modification, and management of models. SBO is a hierarchically arranged set of
genome-scale metabolic models. It is a web-based controlled, relational vocabularies of terms that
bioinformatics platform that uses an automatic are commonly used in mathematical modeling.
version control system to store the complete MEMOSys uses an integrated balance check
developmental history of all model components. mechanism that validates the elemental composi-
This allows researchers to access the entire model tion of consuming and producing reactants. The
at any time during the iterative model building check is automatically executed when reactions
process. Furthermore, MEMOSys offers sophis- are modified, or during the import of a new model.
ticated query mechanisms and supports the Each organism of a model can be annotated
exchange of models using standardized formats. with the corresponding BioCyc (Karp et al. 2005)
identifier. BioCyc is a biological database collec-
tion, which includes highly curated genome and
Model Management pathway information for individual organisms. In
order to facilitate the assignment process,
Database Structure MEMOSys dynamically fetches all available
MEMOSys has been designed to store all proper- organisms from BioCyc and provides suggestions
ties of a metabolic model in a database. The to select the correct identifier.
model itself is represented by a name, its unique Genes and their relationship to other genes and
model identifier, as well as containing reactions, reactions can be described using hierarchical
genes, and metabolites. In addition, it is assigned structures and Boolean operators (e.g., [gene1 or
to an organism and may contain references to an gene2] and gene3). They are linked to the
image that graphically represents the metabolic corresponding BioCyc pages if the organism
network. MEMOSys supports the upload of arbi- identifier and the unique gene symbol are pro-
trary additional data files, which can be directly vided. In addition, for genes having a reference to
linked to stored models. Such files may include the Universal Protein Resource (UniProt)
experimental data sets that were used to validate (Magrane and Consortium 2011) database,
the model during the reconstruction process. In MEMOSys offers a mechanism to download the
addition, analysis results from external tools can amino acid sequence of the transcribed protein
be directly attached to the investigated model. and provides an integrated system to fetch addi-
Each model has an arbitrary number of reac- tional information from the UniProt entry.
tions, which are described by a multitude of prop- UniProt is a popular, freely accessible compre-
erties, including name, Enzyme Commission hensive resource containing protein sequence
(EC) number, reactants, products, and reversibil- data as well as functional and annotation
ity. Reactions can be linked to citations in order to information.
provide primary literature evidence and are Genome-scale metabolic models rely on anno-
assigned to a subsystem, which is used to group tations to unambiguously identify model compo-
reactions into metabolic pathways. MEMOSys nents. History has shown that biologists have
supports the definition of lower and upper bound been using different notations and naming
constraints, which are automatically included schemes for the same gene or protein. MEMOSys
when the model is exported into a file and can allows researchers to annotate reactions, metab-
then be directly used in constraint-based analyses. olites, genes, and compartments with references
Reactants and products of reactions contain the to external databases using the minimum infor-
metabolite itself and the stoichiometric coefficient mation requested in the annotation of biochemi-
for that metabolite, and they are assigned to cal models (MIRIAM) (Le Novère et al. 2005)
a compartment. Compartments are linked to the notation. Every MIRIAM identifier is a single
MEMOSys: Platform for Genome-Scale Metabolic Models 363 M
unique string, which unambiguously references an integrates an automatic version control system,
object in an external resource and facilitates sci- which creates a new revision for every modifica-
entific collaborations and model comparability. tion of a model component. This system allows
MEMOSys automatically transforms MIRIAM researchers to access the complete model history
annotations into web addresses and displays direct and query, compare, and export previous versions
links to the external data sources. of a model. Each modification can be annotated
Furthermore, the application includes with a comment, and the complete change history
a mechanism to easily define additional external is displayed as a list at the respective component
databases, which can then be used by all model pages. The home screen of the application lets the
components to create further references and user specify which version of a model should be
annotations. used and lists the latest modifications for metabo-
Due to the iterative model building process, lites and reactions.
components may be modified several times by
different members of the reconstruction team.
To facilitate the discussion between researchers, Data Access and Supervision
MEMOSys features an integrated web board that
allows attaching discussions to every model com- MEMOSys is a multiuser application using four
ponent. Associated threads are shown at each different user classes to control data access:
component page, and latest comments of all dis- (a) unregistered visitors are allowed to view
cussions are displayed on the home screen. In accepted, publicly available versions of models;
addition, global threads can be created to discuss (b) registered users are able to display in addition
general properties of models. to publicly available models, accepted versions
of assigned models; (c) editors have access to all
Querying System versions of their own models and are able to
MEMOSys uses enhanced lists to present and create, update, and delete model components. In
M
query data stored in the database. Every list can addition they are allowed to upload files to the
be customized to display a selection of available application and import models; (d) administrators
attributes. They are fully sortable and incorporate are editors with additional rights to access all
attributes from different tables into one view, models, change the public availability of models,
which allows comprehensive data representa- and accept modifications.
tions. MEMOSys supports fine-grained searches Each modification of a model component is
where different restrictions can be combined to at first marked as pending and needs to be con-
query for a specific question. In addition, the firmed by an administrator. Upon approval of
application offers an easy to use quick search a modification, a new internal revision number
mechanism that allows users to easily search for is assigned to the model. In addition to the auto-
reactions, metabolites, genes, and organisms. matically set revision number, administrators can
As all model components are highly assign major version numbers to each model.
connected with each other, MEMOSys displays MEMOSys differentiates between publicly avail-
links to referenced components throughout the able models, which are visible to all visitors and
system and allows free navigation within and contain all accepted modifications, and restricted
across all stored models. models that are only visible to registered users
and editors of the assigned models.
Versioning
The construction of a metabolic model is an iter-
ative task, which has been broken down into Comparison
96 steps (Thiele and Palsson 2010) generating
several intermediate versions until the final As the construction of a draft genome-scale met-
model is established. Therefore, MEMOSys abolic model is getting more and more a routine
M 364 MEMOSys: Platform for Genome-Scale Metabolic Models
application, future developments will strongly reactions that are in certain subsystems or using
rely on already existing reconstructions of related the result of a reaction query as input for the
organisms. In addition, researchers are often export mechanism.
interested in the subtle differences between MEMOSys features three different ways to
organisms when exploring specific biological assign reactions and metabolites to compartments
functionalities. Hence, MEMOSys offers (compartmentalization), which allow researchers
a flexible and intuitive mechanism to assess the to directly use exported models in analysis tools
similarity between models allowing users to com- that do not support a fine-grained assignment of
pare any version of different models. Further- reactions to compartments.
more, it is possible to compare two versions of The system supports the import of models that
the same model to identify development changes. are encoded in valid format as defined by the
The first section of the comparison result pre- consensus yeast reconstruction group. In addi-
sents Venn diagrams that graphically display the tion, existing models in SBML format can be
calculated differences for reactions, metabolites, used to improve the annotation of stored model
and genes. Next, restrictions on the used models components (see Fig. 1).
can be set to display only differences in selected
compartments and subsystems. In addition to the
graphical representation, the application shows Installation
detailed lists of unique and equal model compo-
nents and uses tabs to switch between reactions, The application itself and the source code of
metabolites, and genes lists. Every model com- MEMOSys are freely available under the GNU
ponent is connected to the corresponding page Affero General Public License. As MEMOSys is
where detailed information is presented. a web application, it is recommended installing it
on a server system and set appropriate access
permissions for potential users. A detailed used
Data Exchange guide and installation instructions are available at
the distribution website. MEMOSys is available
MEMOSys features the export of current metab- for download at http://www.icbi.at/MEMOSys.
olite and reaction query result lists into Excel or
PDF files, where only the active result set is
included. Since several methods and toolboxes Summary
which analyze genome-scale metabolic models
have been published over the last years (Baart During the last years, numerous genome-scale
and Martens 2012), MEMOSys provides metabolic models have been developed for
a sophisticated data exchange mechanism that a multitude of different organisms. They are
allows the export of models into valid SBML a promising approach to systematically analyze
files. The Systems Biology Markup Language complex cellular systems and have been success-
(SBML) (Hucka et al. 2003) provides fully applied for improving gene annotation,
a common intermediate format that can be used increasing the product yield, and predicting the
to define models in regulatory networks, meta- effect of gene deletions.
bolic pathways, signaling pathways, and gene The web-based MEtabolic MOdel research and
regulation networks. development System (MEMOSys) is a versatile
The exported files are compliant with the con- bioinformatics platform for the management,
sensus yeast format (Herrgård et al. 2008) or with storage, modification, and development of
the COBRA toolbox format (Schellenberger genome-scale metabolic models. It facilitates the
et al. 2011). Researchers are able to export all construction of new models by providing a built-in
available versions of a model and restrict the set version control system, which allows researchers
of exported reactions by either including only to access the complete reconstruction history.
MEMOSys: Platform for Genome-Scale Metabolic Models
MEMOSys: Platform for Genome-Scale Metabolic Models, Fig. 1 Displayed is possibility to select the correct annotation for each modification. In addition, empty
365
the user interface for improving the annotation of metabolite and genes. After loading model component fields can be filled with new annotations, or all stored annotations
an SBML file, the system identifies new or different annotations and offers the can be replaced with the currently loaded ones
M
M
M 366 MetaBin
from these genomes are similar to reads of generate the alignments by implementing Blat
novel or yet unknown genomes because the (Kent 2002) as the faster alignment method in
NRminusGenus or NRminusFamily databases place of BLASTX which is comparable to the
do not contain any genome of that genus. Simu- time taken by composition-based methods. This
lated read datasets were created using the feature makes it practical to use a more accurate
MetaSim program to represent Sanger (read and sensitive homology-based approach for both
length ~800 bp) and 454 (read lengths of ~400 Web- and console-based high-throughput analy-
and ~250 bp) sequences (Richter et al. 2008). sis of large datasets.
A Perl script was developed for generating A unique approach has been adopted which
1,000 simulated reads of length ~75 bp and considers the taxonomic information from all
~45 bp, respectively, from each of the bacterial verified complete or partial ORFs present in
genomes, since the option to create short reads a read and then assigns a taxonomic bin. This
was absent in MetaSim. helps to make correct assignments of reads of
The metagenomic sequences for a real diverse lengths to different taxonomic bins.
metagenomic dataset were taken from human Since our procedure comprehensively considers
gut samples from a single Spanish male individ- all imaginable cases, the results are more accurate
ual generated by Illumina sequencing (V1.CD-2, and specific, and the assignments are not limited
age 49, BMI 27.76, 20,707,369 high-quality by read length. (Details are provided in the
reads, library 090107) (ftp://public.genomics. manuscript, Sharma et al. 2012.)
org.cn/BGI/gutmeta/High_quality_reads/) (Qin The taxonomic binning of the simulated read
et al. 2010). The sample data sequences datasets was carried out using MetaBin and
(Sargasso Sea Subsample 1) for Sargasso Sea MEGAN, and the assignments were counted at
were downloaded from http://www-ab. three levels, namely, “Genus,” “Phylum,” and
informatik.uni-tuebingen.de/software/megan/old- “Higher.” The “correct assignments” were those
datasets (Huson et al. 2007). This set contains the where the assigned phylum was same as the
first 10,000 reads from Sample 1 of the Sargasso expected phylum or simply if it was assigned to
Sea dataset (Venter et al. 2004). its own phylum. Only the intragenic reads were
BLAST (version 2.2.22, ftp://ftp.ncbi.nih.gov/ considered to calculate sensitivity and the posi-
blast/) was downloaded from NCBI. MEGAN tive predictive value (PPV) because the NR ref-
(version 3.8) (http://www-ab.informatik. erence database contains only protein sequences,
uni-tuebingen.de/data/software/megan/download/ and thus the reads coming from known protein
welcome.html) (Huson et al. 2007), SOrt-ITEMS coding regions (intragenic) are expected to find
(http://metagenomics.atc.tcs.com/binning/SOrt- a match. The following standard formulae were
ITEMS) (Monzoorul et al. 2009), and TACOA used to calculate sensitivity and PPV:
(version 1.0, http://www.cebitec.uni-bielefeld.
de/brf/tacoa/tacoa.html) (Diaz et al. 2009) Sensitivity ð%Þ ¼ ðTP=ðTP þ FN ÞÞ 100
were retrieved from their respective sites.
WebCARMA (version 1.0) was run from their Positive predictive value ðPPVÞ ð%Þ
Web server (http://webcarma.cebitec.uni- ¼ ðTPðTP þ FPÞÞ 100
bielefeld.de/cgi-bin/webcarma.cgi) (Gerlach
et al. 2009). True positive (TP) ¼ number of reads assigned
with correct (expected) phylum
Algorithm Development False positive (FP) ¼ number of reads assigned to
MetaBin provides significant improvements over other (incorrect) phylum
currently existing homology-based methods for False negative (FN) ¼ number of unassigned
better taxonomic assignments. It reduces (up to intragenic reads plus number of reads assigned
1,000-fold) the amount of time needed to above to the phylum level (higher)
MetaBin 369 M
The average sensitivity and PPV were calcu- 46 % higher average sensitivity as compared to
lated for all simulated read datasets aligned with MEGAN and SOrt-ITEMS, respectively.
the complete NR database or the NR-G versions. The performance of MetaBin was also evalu-
ated on real metagenomic data using the recent
MetaBin Development human gut data obtained by Illumina sequencing
The MetaBin algorithm was developed in Perl (short reads) from a European male individual
(version 5.10.1), and the dendrogram images and analyzed using MetaBin with Blat as the
were generated using the Perl GD module. alignment program. Only those bins containing
Options are provided to change the different run at least 10,000 reads were considered under
parameters such as bin size, minimum bit- score, default parameter conditions. The analysis of
and bit-score range, to select hits and to create such a large metagenomic dataset proves the
a dendrogram image after comparing the propor- ability of MetaBin to work on real metagenomic
tions of each taxonomic group in the selected datasets. In this analysis, Bacteroidetes was
metagenomes, and to display the respective pro- found as the most abundant phylum (77.4 %)
portions as a pie chart. The algorithm can be used followed by Firmicutes (16.8 %), Proteobacteria
for the taxonomic assignments of both single- and (3.5 %), Actinobacteria (1.7 %), Cyanobacteria
paired-end sequence reads. A user-friendly (0.27 %), and Euryarchaeota (0.24 %). These
website (http://metabin.riken.jp/) was developed results corroborate previous observations
on our server including detailed instructions for (Kurokawa et al. 2007).
installation, usage, and updating of the taxonomy The performance was also evaluated using
database. A free stand-alone executable program longer (~800 bp) reads obtained from the Sar-
is also provided and can be downloaded for dif- gasso Sea dataset. Using this common dataset,
ferent operating systems including Windows, the results of MetaBin, MEGAN, and SOrt-
Linux, and Mac. ITEMS were compared. MetaBin and MEGAN
both predicted a similar number of bins; however,
M
MetaBin assigned comparatively more reads
Results (nearly twice the number of reads at the species
level) to each of these common bins which shows
The overall performance of MetaBin was found its higher sensitivity and higher accuracy. The
to be superior to the other available tools such performance of SOrt-ITEMS was comparatively
as MEGAN, SOrt-ITEMS, TACOA, and poor compared to both MetaBin and MEGAN.
WebCARMA for all read datasets. It assigned A brief comparison of MetaBin was also carried
a higher percentage of reads to their correct out with one of the composition-based methods
genus and phylum, as compared to the other (TACOA) and with another method based on
methods. Particularly for the short (<100 bp) homology to protein families (WebCARMA)
Illumina reads, it assigned up to 18 % more using the above dataset. Both the composition-
reads to their correct taxonomic levels. This is and protein family-based methods showed limi-
a useful and unique ability of MetaBin to make tations in making comprehensive taxonomic
more accurate assignments at the lower and more assignments and performed poorly as compared
specific taxonomic levels. For all simulated read to homology-based methods.
datasets, the average sensitivity and PPV of
MetaBin was similar to or higher than those of
MEGAN, especially for short reads. For ~75 bp The Web Server
reads, MetaBin showed up to 6 % and 16.8 %
higher average sensitivity as compared to Different pages are provided on the Web server
MEGAN and SOrt-ITEMS, respectively. For with several options for carrying out online
~45 bp reads, MetaBin showed up to 32 % and taxonomic analysis. The main page is the
M 370 MetaBin
MetaBin,
Fig. 1 Screenshot of
“application” page using
a sample query
“Application” page, where the user can submit shows their abundance values. Another option is
and carry out taxonomic analysis of either available to compare the taxonomic profiles of up
sequence reads or Blastx output (Fig. 1). to five metagenomic datasets by “Compare
Two options, BLAT and BLAST, are provided Metagenome Profiles.”
to generate the alignments. The input sequences The stand-alone console-based version is pro-
should be submitted in FASTA format, for which vided to analyze large metagenomic datasets
the ORFs are predicted, and the qualified ORFs locally on the user’s system after installation.
are aligned against the NCBI NR database using A free stand-alone executable program is avail-
Blat. This output is used to classify the sequences able for download for several operating systems
into their appropriate taxonomic bins. Another including Linux, Mac, and Windows.
option, BLAST, uses Blastx for generating the
alignments and takes comparatively a much lon-
ger time for generating the alignments as com- Discussion
pared to Blat. The input parameters such as
minimum bit-score (Blat or Blastx output), Homology-based approaches are more common
bit-score range to select hits, and bin size and considered to be more specific and useful for
(minimum number of reads needed to form diverse read length as compared to composition-
a taxonomic bin) can be changed or used as based approaches. However, their implementa-
default. The “Results” page provides the output tion on large metagenomic datasets is limited
files in tab-delimited format and displays thumb- due to the longer analysis time. The MetaBin
nail images of the taxonomic tree (*.png file) and algorithm overcomes this limitation and pro-
functional annotation of the reads using COGs vides a significant improvement over the cur-
functional classes. The results can be rently existing homology-based methods for
downloaded from the website (Fig. 2). better and faster taxonomic assignments by
The “Visualization” page provides several using a more specific ORF-based approach. It
options for displaying the results and carrying carries out more accurate and specific taxo-
out comparative analysis (Fig. 3). nomic assignments at both genus and species
An option to upload the resultant *.json file levels. The replacement of BLAST by Blat in
generated after using the stand-alone version for MetaBin makes it possible to employ a more
additional Web-based analyses is also provided. accurate and sensitive homology-based
There are options to visualize the taxonomic tree approach for the high-throughput analysis of
and prepare a “composition chart” for a single large datasets and also for the development of
dataset. The composition chart gives an overview a Web-based community server. The perfor-
of the microbial distribution in the dataset and mance of this approach was validated using
MetaBin 371 M
M
MetaBin, Fig. 2 Screenshot of results page for the sample query
both simulated reads and real metagenomic technology, and perhaps it is the only method
datasets. In addition, it can be a tool of choice which can be applied for the taxonomic binning
for large metagenomic datasets as demonstrated of reads of lengths as short as 45–75 bp with high
in this entry. It can be used for the taxonomic accuracy and sensitivity. Thus, the MetaBin
assignment of sequence reads of diverse lengths Web server and program can be considered
(45 bp) derived from any existing sequencing a significant improvement over currently
M 372 MetaBioME
Diaz NN, Krause L, Goesmann A, Niehaus K, Nattkemper Vineet K. Sharma1 and Todd D. Taylor2
TW. TACOA: taxonomic classification of environ- 1
MetaInformatics Laboratory, Metagenomics
mental genomic fragments using a kernelized nearest
neighbor approach. BMC Bioinforma. 2009;10:56.
and Systems Biology Group, Department of
Gerlach W, Junemann S, Tille F, Goesmann A, Stoye J. Biological Sciences, Indian Institute of Science
WebCARMA: a web application for the functional and Education and Research, Bhopal, India
taxonomic classification of unassembled metagenomic 2
Laboratory for Integrated Bioinformatics,
reads. BMC Bioinforma. 2009;10:430.
Core for Precise Measuring and Modeling,
Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis
of metagenomic data. Genome Res. 2007;17:377–86. RIKEN Center for Integrative Medical Sciences,
Kent WJ. BLAT–the BLAST-like alignment tool. Yokohama, Japan
Genome Res. 2002;12:656–64.
Kurokawa K, Itoh T, Kuwahara T, Oshima K, Toh H,
Toyoda A, Takami H, Morita H, Sharma VK,
Srivastava TP, et al. Comparative metagenomics Synonyms
revealed commonly enriched gene sets in human gut
microbiomes. DNA Res. 2007;14:169–81. Biocatalysts; Commercially useful enzymes;
McHardy AC, Martin HG, Tsirigos A, Hugenholtz P,
Rigoutsos I. Accurate phylogenetic classification of
CUEs
variable-length DNA fragments. Nat Methods.
2007;4:63–72.
Monzoorul HM, Ghosh TS, Komanduri D, Mande SS. - Definition
SOrt-ITEMS: sequence orthology based approach for
improved taxonomic estimation of metagenomic
sequences. Bioinformatics. 2009;25:1722–30. MetaBioME: Comprehensive metagenomic
Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, et al. A biomining engine.
human gut microbial gene catalogue established by
metagenomic sequencing. Nature. 2010;464:59–65.
Richter DC, Ott F, Auch AF, Schmid R, Huson DH.
MetaSim: a sequencing simulator for genomics and Introduction
metagenomics. PLoS One. 2008;3:e3373.
Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, The relationship between man and microbes is as
Canese K, Chetvernin V, Church DM, Dicuccio M,
old as the age of man himself and it is no wonder
Federhen S, et al. Database resources of the National
Center for Biotechnology Information. Nucleic Acids that man carries around ten times more of these
Res. 2011;39:D38–51. little friends than of his own cells (Gill et al. 2006).
Sharma VK, Kumar N, Prakash T, Taylor TD. Fast and However, it has only been a few 1,000 years since
accurate taxonomic assignments of metagenomic
man first learned to harness the power of microbes,
sequences using MetaBin. PLoS ONE. 2012;7:e34030.
Teeling H, Waldmann J, Lombardot T, Bauer M, initially to accomplish crude and trivial fermenta-
Glockner FO. TETRA: a web-service and a stand- tions like brewing and curdling. With the evolu-
alone program for the analysis and comparison of tion of man, today, these applications have been
tetranucleotide usage patterns in DNA sequences.
extended to almost all areas such as agriculture,
BMC Bioinforma. 2004;5:163.
Tringe SG, Rubin EM. Metagenomics: DNA sequencing pharmaceuticals, industry, biotechnology etc.,
of environmental samples. Nat Rev Genet. where microbes have become indispensable.
2005;6:805–14. These applications have now become more
Venter JC, Remington K, Heidelberg JF, Halpern AL,
refined, and the most remarkable change, which
Rusch D, et al. Environmental genome shotgun
sequencing of the Sargasso Sea. Science. has happened, is that microbial enzymes have
2004;304:66–74. replaced whole microbes in many such processes.
MetaBioME 373 M
These microbial enzymes a.k.a. “biocatalysts” fall, etc. (Daniel 2005; Edwards and Rohwer
offer ecologically friendly or “green” solutions 2005; Kurokawa et al. 2007; Tringe et al. 2005;
for the implementation of biochemical processes Turnbaugh et al. 2006; Tyson et al. 2004;
at a reduced cost and produce a large variety of Venter et al. 2004a; Warnecke et al. 2007), and
chemical substances without involving the use several large-scale worldwide metagenomic
of polluting reagents that are often character- projects are currently under progress or in plan-
istic of chemical synthesis (Ferrer et al. 2005). ning. From these metagenomic projects,
However, only a few enzymes are currently some important biocatalysts have already been
known which can be used as biocatalysts due to isolated such as lipases/esterases, proteases,
the limited number of sequenced microbes, nitrilases, b-lactamases, hydrolases, cellulases,
which is principally limited by the fact that a-amylases, xylanases, oxidoreductases, and dehy-
most (>98 %) of the microbes cannot be cultured, drogenases (Ferrer et al. 2005; Yun and Ryu 2005).
a necessary step for their sequencing by tradi- Therefore, the upcoming information from further
tional methods (Amann et al. 1995). This, yet metagenomic projects holds enormous prospect
unculturable, majority of microbes potentially for the discovery of novel genes, biocatalysts,
conceals an enormous treasure of unknown bio- and biochemical pathways, irrespective of the
logical functions locked in their unidentified necessity for complete genomic sequences.
genes, proteins, and biochemical pathways. Novel biocatalysts can be detected in genomic
Therefore, approaches aimed at mining environ- or metagenomic libraries using three commonly
mental genetic diversity can significantly used strategies: (i) homology-driven screening,
enhance the enzyme repertoire and will be help- (ii) substrate-induced gene expression screening,
ful in the discovery of novel biocatalysts with and (iii) activity-based analysis (Ferrer
potential biotechnological applications. et al. 2005; Yun and Ryu 2005). While these
Another feasible, yet challenging method is to methods have certain advantages like high spec-
create novel biocatalysts by using in silico ificity and reliability, they require extensive min-
M
approaches and bioengineering is to reshuffle ing of large genomic or metagenomic libraries
the 20 known amino acids and mutate the existing and result in a few positives per enzymatic
proteins. However, there exist nearly infinite pos- screening (Ferrer et al. 2005). This is further
sibilities for such an approach, and it is impracti- limited by the low quality of DNA, low coverage,
cal and costly to test them all. In this scenario, host bias, and need for better vector-host combi-
nature appears the veteran since it began its nations for expression.
bioengineering laboratory billions of years ago An alternative and promising approach which
and has already created and tested an intriguing now exists involves direct shotgun sequencing of
diversity of biochemical pathways and their metagenomic libraries (Tringe and Rubin 2005b).
constituent enzymes that perform numerous This approach was earlier considered too expen-
transformations of molecules in diverse biologi- sive, since it required massive sequencing by con-
cal systems with great precision and specificity. ventional sequencers (Sanger). However, the recent
Therefore, it is conceivable that the ideal biocat- availability of a new generation of sequencers, like
alyst may already exist in nature and a wise strat- Roche 454, Illumina HiSeq, Ion Torrent, etc., has
egy would be to augment our knowledge base by made sequencing even more high-throughput, sev-
exploring the inherent diversity of nature. eral orders less expensive, and most importantly
To this end, metagenomics has emerged as cloning independent (Mardis 2008). Considering
a powerful culture-independent approach for the sheer volume of metagenomic samples and
exploring the complexity of microbial genomes implementation of such high-throughput sequenc-
in their natural environments (Tringe and Rubin ing methods, combined with high-throughput
2005a). Many metagenomic projects have recently computational analysis, screening of potential
been conducted, such as metagenomic studies of biocatalysts is more promising and is likely to
soil, sea, acid mines, human gut, termite gut, whale accelerate the process of biocatalyst discovery.
M 374 MetaBioME
Nutrition
Medical
Industry
Enzymatic Analysis
Environment
Energy
Biotechnology
Agriculture
All Applications
All Enzymes
MetaBioME, Fig. 1 Distribution of enzymes (EC classes) into nine application categories
In the present entry, we describe a computa- (42 %). Transferases (EC 2), which perform the
tional platform and resource to identify novel transfer of functional groups from one molecule to
biocatalysts in metagenomic datasets using another, are most abundant in three application
homology-based approaches. We have developed categories, namely, Agriculture (50 %), Nutrition
a comprehensive Metagenomic BioMining (48 %), and Biotechnology (37 %). Hydrolases
Engine (MetaBioME) platform (Sharma et al. (EC 3), which are involved in formation of two
2010), which provides a unique resource for the separate products from a single substrate by hydro-
identification of novel alternatives to the existing lysis, are most abundant only in Industrial appli-
known biocatalysts and novel biocatalysts in cations (45 %). It is clear from the above findings
metagenomic datasets, which can be used as that oxidoreductases (EC 1) are most widely used
leads for further experimental verification. as biocatalysts followed by transferases (EC 2) and
hydrolases (EC 3). It is also noteworthy that
although hydrolases (EC 3) constitute most of the
Results enzymes among the six EC classes, they are not
the most widely employed biocatalysts. The
The distribution of 510 biocatalysts in nine appli- biocatalysts belonging to the remaining three EC
cation categories indicates that the highest num- classes (4, 5, and 6) were not as widely distributed
ber (234, 46 %) of biocatalysts is present in the or were completely absent from many of the nine
“Biotechnology” category and the lowest (3, 3 %) application categories.
in the “Energy” category (Fig. 1).
Oxidoreductases (EC 1), which catalyze Gene Prediction in Metagenomic Datasets
oxidation-reduction reactions, are most abundant (Except HFV)
in five out of nine applications, namely, Enzymatic The average contig length in the metagenomic
Analysis (95 %), Energy (75 %), General Appli- datasets varied between 0.8 and 1.8 kb with the
cations (74 %), Environment (45 %), and Medical exception of AMD (4.18 kb). The prediction of
MetaBioME 375 M
ORFs by Glimmer and MetaGene showed con- which can be queried using a publicly available
siderable variation with MetaGene predicting up Web interface available at http://metasystems.
to twice the number of ORFs as compared to riken.jp/metabiome (Sharma et al. 2010). The
Glimmer. With the exception of AMD, having key idea of MetaBioME is to develop a computa-
an average number of four ORFs per contig tional tool for mining metagenomic datasets by
predicted by MetaGene and Glimmer, the aver- using homology-based approaches to discover
age number of ORFs per contig for the remaining novel biocatalysts and novel alternatives for
datasets was found to vary between 0.6 and 2.3. existing biocatalysts, with advanced analysis
The median protein length in bacteria was options for facilitating the validation of results.
reported in one study as 267 amino acids Therefore, for comprehensive querying, we have
(801 base pairs) (Brocchieri and Karlin 2005). developed the following query pages:
Since, in the above analysis, the average length MetaSearch: It houses a pre-classified set of
of the contigs varies between 0.8 and 1.8 kb, and 510 biocatalysts in nine application categories
the average number of ORFs per contig varies that can be searched for in different metagenomic
between 0.6 and 2.3, it is likely that a significant datasets.
portion of at least one ORF can be predicted in MetaXplorer: It contains the complete set of
a contig of about 1 kb (Tringe et al. 2005). The EC enzymes and options to search for their
ORFs predicted by Glimmer and MetaGene in all homologous ORFs in metagenomic datasets.
the metagenomic datasets were fed into the MetaAlign: It allows users to submit a gene or
“Metabase” database, which is being used for protein sequence of interest and search for the
the development of MetaBioME. existence of a homologous ORF in metagenomic
datasets.
Identification of Potential Biocatalysts The details of these query pages are provided
Using MetaBioME’s homology-based approach, below.
we identified 199 potential alternatives (49 % of
M
total biocatalysts) to known biocatalysts in the MetaSearch: Search for Biocatalysts in
metagenomic datasets using a stringent threshold Metagenomic Datasets
of identity 50 % and coverage 90 %. Among The “MetaSearch” query page is designed to
the nine application categories, novel alternatives identify novel biocatalysts, categorized into nine
to known biocatalysts could be predicted for main application categories in metagenomic
39–50 % of total biocatalysts in each category. datasets (Fig. 2).
We further relaxed the above cutoff (identity This pre-classification helps the user to select
30 % and coverage 90 %) to identify an biocatalysts belonging to any application area and
expanded list of potential alternate biocatalysts search for them in metagenomic datasets. A search
in the metagenomic datasets which could be used can be made by selecting one or more of the
as leads for experimental verification. Using this application categories and a single metagenomic
relaxed cutoff, novel alternatives for a total of dataset. Since the metagenomic datasets contain
305 (75 %) biocatalysts could be identified in volumes of information, the number of hits
the metagenomic datasets from all application reported for each query is expectedly large; there-
categories. Among these potential biocatalysts, fore, we have currently restricted the option to
20 were commonly found in all nine select and search in only one metagenomic dataset
metagenomic datasets, while 64 biocatalysts per query. The queries can also be made by
were rare and could be found in any one of the selecting different attributes such as EC number,
nine metagenomic datasets. enzyme name, Swiss-Prot ID, biochemical path-
way, and substrate or products. Multiple keywords
Description of Web Resource: MetaBioME can also be submitted using Boolean operators. An
We used the above strategy, data, and results to option is also provided to limit the number of
develop a comprehensive resource “MetaBioME,” results by selecting “Best hit” or “Best 10 hits.”
M 376 MetaBioME
On submission of a query for a selected appli- showed at least 50 % coverage with the matched
cation category and metagenomic dataset, metagenomic contigs (Fig. 3).
MetaBioME examines the alignments of all Comprehensive information for each match
Swiss-Prot sequences known for all EC numbers can be retrieved by clicking on the Swiss-Prot
present in that category with the ORFs predicted in ID link on the Results page which opens up the
contigs of the selected metagenomic dataset. The “MetaBioME Profile” page. The profile page
subsequent “MetaResults” page displays the qual- summarizes information on the enzyme proper-
ified hits as a table sorted on the basis of percent ties, reaction performed, pathway information
coverage (completeness of the alignment) and pro- (as available in KEGG), links to related
vides a list of all matching Swiss-Prot IDs which publicly available databases, queried dataset,
MetaBioME 377 M
MetaBioME, Fig. 3 Screenshot of “MetaResults” page showing the results of the submitted sample query
M
and application category. This information is the best match. In the case of a good match, users
followed by a table of predicted ORFs, where are advised to carry out an “Advanced Search,”
the ORFs are segregated as commonly predicted which helps to confirm the goodness of the results
by Glimmer and MetaGene and uniquely by using a suite of options. Users can check the
predicted by Glimmer or MetaGene, respec- alignment of the Swiss-Prot sequence of the
tively. The ORF showing the best match with selected biocatalyst with the best matching
the Swiss-Prot sequence is highlighted in green. ORF. Since conserved motifs likely play a key
This table is followed by the contig view window role in the activity of an enzyme, all Swiss-Prot
displaying the predicted ORFs as directional sequences belonging to the same EC number can
arrows as per the orientation of the ORFs on the be aligned together or with the best matching
contig. The best match is displayed as a green- ORF to find the overall sequence homology
colored arrow. Each arrow can be clicked to among these sequences. This helps in the identi-
retrieve the nucleotide and protein sequences of fication of conserved motifs and confirms if the
the predicted ORFs. This window is followed by best matching ORF also possesses any conserved
a table providing summary information for the motifs which may be present in the Swiss-Prot
best matching ORF. The next table provides sequences. As another functional confirmation,
information on the closest available PDB struc- users can also look for the presence of conserved
ture and displays the 3-D protein structure. domains in the best matching ORF by aligning
In order to provide a useful indicator for the the sequence against the NCBI Conserved
goodness of the results, we have provided Domains Database (CDD).
a “MetaBioME Rating,” which rates the best Additionally, the user can also check if the
matching ORF on a scale of 1–5 stars, with same Swiss-Prot sequence of the biocatalyst in
a single star for lowest match and five stars for question is present in any other metagenomic
M 378 MetaBioME
dataset by carrying out a homology search against number. Any representative Swiss-Prot sequence
other metagenomic datasets. Another search can be selected and searched by TBLASTN in
option is provided to determine if the novel one or more metagenomic datasets selected from
predicted ORF sequence is already present or the drop-down menu. The results “MetaSearch
has a close match with any protein from a Results” and profile “MetaBioME Profile”
known genome available in the Non-Redundant pages, for the submitted query, are similar to as
(NR) database. These additional options are help- explained in the earlier section. This query page
ful in confirming the uniqueness of the novel provides users with an option to search all known
identified biocatalyst. enzymes, as available in EC, irrespective of their
known role as biocatalyst, which is a subset of
CUEsXplorer: Explore Commercially Useful this set.
Enzymes (CUEs) Database
This page provides options for exploring the MetaAlign: Online Application to Search for
CUEs database for any application category or Protein Sequences in Metagenomic Datasets
EC classification. It provides details about MetaAlign is an application powered by the
enzyme function and the curation summary of BLAT (faster and less sensitive) and BLAST
any selected enzyme (Fig. 4). (slower and more sensitive) sequence alignment
tools (Fig. 6).
MetaXplorer: Search for Enzymes in It provides the user an option to carry out
Metagenomic Datasets homology-based searches for single or multiple
This query page provides an option to select and (multi-FASTA format) submitted nucleotide or
search for any enzyme from the six EC classes in protein sequences against the metagenomic
metagenomic datasets (Fig. 5). sequences available in the ten metagenomic
On selecting any EC class, a list box datasets. Larger files containing multiple
containing all EC numbers belonging to that sequences can also be uploaded, with an email
class opens up. Selecting an EC number from being sent to the user on completion of analysis.
this list box reveals an expanded page with infor- The searches can be limited by selecting the
mation on the enzyme name, EC number, Prosite threshold E-value and the number of resultant
ID, enzymatic reaction, KEGG pathway, and list hits. The output format can also be specified as
of all Swiss-Prot IDs belonging to that EC “tabular” or “full” (complete alignment).
MetaBioME 379 M
MEtaGenome ANalyzer (MEGAN): Metagenomic logarithmically to indicate how many reads have been
Expert Resource, Fig. 1 Taxonomy analysis of assigned to it. In addition to the taxon name, each node
500,000 reads from an in vitro-simulated microbial is also labeled by the cumulative number of reads assigned
community Morgan et al. (2010). Each circle represents to, or below, that node
a taxon in the NCBI taxonomy and is scaled
The input to MEGAN is a file of DNA reads an alignment must achieve to be considered;
and a file containing all their hits in a reference minPercent, a further filter to remove all those
database, usually in BLAST or SAM format. In hits whose bit score differences by more than the
addition, at start-up, MEGAN reads in the whole given percentage from the top scoring hit for the
NCBI taxonomy. To perform a taxonomic analysis given read; and minSupport, the minimum number
of a metagenome dataset, MEGAN processes each of reads that a node in the NCBI taxonomy must
DNA read in turn, assigning each read to the node attract before it is shown in the final output.
in the NCBI taxonomy that is the lowest common Reads that have no hits are assigned to
ancestor of the set of species associated with all a special node labeled No Hits, whereas reads
reference sequences that were hit by the read. This that have hits but cannot be assigned to a taxon
approach is known as the LCA algorithm. are mapped to a special Unassigned node. In
The LCA algorithm has a number of parame- addition, reads consisting of highly repetitive
ters, such as minScore, the minimum bit score that sequence are assigned to a Low Complexity node.
MEtaGenome ANalyzer (MEGAN): Metagenomic Expert Resource 385 M
MEtaGenome ANalyzer (MEGAN): Metagenomic Each circle represents a SEED category and is scaled
Expert Resource, Fig. 2 SEED-based functional analy- logarithmically to indicate the cumulative number of
sis of 500,000 reads from an in vitro-simulated micro- reads that have been assigned to it. In addition to the
bial community Morgan et al. (2010). The SEED SEED name, each node is also labeled by the number of
classification tree has been partially expanded to show reads assigned to, or below, that node
details on functional roles involved in flagellar motility.
MEtaGenome ANalyzer (MEGAN): Metagenomic Each circle represents a KEGG category and is scaled
Expert Resource, Fig. 3 KEGG-based functional anal- logarithmically to indicate the cumulative number of
ysis of 500,000 reads from an in vitro-simulated micro- reads that have been assigned to it. In addition to the
bial community Morgan et al. (2010). The KEGG KEGG name, each node is also labeled by the number of
classification tree has been partially expanded to show reads assigned to, or below, that node
details on KO groups involved in flagellar assembly.
pathways. In both cases, the classification can be taxonomy viewer. In additional, the KEGG
represented as a tree with roughly 13,000 nodes. viewer allows one to see how reads map to dif-
To perform a SEED-based analysis, for each ferent enzymes in a given pathway; see Figs. 2
read in the input, MEGAN identifies the highest and 3.
scoring hit to a reference sequence for which the
corresponding functional role is known and then
maps the read to that functional role. In a KEGG- Sequence Alignment
based analysis, each read is mapped to a KO
group in a similar fashion. As pointed out above, the main computational
Both the SEED and KEGG classifications are step is to compute pairwise alignments between
displayed as trees in MEGAN, and the viewers the set of DNA reads and all sequences in an
provide the same interactive features as the appropriate reference database. Based on this,
MEtaGenome ANalyzer (MEGAN): Metagenomic Expert Resource 387 M
MEtaGenome ANalyzer (MEGAN): Metagenomic track shows the reference sequence and the main panel
Expert Resource, Fig. 4 MEGAN’s alignment viewer displays the aligned reads. Letters shown in gray belong to
constructs and displays a multiple sequence between all the reads but are not part of the alignment
reads that map to the same reference sequence. The top
MEtaGenome ANalyzer (MEGAN): Metagenomic different datasets are represented by different colors, and
Expert Resource, Fig. 5 High-level comparison of each node shows a bar chart that indicates the number of
taxonomic content of four different cDNA datasets from reads assigned to that node, on a logarithmic scale
a seawater monitoring study (Gilbert et al. 2008). The four
document and then, based on this, the program of 350 million reads with 1.3 billion BLASTX
can be used to compute a tree, network, or MDS matches. While MEGAN is mainly designed for
plot (not shown here). interactive use on a laptop or desktop computer,
all features of the program can also be accessed in
command-line mode, and thus analyses can also
Handling Large Data be performed on a server within the framework of
a larger bioinformatic analysis pipeline.
As sequencing technologies continue to improve,
the size of analyzed datasets continues to
increase. MEGAN was reportedly used to per- Summary
form the taxonomic analysis of 124 human gut
samples involving around 600 gigabases of MEGAN is an interactive tool for analyzing
sequence (Qin et al. 2010). In an ongoing study, the taxonomic and functional content of
MEGAN is currently being used to analyze a set metagenomic (and metatranscriptomic) datasets.
Metagenome of Acidic Hot Spring Microbial Planktonic Community 389 M
Input is a set of DNA reads and the result of Thiele I, Vassieva O, Ye Y, Zagnitko O, Vonstein V.
comparing the reads against a reference database. The subsystems approach to genome annotation and its
use in the project to annotate 1000 genomes. Nucleic
Taxonomic analysis is performed by placing Acids Res. 2005;33(17):5691–702.
DNA reads onto nodes of the NCBI taxonomy, Qin J, Li R, Raes J, Arumugam M, Burgdorf KS,
whereas functional analysis is based on mapping Manichanh C, Nielsen T, Pons N, Levenez F,
reads to SEED and KEGG categories. Yamada T, Mende DR, Li J, Xu J, Li S, Li D, Cao J,
Wang B, Liang H, Zheng H, Xie Y, Tap J, Lepage P,
The program supports comparative analysis of Bertalan M, Batto J-M, Hansen T, Le Paslier D,
multiple datasets. The program is written in Linneberg A, Nielsen HB, Pelletier E, Renault P,
Java and runs on all major operating systems. Sicheritz-Ponten T, Turner K, Zhu H, Yu C, Li S,
When run in command-line mode, the program Jian M, Zhou Y, Li Y, Zhang X, Li S, Qin N,
Yang H, Wang J, Brunak S, Dore J, Guarner F,
can also be integrated into larger bioinformatic Kristiansen K, Pedersen O, Parkhill J, Weissenbach J,
analysis pipelines. Bork P, Ehrlich SD, Wang J. A human gut microbial
gene catalogue established by metagenomic sequenc-
ing. Nature. 2010;464(7285):59–65.
Zhao Y, Tang H, Ye Y. RAPSearch2: a fast and memory-
Cross-References efficient protein similarity search tool for next-
generation sequencing data. Bioinformatics.
▶ Metagenomics, Metadata, and Meta-analysis 2012;28(1):125–6.
References
Metagenome of Acidic Hot Spring Microbial Plank- El Coquito (EC); red circle indicates the planktonic
tonic Community: Structural and Functional fraction and black square indicates the biofilm surface
M
Insights, Fig. 1 Photographs of the acidic hot spring formation
characteristics, also showed differences in terms and Rhodospirillales that included Acidiphilium
of diversity indexes. However, certain bacterial cryptum (1,681 assigned reads). A high propor-
phyla showed predominance in all of them: tion of sequences related to enzymes involved in
Proteobacteria, Aquificae, Chloroflexi, transposition and integration of mobile genetic
Cyanobacteria, Firmicutes, Nitrospirae, and elements (transposases) were mapped to the
Thermotogae. Based on cluster analysis of the A. cryptum JF-5 genome. By using BLASTX
microbial populations, these spring communities against the NCBI-nr database and the MEGAN
grouped together in a manner consistent with v4.0 software, 19,876 sequences were associ-
sample physicochemical parameters, with pH ated with KEGG pathways, specifically to
and sulfate concentration being the parameters metabolism of carbohydrates (2,623), amino
that most influenced the population structure. acids (2,584), energy (1,920), and nucleotides
Some springs were also characterized by site- (1,431). A total of 87,023 reads (30.9 %) were
specific bacterial taxa that distinguished each assigned to 25 COG categories and most of the
community. Thus despite their geographic prox- sequences were related to replication, recombi-
imity and similar origins, the environmental fac- nation, and repair (10,712 reads), suggesting
tors at each location have resulted in marked that these systems could be important in this
differences in the microbial assemblages present. ecosystem where high UV radiation, acidic pH,
and high water temperature may cause signifi-
cant damage to DNA. Deep sea hydrothermal
Metagenomic Approach: Taxonomic vent chimneys and hot spring microbial commu-
and Functional Assignment of nities are enriched in genes involved in
Metagenome Sequences mismatch DNA repair and homologous recom-
bination, perhaps due to the need for extensive
Although 16S rRNA gene analysis is very useful DNA repair systems to cope with extreme con-
for assessing microbial diversity, it does not pro- ditions that could have potential deleterious
vide ecologically relevant functional informa- effects on their genomes (Klatt et al. 2011; Xie
tion. Thus a direct analysis of total et al. 2011). In this study we also identified
metagenomic sequences becomes relevant. The sequences associated with quorum sensing and
current and most frequently used tools for taxo- cellular communication in biofilms, structures
nomic and functional classification of that could form on the surfaces of these acidic
metagenomic reads are based on local alignments hot springs and could be relevant for ecosystem
(BLAST) against different databases and associ- functionality (Fig. 1).
ating best hits to taxa, specific genes, functional
identifiers, or metabolic pathways (Montaña
et al. 2012). An analysis was therefore carried Metagenomic Approach: Nitrogen and
out with 53 Mb of metagenomic information Sulfur Transformations
retrieved from a planktonic fraction of the
EC hot spring (Jiménez et al. 2012). However, Pathways involved in nitrogen and sulfur metab-
only 8,121 reads (2.9 %) of the total reads olism could be important in acidic hot spring
could be assigned to a taxonomic category, habitats where terminal electron acceptors other
suggesting a great amount of newly described than O2 may be relevant, such as nitrate, ferric
sequences or a large amount of noncoding DNA iron, arsenate, thiosulfate, elemental S, sulfate, or
present in these genomes (especially in micro- CO2. Genes related to the dissimilatory reduction
eukaryotes). A high number of sequences were of nitrate to nitrite (nar GHI genes), conversion
related to Acidithiobacillales (represented by of nitrite to N2 (nir K, nir S, nor B, nos Z), and
sequences related to Acidithiobacillus caldus, associated with ferredoxin-nitrite reductase
Acidithiobacillus ferrooxidans, and Acidithio- (nir A) were found in the metagenome of EC
bacillus thiooxidans) followed by Legionellales hot spring (Fig. 2a). The presence of nif
Metagenome of Acidic Hot Spring Microbial Planktonic Community
Metagenome of Acidic Hot Spring Microbial Planktonic Community: Struc- KEGG characteristic identified and numbers in gray circles indicate the amount of
tural and Functional Insights, Fig. 2 Partial (a) nitrogen and (b) sulfur pathways sequence reads affiliated to the KEGG function (Jiménez et al. 2012)
393
identified by KEGG affiliation of the sequences from EC hot spring. Boxes indicate the
M
M
M 394 Metagenome of Acidic Hot Spring Microbial Planktonic Community
K genes (associated with sulfate-reducing the amino acid level with previously reported
Thermodesulfovibrio and sulfur-reducing bacte- PR sequences from both freshwater and marine
ria Desulfitobacterium) also indicated that in samples. These sequences contained conserved
addition to denitrification, nitrogen fixation residues indicative of proton-pumping activity
could also be taking place in this acidic hot and of pigments that absorb green light. They
spring. Based on taxonomic affiliation, the harbored diversity at the amino acid level and
dissimilatory nitrate reduction is most likely clustered into three groups, showing similarity
carried out by Proteobacteria-like organisms, with both freshwater and marine sequences. The
while assimilatory reduction of nitrate was asso- presence of these genes indicated that PR
ciated mostly with acidophilic micro-algae, phototrophy might play a role in these oligotro-
Acidobacteria, Spartobacteria, and Alphaproteo- phic high-mountain aquatic habitats exposed to
bacteria (Jiménez et al. 2012). Conversion of abundant sunlight by providing a possible
sulfate into adenylylsulfate and, further, to advantage that could contribute to survival.
generate sulfite and H2S were also predicted
from sequence analysis of the EC metagenome.
This included genes involved in conversion of Summary
adenylylsulfate to sulfite (apr AB; cys H), in
sulfite reduction and H2S formation (cys I), and The sequence-based exploration of the
in the oxidation of sulfite to sulfate (sulfite oxi- metagenomic content in Andean hot springs
dase enzymes) (Meyer and Kuever 2008). These goes beyond the identification of taxa using 16S
pathways indicate that the oxidation of H2S and rRNA gene analysis and provides insight into
(or) SO2 could be linked to the acidity of the metabolic potential and ecosystem function. Tax-
environment (Jones et al. 2012). onomic surveys of EC spring and other similar
springs indicated overall predominance of Bacte-
ria over Archaea, even in the most acidic waters.
PCR-Target Approach: Certain bacterial taxa predominated, but there
Proteorhodopsin-Like Genes in Andean were also site-specific groups at each spring,
Acidic Hot Springs indicating that the surveyed microbiomes were
different. The functional annotation showed that
These Andean mountain hot springs are the microbial community in EC spring contained
subjected to a large amount of solar light, yet pathways involved in nitrogen and sulfur metab-
taxonomic surveys identified only few olism, as well as extensive DNA repair systems,
phototrophic bacteria (Bohórquez et al. 2012a; possibly to cope with UV radiation at such high
Jiménez et al. 2012). Thus a search was altitudes. Processes involved in denitrification,
conducted to identify energy-harvesting bacte- nitrogen fixation, and sulfide oxidation were
rial proteorhodopsins (PRs) that could also con- likely linked to the acidity of the environment.
tribute to productivity in these ecosystems Finally, the presence of PR sequences in these
(Bohórquez et al. 2012b). PRs are retinal- communities suggests that these genes might play
binding bacterial transmembrane proton pumps a role important for bacterial survival in these
that can generate energy from light, which are aquatic ecosystems.
therefore important in terms of carbon cycling
and energy flux in various aquatic ecosystems
(Fuhrman et al. 2008). PCR with degenerate Cross-References
primers designed to target an internal conserved
region in the PR gene was used to identify puta- ▶ A 123 of Metagenomics
tive PR sequences. Recovered sequences ▶ Approaches in Metagenome Research:
showed between 80 % and 100 % identity at Progress and Challenges
Metagenome of Acidic Hot Spring Microbial Planktonic Community 395 M
▶ Biological Treasure Metagenome Liu Z, Klatt CG, Wood JM, Rusch DB, Ludwig M, et al.
▶ Computational Approaches for Metagenomic Metatranscriptomic analyses of chlorophototrophs of
a hot-spring microbial mat. ISME J. 2011;5:1279–90.
Datasets López-López O, Cerdán ME, González-Siso MI. Hot
▶ KEGG and GenomeNet, New Developments, spring metagenomics. Life. 2013;2:308–20.
Metagenomic Analysis Mathur J, Bizzoco RW, Ellis DG, Lipson DA, Poole AW,
▶ Lateral Gene Transfer and Microbial Diversity et al. Effects of abiotic factors on the phylogenetic
diversity of bacterial communities in acidic thermal
▶ Metagenomic Potential for Understanding springs. Appl Environ Microbiol. 2007;73(8):
Horizontal Gene Transfer 2612–23.
▶ Metagenomics, Metadata, and Meta-analysis Meyer B, Kuever J. Homology modeling of dissimilatory
APS reductases (AprBA) of sulfur-33 oxidizing and
sulfate-reducing prokaryotes. PLoS One. 2008;3(1):
e1514.
References Montaña JS, Jiménez DJ, Hernandez M, Angel T, Baena
S. Taxonomic and functional assignment of cloned
Aguilera A, Souza-Egipsy V, González-Toril E, sequences from high Andean forest soil metagenome.
Rendueles O, Amils R. Eukaryotic microbial diversity A Van Leeuw J Microb. 2012;101:205–15.
of phototrophic microbial mats in two Icelandic Myers N, Mittermeier RA, Mittermeier CG, da Fonseca
geothermal hot springs. Int Microbiol. 2010;13(1): GA, Kent J. Biodiversity hotspots for conservation
21–32. priorities. Nature. 2000;403:853–8.
Bohórquez LC, Delgado-Serrano L, Lopez G, Osorio- Norris PR. Acidophiles. In: Wiley J and Sons, editors.
Forero C, Klepac-Ceraj V, et al. In-depth characteri- Encyclopedia of life sciences. 2001. p. 1-6.
zation via complementing culture-independent doi:10.1038/npg.els.000033. http://els.net. Accessed
approaches of the microbial community in an acidic 11 Nov 2011.
hot spring of the Colombian Andes. Microb Ecol. Pentecost A, Jones B, Renaut RW. What is a hot spring?
2012a;63:103–15. Can J Earth Sci. 2003;40:1443–6.
Bohórquez LC, Ruiz-Pérez CA, Zambrano MM. Rzonca B, Schulze-Makuch D. Correlation between
Proteorhodopsin-like genes present in thermoaci- microbiological and chemical parameters of some
dophilic high-mountain microbial communities. Appl hydrothermal springs in New Mexico, USA. J Hidrol.
Environ Microbiol. 2012b;78(21):7813–7. 2003;280:272–84.
M
Bouraoui H, Rebib H, Aissa MB, Touzel JP, Siering PL, Clarke JM, Wilson MS. Geochemical and
O’donohue M, Manai M. Paenibacillus marinum biological diversity of acidic, hot springs in Lassen
sp. nov., a thermophilic xylanolytic bacterium isolated volcanic National Park. Geomicrobiol J. 2006;23(2):
from a marine hot spring in Tunisia. J Basic Microbiol. 129–41.
2013. doi:10.1002/jobm.201200275. [Epub ahead of Stout LM, Blake RE, Greenwood JP, Martini AM, Rose
print]. EC. Microbial diversity of boron-rich volcanic hot
Fuhrman JA, Schwalbach MS, Stingl U. springs of St. Lucia, Lesser Antilles. FEMS Microbiol
Proteorhodopsins: an array of physiological roles? Ecol. 2009;70(3):402–12.
Nat Rev Microbiol. 2008;6:488–94. Tirawongsaroj P, Sriprang R, Harnpicharnchai P,
Jiménez DJ, Andreote FD, Chaves D, Montaña JS, Osorio- Thongaram T, Champreda V, et al. Novel thermophilic
Forero C, et al. Structural and functional insights from and thermostable lipolytic enzymes from a Thailand
the metagenome of an acidic hot spring microbial hot spring metagenomic library. J Biotechnol.
planktonic community in the Colombian Andes. 2008;133:42–9.
PLoS ONE. 2012;7(12):e52069. Wang S, Hou W, Dong H, Jiang H, Huang L, et al. Control
Jones B, Renaut R. Hot springs and geysers. In: Reitner J, of temperature on microbial community structure in
Thiel V, editors. Encyclopedia of geobiology. hot springs of the Tibetan Plateau. PLoS ONE.
Berlin: Springer; 2011. doi:10.1007/Springer- 2013;8(5):e62901.
Reference_187284 2012-09-10 14:32:43 UTC. Wemheuer B, Taube R, Akyol P, Wemheuer F, Daniel
Springer Reference (www.springerreference.com). R. Microbial diversity and biochemical potential
Jones DS, Albrecht HL, Dawson KS, Schaperdoth I, encoded by thermal spring metagenomes derived
Freeman KH, et al. Community genomic analysis of from the Kamchatka Peninsula. Archaea. 2013:
an extremely acidophilic sulfur-oxidizing biofilm. (136714).
ISME J. 2012;6:158–170. Xie W, Wang F, Guo L, Chen Z, Sievert SM,
Klatt CG, Wood JM, Rusch DB, Bateson MM, et al. Comparative metagenomics of microbial
Hamamura N, et al. Community ecology of hot spring communities inhabiting deep-sea hydrothermal vent
cyanobacterial mats: predominant populations and chimneys with contrasting chemistries. ISME J.
their functional potential. ISME J. 2011;5:1262–78. 2011;5:414–26.
M 396 Metagenomes: 23S Sequences
M
Metagenomes: 23S Sequences, Fig. 1 (a) Compari- rRNA fragments from each GOS sample dataset in terms
son of number of 23S/28S (dark gray bars) and 16S/18S of number of aligned bases within the rRNA gene bound-
(light gray bars) rRNA fragments retrieved from each aries, excluding any fragment (23S/28S or 16S/18S) that
GOS sample dataset. (b) Average length of 23S/28S contained less than 100 aligned bases. Sites marked with
(dark gray circles) and 16S/18S (light gray circles) an “*” indicate that less than five fragments were retrieved
16S/18S rRNA. Furthermore, 23S/28S rRNA Taxonomic Diversity Based on 23S and
gene fragments are considerably longer than 16S rRNA Genes
16S/18S gene fragments (Fig. 1b). Where an
average 23S/28S rRNA fragment has 836 aligned Percentages of both 23S and 16S rRNA frag-
bases within the rRNA gene boundaries, ments associated with major marine bacterial
a 16S/18S rRNA fragment has 713 aligned and archaeal taxa show good agreement with
bases. More abundant and longer rRNA gene each other (Fig. 2, b). Specifically, based on 23S
fragments may provide additional information rRNA assignments, 43 % of the retrieved rRNA
in assessing taxonomic diversity, both with phy- fragments are associated with Alphaproteo-
logeny and operational taxonomic unit-based bacteria, followed by 17 % Gammaproteo-
methods, as well as increasing the chances to bacteria, 9 % Actinobacteria, 8 %
affiliate other gene fragments with specific line- Cyanobacteria, 8 % Bacteroidetes, 3 %
ages. Both 23S/28S and 16S/18S rRNA frag- Betaproteobacteria, 2 % Euryarchaeota, and
ments are randomly distributed over the rRNA 0.4 % Crenarchaeota (Fig. 2a). However, less
gene regions, meaning that no specific sequence agreement in the assignment of 23S rRNA and
region is over- or underrepresented. 16S rRNA fragments is observed with less
M 398 Metagenomes: 23S Sequences
Metagenomes: 23S Sequences, Fig. 2 Percentage of sample datasets, except GS038–GS046 and GS050. Per-
23S (a) and 16S (b) rRNA fragments associated with centages were calculated based on absolute numbers of
major marine bacterial and archaeal taxa among all GOS fragments associated with a given taxa
abundant marine taxa. For example, Chloroflexi- The former case, where 16S rRNA-based
and Deferribacteres-associated fragments are not assignments estimated more taxa in more sample
observed in the 23S rRNA gene-based classifica- datasets, demonstrates the current drawback of
tion, which may be ascribed to the lack of anno- 23S rRNA-based classification (i.e., its lack of
tated clades for these taxa. In such cases, 16S resolution due to insufficient full-length reference
rRNA gene-based classifications appear to pro- sequences). On the other hand, the latter observa-
vide better estimations. tions demonstrate that when reference sequences
Similar trends are also observed in sample-by- are present for a taxon, the higher number of 23S
sample distribution of taxa at the “class” level for rRNA fragments retrieved can capture what is
both 23S and 16S rRNA-based assignments, as missed with 16S rRNA fragments.
compared to the previous overall assessment Investigating relative abundances at lower
(Fig. 3a, b). Alphaproteobacteria, Gammapro- taxonomic levels can shed light on more promi-
teobacteria, Actinobacteria, Cyanobacteria, nent habitat-specific diversity patterns. However,
Flavobacteria, and Betaproteobacteria are the with the current size and content of LSU rRNA
most abundant taxa in the majority of sample reference databases, the 23S rRNA has a distinct
datasets. However, differences are observed in disadvantage in achieving this. As summarized in
the occurrence or relative abundance of minor Table 1, the percentage of 23S rRNA gene frag-
groups, such as Planctomycetacia or Aquificae. ments that can be classified to a certain taxa is
In certain cases, 23S rRNA-based assessments comparable to the 16S rRNA gene-based classi-
predict higher relative abundances or occurrence fication at domain, phylum or class levels.
in sample datasets for other taxa. Up to 12-fold A decrease in percentage of classified 23S
more Epsilonproteobacteria-associated 23S rRNA fragments was observed at lower levels,
rRNA fragments are found in sample dataset from 95 % at the class level down to even 17 % at
GS000b compared to 16S rRNA fragments. the genus level. This can be explained by the
Additionally, Lentisphaeria, which appears to 23,197 sequences of taxonomically classified cul-
be present in ten sites according to 23S rRNA tured organisms in the SILVA SSU Ref dataset
classifications, are observed only at two sites (release 102) versus only 3,602 sequences in the
according to 16S rRNA gene classifications. LSU Ref dataset of the same release.
Metagenomes: 23S Sequences 399 M
Metagenomes: 23S Sequences, Fig. 3 The relative then normalized according to the total fragment counts
abundance of 23S (a) and 16S (b) rRNA fragments asso- from that site. Abundances are not normalized with
ciated with different taxa (rows) at each GOS sample respect to single copy genes, and since rRNA operons
dataset (columns). Presence of a spot indicates the pres- can occur multiple times in a genome, the numbers do
ence of fragments associated with a given taxa, and the not represent cell abundances. The taxa shown here are on
area of a spot represents the relative abundance. Relative the “class” level, except Cyanobacteria, which is at the
abundances are based on absolute counts of all fragments “phylum” level
from a given site associated with a certain taxa, which are
M 400 Metagenomes: 23S Sequences
Metagenomes: 23S Sequences, Table 1 Percentage Contrary to these results, the primers devel-
of 23S and 16S rRNA gene fragments that can be classi- oped for the amplification of variable regions of
fied up to domain, phylum, class, order, family, and genus
levels. Total number of fragments classified are 20,036 bacterial 23S rRNA sequences (11a–97ar) (Van
and 12,491 for 23S and 16S rRNA, respectively, exclud- Camp et al. 1993) show very poor group coverage
ing Eukarya and fragments with less than 300 aligned in the GOS 23S dataset sequences, with generally
bases for LSU and less than 100 aligned bases for SSU less than 50 % coverage of the target group. 90 %
23S rRNA gene 16S rRNA gene group coverage is only observed for 69ar
fragments (%) fragments (%) (Table 2). Although the primer binding sites
Domain 99.9 100.0 were highly conserved, this is counteracted by
Phylum 96.6 100.0 the very small dataset that these primers were
Class 94.4 99.1
based on. Surprisingly, primers 53a to 97ar are
Order 78.8 96.3
observed to have higher group coverage within
Family 35.4 80.0
the GOS 23S rRNA sequences than within
Genus 16.6 31.2
LSU Parc.
The two archaeal primers (LSU190-F and
LSU2445a-R) (DeLong et al. 1999) show very
Specificity of Common 23S rRNA Primers low group coverage in the GOS 23S dataset
and Probes (Table 2), with 14 % and 5 %, respectively. Nev-
ertheless, while the percentages are higher in the
Including the 23S rRNA gene sequences identi- LSU Parc, they do not exceed 50 %.
fied in the GOS metagenome dataset in the For the BET42a probe (Manz et al. 1992),
SILVA LSU Parc dataset increased its size by 79 % group coverage is observed. This, as well
12 % (SILVA release 102). Furthermore, they as the number of outgroup hits within the GOS
have not undergone PCR amplification and 23S dataset, is close to that reported by a previous
hence provide a unique opportunity for testing evaluation (Amann and Fuchs 2008) (Table 2).
the coverage of previously described universal Group coverage within LSU Parc (87 %) is in
amplification primers, as well as widely used accordance with Amann and Fuchs (Amann and
class-specific probes. Fuchs 2008) (Table 2), although considerably
The most recently developed primer sets more outgroup hits, 348 in LSU Parc versus
(129f, 189f, 457r, 2490r) (Hunt et al. 2006), as 62, are observed.
well as primer 2241r (Lane 1991), show reason- The GAM42a probe coverage in the GOS 23S
able group coverage for the 23S rRNA gene dataset (Table 2) is almost half (42 %) of the
sequences identified in the GOS dataset with an value reported previously (76 %) (Amann and
average of 85 % (Table 2), and the results are Fuchs 2008) and the corresponding evaluation
comparable to those obtained from matching the of the LSU Parc (78 %) dataset. Since the mis-
primers against the SILVA LSU Parc dataset matches could result from sequencing errors, the
(release 102) with a difference of only 2 %. alignments of sequences with mismatches to the
The reference dataset used by Hunt and col- probe GAM42a were manually inspected. A few
leagues is with 2,176 sequences smaller than cases were likely to be sequencing errors and
both the LSU Parc (average of 11,000 target were mainly observed in fragments obtained
group sequences) and the GOS 23S (average of from ends of sequencing reads. The majority of
5,400 target group sequences) datasets used in the mismatches revealed consistent, class-
this study. However, the authors have included specific mismatches. These mismatches are up
environmental shotgun sequences from the Sar- to four bases and are found mainly between
gasso Sea pilot study (Venter et al. 2004) in their E. coli positions 1,030–1,040. Although this eval-
dataset, which would account for the comprehen- uation of the GAM42a probe was based on
siveness of these primers also in the GOS 23S a single environment, the surface ocean, limita-
dataset. tions and anomalous results with the GAM42a
Metagenomes: 23S Sequences 401 M
Metagenomes: 23S Sequences, Table 2 Specificities of selected primers and probes, evaluated on the 23S/28S
rRNA gene fragments retrieved from the GOS metagenomes having more than 300 aligned bases within the rRNA gene
boundaries and on the SILVA Parc release 102 LSU dataset. Outgroup hits are the sum of both Archaea and Eukarya in
case of bacterial primers, both Bacteria and Eukarya in case of archaeal primers, only Eukarya in case of bacterial and
archaeal primers, and non-Betaproteobacteria and non-Gammaproteobacteria for BET42a and GAM42a probes
GOS 23S/28S LSU Parc
Group Group
Primer/ Size of coverage Outgroup Size of coverage Outgroup
probe Target group target group (%) hits target group (%) hits
129fa Bacteria 4,853 74 % 0 10,640 82 % 4
189fa Bacteria 5,285 87 % 0 11,508 87 % 0
457ra Bacteria 5,551 86 % 4 11,177 83 % 279
2241rb Bacteria 5,832 84 % 10 11,457 86 % 3,967
2490ra Bacteria 5,734 94 % 0 10,821 98 % 0
11ac Bacteria 5,256 20 % 0 11,478 39 % 0
23arc Bacteria 5,619 23 % 0 10,526 49 % 4
43ac Bacteria 5,633 6% 0 10,999 44 % 0
53ac Bacteria 5,320 3% 0 10,594 1% 0
62arc Bacteria 5,540 8% 0 11,455 5% 0
69arc Bacteria 5,731 90 % 0 11,443 87 % 0
93ac Bacteria 5,737 62 % 0 10,322 55 % 0
93arc Bacteria 5,731 63 % 0 10,327 56 % 2
97arc Bacteria 4,969 55 % 0 9,165 29 % 38
LSU190-Fd Bacteria and Archaea 5,348 14 % 0 11,741 24 % 0
28 %
LSU2445a- Archaea 142 5% 0 262 28 % 0
Rd M
BET42ae Betaproteobacteria 209 79 % 63 570 87 % 348
GAM42ae Gammaproteobacteria 980 42 % 1 2,877 78 % 10
References: aHunt et al. 2006; bLane 1991; cVan Camp et al. 1993; dDeLong et al. 1999; eManz et al. 1992
probe have been reported previously for other power of using 23S rRNA genes. High-quality
environments as well, which were found to be taxonomic classification for biodiversity analy-
mainly due to polymorphisms at E. coli position sis, as well as primer and probe design, depends
1,033 (Yeates et al. 2003; Barr et al. 2010). Our on the size and extent of the reference dataset
observation confirms these reports, by adding used. The advantage of using the larger 23S
additional polymorphisms before and after this rRNA genes for biodiversity analysis, especially
position. Consequently, the limitations of the for the marine system, has been shown previ-
GAM42a probe might be more severe than pre- ously (Peplies et al. 2004). Additionally,
viously thought, and therefore, we recommend a recent study assessing the diversity of
the design and testing of novel Gammaproteo- paralogous 23S rRNA genes has shown that
bacteria probes. significant sequence diversification was
observed in 184 species, further supporting the
suitability of this molecule for taxonomy (Pei
Summary et al. 2009). Although an obvious limitation
faced during this study was the small size of
This comparative overview of 16S and 23S the 23S rRNA gene reference datasets, this is
rRNA fragments retrieved from the GOS likely to be overcome in the near future with the
metagenomes exemplifies the possibility and contribution of (meta-)genomic sequences from
M 402 Metagenomic Analysis of Bile Salt Hydrolases in the Human Gut Microbiome
mega-sequencing projects, such as the Human Manz W, Amann R, et al. Phylogenetic oligodeoxynu-
Microbiome Project, the TerraGenome, the Tara cleotide probes for the major subclasses of
Proteobacteria: problems and solutions. Syst Appl
Oceans, or the Genomic Encyclopedia of Bacte- Microbiol. 1992;15(4):593–600.
ria and Archaea. Moreover, studies assessing Pei A, Nossa CW, et al. Diversity of 23S rRNA genes
the characteristics and sequence diversity of within individual prokaryotic genomes. PLoS ONE.
23S rRNA genes in bacterial and archaeal 2009;4(5):e5437.
Peplies J, Glöckner FO, et al. Comparative sequence
genomes, in combination with efforts to design, analysis and oligonucleotide probe design based on
test and, reevaluate universal and group-specific 23S rRNA genes of Alphaproteobacteria from
primers and probes, can renew the interest and North Sea bacterioplankton. Syst Appl Microbiol.
utilization of this molecule. The application of 2004;27(5):573–80.
Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza
continually advancing, cheaper sequencing P, Peplies J, Glöckner FO. The SILVA ribosomal RNA
technologies to the undiscovered fraction of the gene database project: improved data processing and
23S rRNA gene sequences can result in a higher web-based tools. Nucleic Acid Res. 2013;41:D590–
appreciation of this valuable phylogenetic D596.
Rijk P, Peer Y, et al. Evolution according to large tribosomal
marker. subunit RNA. J Mol Evol. 1995;41(3):366–75.
Van Camp G, Chapelle S, et al. Amplification and
sequencing of variable regions in bacterial 23S ribo-
References somal RNA genes with conserved primer sequences.
Curr Microbiol. 1993;27(3):147–51.
Venter JC, Remington K, et al. Environmental genome
Amann R, Fuchs BM. Single-cell identification in micro- shotgun sequencing of the Sargasso Sea. Science.
bial communities by improved fluorescence in situ 2004;304(5667):66–74.
hybridization techniques. Nat Rev Microbiol. Yeates C, Saunders AM, et al. Limitations of the widely
2008;6(5):339–48. used GAM42a and BET42a probes targeting bacteria
Barr JJ, Blackall LL, et al. Further limitations of phyloge- in the Gammaproteobacteria radiation. Microbiology.
netic group-specific probes used for detection of bac- 2003;149(5):1239–47.
teria in environmental samples. ISME J. 2010;4:
959–61.
DeLong E, Taylor L, et al. Visualization and enumeration
of marine planktonic Archaea and Bacteria by using
polyribonucleotide probes and fluorescent in situ Metagenomic Analysis of Bile Salt
hybridization. Appl Environ Microbiol. 1999;65(12):
5554–63.
Hydrolases in the Human Gut
Hunt DE, Klepac-Ceraj V, et al. Evaluation of 23S rRNA Microbiome
PCR primers for use in phylogenetic studies of bacte-
rial diversity. Appl Environ Microbiol. 2006;72(3): Brian V. Jones1 and C. G. M. Gahan2
2221–5. 1
Centre Biomedical and Health Science
Lane DJ. 16S/23S rRNA sequencing. In: Stackebrandt E,
Goodfellow M, editors. Nucleic acid techniques in Research, University of Brighton, School of
bacterial systematics. Chichester/New York: Wiley; Pharmacy and Biomolecular Sciences, Brighton,
1991. p. 115–75. East Sussex, UK
Ludwig W, Klenk HP. A phylogenetic backbone and 2
taxonomic framework for prokaryotic systematics.
Department of Microbiology, School of
In: Boone DR, Castenholz RW, editors. The Archaea Pharmacy & Alimentary Pharmabiotic Centre,
and the deeply branching and phototrophic Bacteria. University College Cork, Cork, Ireland
New York: Springer; 2001. p. 49–65.
Ludwig W, Schleifer KH. Bacterial phylogeny based on
16S and 23S rRNA sequence analysis. FEMS
Microbiol Rev. 1994;15(2–3):155–73. Definitions
Ludwig W, Schleifer K. Phylogeny of Bacteria beyond
the 16S rRNA standard. ASM News. 1999;65(11): Metagenome/metagenomics: The collective
752–7.
Ludwig W, Rossello-Mora R, et al. Comparative sequence
genomes of all members of a particular
analysis of 23S rRNA from Proteobacteria. Syst Appl microbial community may be referred to as
Microbiol. 1995;18:164–88. the metagenome (or a genome of many).
Metagenomic Analysis of Bile Salt Hydrolases in the Human Gut Microbiome 403 M
Metagenomics refers to methods which seek to Bile Acids and Microbial Bile Acid
understand the composition, development, and Metabolism
function of microbial ecosystems through analy-
sis of the community metagenome. Bile acids (BA) are cholesterol derivatives syn-
Function-driven metagenomics: A meta- thesized in the liver and linked with either glycine
genomic approach in which emphasis is placed or taurine to form conjugated bile acids (CBA)
on the recovery of genes encoding a defined func- (Ridlon et al. 2006; Begley et al. 2005a, b; Fig. 1).
tion of interest, through assays based on heterol- The dominant CBA in humans are glycine con-
ogous gene expression. Typically metagenomic jugates of cholic acid and chenodeoxycholic acid,
DNA is used to generate genetic libraries in with CBA forming a major component of bile
a surrogate host species that may be easily manip- stored in the gall bladder (Ridlon et al. 2006). In
ulated in the laboratory. Each clone in the library response to food intake, bile is secreted into the
(analogous to books in a conventional library) lumen of the intestine where CBA facilitate the
represents a fragment of metagenomic DNA digestion of dietary fat, promoting the emulsifi-
from a member of the microbial community cation of lipids and their subsequent absorption
under study. Libraries are then subsequently across the intestinal epithelium (Ridlon
screened to identify clones encoding and et al. 2006; Begley et al. 2005a). However, the
expressing activities of interest. functions of bile acids are not limited to diges-
Large-insert library/genetic library: Due to tion, and BA are also important signaling mole-
the complexity of microbial communities, cules that contribute to the regulation of diverse
genetic libraries constructed for function-driven metabolic processes (Thomas et al. 2008; Fig. 2).
metagenomic analysis often seek to clone large These include regulation of mucosal immune
fragments of metagenomic DNA (typically rang- responses in the intestine, as well as aspects of
ing from 40 to 200 kb in size, depending on the energy homeostasis and fat storage (Thomas
specifics of the cloning system used). The term et al. 2008; Inagaki et al. 2006; Houten
M
“insert” refers to the metagenomic DNA frag- et al. 2006; Jones 2011; Watanabe et al. 2006;
ments which are ligated, or “inserted,” into Fig. 2). As such, BA are now no longer viewed as
a plasmid vector that maintains them in the sur- purely digestive secretions but also as metabolic
rogate host bacterium. Insert sizes of ~40 kb and integrators and key regulators of intestinal
over are usually referred to as “large inserts,” homeostasis (Thomas et al. 2008; Hofmann and
giving rise to the term “large-insert library.” Eckmann 2006; Jones 2011).
Sequence-driven metagenomics: A meta- The regulatory functions of bile acids are
genomic approach in which the emphasis is believed to act through two main receptors, the
placed on the generation and analysis of nucleo- nuclear receptor FXRalpha and the membrane
tide sequence data from metagenomic DNA. receptor TGR5, for which bile acids are the nat-
Typically sequence-based approaches are uti- ural ligands (Thomas et al. 2008). These recep-
lized to provide a broad overview of the popula- tors are highly expressed in the liver and
tion structure and predicted functions undertaken intestinal tissues but also in numerous
by a microbial community. extraintestinal tissues (Thomas et al. 2008).
Heterologous gene expression: Refers to the Although the majority of bile acids are efficiently
expression of genes in an organism from which reclaimed from the intestine and returned directly
they did not originate. For function-driven to the liver for reuse (referred to as enterohepatic
metagenomics, this generally refers to the expres- circulation), a portion enter the systemic circula-
sion of genes encoded by cloned fragments of tion and signal other organs through these recep-
metagenomic in the surrogate host species used tors, coordinating cholesterol, triglyceride and
to construct genetic libraries (typically glucose metabolism, as well as fat storage
Escherichia coli). (Thomas et al. 2008; Fig. 2).
M 404 Metagenomic Analysis of Bile Salt Hydrolases in the Human Gut Microbiome
Metagenomic Analysis of Bile Salt Hydrolases in the novo in the liver from cholesterol and referred to as
Human Gut Microbiome, Fig. 1 Structure of domi- primary bile acids (Ridlon et al. 2006; Thomas
nant conjugated bile acids in humans (Modified from et al. 2008). De-conjugated CA and CDCA, as well as
Begley et al. 2005a). Major bile acid species in the human derivatives of these primary BA formed in the intestine,
bile acid pool are conjugated forms linked to either gly- are recovered and returned to the liver where they are
cine or taurine via amide bonds, with glyco-conjugates conjugated and re-assimilated into the bile acid pool. For
dominant in humans (Ridlon et al. 2006). The predomi- comprehensive reviews, see Ridlon et al. (2006) and
nant bile acid species are cholic acid (CA) and Begley et al. (2005a)
chenodeoxycholic acid (CDCA), which are generated de
CBA have also been implicated in the control Jones et al. 2008; Fig. 3). These modified bile
of microbial growth in the small intestine via acids display altered binding characteristics for
toxic effects on colonizing bacteria (Begley bile acid receptors, with microbial products of
et al. 2005a; Ridlon et al. 2006). This antimicro- bile acid metabolism among the most potent ago-
bial effect is thought to repress bacterial growth nists (Thomas et al. 2008). This highlights the
in the small intestine and prevent microbes pro- potential for microbes resident in the human gut
liferating to levels which are harmful to the microbiome to influence wider aspects of host
human host. Local mucosal immune responses metabolism and phenotype, through interaction
in the intestine are also regulated by bile acids with bile acid signaling pathways (Jones
(through FXRalpha) and implicated in microbial et al. 2008; Thomas et al. 2008; Jones 2011;
population control in this compartment (Inagaki Ogilvie and Jones 2012). Congruent with this
et al. 2006). It is most likely that bile acid medi- hypothesis is the accumulating body of evidence
ated mucosal immune regulation works in syn- implicating microbial bile acid metabolism as the
ergy with the direct effects of bile acids on basis of a long-standing dialogue between the
resident microbes, to prevent bacterial over- human host and its gut microbiome (Jones 2011;
growth in the small intestine and associated Inagaki et al. 2006; Gadaleta et al. 2011; Maran
deleterious effects on host health (Inagaki et al. et al. 2009; Modica et al. 2008; Duboc et al. 2013;
2006; Hofmann and Eckmann 2006; Begley Jones et al. 2008). As such, there is increasing
et al. 2005a; Fig. 2). interest in understanding the role of this activity
However, once secreted into the intestinal in human health and disease processes, with this
lumen, CBA are subject to extensive biotransfor- function of the gut microbiome likely to be
mation by indigenous gut microbes, leading to a viable target for disease prevention through
the formation of a range of secondary and tertiary manipulation, or augmentation of the intestinal
products (Ridlon et al. 2006; Begley et al. 2005a; microbial ecosystem.
Metagenomic Analysis of Bile Salt Hydrolases in the Human Gut Microbiome 405 M
Metagenomic Analysis
of Bile Salt Hydrolases in
the Human Gut
Microbiome,
Fig. 2 Overview of
physiological functions
undertaken by bile acids.
Boxes shaded violet
summarize the direct
functions of bile acids in
the small intestine,
attributed to their physical
properties. Boxes shaded
blue summarize regulatory
functions of bile acids,
through interaction with the
main bile acid receptors
TGR5 and FXRalpha. For
comprehensive reviews of
bile acid signaling, see
Thomas et al. (2008)
Overview of Bile Salt Hydrolases: While the main substrates for BSH and PVA
Biochemistry, Structure, and Function enzymes (conjugated bile acids and penicillins,
respectively) vary considerably in structure, PVA
Bile salt hydrolases (BSH; EC 3.5.1.24) (also has been shown to exhibit some moderate activity
designated as choloylglycine hydrolases or con- against bile acids and some BSH enzymes dem-
jugated bile acid hydrolases) are members of the onstrate mild activity against penicillin
N-terminal nucleophilic (Ntn) hydrolase super- V (Kumar et al. 2006). This suggests that each
family of proteins and catalyze the hydrolysis of enzyme group has preferential activity against
conjugated bile acids, linked with the amino acids a specific substrate but that some overlap in activ-
taurine or glycine (tauro-CBA, glyco-CBA), to ities also occurs (Kumar et al. 2006). The
liberate free primary bile acids and amino acids sequence homology between these enzyme fam-
(Fig. 3; Begley et al. 2006; Kumar et al. 2006). ilies has led to mis-annotation of PVA in some
The wider enzyme superfamily also contains the bacterial genomes, for example, in the initial
penicillin V acylase (PVA; EC 3.5.1. 11) enzyme genome annotation of Listeria monocytogenes
family, and BSH and PVA enzymes share signif- (Begley et al. 2005b) and Lactobacillus
icant homology and catalyze hydrolysis of the plantarum WCFS1 (Lambert et al. 2008a). This
same type of chemical bond. highlights a requirement for functional enzymatic
M 406 Metagenomic Analysis of Bile Salt Hydrolases in the Human Gut Microbiome
Metagenomic Analysis of Bile Salt Hydrolases in the which accumulate in the bile acid pool. GCA, TCA, glyco-
Human Gut Microbiome, Fig. 3 Major bile acid and tauro-conjugated cholic acid, respectively; GCDCA,
transformations undertaken by the human gut TCDCA, glyco- and tauro-conjugated chenodeoxycholic
microbiota (Modified from Jones 2011). Bile salt hydro- acid, respectively; CA, CDCA, free primary bile acids
lase (BSH) catalyzes the initial de-conjugation of CBA to cholic acid and chenodeoxycholic acid, respectively;
liberate free primary bile acids and amino acids. Free DCA, LCA, free secondary BA deoxycholic acid and
primary bile acids are then available to further modifica- lithocholic acid, respectively. For comprehensive reviews
tion by the gut microbiome and converted to secondary of microbial bile acid transformations, see Ridlon
forms. A multistep 7-alpha dehydroxylation pathway is et al. (2006) and Begley et al. (2005a)
responsible for generation of key secondary BA species
analysis in order to determine substrate prefer- specific loops near the active site in each case
ences and to guide annotation (Jones et al. 2008; which may explain differences in substrate spec-
Lambert et al. 2008b). ificity (Kumar et al. 2006).
The crystal structure has been solved for Structural and functional analysis of BSH
a number of BSH (Kumar et al. 2006; Rossocha enzymes from different bacteria has revealed
et al. 2005) and PVA (Suresh et al. 1999) the presence of conserved amino acids that are
enzymes and demonstrates a conservation in thought to be essential for bile hydrolysis. In
overall structure suggestive of shared mecha- particular the thiol group of the Cys-1 amino
nisms of action and an evolutionary relationship acid has been shown to be essential for catalytic
between BSH and PVA (Kumar et al. 2006). activity (Kim et al. 2004; Lodola et al. 2012). In
Detailed analysis of the structure of BSH and addition a number of amino acids including
PVA enzymes indicates that there is Asp-20, Tyr-82, Asn-175, and Arg-228 are
a significant difference in the organization of highly conserved across numerous BSH enzymes
Metagenomic Analysis of Bile Salt Hydrolases in the Human Gut Microbiome 407 M
(Begley et al. 2006) and have recently been sequencing (shotgun metagenomics or sequence-
shown to be essential for catalytic activity mainly based metagenomics) or used to construct
through electrostatic interactions with the Cys-1 large-insert genetic libraries for function-based
sulfur atom (Lodola et al. 2012) screening (function-driven metagenomics)
(Begley et al. 2006). Despite high levels of (Handelsman 2004) (Fig. 4).
amino acid conservation, different BSH enzymes The resulting data not only affords access to
display subtle differences in their preferred bile census-type information describing the composi-
substrates with some enzymes exhibiting hydro- tion of a community (who is there?) but also
lysis of glyco- and tauro-conjugated bile acids permits access to the broader functional content
and others demonstrating specific hydrolysis encoded by microbial ecosystems (what are they
of tauro-conjugated bile acids (Jones et al. doing?) (Handelsman 2004; Jones et al. 2008).
2008). BSH enzymes with specificity for tauro- Recently both function-driven and sequence-
conjugated bile acids are highly represented based metagenomic approaches have been
among the Bacteroidetes and form a separate applied to analyze BSH activity in the gut
phylogenetic group relative to other BSH microbiome and provide good examples of the
enzymes but have not been characterized in capacity for metagenomics to generate novel
detail (Jones et al. 2008). Further biochemical functional insights into a microbial community
analysis of a variety of BSH enzymes is and, in the case of the human microbiome, to
warranted to determine the structural variances understand its influence on host health (Jones
that give rise to these subtle differences in bile et al. 2008; Ogilvie and Jones 2012).
acid substrate range. Function-Driven Metagenomic Analysis of
Bile Salt Hydrolases: Due to the relative paucity
of information regarding the genes underpinning
Metagenomic Analysis of Bile Salt bile acid metabolism in the gut microbiome, ini-
Hydrolases (BSHs) tial community-wide studies of this activity uti-
M
lized a function-driven metagenomic approach,
As the human gut microbiota is composed pre- to assess the diversity and phylogenetic distribu-
dominantly of microbes which are yet to be tion of BSH activity in this ecosystem (Jones
grown in the laboratory, a range of culture- et al. 2008; Fig. 4). The reliance on heterologous
independent approaches have been developed gene expression in the surrogate host (typically
and applied to study this and other microbial E. coli) and the requirement for a phenotypic
communities (Handelsman 2004; Jones and screen for the trait of interest are clear limitations
Marchesi 2007; Qin et al. 2010; Kurokawa et al. of the function-based strategy but are offset by
2007; Gill et al. 2006). Metagenomic approaches unique benefits of this approach over other
constitute a particularly powerful branch of the metagenomic techniques (Handelsman 2004).
culture-independent techniques available for A major advantage of the function-driven
characterization of microbial ecosystems, in approach is that no prior knowledge or sequence
which the collective genomes of all species com- data for the genes underpinning an activity is
prising a community are considered as a single, required, which not only allows the application
community-wide, genetic unit (the metagenome) of metagenomics to poorly studied microbial
(Handelsman 2004). Access and analysis is functions in a community (such as bile acid
guided by this basic principle, and metagenomic metabolism) but also permits the recovery of
approaches are rooted in the extraction of total, novel, unrelated enzyme classes catalyzing
mixed community DNA (metagenomic DNA) a particular reaction (Jones et al. 2008;
without any prior cultivation (Handelsman Handelsman 2004). Furthermore, a clear confir-
2004). Recovered community DNA is then either mation of activity among the genes identified is
subject to direct analysis using high-throughput intrinsic to the function-driven approach. This is
M 408 Metagenomic Analysis of Bile Salt Hydrolases in the Human Gut Microbiome
Metagenomic Analysis of Bile Salt Hydrolases in the may be subject directly to high-throughput sequencing
Human Gut Microbiome, Fig. 4 Overview of (shotgun metagenomics) or first used as a template for
metagenomic approaches to study microbial ecosys- PCR reactions intended to amplify key genes of interest.
tems. Recovery of metagenomic DNA (Modified from The latter is most typically used to amplify phylogenetic
Ogilvie et al. 2012): Metagenomic approaches begin anchors, such as genes for 16S ribosomal RNA, which
with sampling the microbial ecosystem and extracting permit a census of the species present in a community.
DNA from the mixed community as a whole, without Sequences generated directly in the shotgun approach can
any prior cultivation. This metagenomic DNA may then subsequently be compared with well-characterized micro-
be subjected to one or more strategies to access the func- bial genomes and/or assembled into large contigs and
tional content of the ecosystem under study and/or explore genes predicted, in order to assess the functions encoded
the population structure and identify species present. by community members (with information on population
Sequence-driven metagenomics: Metagenomic DNA structure also captured in this strategy where relevant gene
Metagenomic Analysis of Bile Salt Hydrolases in the Human Gut Microbiome 409 M
of major benefit in the analysis of enzymes such members of the gut microbiome, the conserva-
as BSH, which share a considerable degree of tion of this function between distinct human
sequence homology with closely related enzymes microbiomes, and the role of this activity in
in the wider Ntn_CGH-like (COG3049) family of gut-associated bacteria (Jones et al. 2008;
proteins (Jones et al. 2008; Kumar et al. 2006). In Ogilvie and Jones 2012; Fig. 6).
particular BSH are closely related to penicillin Distribution of BSH Activity Among Mem-
V amidases, from which they are believed to have bers of the Human Gut Microbiome: Sequence
evolved, and comparison of sequence data alone data obtained from metagenomic clones
is often insufficient for the accurate prediction of encoding BSH activity was used to predict the
function in these enzymes (Kumar et al. 2006; phylogenetic origin of the BSHs obtained and
Jones et al. 2008). determine which members of the gut microbiome
The function-driven approach employed to encode this function (Jones et al. 2008). Although
survey BSH activity in the gut microbiome was the taxonomic resolution afforded by this analy-
based on screening large-insert genetic libraries sis was limited by a lack of conserved phyloge-
(constructed from metagenomic DNA derived netic anchors in many metagenomic clones (such
from stool samples), using a simple plate-based a 16S rRNA genes) and the limited availability of
assay to identify clones able to de-conjugate genome sequences from gut-associated bacterial
CBA (Fig. 5 Library construction and Screen). species at the time of analysis (against which
The basis of this screen is the complementa- recovered BSH sequences could be compared),
tion of the BSH-deficient E. coli host used this survey nevertheless revealed a broad distri-
to construct libraries and the subsequent bution of BSH activity within the gut microbiome
de-conjugation of CBA incorporated into the (Jones et al. 2008).
bacterial growth media used for screening All major bacterial phyla comprising the
(Jones et al. 2008; Dashkevicz and Feighner human gut microbiome (Bacteroidetes,
1989). Once liberated, free bile acids are no Firmicutes, Actinobacteria) were shown to encode
M
longer soluble and precipitate to form a halo this function, highlighting the high level of redun-
around BSH-positive clones, allowing those dancy and general stability of BSH activity within
harboring active BSH to be easily identified the community (Jones et al. 2008). Furthermore,
and recovered for further analysis (Jones BSH activity was also identified in the archaeal
et al. 2008; Fig. 5). Characterization of BSHs species Methanobrevibacter smithii, which com-
recovered from the human gut metagenomic monly forms a part of the human gut microbiome
library through function-based screening pro- (Jones et al. 2008). These observations further
vided the basis to subsequently examine the dis- expanded the representation of BSH among com-
tribution and evolution of this activity among munity members and revealed this function to be
Metagenomic Analysis of Bile Salt Hydrolases in the from genes of interest. This is a major advantage of the
Human Gut Microbiome, Fig. 4 (continued) such as function-driven approach which facilitates the identifi-
16S rRNA genes are identified). Function-driven cation of novel enzyme classes and is well suited to
metagenomics: these approaches rely on the construc- explore activities for which few initial examples of
tion of large-insert genetic libraries and the heterologous well-characterized genes or proteins exist. However,
expression of cloned genes in the surrogate host species a second major caveat of the function-driven approach
(as used to explore BSH activity in the human gut is that a suitable high-throughput screen for the activity
microbiome; Jones et al. 2008). Although the require- of interest must also be available (see Fig. 5). Fosmid
ment for genes originating in diverse and distantly (vectors based on the E. coli F-plasmid) and BACs
related species to express functional proteins in the (bacterial artificial chromosomes) represent the most
library host is a limitation of this method, unlike commonly used systems for construction of large-insert
sequence-driven approaches, there is no requirement metagenomic libraries
for prior information or well-characterized sequences
M 410 Metagenomic Analysis of Bile Salt Hydrolases in the Human Gut Microbiome
Metagenomic Analysis of Bile Salt Hydrolases in the sequence similarities and used to calculate the relative
Human Gut Microbiome, Fig. 6 Relative abundance abundance of BSHs for major phylogenetic divisions in
of bile salt hydrolases in the gut microbiome in health each gut microbiome (expressed as Hits/Mb DNA). ACT
and disease (From Ogilvie and Jones 2012). Human gut Actinobacteria; BACT Bacteroidetes; FIRM Firmicutes;
microbiomes from the MetaHIT dataset were surveyed TOTAL BSH relative abundance in MetaHit dataset as
using sequence from BSH with proven function to identify a whole irrespective of phylogenetic affiliations. Healthy
homologues to these genes (minimum of 35 % amino acid healthy individuals only (n ¼ 99), UC individuals with
identity 50 aa or more and 1e5 or lower) in the ulcerative colitis only (n ¼ 21), CD individuals with
124 individual gut microbiomes represented in this dataset Crohn’s disease only (n ¼ 4). Error bars indicate stan-
(Qin et al. 2010). Identified BSH sequences were subse- dard error of the mean. Level of significance: * P ¼
quently affiliated to different bacterial divisions based on < 0.01; ** P ¼ < 0.00
microbial BSHs suggests that these may be an sequence data from genes with proven functions
example of a mutually acceptable arrangement or activities. Such data in itself constitutes
between the host and its microbiome (Jones a useful and valuable resource for numerous
et al. 2008). other applications, including the accurate anno-
BSH Activity as a Conserved Feature of the tation and interpretation of shotgun metagenomes
Human Gut Microbiome: Although the initial (and complete bacterial genomes), opening the
application of function-driven screens provided way for larger-scale sequence-based surveys of
much fundamental insight into bile acid metabo- key functions within microbial ecosystems. This
lism by the gut microbiome, these studies also is exemplified by the use of BSH recovered
fill an additional role in generating baseline through function-driven metagenomics to
M 412 Metagenomic Analysis of Bile Salt Hydrolases in the Human Gut Microbiome
subsequently interrogate a range of sequence- this signaling network through bile acid transfor-
based shotgun metagenomes, in order to examine mations, alterations in capacity for bile acid
the representation of this activity among distinct metabolism in the human gut microbiome may
gut communities and other microbial ecosystems play a role in the pathogenesis of numerous dis-
(Ogilvie and Jones 2012; Jones et al. 2008). eases (Jones et al. 2008; Ogilvie and Jones 2012;
This approach was first applied to survey Jones 2011). For example, the products of micro-
15 human gut metagenomes and several non-gut bial bile acid metabolism have been linked to the
metagenomes from a range of habitats initiation and pathogenesis of colorectal cancer
(Jones et al. 2008). Comparison of the relative (CRC) through several mechanisms, including
abundance of genes with homology to functional the direct carcinogenicity of some BA
BSHs in human gut microbiomes with non-gut (Bernstein et al. 2005; Hill 1990; O’Keefe 2008;
habitats revealed an enrichment of putative BSHs Debruyne et al. 2001).
in the human gut microbiome (Jones et al. 2008). Recent observations also implicate the pertur-
This is in keeping with the concept of CBA as an bation of bile acid signaling as a potential mech-
important habitat-associated selective pressure anism contributing to the pathogenesis of CRC
for gut microbes (absent in non-gut environ- and other inflammatory bowel diseases, with the
ments) and BSH as a conserved microbial adap- dedicated bile acid receptor FXRalpha demon-
tation to life in the mammalian intestinal tract strated to be protective against both CRC and
(Jones et al. 2008). Crohn’s disease in murine models (Gadaleta
When relative abundance of BSH homologues et al. 2011; Modica et al. 2008; Duboc
was compared between individual gut et al. 2013; Maran et al. 2009). Since activation
microbiomes, the potential for interindividual var- of this receptor is implicated in the
iation in abundance and types of BSH was also downregulation of mucosal immune responses
highlighted (Jones et al. 2008). Because BSH cat- and protection against autoimmune damage and
alyzes the initial rate limiting step in the wider induction of antiapoptotic pathways in the human
pathway of microbial bile acid metabolism facili- gut (Gadaleta et al. 2011; Duboc et al. 2013),
tated by the gut microbiome (Fig. 3), variation in alterations to microbial bile acid metabolism
overall levels of BSH should be good predictors of leading to changes in the balance of BA species
the capacity for bile acid modification in a given available for receptor binding have clear impli-
microbiome (Jones et al. 2008; Ogilvie and Jones cations for disease initiation and progression.
2012). Furthermore, previous characterization of The initial function-driven metagenomic anal-
BSH types originating from the main phylogenetic ysis of BSH activity in the gut microbiome also
groups in the human gut microbiome revealed provided the basic information to explore these
differences in substrate range of enzymes encoded theories further and to begin to explore the asso-
by different phyla, highlighting the potential for ciation between microbial bile acid metabolism
shifts in community structure to also alter aspects and intestinal diseases (Jones et al. 2008; Ogilvie
of bile acid metabolism by altering the prevailing and Jones 2012). This is exemplified by the appli-
bile acid modifications undertaken by gut cation of gut-derived BSH sequences (with
microbes (Jones et al. 2008). proven activity) to explore changes in the BSH
profile in the microbiomes of individuals with
inflammatory bowel disease (Ogilvie and Jones
Metagenomic Analysis of Bile Salt 2012). Surveys of whole community shotgun
Hydrolases in Health and Disease metagenomes for genes homologous to functional
BSH sequences revealed a distinct reduction in
Due to the role of bile acids in regulating metab- the relative abundance of BSH homologues in the
olism and mucosal immune responses and the gut microbiomes of individuals with Crohn’s dis-
potential for the gut microbiome to influence ease (CD), primarily within BSHs affiliated with
Metagenomic Analysis of Bile Salt Hydrolases in the Human Gut Microbiome 413 M
the Firmicutes division (Ogilvie and Jones 2012). and has already yielded new targets for disease
These changes are in keeping with the well- diagnosis, prophylaxis, or treatment which can
documented dysbiosis and shift in community now be explored further.
structure characteristic of CD (where the diversity
of Firmicutes is markedly reduced) (Manichanh
et al. 2006; Qin et al. 2010) and the role of
FXRalpha signaling in regulation of mucosal References
immune responses (Gadaleta et al. 2011). These
Begley M et al. The interaction between bacteria and bile.
metagenomic-based predictions of changes in
FEMS Microbiol Rev. 2005a;29:625–91.
functional capacity of the CD gut microbiome Begley M et al. Contribution of three bile-associated loci,
related to bile acid metabolism have since been bsh, pva, and btlB, to gastrointestinal persistence and
validated and a reduction in capacity for bile acid bile tolerance of Listeria monocytogenes. Infect
Immun. 2005b;73:894–904.
modification demonstrated in active disease
Begley M et al. Bile salt hydrolase activity in probiotics.
(Duboc et al. 2013). The apparent deficiency of Appl Environ Microbiol. 2006;72:1729–38.
this function in the CD gut microbiome now raises Bernstein H et al. Bile acids as carcinogens in human
the potential for targeting bile acid metabolism in gastrointestinal cancers. Mutat Res. 2005;589:47–65.
Dashkevicz MP, Feighner SD. Development of a differen-
the gut microbiota as a marker for disease risk or
tial medium for bile salt hydrolase-active Lactobacillus
therapeutic intervention. spp. Appl Environ Microbiol. 1989;55:11–6.
Debruyne PR et al. Mutat Res. 2001;480–81:359–69.
Duboc H et al. Connecting dysbiosis, bile-acid
dysmetabolism and gut inflammation in inflammatory
Summary bowel diseases. Gut. 2013;62:531–9.
Gadaleta RM et al. Farnesoid X receptor activation
The analysis of bile acid metabolism in the inhibits inflammation and preserves the intestinal
human gut microbiome has benefited greatly barrier in inflammatory bowel disease. Gut. 2011;60:
from the application of metagenomics and pro- 463–72. M
Gill SR et al. Metagenomic analysis of the human distal
vides an excellent example of how these powerful gut microbiome. Science. 2006;312:1355–9.
community-level approaches can rapidly provide Handelsman J. Metagenomics: application of genomics to
significant insight into the functioning and devel- uncultured microorganisms. Microbiol Mol Biol Rev.
2004;68:669–85.
opment of microbial ecosystems. In the case of
Hill MJ. Bile flow and colon cancer. Mutat Res.
the human gut microbiome, and other host- 1990;238:313–20.
associated microbial consortia, metagenomic Hofmann AF, Eckmann L. How bile acids confer gut
approaches can also generate new understanding mucosal protection against bacteria. Proc Natl Acad
Sci U S A. 2006;103:4333–4.
of how bacteria interact with and impact upon
Houten SM et al. Endocrine functions of bile acids.
their higher host organisms. EMBO J. 2006;25:1419–25.
In the case of bile acid metabolism by the gut Inagaki T et al. Regulation of antibacterial defense in the
microbiome, the deployment of metagenomics to small intestine by the nuclear bile acid receptor. Proc
Natl Acad Sci U S A. 2006;103:3920–5.
explore this aspect of the indigenous intestinal
Jones BV. Bacterial bile acid modification and potential
microbiota has rapidly enhanced our understand- pharmaceutical applications. J Appl Ther Res. 2011;8:
ing of this activity, its effect on human health, and 94–100.
its function within the gut microbiome. Our Jones BV, Marchesi JR. Accessing the mobile
metagenome of the human gut microbiota. Mol
knowledge of bile acid metabolism by the gut
Biosyst. 2007;3:749–58.
microbiome has now been elevated to a point Jones BV et al. Functional and comparative metagenomic
where tangible hypotheses regarding impacts on analysis of bile salt hydrolase activity in the human gut
host health can be formulated and tested. Although microbiome. Proc Natl Acad Sci U S A. 2008;105:
13580–5.
much remains to be done and our understanding is
Kim GB et al. Cloning and characterization of the bile salt
far from complete, metagenomics will undoubt- hydrolase genes (bsh) from Bifidobacterium bifidum
edly continue to play a key role in ongoing studies strains. Appl Environ Microbiol. 2004;70:5603–12.
M 414 Metagenomic by RAPD Profiling
Vulic M, Dionisio F, Taddei F, Radman M. Molecular selection biasness is not involved. High-
keys to speciation: DNA polymorphism and the con- throughput methods can be employed for direct
trol of genetic exchange in enterobacteria. Proc Natl
Acad Sci U S A. 1997;94:9763–7. sequencing of the metagenome. The functional
Woese CR. Interpreting the universal phylogenetic tree. approach is used to explore genes that encode
Proc Natl Acad Sci U S A. 2000;97:8392–6. novel enzymes or drugs, but advancements are
Yooseph S, Nealson KH, Rusch DB, McCrow JP, Dupont needed for function-based metagenomics by
CL, Kim M, et al. Genomic and functional adaptation
in surface ocean planktonic prokaryotes. Nature. employing high-throughput screenings.
2010;468:60–6.
Introduction
under selective conditions. The recombinant Gene of interest can also be identified by random
clones that contain target gene and produce sequencing. Phylogeny can be linked with the
corresponding gene product in active form show functional gene by performing phylogenetic
optimum growth. This functional complementa- analysis with flanking DNA.
tion was used to isolate lysine racemase (Lyr)
gene from soil metagenome; in this E. coli
BCRC 51,734 cells were used as the host and Metagenomic Sequencing
D-lysine as selection agent (Chen et al. 2009).
The above approach faces certain problems Gradual change has been experienced in the area
including that of inaccurate transcription of target of sequencing. Classical Sanger’s sequencing
genes and assemblage problems of the technology is being proceeded by next-generation
corresponding enzymes. There is a scope of sequencing (NGS). Sanger method is preferred for
improvement in screening efficiency by enrich- its low error rate, long read length (>700 bp), and
ment of target microbes or use of screening sen- large insert sizes, but it has a drawback of being
sitive substrate (Streit and Schmitz 2004). a labor-intensive process. Array-based sequencing
Sequence-driven screening methods comprise and in vitro amplification of target DNA fragments
of primers and probes of known conserved constitute the second-generation DNA sequenc-
sequence that include phylogenetic or functional ing. Such technology is implemented in
genes. Target clone is identified by PCR-based 454 Genome Sequencer, Illumina Genome Ana-
amplification or hybridization. PCR amplifica- lyzer, and SOLiD platform (Xing et al. 2012).
tion of 30 genes encoding novel patellamide- These next-generation approaches have the capac-
like precursor peptide from Prochloron sp. ity for abundant parallel sequencing of samples.
symbionts living in consortia with marine Pyrosequencing allows sequencing of 100–200 bp
sponges was reported by Schmidt and coworkers of single-stranded DNA and employs luciferase-
(Banik and Brady 2010). Fifteen new variants of based real-time monitoring of pyrophosphate
the gene encoding precursor to the microviridin release (Guazzaroni et al. 2009) and has high
peptide were identified by Ziemert and coauthors accuracy rates comparable to Sanger’s
in a PCR-based methodology. Homology-based sequencing.
screening is carried out mostly by using degener- Metagenomics employs two approaches:
ate PCR primers, RT-PCR, DNA microarrays, firstly, system-based approach, where complete
integron, and affinity capture methods of sample of DNA is processed and analyzed.
sequence-based screening, as reported in litera- MG-RAST (Metagenomic Rapid Annotation
ture (Xing et al. 2012). Relatively a new method Using Subsystem Technology) characterizes HTS
for genetic screening is substrate-induced gene pyrosequencing run (Larsen et al. 2012). Secondly,
expression screening (SIGEX). These species identification-based approach involves the
metabolism-related genes are selectively probability of potentially missing certain taxa in
expressed in the presence of certain substrates. the process of PCR-based amplification of specific
Chromatography-based screening techniques regions. One of the efficient methods of high-
known as compound configuration screening are throughput analysis (HTS) of genes is based on
also reported. Clones are screened on their capa- microarrays; differential gene expression quantifi-
bility to produce new structural compounds cation of environmental bacterial diversity can be
depicting different chromatographic peaks rela- monitored (Cowan et al. 2005). Second-generation
tive to the host cells. Microarray-based GeoChip sequence technologies help in obtaining more
technology has been developed to access genetic information from complex microbial communities
and functional diversity of microbial community. (Logares et al. 2012). Open reading frames
Reactome array is a new sensitive metabolite and operons can be identified by analysis of
array which offers functional analysis of meta- longer contiguous sequences. Colony hybridiza-
bolic phenotypes (Streit and Schmitz 2004). tion and pyrosequencing when combined with
Metagenomic Research: Methods and Ecological Applications 425 M
metagenomic approach helped in gaining informa- biological, physical, and chemical parameters
tion about genetic organization and diversity of that fully characterize a microbial community.
specific operon. Multivariate statistical analysis is provided by
Addition of sample specific oligonucleotides various tools like Primer-E package. This pack-
barcode to PCR primers had an advantage of age helps in the generation of multidimensional
sequencing a number of samples simultaneously scaling (MDS) plots, analysis of similarities
at a relatively reduced cost, also known as (ANOSIM), and species identification
barcoding or multiplexing (Willner and (SIMPER) (Thomas et al. 2012). A wide variety
Hugenholtz 2013). Third-generation sequencing of bioinformatic tools and databases are available
is evolving fast. The first such technology became for metagenomic studies (Table 1).
available was PacBio RS from Pacific Biosci-
ences. This immobilized polymerase performs
sequencing, and four differently colored nucleo- Ecological Inferences
tides are detected in real time (Logares
et al. 2012). Another innovative sequencing plat- Community Studies
form known as Ion Torrent is based on the princi- The ecological role of the microorganisms
ple that DNA polymerization releases protons can be highlighted by conducting a genome-
which can help in the detection of nucleotide wide analysis. The ecosystem is highly dynamic
incorporation. Read length >100 bp can be in structure, and by employing shotgun
obtained in the above technology. DNA nanoballs metagenomics, direct sequencing of community
can be sequenced in a technology offered by Com- DNA can be achieved. Metagenomics generate
plete Genomics (Thomas et al. 2012). environmental microbial community data that
helps in the investigation of microbial environ-
mental interactone (MEI) (Larsen et al. 2012).
Assembly, Binning, and Annotation PCR-based methods such as amplified ribo-
M
somal DNA restriction analysis (RISA), dena-
Recovering and characterization of genome of turing gradient gel electrophoresis (DGGE), and
cultured organisms requires assembly of short- terminal restriction fragment length polymor-
read fragments into longer genome contigs. phism (T-RFLP) have been used for the charac-
Reference-based assembly method is applied, if terization of community microorganisms.
closely related reference genomes are available. These techniques were applied to study the
Large computational resources are required for de bacterial response in a pesticide contaminated
novo assembly (Thomas et al. 2012). A process soil (Imfeld and Vuilleumier 2012). Subsurface
based on sequence comparison of unknown DNA oil reservoirs with high pressure, salt, heavy
with reference databases, known as binning, helps metals, and organic solvent concentration have
to sort DNA sequences into groups representing been analyzed by metagenomics. In another
genomes from closely related organisms. study, permafrost samples from the Canadian
Metagenome sequence data is generally annotated High Arctic and Alaska were investigated, in
by feature prediction and functional annotation. order to understand its potential linkage to
Feature prediction labels the sequences as gene, global warming (Lewin et al. 2012). Microbial
and functional annotation assigns taxonomic niche study was conducted on flowing acid mine
neighbors and putative gene function. drainage to determine the industrial community
structure of a natural acidophilic biofilm
growing on it (Streit and Schmitz 2004).
Data Handling and Statistical Analysis ECOMIC-RMQS project is a French initiative
to characterize soil microbial communities.
Statistical approach aids metagenomics to link Innovative studies and methodologies can
functional and phylogenetic information to the determine organism’s possible habitat in
M 426 Metagenomic Research: Methods and Ecological Applications
Metagenomic Research: Methods and Ecological Applications, Table 1 Bioinformatic tools and databases
commonly used in metagenomic studies
Name Description Website
ARB Tools for sequence database handling and data analysis www.arb-home.de
CAMERA Community Cyberinfrastructure for Advanced http://camera.calit2.net
Microbial Ecology Research and Analysis
CARMA Characterizing short-read metagenomes www.cebitec.uni-bielefeld.de/brf/
carma/
COG Clusters of Orthologous Groups http://www.ncbi.nlm.nih.gov/COG/
DDBJ DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/
DOTUR Defining Operational Taxonomic Units and Estimating http://www.plantpath.wisc.edu/fac/
Species Richness joh/dotur.html
EMBL European Molecular Biology Laboratory www.embl.de/services/
bioinformatics/index.php
GAAS Genome relative Abundance and Average Size http://sourceforge.net/projects/
gaas/
GenBank Genetic sequence database www.ncbi.nlm.nih.gov/Genbank/
metagenome.html
GOLD Genomes Online Database www.genomesonline.org
GSC Genomic Standards Consortium www.gensc.org
INSDC International Nucleotide Sequence Database http://www.insdc.org/
Collaboration
IMG/M Integrated Microbial Genomes http://img.jgi.doe.gov/
KEGG Kyoto Encyclopedia of Genes and Genomes http://www.genome.jp/kegg/
LefSe LDA Effect Size http://huttenhower.sph.harvard.edu/
galaxy/root?tool_id¼lefse_upload
MEGAN MEtaGenome ANalyzer www-ab.informatik.uni-tuebingen.
de/software/megan
Megx.net Marine Ecological GenomiX www.megx.net
MetaPhlAn Metagenomic Phylogenetic Analysis http://huttenhower.sph.harvard.edu/
galaxy/root?tool_id¼lefse_upload
GraPhlAn Graphical Phylogenetic Analysis http://huttenhower.sph.harvard.edu/
galaxy/root?tool_id¼lefse_upload
METAREP JCVI Metagenomics Reports http://jcvi.org/metarep/
PyNAST Python Nearest Alignment Space Termination www.qiime.org/pynast/
Naive Bayes classifier Probabilistic classifier http://www.statsoft.com/textbook/
naive-bayes-classifier/
MG-RAST Metagenomic RAST http://metagenomics.nmpdr.org
PHACCS Phage Communities from Contig Spectra http://sourceforge.net/projects/
phaccs/
RefSeq Reference Sequence http://www.ncbi.nlm.nih.gov/
refseq/
ShotgunFunctionalizeR R-package for functional comparison of metagenomes http://shotgun.zool.gu.se
SILVA Comprehensive online ribosomal RNA sequence data www.arb-silva.de
base
SINA Bioinformatic tools for sequence alignment www.arb-silva.de
SmashCommunity Stand-alone metagenomic annotation and analysis http://www.bork.embl.de/software/
pipeline smash/
Sort-ITEMS Sequence orthology-based approach for improved http://metagenomics.atc.tcs.com/
taxonomic estimation of metagenomic sequences binning/SOrt-ITEMS/
STAMP Statistical Analysis of Metagenomic Profiles http://kiwi.cs.dal.ca/Software/
STAMP
(continued)
Metagenomic Research: Methods and Ecological Applications 427 M
Metagenomic Research: Methods and Ecological Applications, Table 1 (continued)
Name Description Website
TACOA Taxonomic classification of environmental genomic http://www.cebitec.uni-bielefeld.
fragments using a kernelized nearest neighbor approach de/brf/tacoa/tacoa.html
TETRA Fragment assignment by intrinsic tetranucleotide www.megx.net/tetra
frequencies
Treephyler Fast taxonomic profiling of metagenomes http://www.gobics.de/fabian/
treephyler.php
Fast UniFrac Comparison of microbial communities http://bmf2.colorado.edu/
fastunifrac
XplorSeq Mac OSX software for sequence analysis www.phyloware.com/Phyloware/
XplorSeq.html
Xipe Statistical comparison program http://edwards.sdsu.edu/cgi-bin/
xipe.cgi
that occur naturally in the environment, a large parts of treatment scenarios applied to contami-
number of species are able to compete to use the nated sites, so metagenomic studies of bioreme-
pollutant as a source of carbon, nutrients, or diation will also provide information on how
energy. At the other extreme, when the intro- microbial communities respond to changes in
duced pollutant is complex or synthetic in origin, a variety of environmental factors. To date, only
there may be no local strains that are immediately a handful of such studies have been conducted
capable of metabolizing it or reducing its toxicity. (Table 1).
A number of bioremediating microorganisms
have been isolated from contaminated sites, but it
is now generally understood that the information Types of Metagenomic Studies Used in
obtained from these isolates is insufficient to Bioremediation
understand the workings of complex microbial
communities. More complete genetic informa- Strictly speaking, metagenomics involves the
tion from natural environments is required to entirety of genetic information contained within
understand how contamination affects microbial a sample. More efficient sequencing now makes
communities on the whole, and whether there is it possible to produce this data, but the effort
the potential for further optimization of bioreme- required to thoroughly analyze such huge
diation. The large-scale, culture-independent datasets is a limiting step in metagenomic studies.
studies that are required to meet this end are Even when full metagenomes are sequenced,
now possible with the advent of new high- analysis of the data will often focus on specific
throughput sequencing technologies. genes of interest. There is also a trade-off
between the number of samples analyzed and
the depth of sequencing possible. While it is
Aspirations for Metagenomics in tempting to completely sequence and annotate
Bioremediation single samples, it is difficult to know how repre-
sentative this sample is of an entire environment
Understanding the differences between or in the case of composite samples, the variabil-
a contaminated environment and its ity that exists within the environment.
uncontaminated equivalent is a major topic of As a compromise, many studies of contami-
study in bioremediation research, as it can help nated sites have used what has been referred to as
in determining how much of the natural function gene-targeted metagenomics (Iwai et al. 2010), in
of the system has been altered by contamination. which specific gene regions are amplified and
Metagenomic data can provide information about then sequenced using high-throughput technolo-
taxonomic and enzymatic diversity both pre- and gies. This has been used in bioremediation stud-
post-contamination, which will allow the mining ies to look at specific functional genes (Bell
of potentially active genes and organisms. Accu- et al. 2011; Iwai et al. 2010) as well as 16S
mulating metagenomes from a variety of contam- rRNA gene diversity (e.g., Bell et al. 2011;
inated and uncontaminated equivalent Gihring et al. 2011). The limitations of gene-
environments will make it possible to link targeted metagenomics are that (1) genetic infor-
changes in contaminant composition and concen- mation that is not immediately of interest cannot
tration to specific genes and taxa. In addition, be explored in the future, (2) novel genes that
such studies will answer questions about the cannot be amplified by the selected primers are
microbial ecology of the contaminated system, excluded from the analysis, and (3) information
specifically how microorganisms respond to the about the relative occurrence of the targeted
disturbance created by the contaminant. Adjust- genes within the sample will be lost.
ments of nutrients, carbon sources, pH, tempera- Several recent reports have incorporated some
ture, oxygen, and water content are frequently type of metagenomics into the study of the
Metagenomics Potential for Bioremediation 431 M
Metagenomics Potential for Bioremediation, Table 1 Studies that have used metagenomics to study microbial
populations in contaminated substrates
Gene groups Sequencing
Substrate Contaminant Treatment examined Key finding type References
Whole genome sequencing
Groundwater Heavy None 16S rRNA, Significant loss of PRISM Hemme
metals, metabolism, species and metabolic 3730 et al. 2010
nitrate, stress response diversity following capillary
organic more than 50 years of DNA
solvents contamination sequencer
Soil Diesel Monoammonium 16S rRNA, Shift from Roche/454 Yergeau
phosphate and alkyl group Gammaproteobacteria GS FLX et al. 2012
aeration hydroxylases, to Alphaproteobacteria Titanium
extradiol and Actinobacteria
dioxygenase, after 1 year of
intradiol remediation
dioxygenase,
gentisate/
homogentisate
dioxygenase
Gene-targeted sequencing
Soil JP-8 jet fuel Monoammonium 16S rRNA, Alphaproteobacteria in Roche/454 Bell
phosphate alkB contaminated soils GS FLX et al. 2011
were more effective at Titanium
incorporating added
nitrogen than were
other bacterial taxa
Rhizosphere PCB None Toluene/ Unexpected gene Roche/454 Iwai
soil biphenyl diversity, including FLX et al. 2010 M
dioxygenases 25 novel clusters
Subsurface Uranium Ethanol injection 16S rRNA Identified indicator Roche/454 Cardenas
sediment (VI) taxa specific to various FLX et al. 2010
hydrochemical
conditions and those
that responded to
treatment
Mangrove MF380 None 16S rRNA Wide diversity in both Roche/454 dos Santos
sediment heavy fuel oil contaminated and FLX et al. 2011
uncontaminated
sediment, with
indicator taxa detected
for each
Groundwater Uranium, Emulsified 16S rRNA Very narrow group of Roche/454 Gihring
sulfate, vegetable oil microorganisms that FLX et al. 2011
nitrate were stimulated by the
treatment and/or
involved in
remediation
Liquid Synthetic Added individual 16S rRNA Microbial community Roche/454 Johnson
media aromatic alkanoic acids was unique to the et al. 2011
alkanoic contaminant added,
acids which varied in alkyl
side branching
(continued)
M 432 Metagenomics Potential for Bioremediation
microorganisms living in contaminated environ- of the variation between samples, they only
ments. Since the labor required to process data is described small portions of microbial communi-
beginning to outweigh the cost of sequencing as ties. Even clone library studies rarely sampled
the limiting step in metagenomic analyses, more than a few hundred clones, whereas
a variety of screening methods have been used multiplexed next-generation sequencing easily
in bioremediation studies to optimize the output provides several thousand sequences per sample.
of information (Fig. 1). The various approaches Since studies into bioremediation generally
to metagenomics that have been taken in biore- aim to identify effective pathways for converting
mediation research are outlined below. or tolerating contaminants, how relevant is tax-
onomy? There is still debate surrounding how
Multiplexed 16S rRNA Gene Sequencing much functional redundancy exists between
Because of its potential to quickly assign taxon- microbial species and how prevalent horizontal
omy to large numbers of microorganisms, 16S gene transfer (HGT) is within microbial commu-
rRNA gene sequencing has gone through several nities, yet a recent metagenomic study shows that
waves of popularity in microbial ecology. Com- distinct bacterial species likely do exist (Caro-
parisons of the 16S rRNA gene profiles of envi- Quintero and Konstantinidis 2012). A number
ronmental samples have taken off again with the of 16S rRNA gene surveys have been conducted
advent of high-throughput sequencing (Tringe in contaminated environments and have been
and Hugenholtz 2008) and are currently more used to assess how microbial communities vary
popular than any other type of metagenomic in relation to uncontaminated reference environ-
study. One reason is that a large number of 16S ments or how a community changes in
rRNA gene entries exist in NCBI and EMBL, as a contaminated environment over time. In several
do curated 16S rRNA gene databases such as the of these studies, 16S rRNA gene-targeted
Ribosomal Database Project (http://rdp.cme.msu. metagenomics has identified indicator species
edu/) and Green Genes (http://greengenes.lbl. that are specific to certain contaminants and envi-
gov/). As a result, profiles of community diversity ronmental conditions (Cardenas et al. 2010; dos
can be conducted with only a cursory understand- Santos et al. 2011). Similar multiplexed studies
ing of bioinformatics. While early techniques may be used to identify indicator species across
such as T-RFLP and DGGE gave some indication multiple environments at similar stages of
Metagenomics Potential for Bioremediation 433 M
Metagenomics Potential for Bioremediation, Fig. 1 Methods for integrating metagenomics into bioremediation
studies
M 434 Metagenomics Potential for Bioremediation
contamination, and these indicator species could considered, since many sequencing technologies
theoretically be used to assess the state of other have a maximum read length, although with time,
contaminated sites. this is becoming less of a concern.
The major advantage of the high-throughput
sequencing approach when compared with earlier Functional Screening
16S rRNA gene profiling techniques is the depth Since bioremediation is generally focused on
of coverage. In mangrove sediment contaminated which microbial communities most effectively
with heavy fuel, little change was seen at the degrade pollutants, it can potentially be straight-
phylum level following contamination, while forward to functionally screen for samples of inter-
large shifts were observed at finer taxonomic est. A study of contaminated Arctic soils
levels (dos Santos et al. 2011), an effect that compared the hydrocarbon-degrading efficiency
may not have been visible using coarser profiling of various soils in response to different in situ
methods. Similarly, 16S rRNA gene and ex situ treatments, with degradation occurring
pyrosequencing showed that a very narrow significantly more effectively in one location. Sub-
group of taxa were stimulated by emulsified oil sequently, a metagenomic analysis was conducted
injection in a uranium-contaminated aquifer throughout a year-long time course on the soil that
(Gihring et al. 2011). With less sequencing cov- most rapidly degraded the contaminating hydro-
erage, it would be impossible to determine carbons, along with an uncontaminated reference
whether these were the only taxa stimulated or soil (Yergeau et al. 2012). Metagenomic studies
simply the most dominant members of the that are conducted in vitro also involve an aspect
community. of selection, as only microorganisms that are capa-
ble of growing in mixed culture prevail. Mixed
Multiplexed Functional Gene Sequencing culture studies are common, as they often evaluate
In many bioremediation studies, specific cata- the potential for bioremediation in treatment facil-
bolic, reducing, or oxidizing genes are the sub- ities. Metagenomics is starting to be applied to
jects of interest. In such cases, it may be desirable such studies, as in one case in which it was deter-
to simply amplify and sequence these targeted mined that the amount of branching in synthetic
genes. As with 16S rRNA gene sequencing, aromatic alkanoic acids led to vastly different
many samples can be processed by multiplex microbial communities (Johnson et al. 2011).
sequencing for a limited cost. Degenerate primers Prescreening of DNA can also be conducted
have been used to amplify alkane on large genomic fragments that are contained
monooxygenase genes from hydrocarbon- within plasmids, such as fosmids or cosmids. By
contaminated Arctic soil, and sequencing showed transforming these vectors into hosts such as
that those related to Alphaproteobacteria E. coli, the DNA fragment can be screened for
responded most positively to amendment with the ability to mineralize or tolerate a specific
monoammonium phosphate (Bell et al. 2011). contaminant. This strategy permits the identifica-
Amplicons were also obtained from a tion of genes that are involved in the catabolism
PCB-contaminated soil using degenerate primers of particular pollutants, or that permit host sur-
targeting toluene/biphenyl dioxygenase genes, vival, provided the essential pathway can be
and sequencing identified a variety of novel contained in a single DNA fragment and can be
dioxygenase gene clusters (Iwai et al. 2010). In expressed in the host. Sequencing is also more
terms of gene discovery, the major drawback of targeted using this approach, as the sequencing of
this approach is that gene identification depends housekeeping and rRNA genes is limited.
on novel genes having significant homology at To search for genes capable of degrading cate-
the primer-targeted regions. Even when the chol, metagenomic DNA from a hydrocarbon-
targeted genes are known, the chosen primers contaminated soil was fragmented, cloned into
will bias the relative gene abundance within fosmid vectors, transformed into E. coli, and
each sample. Amplicon size must also be plated with catechol as a carbon substrate. A high
Metagenomics Potential for Bioremediation 435 M
diversity of extradiol dioxygenase genes was (Yergeau et al. 2012), demonstrating that, in
observed, as well as a surprisingly high density this case, there was significance to taxonomic
of one extradiol dioxygenase per 3.6 Mb of DNA affiliation. Similarly, most of the functional
screened (Brennerova et al. 2009). A similar genes (stress response, metal resistance, etc.)
approach identified novel extradiol dioxygenase identified in the metagenome of a heavy metal-
genes, as well as previously unknown arrange- contaminated groundwater community were
ments of catechol-degrading pathways (Suenaga traced to Gammaproteobacteria, the group that
et al. 2009). The drawbacks of this approach are also dominated the 16S rRNA gene profile
that the entire genetic pathway must be contained (Hemme et al. 2010).
within a single plasmid; that the host may be Full metagenomes can also provide information
unable to survive in the presence of any toxic on the relative abundance of genes of interest.
gene products, meaning that not all relevant PCR-based approaches introduce a primer bias
genes will necessarily be identified; and that prior to sequencing, whereas strict metagenomic
some genes may not be expressed if the chosen analysis permits a more direct quantitative compar-
host is not closely related to the organism from ison. Within the contaminated groundwater
which the DNA fragment originated. metagenome, stress-response genes, such as those
involved in DNA repair and heavy metal resistance,
Full Metagenome Analysis were more abundantly represented than would be
Full metagenomic sequencing, when possible, expected in an uncontaminated community
provides the greatest amount of information. (Hemme et al. 2010). Most hydrocarbon-degrading
With this approach, any number of post hoc ana- genes were high in abundance in the contaminated
lyses can be conducted on a dataset. While much Arctic soil metagenomes when compared with the
of the genetic information obtained from a given uncontaminated reference soil, but extradiol aro-
environment may lack appropriate comparators matic ring-cleavage dioxygenase sequences
in existing gene banks, collecting full decreased after a year of treatment, while other
M
metagenomic information will allow future dioxygenases increased in abundance, and alkane
researchers the opportunity to analyze the hydroxylases remained constant throughout treat-
dataset. At the moment, a number of database ment (Yergeau et al. 2012). Caution should be
projects are ongoing in an attempt to collect and exercised when using preparatory techniques such
annotate metagenomic data, including some from as whole genome amplification, since the quantita-
contaminated sites (e.g., http://www.hydrocar- tion of genes can be affected (Yergeau et al. 2010).
bonmetagenomics.com/). Although the amount of DNA required for
To date, only a handful of complete metagenomic sequencing is decreasing, whole
metagenomic studies have been conducted in genome amplification may still be necessary in
contaminated environments. While 16S rRNA very low biomass systems, as can be found in
gene studies are useful in determining the rela- some highly contaminated environments.
tive microbial diversity of environments,
the metabolic potential of a microbial commu-
nity may not be strictly linked to its taxonomic Information Lacking from
profile. Thus, full metagenomic studies can Bioremediation Literature
be used to assess how diversity relates to func-
tional potential. A metagenomic study of a Genes Involved in Bioremediation
diesel-contaminated Arctic soil showed that Key pathways involved in the bioremediation of
a shift in 16S rRNA gene sequences from major contaminants are known, but many novel
Gammaproteobacteria to Alphaproteobacteria enzymes and pathways are still being discovered.
and Actinobacteria mostly correlated with The lack of sequence conservation in some key
a shift in hydroxylases and dioxygenases that gene families has made it difficult to determine
were affiliated with those same organisms their true diversity using PCR-based methods.
M 436 Metagenomics Potential for Bioremediation
In the case of genes that code for enzymes that are community member. In addition, large differ-
involved in normal forms of metabolism or other ences in % G+C and codon bias between puta-
housekeeping functions within the cell, this tively transposed genes suggested a very recent
diversity may be extensive. Metagenomic studies origin for acetone carboxylases, mercuric resis-
across contaminated environments will help cor- tance operons, and czcD divalent cation trans-
relate gene groups with contaminants, and this porters (Hemme et al. 2010). The persistence of
may identify roles for pathways that had previ- HGT after 50 years of continued contaminant
ously been considered unimportant in the conver- stress suggests that it may be very important to
sion or tolerance of contaminants. the survival of microorganisms in a contaminated
Microbial species that are not directly environment.
involved in bioremediation can also represent Horizontal gene transfer was also suspected
a sizeable proportion of a contaminated commu- when a mismatch between the number of cyto-
nity. Soils contaminated with hydrocarbons have chrome P450 genes affiliated with Rhodococcus
still provided homes for populations of nitrifying and the relative abundance of Actinobacteria was
bacteria (Deni and Penninckx 1999) and observed in the metagenome of diesel-
cyanobacteria (Yergeau et al. 2012), while the contaminated Arctic soils (Yergeau et al. 2012).
stimulation of the microbial reductive chlorina- A number of the genes detected in this study can
tion of PCE and TCE by adding organic products be plasmid-borne, so this may be a common
tends to promote many microorganisms that are response. Future metagenomic analyses pre- and
not involved in remediation (Strycharz post-contamination may show how quickly this
et al. 2008). In addition, microorganisms that process can shape the genetic structure of micro-
function in various nutrient cycles (e.g., nitrogen bial communities. If HGT is determined to be
fixers) may be important to the functioning of the a major force shaping newly contaminated envi-
overall community. To date, it is not really ronments, the metagenomic screening of mobile
known how much these other species affect func- elements alone may be another method of elimi-
tioning in contaminated environments or how nating large amounts of housekeeping and redun-
bioremediation is affected if some processes are dant genetic information.
disrupted.
Quantitation
Extent of Horizontal Gene Transfer As mentioned, metagenomes that have not been
It can be difficult to determine the taxonomic modified by processes such as whole genome
affiliation of plasmid-borne DNA, and certain amplification may permit actual quantification
key genes involved in bioremediation, such as of gene abundances. Whereas techniques such
naphthalene dioxygenases and alkane as qPCR and PCR-based diversity studies are
monooxygenases (Whyte et al. 1997), have been subject to amplification biases, the metagenome
found on plasmids. Mobile genetic elements are represents all of the genetic information that
known to be common in at least some natural could be extracted from a sample. Most previous
environments, but it is not known how significant attempts to quantify microbial allocation of gene
a role HGT plays in the adaptation of microbial resources to important processes in contaminated
communities to contamination. sites have relied heavily on PCR methods.
In metagenomic studies, genes can be com- Some early metagenomic studies have already
pared with the background DNA of the commu- shown the potential of quantitation. The
nity metagenome, which can help in identifying relative genomic allocation to the degradation of
the prevalence of HGT. Bioinformatic analysis of various components of jet fuel, a complex con-
a metagenome under long-term contamination taminant, was observed in a contaminated soil
showed that roughly 12 transposons were present community. It was also observed that known
per Mb of DNA, which was similar to reference hydrocarbon-degrading genes represented a dis-
strains of Xanthomonas, the dominant proportionate amount of the total metagenome
Metagenomics Potential for Bioremediation 437 M
(Brennerova et al. 2009). An overabundance of contaminant breakdown, and a recent review
genes conferring resistance to heavy metals, describes the potential power of combining SIP
nitrate, and organic solvents was observed in with metagenomics (Chen and Murrell 2010).
a heavy metal-contaminated aquifer (Hemme SIP-metagenomic analyses of contaminated sub-
et al. 2010). Semiquantitative approaches have strates allow the genes and species that actively
also been used to determine relative shifts in respond to pollutants to be separated from the
species abundance and nitrogen incorporation in huge amount of background genetic information
contaminated environments (Bell et al. 2011; that may remain from the initial, uncontaminated
Cardenas et al. 2010), and future studies using soil. The link between taxonomic affiliation and
full metagenome analysis would permit actual community function is already being explored
quantification. through the combination of SIP and high-
throughput sequencing (Bell et al. 2011), while
advances in RNA-SIP will provide a comprehen-
The Future of Metagenomics in sive picture of how the addition of substrates,
Bioremediation whether contaminants or amendments, directly
affects transcription. At the moment, the CsCl
Technologies that facilitate metagenomic gradients that are required to separate labeled
research are advancing quickly, and many studies and unlabeled nucleic acids are extremely cum-
that had previously been outside the realm of bersome and limit the number of samples that can
consideration are becoming possible. Companies be processed within a given study.
such as PacBio and Nanopore are producing However, a novel proteomic-SIP technique,
sequencers that will allow Kb reads of DNA, using 2-dimensional liquid chromatography-
which will make it possible to assemble continu- tandem mass spectrometry (2D-LC-MS/MS),
ous genomes in mixed communities. Even with was able to examine the isotopic ratios of roughly
current technologies this is becoming feasible, as 100,000 spectra while simultaneously searching
M
the entire draft genome of a novel permafrost a database of 31,966 protein sequences in under
methanogen was assembled by end-to-end 24 h (Pan et al. 2011). The computing power
linking of 113 bp paired-end reads that were required to conduct the analysis was enormous,
produced in a metagenomic study using Illumina but as with all high-throughput processing, this
GAII technology (Mackelprang et al. 2011). can be expected to change rapidly with time. The
The combination of various high-throughput potential for applying the proteomic-SIP tech-
techniques will enable comprehensive studies of nique in bioremediation studies is enormous, as
microbial communities and shed light on the even small numbers of proteins produced by rare
links between species diversity, gene density, microorganisms can be tracked (Pan et al. 2011).
gene expression, protein production, and chemi- This will be especially useful in examining bio-
cal transformation in contaminated environ- remediation pathways that involve syntrophic
ments. Stable isotope probing (SIP) is interactions, or those involved in the processing
a technique that involves adding heavy isotope- of slowly degraded contaminants, in which nutri-
labeled compounds to a substrate and allowing ent flux and subsequent protein production are
microorganisms to consume it and incorporate bound to be low.
the labeled atoms into cellular components such In contaminated environments, metagenomics
as DNA, RNA, and phospholipids. In the case of has been used to compare polluted substrates with
DNA-SIP, all DNA from a treated sample is uncontaminated reference substrates (e.g.,
extracted and then centrifuged in CsCl gradients Yergeau et al. 2012) and has also been used to
to separate the “heavy” (labeled) from the “light” directly measure species composition within the
(unlabeled) DNA. This technique has great same matrix before and after contamination (dos
potential in terms of identifying functionally Santos et al. 2011). These types of comparative
active microbes, specifically those involved in studies are geared at understanding what genetic
M 438 Metagenomics Potential for Bioremediation
information distinguishes a contaminated envi- is being asked, as well as the resources that are
ronment from similar pristine systems. One of available. While full metagenomic studies provide
the next major efforts in metagenomics is likely the greatest amount of data per sample, surveying
to be the identification of a core microbiome for indicator species or gene diversity across
(Shade and Handelsman 2012). In other words, a wide range of samples may be more appropriate
what genes and species are common across an in many cases. These methods may change
environment and across multiple environments. quickly as technology continues to improve, but
With a more comprehensive idea of what core ultimately, the best approaches will be those that
microbiomes exist, environments may be aligned answer questions about how to most efficiently
by their conserved regions, much as sequences improve the bioremediation of contaminated sites.
are now, and the true variability between envi-
ronments can then be assessed. In the context of
bioremediation, it will be important to understand
References
whether there are critical genes and organisms
that must respond positively to the introduction Bell TH, Yergeau E, Martineau C, et al. Identification of
of a contaminant in order to achieve successful nitrogen-incorporating bacteria in petroleum-
remediation. Genes promoted outside of this contaminated Arctic soils by using [(15)N]DNA-
based stable isotope probing and pyrosequencing.
common core must then be the result of other
Appl Environ Microb. 2011;77:4163–71.
environmental or stochastic processes. Brennerova MV, Josefiova J, Brenner V, et al.
Many current genomic studies focus on snap- Metagenomics reveals diversity and abundance of
shots of genetic information in environmental meta-cleavage pathways in microbial communities
from soil highly contaminated with jet fuel under
samples, but the high growth rate of microorgan-
air-sparging bioremediation. Environ Microbiol.
isms means that many microbial communities 2009;11:2216–27.
are undergoing constant and rapid evolution. Cardenas E, Wu WM, Leigh MB, et al. Significant asso-
This suggests that longer-term metagenomic ciation between sulfate-reducing bacteria and
uranium-reducing microbial communities as revealed
studies should be a focal point of future research.
by a combined massively parallel sequencing-
The metagenomic study by Hemme et al. (2010) indicator species approach. Appl Environ Microb.
of metal-contaminated groundwater showed that 2010;76:6778–86.
50 years of pollutant stress had reduced species Caro-Quintero A, Konstantinidis KT. Bacterial species
may exist, metagenomics reveal. Environ Microbiol.
and metabolic diversity to a minimal level of
2012;14:347–55.
complexity. While all necessary metabolic path- Chen Y, Murrell JC. When metagenomics meets stable-
ways were found, more than ten times fewer isotope probing: progress and perspectives. Trends
OTUs, with a similar loss in metabolic complex- Microbiol. 2010;18:157–63.
Deni J, Penninckx MJ. Nitrification and autotrophic nitri-
ity, were present than were observed at an adja-
fying bacteria in a hydrocarbon-polluted soil. Appl
cent background site. Monitoring how evolution Environ Microb. 1999;65:4008–13.
selects genes in contaminated environments over dos Santos HF, Cury JC, do Carmo FL, et al. Mangrove
the long term will undoubtedly assist in the bacterial diversity and the impact of oil contamination
revealed by pyrosequencing: bacterial proxies for oil
understanding and treatment of chronically con-
pollution. PLoS One. 2011;6:e16943.
taminated sites, although the interpretation of Gihring TM, Zhang GX, Brandt CC, et al. A limited
large amounts of data will first require microbial consortium is responsible for extended
a solution to the human-processing bottleneck. bioreduction of uranium in a contaminated aquifer.
Appl Environ Microb. 2011;77:5955–65.
Hemme CL, Deng Y, Gentry TJ, et al. Metagenomic
insights into evolution of a heavy metal-contaminated
Summary groundwater microbial community. ISME J.
2010;4:660–72.
Iwai S, Chai BL, Sul WJ, et al. Gene-targeted-
A variety of metagenomic approaches are avail-
metagenomics reveals extensive diversity of aromatic
able to bioremediation researchers. The choice of dioxygenase genes in the environment. ISME J.
technique will depend heavily on the question that 2010;4:279–85.
Metagenomics, Metadata, and Meta-analysis 439 M
Johnson RJ, Smith BE, Sutton PA, et al. Microbial bio- Definition
degradation of aromatic alkanoic naphthenic acids is
affected by the degree of alkyl side chain branching.
ISME J. 2011;5:486–96. The analytical approach of identifying emergent
Mackelprang R, Waldrop MP, DeAngelis KM, et al. patterns in ecological properties of microbial
Metagenomic analysis of a permafrost microbial communities by sequencing community structure
community reveals a rapid response to thaw. Nature. and function and defining the physical, chemical,
2011;480:368–71.
Pan CL, Fischer CR, Hyatt D, et al. Quantitative and biological parameters of the ecosystem.
tracking of isotope flows in proteomes of microbial Metagenomics is the study of all genetic mate-
communities. Mol Cell Proteomics. 2011; 10: rial from all organisms in a defined sample
M110.006049. (Handelsman et al. 1998). However, it is defined:
Shade A, Handelsman J. Beyond the Venn diagram: the
hunt for a core microbiome. Environ Microbiol. metagenomics is just a term used to describe
2012;14:4–12. a selection of tools and techniques that enable
Strycharz SM, Woodard TL, Johnson JP, et al. Graphite us to uncover the DNA from the organisms in
electrode as a sole electron donor for reductive dechlo- an environment (which can comprise any ecosys-
rination of tetrachlorethene by Geobacter lovleyi.
Appl Environ Microb. 2008;74:5943–7. tem, from soil to human intestinal tract). Meta-
Suenaga H, Koyama Y, Miyakoshi M, et al. Novel orga- data (also known as contextual data) refers
nization of aromatic degradation pathway genes in directly to information regarding the original
a microbial community as revealed by metagenomic sample, the extraction and handling of the DNA,
analysis. ISME J. 2009;3:1335–48.
Tringe SG, Hugenholtz P. A renaissance for the and the sequencing platform and data processing
pioneering 16S rRNA gene. Curr Opin Microbiol. information (Field et al. 2011; Yilmaz et al.
2008;11:442–6. 2011). Without such metadata, metagenomic
Whyte LG, Bourbonnière L, Greer CW. Biodegradation sequence data would be redundant for anything
of petroleum hydrocarbons by psychrotrophic
Pseudomonas strains possessing both alkane (alk) other than basic gene discovery. Meta-analysis,
and naphthalene (nah) catabolic pathways. Appl Envi- which is the process of performing comparative
ron Microb. 1997;63:3719–23. investigation of features between datasets, is
M
Yergeau E, Hogues H, Whyte LG, et al. The functional greatly enhanced by the combination of
potential of high Arctic permafrost revealed by
metagenomic sequencing, qPCR and microarray ana- metagenomic data and metadata (Knight
lyses. ISME J. 2010;4:1206–14. et al. 2012).
Yergeau E, Sanschagrin S, Beaumier D, et al.
Metagenomic analysis of the bioremediation of
diesel-contaminated Canadian high Arctic soils.
PLoS One. 2012;7:e30058. Metagenomics
in 2004 of direct sequencing approaches, which have been adopted by the International Nucleo-
provided a different route to market compared to tide Sequence Database Collaboration (INSDC)
clone-dependent sequencing, has accelerated the and a considerable number of journals. The major
implementation and data generation capability of proponent from the latter group is the GSC’s own
this technique. Existing studies have been well journal, Standards in Genomic Science, which
reviewed in terms of the impact on community requires a detailed but standard description of
ecology interpretation and novel biochemical the associated metadata for genome and
process identification (Gilbert and Dupont 2011). metagenome reports (Gilbert et al. 2010a; Nelson
et al. 2009).
Metadata
Meta-analysis
The ensuing data bonanza (Field et al. 2011) has
driven the need for more robust and comprehen- Meta-analysis is defined as the combination of
sive standards for recording and sharing informa- results from different studies that have similar or
tion about why, how, and from where the related research hypotheses. While not strictly
sequencing data was generated. One person’s a meta-analysis, the use of comparative
metadata is another person’s primary data, metagenomics to explore the principles of micro-
and so the community outreach to determine bial ecology stems from the common analysis of
the consensus for recording different data data generated by different studies in different
types and information has been a mammoth ecosystems to explore central hypotheses, usually
effort. The Genomic Standards Consortium related to the overall distribution of taxonomic
(Field et al. 2011) has risen to be one of the functional attributes in the community. Initial
most prominent and successful standards com- efforts include comparative analysis of four
munities. The central tenet of the Genomic Stan- metagenomic samples from soil and whale fall
dards Consortium is to promote mechanisms (Tringe et al. 2005), 87 viral and microbial
that standardize the description of genomes, metagenomic datasets from nine biomes
metagenomes, and amplicon sequences and the (Dinsdale et al. 2008), metagenomic datasets
exchange and integration of these data and asso- from 86 viral and microbial communities
ciated metadata (www.gensc.org). The GSC has (Willner et al. 2009), and more recently
created three minimal information checklists, 77 metagenomes (Delmont et al. 2011). These
which collectively are known as the Minimal studies have led to the conclusion that different
Information about ANY sequence (MIxS) check- environments have habitat-specific functional
lists. The three standards are the Minimal Infor- and taxonomic fingerprints that indicate
mation about a Genomic Sequence (MIGS; Field environment-specific genomic adaptation. Of
et al. 2008), the Minimal Information about course this should be taken with a caveat that
a Metagenomic Sequence (MIMS), and the Min- each comparative study has a small number of
imal Information about a Marker Gene Sequence metagenomes in the analysis and that each
(MIMARKS) (Yilmaz et al. 2011). These infor- metagenomic dataset only comprises a tiny frac-
mation checklists and the ancillary environmen- tion of the functional information present in any
tal data sheets describe the types of information community. The latter point is made obvious by
the community would like to see associated with ultra-deep screening of microbial diversity,
the sequence data, and importantly provide whereby even in marine coastal surface waters,
a description for recording these data using the species richness can be astounding (>100,000
a defined standard. This enables a level playing taxa per L of water; Caporaso et al. 2011).
field for the provision and sharing of data Importantly, cross-sample comparisons
between organizations and PIs, and the checklists should be performed in concert with dynamic
Metagenomics, Metadata, and Meta-analysis 441 M
comparative analysis of the contextual environ- Cross-References
mental data. These physical, chemical, and bio-
logical data that describe the environment in ▶ Approaches in Metagenome Research:
which the microbial organisms under investiga- Progress and Challenges
tion were isolated are vital to interpreting the ▶ Biological Treasure Metagenome
gradients of function and specific trends in gene ▶ Challenge of Metagenome Assembly and
persistence seen between samples and studies. Possible Standards
Within one study, such as the Global Ocean Sam- ▶ Computational Approaches for Metagenomic
pling (Rusch et al. 2007) or Western English Datasets
Channel (Gilbert et al. 2010b), the link between ▶ Metagenomic Research: Methods and
environmental metadata and the functional or Ecological Applications
taxonomic sequence data can be implicit. How-
ever, in comparative studies, it is rare to be able to
generate canonical correlations between specific References
functional gene abundances and different contex-
tual metadata as different studies tend to measure Caporaso JG, Field D, Paszkiewicz K, Knight R, Gilbert
JA. Evidence for a persistent microbial community in
different parameters differently. The Earth
the Western English Channel. ISME J. 2012;6:1089–
Microbiome Project (www.earthmicrobiome. 1093.
org) is working to create not just comparable Delmont TO, et al. Metagenomic mining for microbiolo-
data on the basis of methodological standard pro- gists. Isme J. 2011;5(12):1837–43.
Dinsdale EA, et al. Functional metagenomic profiling of
tocols (e.g., DNA extraction, PCR, sequencing)
nine biomes. Nature. 2008;452(7187):629–32.
but also by obtaining data with comparable con- Falkowski PG, Fenchel T, Delong EF. The microbial
textual information, e.g., temperature measure- engines that drive Earth’s biogeochemical cycles.
ments, latitude and longitude, ammonia Science. 2008;320(5879):1034–9.
concentrations, pH, etc. All these metadata are Field D, et al. The minimum information about a genome M
sequence (MIGS) specification. Nat Biotechnol.
being collated into large-scale databases with the 2008;26(5):541–7.
Genomic Standards Consortium’s MIxS check- Field D, et al. The genomic standards consortium. PLoS
lists as the data framework, and so they represent Biol. 2011;9(6):e1001088.
Gewin V. Genomics: discovery in the dirt. Nature.
the community consensus for these records.
2006;439(7075):384–6.
Gilbert JA. Beyond the infinite – tracking bacterial gene
expression. Microbiol Today. 2010;37(2):82–5.
Summary Gilbert JA, Dupont CL. Microbial metagenomics: beyond
the genome. Ann Rev Mar Sci. 2011;3:347–71.
Gilbert JA, et al. Metagenomes and metatranscriptomes
Metagenomics studies now need to be performed from the L4 long-term coastal monitoring station in the
using the principles of scientific investigation and Western English Channel. Stand Genomic Sci.
excellent statistical experimental design, using 2010a;3(2):183–93.
Gilbert JA, et al. The taxonomic and functional diversity
replication and adequate controls to determine if
of microbes at a temperate coastal site: a ‘multi-omic’
the perceived biological variation actually could study of seasonal and diel temporal variation. PLoS
be used to explore basic ecological principles. One. 2010b;5(11):e15545.
The only appropriate way to perform good Handelsman J, et al. Molecular biological access to the
chemistry of unknown soil microbes: a new frontier for
meta-analysis for metagenomic studies is to uti-
natural products. Chem Biol. 1998;5(10):R245–9.
lize excellent metadata, and this comes back to Hugenholtz P, Kyrpides NC. A changing of the guard.
the design of the experiment, long before any Environ Microbiol. 2009;11(3):551–3.
molecular analysis has even been suggested. It Knight R, et al. Designing better metagenomic surveys:
the role of experimental design and metadata
also must leverage multidisciplinary effort to
capture in making useful metagenomic datasets for
obtain the right data to answer the relevant ecology and biotechnology. Nat Biotechnol.
questions. 2012;30(6):513–520.
M 442 MetaRank: Ranking Microbial Taxonomic Units
Nelson OW, Harrison SH, Garrity GM. Meeting report for microorganisms in microbial communities
SIGS1: first conference of the standards in genomic (Hugenholtz and Tyson 2008). A key question
sciences eJournal. Stand Genomic Sci. 2009;1(1):
72–6. in metagenomics is whether and how changes in
Rusch DB, et al. The sorcerer II global ocean sampling the microbial abundances of taxonomic units or
expedition: Northwest Atlantic through eastern tropi- functional groups relate to alterations of habitats
cal Pacific. PLoS Biol. 2007;5(3):e77. (Hamady and Knight 2009). To characterize the
Tringe SG, et al. Comparative metagenomics of microbial
communities. Science. 2005;308(5721):554–7. relationship, it is important to compare microbial
Whitman WB, Coleman DC, Wiebe WJ. Prokaryotes: community compositions in different environ-
the unseen majority. Proc Natl Acad Sci USA. ments (Wooley et al. 2010).
1998;95(12):6578–83. Many statistical methods (e.g., Metastats
Willner D, Thurber RV, Rohwer F. Metagenomic signa-
tures of 86 microbial and viral metagenomes. Environ (White et al. 2009), ShotgunFunctionalizeR
Microbiol. 2009;11(7):1752–66. (Kristiansson et al. 2009), STAMP (Parks and
Yilmaz P, et al. Minimum information about a marker Beiko 2010)) have been developed for compara-
gene sequence (MIMARKS) and minimum informa- tive metagenomics in attempt to identify differ-
tion about any (x) sequence (MIxS) specifications. Nat
Biotechnol. 2011;29(5):415–20. entially abundant features between microbial
communities. Most of these methods employ sta-
tistical hypothesis tests to determine whether
member abundances are equal in distinct commu-
MetaRank: Ranking Microbial nities and focus on the quantitative differences
Taxonomic Units or Functional between microbial community compositions.
Groups for Comparative Analysis of They are highly dependent on the precision of
Metagenomes estimated values in member abundances.
However, estimated abundances might devi-
Tse-Yi Wang1 and Huai-Kuang Tsai2 ate from the true abundances in habitats due to
1
Department of Medical Research, Mackay sampling biases and other systematic artifacts in
Memorial Hospital, New Taipei City, Taiwan metagenomic data processing (Ashelford
2
Institute of Information Science, Academia et al. 2005; Brady and Salzberg 2009; Gomez-
Sinica, Taipei, Taiwan Alvarez et al. 2009; Mavromatis et al. 2007).
Although systematic artifacts can be corrected
through improvements in data processing tech-
Definition niques, sampling biases will remain unavoidable
unless exhaustive data of the whole populations
MetaRank is a rank conversion scheme for ana- become available (Wooley and Ye 2010).
lyzing microbial communities based on the rela- To reduce the effects of sampling biases,
tive order of member (taxonomic unit or MetaRank performs a series of rank conversions
functional group) abundances rather than their for analyzing microbial communities based on
estimated values (e.g., proportions). It leverages the ranks of members rather than their estimated
a series of statistical hypothesis tests to compare abundances. It leverages the fact that the ranks of
member abundances within microbial communi- highly abundant members are less affected by
ties and determine their ranks, providing an alter- sampling biases because large values and, by
native rank-based method for characterizing extension, their relative order are robust against
metagenomes. small deviations. It also utilizes statistical
hypothesis testing to compare member abun-
dances within communities and determine the
Introduction ranks as follows: Highly abundant members are
delegated to high ranks and any two members
Metagenomics is a field that involves sampling, without statistically significantly different abun-
sequencing, and analyzing the genetic material of dances are assigned the same rank.
MetaRank: Ranking Microbial Taxonomic Units 443 M
Empirical tests on real datasets and synthetic To select highly abundant members with
samples (Kurokawa et al. 2007; Ley et al. 2006; proportions that are significantly higher than
Mavromatis et al. 2007) approve that MetaRank the average proportion (1/N), MetaRank
is able to downsize the effects of sampling biases applies hypothesis tests, Ho: pn 1/N vs. Ha:
and help to clarify the characteristics of pn > 1/N for all 1 n N. Since
metagenomes. The ranks converted by MetaRank Xn Binomial(S,pn) with mean E(Xn) ¼ Spn
have small normalized standard deviations, and variance Var(Xn) ¼ Spn(1 – pn), the bino-
which clearly reveal the common traits within mial distribution of the test statistic Xn under
a set of metagenomes. The ranks also capably Ho is approximated by normal distribution with
identify the discriminating features of microbial z-statistic Zn:
community compositions (Wang et al. 2011). In
addition, it is noted that MetaRank as a rank-
S
based approach has the same disadvantages of X n Eð X n Þ Xn
N
Zn ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
all nonparametric methods. There is a loss of
VarðXn Þ S 1
information and the loss of ability to provide 1
parametric statistics for inference. Therefore, N N
MetaRank is a useful rank-based alternative for _ 1
analyzing metagenomes that complements para- pn
N
¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
metric methods. ffi N ð0, 1Þ
1 1
1
SN N
Methods
when sample size S is large enough such that
Given a metagenomic sample of a microbial com- pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
0 EðXn Þp\pm3 VarðXn Þ S: Otherwise, the M
munity, MetaRank first employs binomial tests to exact binomial test is applied when S is small
iteratively select highly abundant members within pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
such that EðXn Þ 3 VarðXn Þ < 0 or
the community followed by multinomial tests to pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
S < EðXn Þ þ 3 VarðXn Þ.
rank the selected members in each run.
The p-value for exact binomial test is calcu-
lated as follows:
Binomial Tests for Selecting Highly Abundant
Members
XS
For N members in a microbial community, let Xn S 1 1 Sk
represent the abundance of the nth member in the P½ X n x n
¼ 1
_ k¼xn
k Nk N
metagenomic sample and p n (i.e., Xn/S) be the
sample proportion of the nth member, where
n ¼ 1, 2, . . ., N and S ¼ X1 + X2 + . . . + XN. where xn is the observed value of the test statistic
Under the assumption that all nucleic acids of Xn.
microorganisms in habitats are equally likely to MetaRank considers members that reject the
be sampled and sequenced in metagenomic null hypothesis with statistical significance as
experiments, the abundance Xn of the nth member highly abundant. For those that fail to reject the
in the sample is modeled as a binomial random null hypothesis (assuming N0 members remain),
variable: MetaRank temporarily sets them aside and con-
tinues to select members whose proportions
Xn BinomialðS, pn Þ, are significantly larger than the average (1/N0 ) in
the next iteration. When none of the remaining
where pn is the unknown population proportion of members reject the null hypothesis, MetaRank
the nth member in the habitat and estimated by terminates the selection procedure and considers
_
the sample proportion p n . all remaining members as rare members.
M 444 MetaRank: Ranking Microbial Taxonomic Units
_ _
!hþk
rank. Second, the members selected in distinct pi þ pj _ _ Shk
1 pi þ pj
iterations are ranked according to the order in 2
which they were selected; thus, the members
selected in the first iteration of the procedure are
where xi and xj are the observed values of Xi
assigned higher ranks than all the others. Third, if
and Xj.
two abundances (the ith and jth members) are
As a result, the sorted abundances X(1) X(2)
selected in the same iteration, MetaRank deter-
. . . X(m) . . . X(M) are converted into
mines their ranks (Ri > Rj, Ri < Rj or Ri ¼ Rj)
ranks 1 R(1) R(2) . . . R(m) . . .
by two hypothesis tests, Ho: pi pj vs. Ha: pi >
R(M) M, where the subscript in parentheses
pj and H0 o: pj pi vs. H0 a: pj > pi. If Ho is
(m) denotes the mth order in the community and
rejected, Ri > Rj; conversely, if H0 o is rejected,
M is the total number of members. For members
Ri < Rj. However, if both Ho and H0 o are
whose abundances cannot be distinguished
accepted, Ri ¼ Rj.
from each other by hypothesis testing, MetaRank
Under the same assumption that all nucleic
converts them into their average order; i.e.,
acids are equally likely to be sampled and
for any m0 , m00 such that R(m0 ) < R(m0 +1) ¼
sequenced, each abundance Xn is modeled as
R(m +2) ¼ . . . ¼ R(m00 1) < R(m00 ) (given R(0) ¼ 0
0
a binomial random variable; any two abundances
and R(M+1) ¼ M + 1), we have
Xi and Xj are jointly modeled by the multinomial
distribution (i.e., the generalization of binomial m þm
0 00
MetaRank: Ranking Microbial Taxonomic Units or the phylum level of the 5,000 synthetic samples for each
Functional Groups for Comparative Analysis of sample-sequencing depth r ∈ {10 %, 20 %, . . ., 90 %}.
Metagenomes, Fig. 1 The averages of CV, which is Under distinct sample-sequencing depth, the averages of
the normalized standard deviation, in the ranks converted CV in the ranks converted by MetaRank are smaller than
by MetaRank, estimated proportions and ordinary ranks at the ones in the others
metagenomes and synthetic samples (Kurokawa the ranks converted by MetaRank, estimated pro-
et al. 2007; Ley et al. 2006; Mavromatis et al. portions, or ordinary ranks. As shown in Fig. 1, the
2007; Wang et al. 2011). In synthetic samples, it normalized standard deviations in the ranks
is shown that as compared with the estimated converted by MetaRank are smaller than the ones
proportions or the ordinary ranks of straightfor- in the estimated proportions and the ordinary
M
ward sorted abundances, the ranks converted by ranks. Similar observations are also found at the
MetaRank have smaller normalized standard devi- taxonomic levels of class, order, family, genus,
ation and are less affected by sampling biases. In and other simulated datasets in the Wang
real metagenomes, using MetaRank is able to clar- et al. (2011) study. The results confirm that
ify the common traits and detect the discriminating MetaRank is able to reduce the effects of sampling
features of those microbiomes. biases.
MetaRank: Ranking Microbial Taxonomic Units or agglomerative clustering (bottom-up clustering) initially
Functional Groups for Comparative Analysis of treats each sample as a single cluster at the bottom and
Metagenomes, Fig. 2 The hierarchical clustering then successively agglomerates pairs of nearest clusters
results of the ranks converted by MetaRank at the phylum until all clusters have been merged into a single cluster at
level in 12 obese individuals at week 0 (I1W0, I2W0, . . ., the top. Given a fix distance 0.2 (i.e., Pearson correlation
I12W0) and 52 (I1W52, I2W52, . . ., I12W52), including 0.8), there are three main clusters, where the unweighted
the four lean controls (I13W0, I14W0, I13W52, and arithmetic mean of distances within clusters are smaller
I14W52), based on UPGMA. The hierarchical than 0.2
denoted by the convention, IxWy, where (Unweighted Pair Group Method with Arithmetic
x represents the xth individual and y represents Mean). Figure 2 illustrates the result of the simple
the time point. In the second dataset, four infant case that only consists of the samples at week
and nine adult samples were extracted from dif- 0 and 52 (before and after diet). As shown in
ferent individuals for COG-functional analysis. Fig. 2, given a fix distance 0.2 (i.e., Pearson
When comparing metagenomes in the first correlation 0.8), there are three main clusters,
dataset (Ley et al. 2006), using MetaRank is where the unweighted arithmetic mean of dis-
able to clarify the common traits of similar sam- tances within clusters are smaller than 0.2. The
ples (Wang et al. 2011). The taxonomic abun- four lean controls are closely grouped together in
dances in the obese samples and the lean one cluster that contains some obese samples at
controls are converted into ranks by MetaRank, week 0 and all the obese samples at week
followed by hierarchical clustering with UPGMA 52 except one (I4W52). More than half of the
MetaRank: Ranking Microbial Taxonomic Units 447 M
obese samples at week 0 are in the other two nonparametric approach, provides a useful rank-
clusters. The result shows that after dieting based alternative to analyzing microbial commu-
almost all the obese samples are clustered nity compositions.
together with the four lean controls. Similar
results are observed in the members of the
biome at the taxonomic levels of class, order, Cross-References
family, and genus (Wang et al. 2011).
Additionally, MetaRank is able to detect rank- ▶ STAMP: Statistical Analysis of Metagenomic
based differences and identify discriminating fea- Profiles
tures between metagenomes in the second dataset
(Kurokawa et al. 2007). The abundances of func-
tional groups in the infant and adult samples are
first converted into ranks by MetaRank. Then the References
t-test is applied to identify rank-based differences
Ashelford KE, Chuzhanova NA, Fry JC, Jones AJ,
between the infant and adult samples. When com- Weightman AJ. At least 1 in 20 16S rRNA sequence
pared with proportion differences detected by records currently held in public repositories is esti-
a parametric method (only t-test without mated to contain substantial anomalies. Appl Environ
Microbiol. 2005;71:7724–36.
MetaRank), it is found that MetaRank,
Brady A, Salzberg SL. Phymm and PhymmBL:
a nonparametric approach, helped to identify metagenomic phylogenetic classification with interpo-
additional functional groups as discriminating lated Markov models. Nat Methods. 2009;6:673–6.
features (Wang et al. 2011). Since nonparametric Gomez-Alvarez V, Teal TK, Schmidt TM. Systematic
artifacts in metagenomes from complex microbial
and parametric methods are complementary to
communities. ISME J. 2009;3:1314–7.
each other in statistics (one cannot replace the Hamady M, Knight R. Microbial community profiling for
other), MetaRank is thus a useful rank-based human microbiome projects: tools, techniques, and
approach complementary to parametric methods. challenges. Genome Res. 2009;19:1141–52. M
Hugenholtz P, Tyson GW. Microbiology: metagenomics.
Nature. 2008;455:481–3.
Kristiansson E, Hugenholtz P, Dalevi D. ShotgunFunc-
Summary tionalizeR: an R-package for functional comparison of
metagenomes. Bioinformatics. 2009;25:2737–8.
Kurokawa K, Itoh T, Kuwahara T, et al. Comparative
Most statistical methods for comparative analysis
metagenomics revealed commonly enriched gene sets
of microbial community compositions rely on in human gut microbiomes. DNA Res. 2007;14:169–81.
estimated abundances of members. However, Ley RE, Turnbaugh PJ, Klein S, Gordon JI. Microbial
when processing metagenomic data, sampling ecology: human gut microbes associated with obesity.
Nature. 2006;444:1022–3.
biases and systematic artifacts cause noisy devi-
Mavromatis K, Ivanova N, Barry K, et al. Use of simulated
ations that may result in estimated abundances data sets to evaluate the fidelity of metagenomic
differing from true abundances. MetaRank, processing methods. Nat Methods. 2007;4:495–500.
which converts highly abundant members into Parks DH, Beiko RG. Identifying biologically relevant
differences between metagenomic communities. Bio-
higher ranks, is designed to cut the effects of
informatics. 2010;26:715–21.
noisy deviations. It leverages the fact that Wang TY, Su CH, Tsai HK. MetaRank: a rank conversion
the ranks of highly abundant members are scheme for comparative analysis of microbial commu-
robust against small deviations. Empirical tests nity compositions. Bioinformatics. 2011;27:3341–7.
White JR, Nagarajan N, Pop M. Statistical methods for
on synthetic samples and real metagenomes con-
detecting differentially abundant features in clinical
firm that the ranks converted by MetaRank metagenomic samples. PLoS Comput Biol. 2009;5:
have small normalized standard deviations, facil- e1000352.
itate the comparative analysis of metagenomes, Wooley JC, Ye Y. Metagenomics: facts and artifacts, and
computational challenges. J Comput Sci Technol.
and help to reveal the common characteristics
2010;25:71–81.
or the discriminating features within a set Wooley JC, Godzik A, Friedberg I. A primer on
of microbiomes. Therefore, MetaRank, as a metagenomics. PLoS Comput Biol. 2010;6:e1000667.
M 448 METAREP, Overview
20,864 million reads of Illumina data have been Laboratory (MG-RAST (Meyer et al. 2008)),
produced from healthy individuals. The compar- and the University of San Diego (CAMERA
ison of the sequence reads to protein databases (Sun et al. 2011)). Efforts that require compute
alone is estimated to generate data exceeding resources owned by the researches (or rented via
12 terabytes (Human Microbiome Project Con- a cloud service) include CLOVR (Angiuoli
sortium 2012). We believe that the HMP typifies et al. 2011), Galaxy (Goecks et al. 2010), and
the scope and complexity of metagenomic pro- METAREP. The free annotation resources,
jects that will come. The collection, integration, however, are often tightly coupled to each cen-
sharing, and comparison of this data represent ter’s specific infrastructure including its com-
a characteristic example of the current pute resources. Thus they cannot easily be
metagenomic data analysis challenges. Toward installed and modified to satisfy custom needs
this end we have developed METAREP, an including privacy concerns and advanced data
open-source and thus adjustable software that access management. In contrast CLOVR, Gal-
enables exploratory data analysis for projects of axy, and METAREP are self-contained and can
this size and larger (Goll et al. 2010, 2012). be run on other systems, and the source code can
A variety of other free metagenomic annota- be adapted to handle project-specific needs. On
tion and analysis software is accessible to the analysis side, most resources provide sum-
researchers (Table 1). Efforts that include anno- mary results that fit a certain workflow that are
tation workflows and free compute resources are tailored toward answering a certain question.
provided by the US Department of Energy METAREP is an exception, as it supports
(IMG/M (Markowitz et al. 2012)), the European generic exploratory data analysis for annota-
Bioinformatics Institute, the Argonne National tions from different workflows that can be
M 450 METAREP, Overview
queried and filtered dynamically. For example, Except for the first two columns (peptide_id and
its functionality can be used to visualize how library_id), which specify the unique ID of the
specific taxonomic or metabolic markers vary respective annotation entry (gene/protein ID) and
across samples. METAREP does not support the library/dataset ID, respectively, columns are
a particular workflow but a generic annotation optional. This has the advantage that workflows
input format. As a consequence, it does not that do not produce all of the data types are
include annotation workflows. To bridge this supported. The last two columns in Table 2 provide
gap, users can run a public annotation service example values for each of the fields per pipeline.
or a custom local pipeline, format the data, and While the unique ID fields mentioned before store
import it. a single value, most of the other fields can store
In the following sections we will describe how multiple values (as indicated in column 3). By
to import data, highlight features to analyze indi- convention, multiple values are double pipe sepa-
vidual and multiple datasets, carry out BLAST rated. For example, information for a multi-
searches, and customize the software. enzymatic protein can be stored by setting the
value of the ec_id field to “1.6.99.3||1.6.5.3”. By
convention, the ec_id field stores the enzyme acces-
Data Format and Import Process sions according to IUBMB format. Higher-level
enzymatic levels are encoded using dashes for all
The current METAREP tab-delimited format unspecified levels, e.g., 3.4.-.-. The go_id field
specification for 17 fields is shown in Table 2. stores accessions defined by the Gene Ontology
Understanding this format is crucial for subse- (Ashburner et al. 2000) with accessions being
quent analysis. Following the outlined conven- prefixed using uppercase “GO:”. The hmm_id is
tions will help users to leverage as much of the a generic field for hidden Markov model-based
functionality as possible and understand what assignments. It takes Pfam accessions (PF234)
fields are supported. The format has been (Punta et al. 2012), TIGRFAM accessions
designed to accommodate common data types (TIGR23423) (Haft et al. 2003), superfamily acces-
that are produced by many annotation workflows sions (SSF345) (Madera et al. 2004), and combina-
without being tied to a specific workflow. The tions of the same (separated by double pipes).
disadvantage of this flexibility is that a custom The blast_* fields store information of BLAST
parser needs to be written to format the output of (Altschul et al. 1990) alignments (but can hold
a certain workflow according to this tab format alignment information from other alignment soft-
before importing the data. However, in most ware). In particular, the blast_tree field stores
cases, generating the METAREP tab-delimited organismal information in the form of the lowest
format is trivial. In addition, METAREP provides taxon using the NCBI Taxonomy as the reference
data formatting functionality for two workflows: taxonomy. For example, to indicate that a certain
(1) the JCVI Prokaryotic Metagenomic Annota- annotation entry belongs to Escherichia coli, the
tion Pipeline (JPMAP (Tanenbaum et al. 2010)) blast_tree field can be set to NCBI taxon id
and (2) the HUMAnN metabolic reconstruction “83333”. If multiple NCBI taxon IDs are pro-
pipeline (Abubucker et al. 2012). The open- vided, the lowest common ancestor will be deter-
source code for formatting output from these mined during the data import process based on
two pipelines serves as a template for supporting the NCBI taxon ID set provided by the user. The
other formats. The code base also includes blast_evalue, blast_pid (proportion of identical
a Perl utility script (scripts/perl/ amino acids), and blast_cov (proportion of cov-
metarep_loader.pl) to import tab- erage of query sequence) reflect alignment qual-
delimited annotation files into METAREP ity data types. The field values range from 0 to
projects (more details on how to use the import 1 and allow users to filter their data based on
script can be found at https://github.com/jcvi/ alignment quality (see searching and filtering).
METAREP/wiki/Installation-Guide-v-1.4.0). The ko_id field stores the KEGG Ortholog
METAREP, Overview, Table 2 Data format
METAREP, Overview
Multi-
Column Field name valued Description JPMAP HUMAnN
1 peptide_id No Unique entry ID JCVI_PEP_1234123 ptr:453118
2 library_id No Dataset ID SRS011061 SRS011061
3 com_name Yes Functional description Sugar ABC transporter, periplasmic sugar-binding protein LGMN, legumain, K01369 legumain
[EC:3.4.22.34]
4 com_name_src Yes Functional description source Uniref100_A23521 ptr:453118
Description assignment
5 go_id Yes Gene Ontology ID GO:0009265 GO:0001509
6 go_src Yes Gene Ontology source PF02511 K01369
Assignment
7 ec_id Yes Enzyme commission ID 2.1.1.148 3.4.22.34
8 ec_src Yes Enzyme commission source PRIAM ptr:453118
9 hmm_id Yes HMM ID PF02511 NA
10 blast_tree Yes NCBI Taxonomy ID 246194 9598
11 blast_evalue No BLAST E-value 1.78E-20 Median
12 blast_pid No BLAST percent identity 0.93 Median
13 blast_cov No BLAST sequence coverage 0.82 N/A
14 Filter Yes Filter tag Repeat N/A
15 ko_id Yes KEGG Ortholog ID N/A K01369
16 ko_src Yes KEGG Ortholog source N/A ptr:453118
17 Weight No Weight to adjust abundance of 1 43.23
assignments
451
M
M
M 452 METAREP, Overview
accession (KO2134). Both the ec_id and ko_id imported data in tabular format. This is helpful
fields are used to support two types of pathway to check if the data has been correctly imported.
analysis (see pathway analysis section). Pathway The Summary Tab provides an overview of over-
analysis based on the ec_id field allows analysis all annotation statistics including a high-level
of 100, strictly metabolic, pathways. Pathway taxonomic breakdown. Subsequent tabs summa-
functionality based on ko_id is more comprehen- rize statistics for a corresponding annotation attri-
sive supporting 200 additional non-metabolic bute. For each, the top 20 ranked features with the
pathways such as transcription and translation. absolute and relative counts are displayed. Users
Depending on which field is populated, function- can adjust the number of top feature that is being
ality is activated. Source fields (fields with a _src displayed (up to 1,000 ranks) and download the
postfix) describe the origin of certain value. For data in tab-delimited format.
example, an enzyme accession may have been
assigned based on a certain TIGRFAM model or Dataset Search and Filter Options
a reference gene/protein homology hit or other The Search Page facilitates dynamic filtering of
methods. The ec_src field can be used to track this annotation and allows users to export matching
information. Finally, the weight field allows users entries and associated statistics. Once a query is
to assign weight to a certain entry to adjust the executed, the page summarizes top 10 statistics
absolute and relative frequency of associated for several annotation attributes in the form of
entry values. The field can be used for encoding lists and pie charts. The page also lists individual
abundance information such as the number of matching annotation entries so that users can
reads that support a certain gene/protein confirm that the query correctly retrieved the
(in transcriptomic or assembly studies) or spec- desired results. The top 10 statistics, matching
tral counts in meta-transcriptomic studies. By annotations, and underlying protein sequences
default the weight field is set to 1. (if configured, see configuration) can be
When we subsequently refer to annotation exported. To search a dataset, users can enter
attributes, we mean a selection of these fields a search term and select the field to search in
that are used throughout the software to provide from a drop-down box. Selections include
summary statistics and compare datasets. They ID-based and name-based searches. The former
refer to NCBI Taxonomy, Gene Ontology, performs exact searches; the latter executes fuzzy
Enzyme Classification, HMM, and KEGG/ name-based searches. For example, the user can
Metacyc pathways and KEGG Ortholog fields. enter 2.7.1.147 and select the Enzyme ID field
A feature refers to a certain value that an anno- from the drop-down box to search for exact
tation attribute can take. A feature-dataset matrix matches. Alternatively, the user can carry out
is a two-dimensional matrix with features of a fuzzy name-based search for “Glucokinase”
a certain annotation attribute as rows and which retrieves three matching enzymes:
datasets as columns. Cells represent the sum of ADP-specific glucokinase (2.7.1.147), glucoki-
weights of the respective feature-dataset combi- nase (2.7.1.2), and phosphoglucokinase
nation (by default it reflects the number of genes/ (2.7.1.10). For both search strategies, the selec-
peptides with that specific feature). tion triggers a query generation process that cre-
ates a query that is compatible with the Solr/
Lucene query syntax (http://wiki.apache.org/
Single Dataset Options solr/SolrQuerySyntax). The original search term
is prefixed by the search field, and multiple terms
Dataset Summary Statistics can be logically combined using the AND, OR,
The View Dataset Page displays the imported and NOT keywords. In the ID-based example, the
annotations and provides high-level summaries final query that will be generated is “ec_id:
of annotation attributes including detailed path- 2.7.1.147”. For the name-based example, the
way summaries. The Data Tab shows the final query represents a logical combination
METAREP, Overview 453 M
(“OR”) of all individual matches, in this case menu) is to use Solr/Lucene wildcard characters.
“ec_id: 2.7.1.147 OR ec_id: 2.7.1.2 OR ec_id: There are two supported wildcards: “?” and “*”.
2.7.1.10”. The same principle is being applied The “?” performs a single character wild card
to pathway name-based searches. A search for search. For example, to find common names like
“starch and sucrose metabolism” using the fliF, fliC, and fliS, one can search for
name-based KEGG pathway name (EC) option “com_name_txt:fli?”. The “*” performs a
searches for all enzymes in that pathway by gen- multiple-character wild card search. For exam-
erating the following query: “ec_id:1.1.1.22 OR ple, to search for all transferases 2.1.1.1, 2.1.1.2,
ec_id:1.1.99.13 OR ec_id:2.4.1.1 OR 2.1.13, etc., one can enter “ec_id:2.*”. The quan-
ec_id:2.4.1.10 OR . . .”. While the drop down titative alignment information (proportion of
helps to build queries, experienced users can identical amino acids, proportion of covered
enter the Solr/Lucene-formatted queries directly. query amino acids, E-value) range queries can
This has the advantage of entering custom logical be applied that identify entries that fall between
combinations of particular fields of interest a minimum and maximum. For example, to filter
(a complete list of fields and example queries the data for a 1.0E-5 < ¼ E-value < ¼1.0E-20,
are shown in Table 2). Note that if the value one can search the blast_evalue_exp (which
contains itself a colon (which is a special charac- stores the negative E-value exponent) for
ter of the Sorl/Lucene language to separate field “blast_evalue_exp:[5 TO 20]”. To exclude the
names from values), it needs to be preceded by boundary values from the result list, the user
a backward slash. For example, a search for can use “blast_evalue_exp:{5 TO 20}”. This is
“go_id:GO\:0000160” instead of “go_id:GO\: equivalent to 1.0E-5 < ¼ E-value < ¼1.0E-20.
0000160” will return the desired results – fields When filtering for E-values, there is usually no
that store complete hierarchies including the defined lower bound. This can be reflected using
NCBI Taxonomy and the Gene Ontology. The a wild card character “*”. For example,
former is encoded in the blast_tree field, which “blast_evalue_exp:[5 TO *]” searches for all
M
stores the whole taxonomic lineage (according to entries with an E-value < ¼ 1.0E-5. Finally, if
NCBI) for each entry in the form of NCBI taxon the sequence store path has been defined (see
IDs. For example, a protein entry with a species section “Installation and Configuration”), the user
assignment of “Escherichia coli” with NCBI can enter an amino acid sequence into the search
taxon 562 has the following nine NCBI taxon box and select the Search by Sequence option with
IDs stored in the blast_tree field: a certain minimum E-value. The software then
562 ¼ Escherichia coli; 561 ¼ Escherichia; executes a BLASTP search behind the scenes and
543 ¼ Enterobacteriaceae; 91347 ¼ Enterobac- returns the top-matching entry
teriales; 1236 ¼ Gammaproteobacteria; 1224 ¼ accessions (peptide_id field) concatenated by an
Proteobacteria; 2 ¼ Bacteria; 131567 ¼ OR and visualizes summary statistics for homolo-
cellular_organisms; 1 ¼ root; gous proteins.
This allows users to find the entry by searching
for “blast_tree:562” (Escherichia coli) as well as Drill into Datasets Using Hierarchical
“blast_tree:2” (Bacteria) or any other IDs that are Datasets
part of that lineage. This can be very helpful for The Browse Dataset Pages are available for sev-
excluding or including proteins that were eral annotation hierarchies including NCBI Tax-
assigned to a certain taxonomic group. For exam- onomy, Gene Ontology, Enzyme Classification,
ple, “ec_id:2.7.1.2 AND blast_tree:2” can be and KEGG and Metacyc metabolic pathways.
used to filter the data for bacterial glucokinases. For KEGG two different pathway hierarchies
A search for “NOT blast_tree:9606” excludes can be selected: enzyme based and KO based.
entries that were assigned to “homo sapiens”. The difference is that the enzyme-based version
Another way of fuzzy searching (in addition to uses enzyme assignments and maps them to
the name-based searches using the drop-down EC-based KEGG pathways (a subset of KEGG
M 454 METAREP, Overview
METAREP, Overview, Fig. 1 Screenshot of the METAREP Browse Pathway (EC) page
pathways that are mainly related to metabolism), Panel, and the Results Panel (Fig. 2). The right
while the KO-based version uses KEGG upper dataset select box in the Dataset Select
Orthologs to infer pathway membership and Panel allows users to select datasets by dragging
uses a more comprehensive set of pathways selected datasets to the left upper panel or by
including non-metabolic processes such as trans- clicking on the plus symbol. The dataset selection
lation and transcription. The number of hits is can be narrowed down by entering keywords in the
displayed for each node in the tree, and a user search textbox in the left upper panel. The Filter
can click on a tree node and expand further. After and Options Panel provides a textbox to enter
clicking a node, a summary of that node is shown a Lucene query (see section “Dataset Search and
in the right panel featuring a pie chart calculated Filter Options” and Table 3). If applied, each
from its sub-nodes and top lists of functional and dataset gets filtered and only annotation entries
taxonomic assignments. Once the user has that match the query are being retained for the
reached the pathway level, for the KEGG ver- comparison. A typical example is to apply
sions of the Browse Pathways pages, relative a more stringent BLAST E-value baseline.
abundance of pathway members is visualized on Another example, highlighted in {REF}, is to
top of pathway maps (Fig. 1). filter all datasets for a certain enzymatic marker,
e.g., pyruvate dehydrogenase complex
(“ec_id:1.2.4.1 OR ec_id:2.3.1.12 OR ec_id:
Multi-dataset Analysis Options 1.8.1.4” or the shorter version “ec_id:1.2.4.1 OR
2.3.1.12 OR 1.8.1.4”). A minimum count value
Compare Feature Abundance Profiles Across can be entered into the Min. Count Field to filter
Datasets out features whose minimum count across all
The Compare Page unifies a variety of descrip- datasets is equal or higher than the specified
tive, graphical, and statistical analysis options to count. By default this field is set to 0 showing
compare annotation attributes of dozens of any features with at least one dataset having
datasets. The page features three distinct panels, a count of one (features with zero counts across
the Dataset Select Panel, the Filter and Options all datasets are discarded). The main compare
METAREP, Overview 455 M
METAREP, Overview, Fig. 2 Screenshot/conceptual Visualization options include heatmap (shown), hierarchi-
overview of the METAREP Compare Page. Current cal clustering, multidimensional scaling, and Mosaic
implementation of the METAREP Compare page (key Plots. Advanced Compare options include statistical tests
options are highlighted in green panels). The page allows for pairwise dataset comparisons (Fisher’s Exact Test, M
users to compare absolute and relative abundance of anno- Equality of Proportions Test) as well as for comparing
tations categories across multiple datasets including taxo- two dataset populations (Wilcoxon Rank-Sum and
nomic, pathway, enzyme, and GO classifications. a nonparametric t-test)
options can be selected from the drop down next Relative Count Matrix shows a numeric rep-
to the Min. Count Field and are organized by the resentation of a normalized feature-dataset
following categories: Count Matrices, Statistical matrix with cells containing the number of counts
Tests (2 Datasets), Statistical Tests for a feature-dataset combination divided by the
(2 populations), and Plot options. total dataset count. If a filter was entered, the cells
The Results Panel is automatically updated represent the number of counts for a feature-
upon option selection displaying feature-dataset dataset combination divided by the total count
matrices or graphical representations of the same. of the filtered dataset.
A certain annotation attribute can be selected by Heatmap Count Matrix shows a numeric rep-
clicking on the respective tab and exported using resentation of a row-normalized feature-dataset
the disk with the green array key. In the follow- matrix with cells containing the relative counts per
ing, we describe the option in more detail. dataset divided by the sum of relative counts per
row, i.e., across all datasets. Cells are color coded
Count Matrices: Applicable if at Least One Dataset according to their row-normalized counts. The color
Was Selected scheme can be changed using a drop-down menu.
Absolute Count Matrix shows a numeric repre-
sentation of a feature-dataset matrix with cells Statistical Tests (2 Datasets)
containing the number of counts for a feature- The following dataset tests are applicable if two
dataset combination. datasets were selected. As input for the tests,
M 456 METAREP, Overview
METAREP, Overview, Fig. 3 Compare Page plot mangrove on Isabella Island). For GS-11, GS-30, and
options exemplified using a selection of eight Global GS-32, two samples were taken from the same location.
Ocean Survey (GOS) samples. Plots A, B, and D show The hierarchical clustering and heatmap dataset-based
the same selection of datasets based on organismal com- dendrograms and the MDS plot show that the replicated
position on the family level (assigned based on the best samples cluster together. The dendrogram shows that,
reference hit using BLAST) with the Minimum Count although the mangrove samples are distinct from the rest
Option set to 5. Plot C summarizes the same datasets for of the Galapagos Islands, they are more close to each other
the phylum level. GS-11 and GS-12 were sampled from when compared to the East Coast samples. The heatmap
the Chesapeake Bay, Annapolis, MD, USA, and Delaware shows an increase of Rhodobacteraceae (orange to
Bay, NJ, USA, respectively. GS-30, GS-31, and GS-G32 white) and a decrease in Comamonadaceae and
were sampled close to the Galapagos Islands (GS-30 off Burkholderiaceae families (orange to red) when compar-
Roca Redonda, GS-31 Fernandina Island, and GS-32 ing these two groups based on the % abundance level
M 460 METAREP, Overview
METAREP, Overview, Fig. 4 Screenshot of the BLAST can be displayed and exported in various formats includ-
Sequence Page. A protein sequence of interest can be ing annotation (shown), alignment (shown in the zoom
searched against a selection of datasets. BLAST results panel), and tabular
The INTERNAL_EMAIL_EXTENSION vari- increase the precision of the p-values (see (White
able can be specified to identify internal users that et al. 2009) for details).
register with the instance and set permissions To activate the METAREP blast functionality,
accordingly. By default, users that register with searching and exporting of sequences, the
the specified e-mail extension are granted full SEQUENCE_STORE_PATH variable needs to
data access. The GOOGLE_ANALYTIC- be defined. This path points to the location on
S_TRACKER_ID and GOOGLE_ANALY- the Web server where the formatdb-formatted
TICS_DOMAIN_NAME variables configure the protein sequence files are kept (organized by
instance to synchronize Web usage with Google project ID and dataset, Table 4). The perl/
Analytics to track usage statistics. scripts/metarep_format_sequence.
The NUM_METASTATS_BOOTSTRAP_ pl utility can be used to format sequence data
PERMUTATIONS variable sets the number of according to this format. If an FTP server is
replicates to determine the null distribution for available, data sharing of a collection of custom
the METASTATS test and can be increased to files per dataset the via dataset download option
M 462 METAREP, Overview
can be activated by specifying the FTP_HOST, balanced Dell Power Edge R710 servers each
FTP_USERNAME, and FTP_PASSWORD vari- having eight cores (2.66 GHz), 72 G RAM, and
ables. The software identifies FTP data by 2 600 GB HD. So far we have successfully
looking for the project ID folder followed by indexed 190 M. Our HMP METAREP instance
a tar-gzipped file that has a matching dataset that serves over 400 million weighted annotations
name, i.e., <dataset-names>.tgz. entries runs on a single server with two multi-
threaded Xeon X7560 2.26 GHz processors with
Example Hardware Configurations a total of 16 cores (32 threads), 256 G RAM, and
The main requirements are driven by the amount 4 terabyte of disk space. For performance bench-
of annotations that are to be stored in index files marks, please see Goll et al. (2010), Supplemen-
and served by a Solr/Lucene server. The main tary Fig. 1, and Goll et al. (2012), Fig. 6.
impact on performance for a single machine is
the amount of memory available for result
retrieval, caching, and operating systems for file Additional Resources
caching. If annotations are weighted, i.e., the
weight field is set to values other than 1, the As part of the NIH HMP project, the software was
CPU requirements increase (see (Goll tested with short-read annotations derived from
et al. 2012), Fig. 6). We are currently running over 14 trillion Illumina reads (Goll et al. 2012).
a two-server system that is served by two load The study includes several scenarios on how to
METAREP, Overview 463 M
investigate the NIH human microbiome data open source tool for high-performance comparative
including how to analyze specific metabolic metagenomics. Bioinformatics. 2010;26:2631–2.
doi:10.1093/bioinformatics/btq455.
markers, cluster datasets based on their metabolic Goll J, Thiagarajan M, Abubucker S, Huttenhower C,
profile, and identify pathways that are differen- Yooseph S, et al. A case study for large-scale human
tially abundant between human body habitats. microbiome analysis using JCVI’s Metagenomics
The data can be accessed at www.jcvi.org/hmp- Reports (METAREP). PLoS ONE. 2012;7:e29044.
doi:10.1371/journal.pone.0029044.
metarep. The following short video tutorial Haft DH, Selengut JD, White O. The TIGRFAMs data-
summarizes key functionality (YouTube base of protein families. Nucleic Acids Res.
ID:7FPJaPyLjMk). The METAREP home page 2003;31:371–3.
at www.jcvi.org/metarep provides an anonymous Handelsman J. Metagenomics: application of genomics to
uncultured microorganisms. Microbiol Mol Biol Rev.
login via the “Try It” button to evaluate the latest 2004;68:669–85. doi:10.1128/MMBR.68.4.669-
functionality for a collection of ocean samples 685.2004.
taken from the North Pacific Subtropical Gyre Human Microbiome Project Consortium. A framework
(DeLong et al. 2006). The open-source code of for human microbiome research. Nature.
2012;486:215–21. doi:10.1038/nature11209.
the software and developer information including Madera M, Vogel C, Kummerfeld SK, Chothia C, Gough
how to contribute to the open-source project is J. The SUPERFAMILY database in 2004: additions
available at the project’s source code repository and improvements. Nucleic Acids Res. 2004;32:
at https://github.com/jcvi/METAREP. For ques- D235–9. doi:10.1093/nar/gkh117.
Markowitz VM, Chen I-MA, Chu K, Szeto E,
tions and comments, please join the mailing list at Palaniappan K, et al. IMG/M-HMP: a metagenome
www.jcvi.org/metarep or directly send an e-mail comparative analysis system for the human
to metarep@googlegroups.com. microbiome project. PLoS ONE. 2012;7:e40151.
doi:10.1371/journal.pone.0040151.
Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM,
et al. The metagenomics RAST server – a public
References resource for the automatic phylogenetic and functional
analysis of metagenomes. BMC Bioinformatics.
M
Abubucker S, Segata N, Goll J, Schubert AM, Izard J, 2008;9:386. doi:10.1186/1471-2105-9-386.
et al. Metabolic reconstruction for metagenomic data Milligan, Glenn W. An examination of the effect of six
and its application to the human microbiome. PLoS types of error perturbation on fifteen clustering algo-
Comput Biol. 2012;8:e1002358. doi:10.1371/journal. rithms. Psychometrika1980;45(3):325–342.
pcbi.1002358. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J,
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. et al. The Pfam protein families database. Nucleic
Basic local alignment search tool. J Mol Biol. 1990;215: Acids Res. 2012;40:D290–301. doi:10.1093/nar/
403–10. doi:10.1016/S0022-2836(05)80360-2. gkr1065.
Angiuoli SV, Matalka M, Gussman A, Galens K, Stein LD. The case for cloud computing in genome infor-
Vangala M, et al. CloVR: a virtual machine for auto- matics. Genome Biol. 2010;11:207. doi:10.1186/gb-
mated and portable sequence analysis from the desktop 2010-11-5-207.
using cloud computing. BMC Bioinformatics. Sun S, Chen J, Li W, Altintas I, Lin A, et al. Community
2011;12:356. doi:10.1186/1471-2105-12-356. cyberinfrastructure for advanced microbial ecology
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, research and analysis: the CAMERA resource. Nucleic
et al. Gene ontology: tool for the unification of biol- Acids Res. 2011;39:D546–51. doi:10.1093/nar/
ogy. The gene ontology consortium. Nat Genet. gkq1102.
2000;25:25–9. doi:10.1038/75556. Tanenbaum DM, Goll J, Murphy S, Kumar P, Zafar N,
DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, et al. The JCVI standard operating procedure for anno-
et al. Community genomics among stratified microbial tating prokaryotic metagenomic shotgun sequencing
assemblages in the ocean’s interior. Science. data. Stand Genomic Sci. 2010;2:229–37.
2006;311:496–503. doi:10.1126/science.1120250. doi:10.4056/sigs.651139.
Goecks J, Nekrutenko A, Taylor J, Galaxy Team. Galaxy: White JR, Nagarajan N, Pop M. Statistical methods for
a comprehensive approach for supporting accessible, detecting differentially abundant features in clinical
reproducible, and transparent computational research metagenomic samples. PLoS Comput Biol. 2009;5:
in the life sciences. Genome Biol. 2010;11:R86. e1000352. doi:10.1371/journal.pcbi.1000352.
doi:10.1186/gb-2010-11-8-r86. Wu M, Eisen JA. A simple, fast, and accurate method of
Goll J, Rusch DB, Tanenbaum DM, Thiagarajan M, Li K, phylogenomic inference. Genome Biol. 2008;9:R151.
et al. METAREP: JCVI metagenomics reports–an doi:10.1186/gb-2008-9-10-r151.
M 464 MetaTISA: Metagenomic Gene Start Prediction with
large number of incorrectly identified or poorly By default, the five highest-scoring BLAST
annotated reference sequences in the public matches are examined for origin in terms of
sequence databases (Bidartondo 2008; Hartmann organelle or taxonomic domain, and each origin
et al. 2011). is given a score based on the number of sequences
Metaxa (Bengtsson et al. 2011) is a software among the top five BLAST hits that belong to the
package that resolves the problem of extracting respective origin. The matches to the HMM pro-
and sorting SSU sequences to origin in an accu- files in the previous step are weighted together
rate and rapid way. The end result is a set of with these BLAST-based origin scores to make
FASTA files, each representing all SSU a final call on the most likely origin of the
sequences from a particular organelle or taxo- sequence fragment. In cases where the origin
nomic domain, for further analysis of species cannot be determined with certainty, but where
composition or other endeavors. there is a strong candidate, Metaxa assigns the
sequence to the most likely origin, but flags it as
potentially in need of manual inspection. If scores
Methods for origin are tied altogether, the sequence is
assigned into a special “uncertain” bin. In the
Extraction two latter cases, sequence alignments of the
The rRNA SSU gene is composed of eight to nine extracted fragment and the five best BLAST
hypervariable (“V”) regions flanked by more matches are computed automatically using
conserved domains (Hartmann et al. 2010). MAFFT (Katoh and Toh 2008), to assist the
Metaxa carries out the extraction of SSU user in the interpretation process.
sequence fragments from the metagenome using
the HMMER package (Eddy 2010) and Hidden Input and Output
Markov Models (HMMs) representing the most Metaxa takes input in the FASTA format and out-
conserved parts of the SSU gene, chiefly at the 50 puts one FASTA file for each origin found.
and 30 end of each V region. These HMMs are Optionally, Metaxa can also produce output in
modeled according to the same principles as table format. The entire running process is outlined
those of V-Xtractor (Hartmann et al. 2010). in Fig. 1.
Since the Metaxa models represent a set of highly
conserved domains, false-positive matches can
be all but avoided as only high-scoring profile Performance
matches are considered. Metaxa features HMM
profiles representing the archaeal and bacterial Metaxa has been shown to classify more than
16S genes, the eukaryote 18S gene, the mitochon- 99.95 % of the core-release sequences in the
drial 12S and 16S genes, and the chloroplast 16S SILVA database according to their annotated ori-
gene. These sets of HMM profiles enable Metaxa gin, and it has a false-positive rate of 0.00012 %
to identify and distinguish among all these classes (Bengtsson et al. 2011). When evaluated on sim-
of SSU sequences. ulated metagenomic data comprising three sets of
100,000 sequences with fragment lengths of
Classification 1,000, 300, and 100 bp, Metaxa processed the
After extracting all SSU sequences from the datasets in 112, 47, and 35 min, respectively,
query dataset, Metaxa proceeds to classify the with very high accuracy down to typical
extracted SSU sequence fragments. This is 454 read lengths (300 bp), retaining fidelity for
performed by comparing each fragment to bacterial sequences even at read lengths as short
a carefully selected set of reference SSU as 100 bp (Fig. 2). This suggests that Metaxa is
sequences from GreenGenes, SILVA, CRW highly reliable for Sanger, as well as 454-derived,
(Cannone et al. 2002), and MitoZoa (Lupi metagenomes, and that it is useful even on
et al. 2010) using BLAST (Altschul et al. 1997). metagenomes generated using short-read
Metaxa, Overview 469 M
sequencing technologies, such as Illumina. For example, a set of sequences extracted using
Metaxa takes advantage of multiple processor Metaxa could be used for sequence diversity
cores, if available, and it has no software or analysis. However, because of the classification
hardware restriction on the number of input capabilities of Metaxa, it is also useful in sorting
sequences. out PCR-amplified SSU libraries before continu-
ing with species richness investigations such as
rarefaction or species accumulation analysis.
Applications Here, the ability of Metaxa to separate chloro-
plast and mitochondrial SSU sequences from
Metaxa has obvious uses in deriving taxonomic other SSU entries is crucial for the accuracy of
inferences from metagenomic sequence sets. the downstream analysis. Metaxa could also be
M 470 Metaxa, Overview
used as a tool to verify the authenticity of anno- processor cores where available. It can be used
tations in SSU sequence databases and reference as a tool for taxonomic analysis of metagenomes
libraries. as well as a classification tool for SSU amplicons.
Metaxa is freely available from http://microbiol-
ogy.se/software/metaxa/.
Availability
References
Summary
Altschul SF, Madden TL, Sch€affer AA, Zhang J, Zhang Z,
Metaxa is a high-performance software tool for Miller W, Lipman DJ. Gapped BLAST and
PSI-BLAST: a new generation of protein database
extracting and classifying SSU sequences from search programs. Nucleic Acids Res. 1997;25(17):
metagenomic datasets. The accuracy of the soft- 3389–402.
ware is very high, providing high sensitivity Bengtsson J, Eriksson KM, Hartmann M, Wang Z, Shenoy
toward SSU fragments even at short-read lengths BD, Grelet G-A, Abarenkov K, Petri A, Alm
Rosenblad M, Nilsson RH. Metaxa: a software tool
while maintaining a false-positive rate of about
for automated detection and discrimination among
0.00012 %. Metaxa is fast compared to, e.g., ribosomal small subunit (12S/16S/18S) sequences of
BLAST, and it takes advantage of multiple archaea, bacteria, eukaryotes, mitochondria, and
Microbial Diversity, Bar-Coding Approaches 471 M
chloroplasts in metagenomes and environmental
sequencing datasets. Antonie van Leeuwenhoek. Microbial Diversity, Bar-Coding
2011;100(3):471–5.
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Approaches
Sayers EW. GenBank. Nucleic Acids Res.
2009;37(Database issue):D26–31. James A. Foster
Bidartondo MI. Preserving accuracy in GenBank. Science Department of Biological Sciences, Institute for
(New York). 2008;319(5870):1616.
Cannone JJ, Subramanian S, Schnare MN, Collett JR, Bioinformatics & Evolutionary Studies (IBEST),
D’Souza LM, Du Y, Feng B, Lin N, Madabusi LV, University of Idaho, Moscow, ID, USA
M€uller KM, Pande N, Shang Z, Yu N, Gutell RR. The
comparative RNA web (CRW) site: an online database
of comparative sequence and structure information for
ribosomal, intron, and other RNAs. BMC Bioinfor- Introduction
matics. 2002;3:2.
Cole JR, Chai B, Farris RJ, Wang Q, Kulam-Syed-
Amplicon fingerprints are useful for ecological
Mohideen AS, McGarrell DM, Bandela AM,
Cardenas E, Garrity GM, Tiedje JM. The ribosomal studies of microbial communities. Most studies
database project (RDP-II): introducing myRDP space to date have used these techniques for determin-
and quality controlled public data. Nucleic Acids Res. ing how many species are present (richness, or
2007;35(Database issue):D169–72.
alpha diversity) in what ratios (beta diversity),
Desai N, Antonopoulos DA, Gilbert JA, Glass EM, Meyer
F. From genomics to metagenomics. Curr Opin which populations or species are present, and
Biotechnol. 2012;23(1):72–6. what metabolic or ecological functions the com-
DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie munity and its constituents may provide. These
EL, Keller K, Huber T, Dalevi D, Hu P, Andersen
data inform downstream analyses to determine
GL. Greengenes, a chimera-checked 16S rRNA gene
database and workbench compatible with ARB. Appl the response of microbial ecosystems to environ-
Environ Microbiol. 2006;72(7):5069–72. mental change, the relationship between human
Eddy S. HMMER. http://hmmer.janelia.org (2010), microbiota and health, the ecological succession,
Accessed 2012-05-15.
M
the co-evolutionary constraints within and
Hartmann M, Howes CG, Abarenkov K, Mohn WW,
Nilsson RH. V-Xtractor: an open-source, high- between communities and their environments,
throughput software tool to identify and extract hyper- and more (Foster et al. 2012a).
variable regions of small subunit (16S/18S) ribosomal This encyclopedia entry focuses on bacterial
RNA gene sequences. J Microbiol Method. 2010;
fingerprinting, since it has a longer history and is
83(2):250–3.
Hartmann M, Howes CG, Veldre V, Schneider S, more mature than fingerprinting techniques for
Vaishampayan PA, Yannarell AC, Quince C, other kingdoms of life. But these techniques are
Johansson P, Björkroth KJ, Abarenkov K, Hallam SJ, in principle applicable to all microbial organisms,
Mohn WW, Nilsson RH. V-REVCOMP: automated
including archaea and eukarya such as fungi,
high-throughput detection of reverse complementary
16S rRNA gene sequences in large environmental diatoms and tiny arthropods, and viruses
and taxonomic datasets. FEMS Microbiol Lett. (assuming they are organisms). Amplicons for
2011;319(2):140–5. bacteria have been in use since the beginning of
Katoh K, Toh H. Recent developments in the MAFFT
the molecular revolution and their gene products
multiple sequence alignment program. Brief
Bioinform. 2008;9(4):286–98. have been well characterized. However, potential
Lupi R, de Meo PD, Picardi E, D’Antonio M, Paoletti D, amplicons exist for all organisms. As dominant as
Castrignanò T, Pesole G, Gissi C. MitoZoa: a curated bacterial life is on Earth, it is by no means the
mitochondrial genome database of metazoans for
only microbial realm of interest. Nonetheless, it
comparative genomics studies. Mitochondrion. 2010;
10(2):192–9. is the focus of this entry.
Pruesse E, Quast C, Knittel K, Fuchs BM, The terminology herein is taken from the bac-
Ludwig W, Peplies J, Glöckner FO. SILVA: a com- terial ecology literature. A population is
prehensive online resource for quality checked and
a collection of individuals of the same type. In
aligned ribosomal RNA sequence data compatible
with ARB. Nucleic Acids Res. 2007;35(21): sexual organisms, a population is typically
7188–96. a collection of individuals from the same species.
M 472 Microbial Diversity, Bar-Coding Approaches
In asexual organisms, however, the species con- and highly conserved, providing a reliable guide
cept is problematic. In any case, one may be for fast and accurate alignment of large sets of
interested in discriminating to a subspecific or sequences (Nawrocki et al. 2009). This gene is
strain level, or indeed to higher levels. Thus, the strongly conserved, since it is a critical part of the
definition of a population is relative to the spe- replicative machinery in bacteria (and some
cific question under investigation. A community archaea). So it is in principle useful for recogniz-
is a collection of co-occurring populations. ing deep phylogenetic divergences. And finally
Therefore, the number of distinct populations in the 16S rDNA gene shows little evidence of hor-
a community is the richness of that community. izontal transfer, which makes it more useful as
The diversity of a community includes the rela- a phylogenetic marker. Woese and Fox first dem-
tive abundance of populations and their potential onstrated the utility of 16S rDNA analysis with
interactions. their discovery that archaea are a distinct king-
Amplicon fingerprinting techniques have dom of life (Woese 2004; Woese and Fox 1977).
developed in tandem with new sequencing tech- Several hypervariable regions in the 16S
nologies. Current fingerprinting approaches are rDNA gene provide enough sequence variation
particularly well adapted to modern high- to distinguish bacterial populations, sometimes to
throughput sequencing and have largely replaced the strain level. Hypervariable regions typically
older techniques based on electrophoresis or cap- contain loops in the rRNA secondary structure,
illary sequencers. The older approaches are still which change more as species evolve, since they
useful for crude estimates using older, and there- are not as structurally constrained as stems. Reli-
fore inexpensive and less used, equipment. How- able primers exist for nine regions, known as V1
ever, as the cost of new sequencing technologies through V9, that were short enough to be
drops, more modern amplicon fingerprinting completely sequenced by Sanger sequencing
approaches are likely to continue to replace when the primers were developed (Kim et al.
their predecessors. 2011). Hypervariable regions differ in the speci-
Amplicon fingerprinting techniques are cul- ficity and precision with which they can distin-
ture independent, meaning that it is unnecessary guish different types of organisms, so the choice
to grow cultures of individual populations or of amplicon primers is study specific (Schloss
communities before extracting DNA. This is par- 2010; Bazinet and Cummings 2012). As newer
ticularly significant in the microbial world, since sequencing technologies have increased the
most bacteria and archaea cannot currently be length of genetic fragments that can be
grown in the lab. Estimates show that as much sequenced, it has become standard practice to
as 97 % of existing microbial biodiversity is amplify from one end of one region to an end of
currently uncultivable (Whitman et al. 1998). another region. For example, V35 and V69,
These techniques enable ecological and func- which span regions 3–5 and 6–9, respectively,
tional analysis of communities that largely con- are common in the literature.
sist of otherwise inaccessible “biotic dark Since it has become possible to sequence
matter.” much larger fragments, it has become common
to attach “bar code adapters” to primers. This
makes it easier to multiplex samples from several
Choosing Amplicons different experimental treatments into single
sequencing runs and then separate the data algo-
With bacteria, the amplicon of choice has long rithmically. In theory, one could improve resolu-
been the gene for the small RNA subunit of the tion of fingerprinting techniques by multiplexing
ribosome, known as 16S rDNA for its size several primers for multiple hypervariable
(16 Svedberg units). Nearly universal primers regions, as if fingerprinting multiple fingers at
exist for several regions of this gene. The second- the same time. However, most projects currently
ary structure of 16S rDNA is well characterized work with only single sets of primers. However,
Microbial Diversity, Bar-Coding Approaches 473 M
very soon it will be feasible to sequence the entire do not have databases comparable to those avail-
16S rDNA gene, which of course will comprise able for 16S rDNA and have fewer useful
all hypervariable regions, making the choice of primers.
primers irrelevant for microbial community fin- No fingerprinting technique based on a single
gerprinting. An intriguing possibility will be to gene, however, carefully chosen, can hope to
multiplex fingerprinting from multiple genes that distinguish all microbes or fully elucidate all
expand analysis beyond the bacterial kingdom, microbial metabolic and ecological functions.
for example, multiplexing 16S rDNA and 18S Even when it becomes feasible to routinely
rDNA amplicons. sequence entire 16S rDNA genes from individ-
Databases of full 16S rRNA sequences exist ual cells, the gene-based amplicon analysis will
for hundreds of thousands of microbes (Cole only produce gene genealogies rather than
et al. 2007; DeSantis et al. 2006). A typical organismal phylogenies or full metabolic pro-
workflow searches these databases for putative files. Multiplexing amplicon processing for
homologues to amplicons. The annotations for several genes may improve phylogenetic reso-
these hits then inform likely taxonomic and func- lution. But as it becomes feasible to sequence
tional associations (Kuczynski et al. 2010). entire genomes for whole communities with
But modern databases have serious limita- shotgun metagenomics or single-cell genomics,
tions. It is rarely possible to classify bacteria it will become unnecessary to choose target
below the family level, since there are vastly amplicons at all.
more different populations than have been
observed. As cultivation-independent sequencing
methods grow more popular, new sequences in Fingerprinting Techniques
the databases tend to be from unclassified, and
therefore unannotated, populations. Annotations Fragment-based techniques use the length of
in existing databases are highly biased toward amplicon fragments as fingerprints. The spectra
M
pathogenic or other human-associated organisms. of these lengths indicate which microbial
Very closely related genera, species, and strains populations were in the original sample, assum-
can differ dramatically in their metabolic poten- ing that there is sufficient variation in the
tial and preferred ecological habitats. Finally, amplicon fragments. We present the three most
different species vary widely in their 16S rDNA common fragment-based techniques here.
copy numbers, making it easy to confuse dosage Temperature gradient and denaturing gradient
effects and within-individual sequence variation gel electrophoresis (TGGE and DGGE) separate
with species abundances. the DNA fragments by size using standard gel
Other genetic targets may serve the same func- electrophoresis (Fischer and Lerman 1979). The
tion as 16S does for microbial ecology, provided resulting band patterns are then the community
they exhibit sufficient variation, stability, and fingerprints. Presumably, more complex patterns
vertical inheritance. For example, the RNA poly- represent more complex communities and pat-
merase b-subunit gene, rpoB, is a single-copy terns from distinct populations contributing addi-
gene and has been recommended as an alternative tively to the overall pattern so that one can
to 16S rDNA. Other highly conserved house- decompose the community fingerprint into con-
keeping genes such as cytochrome B (cytB), stituent populations.
those responsible for electron transport in aerobic Automated ribosomal intergenic spacer anal-
organisms, may be more appropriate for plant ysis (ARISA) determines the spectra of the
studies or deep resolution of Cyanobacteria. intergenic spacer region (ITR) between small
And of course, eukaryotes and some Archaea do and large ribosomal subunit genes in bacteria
not have 16S ribosomal subunits, so a more (Fisher and Triplett 1999). The flanking genes
appropriate gene is their small subunit analogue, are highly conserved, making ITS a reasonable
the 18S rDNA gene. Currently, these alternatives amplicon. Moreover, the length ITS is highly
M 474 Microbial Diversity, Bar-Coding Approaches
variable between bacterial species, so a spectrum emerging for cleaning and quality control of
of ITS lengths is a reasonable fingerprint. raw data, detecting erroneous sequences (such
Terminal restriction fragment length polymor- as chimeras), aligning sequences, clustering fin-
phism (TRFLP) analysis binds fluorescent gerprints by similarity, searching for similar
markers to the amplicon PCR primers before annotated sequences in existing databases, and
restriction, marking the restriction fragments more. Two software packages aggregate state-
adjacent to the primer (Sch€utte et al. 2008). One of-the-art algorithms and pipelines to bring the
can then separate the labeled fragments by size, state of the art to the typical user, namely, Quan-
for example, in a capillary sequencer. The spectra titative Insights Into Microbial Ecology
of the lengths of these fragments are then the (QIIME) and MOTHUR (Caporaso et al. 2010;
fingerprint for the study sample. Schloss et al. 2009). Both packages are compat-
All three length-based fingerprinting tech- ible with most computing platform and are
niques have inherent biases and limitations, and updated regularly with the newest algorithms
all three are still commonly used. A PubMed from the research community. Both have exten-
search on 12 July 2012 for the terms “DGGE,” sive tutorials and reference documentation.
“ARISA,” and “TRFLP” returned 5658, 119, and MOTHUR is open source. Both packages per-
107 hits, respectively, with several recent cita- form most standard diversity analyses and pro-
tions indicating current use of all three duce datasets that can be imported into the
techniques. R statistical environment for further analysis
Bioinformatics has been critical for (Beck et al. 2011).
interpreting fragment-based amplicon fingerprint To summarize, amplicon choice remains
data. A common approach has been to perform in important to fingerprinting analyses, though frag-
silico analyses of existing databases, to determine ments of the 16S rDNA gene remain the amplicon
length spectra for known sequences. This pro- of choice for bacterial community diversity stud-
vides a kind of “reverse telephone book” with ies. Amplicon sequences are becoming the fin-
which one can translate empirical fingerprints gerprints of choice, though derived data such as
into possible population compositions. Two typ- length spectra for restriction fragments or
ical tools for this sort of analysis, focused on interspacer regions are still widely used. Future
TRFLP and still in heavy use, are the Microbial sequencing technologies are sure to change the
Community Analysis (MiCA) suite and the TFLP fingerprinting landscape significantly. Finally,
Analysis Program (TAP-TRFLP) (Shyu amplicon fingerprinting analysis requires exten-
et al. 2007; Cole et al. 2009). sive bioinformatic support, and appropriate tools
Sequence-based fingerprinting techniques use are available.
the amplicon sequences themselves as finger-
prints, rather than their length spectra. Current
sequencing technologies, also known as next- Cross-References
generation sequencing, have made it feasible to
sequence millions of amplicons in a single run. ▶ Culture Collections in the Study of Microbial
Different sequencing technologies vary in their Diversity, Importance
sequencing accuracy, typical type of sequencing ▶ Metagenomics, Metadata, and Meta-analysis
errors, and length of amplicon (Foster ▶ Microbial Ecology in the Age of
et al. 2012b). Consequently, the vast majority of Metagenomics: An Introduction
current amplicon fingerprinting projects use ▶ New Computational Methodologies to
amplicon sequences rather than derived data Understand Microbial Diversity
such as lengths. ▶ Next-Generation Sequencing for
Bioinformatics to analyze sequence-based Metagenomic Data: Assembling and Binning
fingerprints is a very active area of research. ▶ Protein-coding Genes as Alternative Markers
New and improved algorithms are constantly in Microbial Diversity Studies
Microbial Ecology in the Age of Metagenomics: An Introduction 475 M
References Schloss PD. The effects of alignment quality, distance
calculation method, sequence filtering, and region
Bazinet AL, Cummings MP. A comparative evaluation of on the analysis of 16S rRNA gene-based studies.
sequence classification programs. BMC Bioinformat- PLoS Comput Biol. 2010. doi:10.1371/journal.
ics. 2012;13(1):92. doi:10.1186/1471-2105-13-92. pcbi.1000844.
Beck D, Settles M, Foster JA. OTUbase: an R infrastruc- Schloss PD, Westcott SL, Ryabin T, Hall JR,
ture package for operational taxonomic unit data. Bio- Hartmann M, Hollister EB, Lesniewski RA,
informatics (Oxford, England). 2011;27(12):1700–1. et al. Introducing mothur: open-source, platform-
doi:10.1093/bioinformatics/btr196. independent, community-supported software for
Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, describing and comparing microbial communities.
Bushman FD, Costello EK, Noah F, et al. QIIME Appl Environ Microbiol. 2009;75(23):7537–41.
allows analysis of high-throughput community doi:10.1128/AEM.01541-09.
sequencing data. Nat Methods. 2010;7(5):335–6. Sch€
utte UME, Abdo Z, Bent SJ, Shyu C, Williams CJ,
doi:10.1038/nmeth.f.303. Pierson JD, Forney LJ. Advances in the use of Termi-
Cole JR, Chai B, Farris RJ, Wang Q, Kulam-Syed- nal Restriction Fragment Length Polymorphism
Mohideen AS, McGarrell DM, Bandela AM, (T-RFLP) analysis of 16S rRNA genes to characterize
Cardenas E, Garrity GM, Tiedje JM, et al. The Ribo- microbial communities. Appl Microbiol Biotechnol.
somal Database Project (RDP-II): introducing myRDP 2008;80(3):365–80. doi:10.1007/s00253-008-1565-4.
space and quality controlled public data. Nucleic Shyu C, Soule T, Bent SJ, Foster JA, Forney LJ. MiCA:
Acids Res. 2007;35:D169–72. doi:10.1093/nar/ a web-based tool for the analysis of microbial commu-
gkl889. nities based on terminal-restriction fragment length
Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, polymorphisms of 16S and 18S rRNA genes.
Kulam-Syed-Mohideen AS, et al. The Ribosomal J Microb Ecol. 2007;53(4):562–70.
Database Project: improved alignments and new tools Whitman WB, Coleman DC, Wiebe WJ. Prokaryotes:
for rRNA analysis. Nucleic Acids Res. 2009;37: the unseen majority. Proc Natl Acad Sci U S A.
D141–5. doi:10.1093/nar/gkn879. 1998;95(12):6578–83.
DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie Woese CR. A new biology for a new century. Microbiol
EL, Keller K, Huber T, Dalevi D, Hu P, Andersen Mol Biol Rev MMBR. 2004;68(2):173–86.
GL. Greengenes, a chimera-checked 16S rRNA gene doi:10.1128/MMBR.68.2.173-186.2004.
Woese CR, Fox GE. Phylogenetic structure of the pro-
database and workbench compatible with ARB. Appl
karyotic domain: the primary kingdoms. Proc Natl
M
Environ Microbiol. 2006;72(7):5069–72. doi:10.1128/
AEM.03006-05. Acad Sci USA. 1977;74(11):5088–90.
Fischer SG, Lerman LS. Length-independent separation
of DNA Restriction Fragments in two-dimensional gel
electrophoresis. Cell. 1979;16(1):191–200.
Fisher MM, Triplett EW. Automated approach for ribo-
somal intergenic spacer analysis of microbial diversity
Microbial Ecology in the Age of
and its application to freshwater bacterial communi- Metagenomics: An Introduction
ties. Appl Environ Microbiol. 1999;65(10):4630–6.
Foster JA, JH Moore, Gilbert JA, Bunge J. Microbiome Jianping Xu
studies: analytical tools and techniques. In: Russ B
Department of Biology, McMaster University,
Altman, A Keith Dunker, Lawrence Hunter, Teri E
Klein (eds), Pac Symp Biocomput. 2012a;200–2. Hamilton, ON, Canada
World Scientific, Singapore .
Foster JA, Bunge J, Gilbert JA, Moore JH. Measuring the
microbiome: perspectives on advances in DNA-based
techniques for exploring microbial life. Brief
Introduction
Bioinform. 2012b. doi:10.1093/bib/bbr080.
Kim M, Morrison M, Yu Zhongtang. Evaluation of differ- Microbial ecology is an interdisciplinary science
ent partial 16S rRNA gene sequence regions for phy- related to microbiology and ecology. Its investi-
logenetic analysis of microbiomes. 2011;84(1):81–7.
gations range from analyzing the diversity of
doi:10.1016/j.mimet.2010.10.020
Kuczynski J, Liu Z, Lozupone C, McDonald D. Microbial microorganisms within and among the different
community resemblance methods differ in their ability ecological niches on Earth to understanding the
to detect biologically relevant patterns. Nat Methods. interrelationships among microorganisms,
2010;7(10):813–9. doi:10.1038/nmeth.1499.
between microorganisms and macroorganisms,
Nawrocki EP, Kolbe DL, Eddy SR. Infernal 1.0: inference
of RNA alignments. Bioinformatics (Oxford, and between microorganisms and their abiotic
England). 2009. doi:10.1093/bioinformatics/btp157. environmental factors. Microbial diversity and
M 476 Microbial Ecology in the Age of Metagenomics: An Introduction
the interactions between microbes and other hydrothermal vents at the bottom of deepest
organisms can be analyzed at morphological, oceans. Current estimates put the number of
structural, physiological, and/or genetic levels. microbial cells on Earth at around 5.0 1030,
The recent advances in high-throughput technol- about eight orders of magnitude greater than the
ogies, especially in genome sequencing, are number of stars in the observable universe.
reshaping our understandings of microbial ecol- Indeed, despite their small sizes, the large num-
ogy. This entry introduces the fundamental con- ber of microbial cells on Earth makes microor-
cepts and issues in microbial ecology, with a brief ganisms the single largest carbon sink, more
focus on how metagenomics tools are impacting than those from plants and animals. Their large
microbial diversity studies. number, broad ecological distribution, and vast
diversity of metabolic pathways unparalleled by
macroorganisms make microbes indispensable
Microorganisms and Microbiology and central to our considerations of global geo-
chemical cycles and environmental issues.
A microorganism refers to any life form that can’t Most of the early methodologies for studying
be easily seen by the human naked eye. Microor- microorganisms are still widely used today, and
ganisms encompass morphologically, structur- many discoveries about the fundamental fea-
ally, and phylogenetically very diverse forms of tures of life were made using microorganisms
life and traditionally include both acellular life as model systems. Among the many practical
forms such as viruses and cellular life forms in all contributions of microbiology, microbiological
three domains, the Bacteria, Archaea, and discoveries have significantly impacted (and are
Eukarya (Woese 1987). Organisms in Bacteria continuing to impact) the control and prevention
and Archaea are completely microbial. Even in of diseases in plants, animals, and humans.
Eukarya, macroorganisms such as animals and However, techniques and methodologies alone
plants represent only parts of two of at least were insufficient for establishing microbiology
eight superkingdoms within this domain, while as a fledging field of scientific investigation.
the remaining six or more superkingdoms are Reductionist approaches and guidelines for
exclusively microbial (Baldauf 2003). While hypothesis testing such as the Koch’s postulates
most microorganisms can’t be seen at all by the for identifying the causative agents of infectious
naked eye, for many microorganisms, certain diseases were pivotal for the development of
stages of their life cycles can be easily visualized. microbiology. Interestingly, with the rapid
For example, mushrooms, the sexual reproduc- developments both in high-throughput experi-
tive structure of certain groups of fungi, are mental tools (e.g., Xu 2014) and in bioinformat-
a common occurrence on forest floors at certain ics software capable of analyzing large and
times of the year. diverse datasets, holistic views about microor-
Microorganisms were first seen and ganisms are beginning to attract significant sci-
described by Antonie van Leeuwenhoek in entific attention. Indeed, aside from the
1676 when he used a microscope to examine traditional subdisciplines such as microbial cell
a variety of natural and human-made objects. biology, biochemistry, physiology, genetics,
Subsequent developments in methodologies for ecology, and evolutionary biology, microbiol-
growing, purifying, and studying microorgan- ogy now also includes microbial genomics, sys-
isms ushered in a golden era of microbiology, tems microbiology, microbial community
which is still going strong today. Microorgan- ecology, and ecosystem microbiology. In addi-
isms have now been found in virtually every tion, the diverse subdisciplines of microbiology
habitable niche on Earth, from hot springs to have become integral components of agricul-
salt lakes, from frozen environments in the Ant- ture, forestry, animal husbandry, fishery, min-
arctica and glaciers at the top of mountains to ing, environmental sciences, and medicine.
Microbial Ecology in the Age of Metagenomics: An Introduction 477 M
Microbial Ecology years to describe direct analyses of environmental
DNA (Marco 2009). However, metagenomics
Broadly speaking, microbial ecology is the scien- has emerged as the favorite term and the prefix
tific discipline that examines the relationships “meta-“is now used to describe the direct
between microorganisms and their environments. analyses of environmental RNA, proteins, and
Ecologically oriented studies of microbes were metabolites, corresponding respectively to
performed as soon as their existence was realized. meta-transcriptomics, meta-proteomics, and
However, the term microbial ecology came into metabolomics (Fig. 1). Together, the direct analyses
frequent use only in the early 1960s, and its emer- of biological molecules from natural environments
gence as an independent field of investigation was constitute the field of “meta-omics” (Fig. 1).
propelled by both the awakening public interest in The different subfields of meta-omics analyze
environmental issues and the increasing recogni- complementary sets of biological molecules
tion of the essential roles of microbes in Earth’s directly from the environments that together help
geochemical cycling and in human welfare. provide holistic views of the natural biological
At present, microbial ecological investiga- communities. For example, analyses of environ-
tions can be grouped into three broad types: mental DNA samples can provide estimates of
(i) identifying the taxonomic, structural, and the taxonomic and genome diversities of organ-
functional diversities of microorganisms in isms in ecological niches in nature, the extracted
natural ecological niches; (ii) analyzing the RNA, protein, and metabolites provide information
relationships among microorganisms, between about the functions of the environmental genomes,
microorganisms and macroorganisms (plants including the degrees to which genes are tran-
and animals including humans), and between scribed and translated, and the types and amount
microorganisms and environmental factors of metabolites are generated in natural ecological
(such as nutrients, temperature, pH, pressure, niches. In addition, to properly analyze and inte-
oxygen); and (iii) investigating the mechanisms grate the diverse biological datasets, effective
M
that generate and maintain the diversity of micro- “meta-programs” are also needed and several
organisms and their relationships with each other such programs are currently available (de
and with their biotic and abiotic factors in natural Bruijn 2011).
environments. Among the three types of research Because biological materials (e.g., different
activities, most metagenomics studies of micro- types of microbial cells) can be very different
bial ecology have focused on microbial diversity from each other in terms of their size, morphology,
in natural environments. and structure, obtaining DNA (and/or RNA, pro-
Below is a brief introduction to metagenomics tein, and metabolites) directly from environmental
and how metagenomics approaches have shaped samples that can realistically reflect their native
our understanding of microbial diversity. For the biological states may require extensive sample
impact of metagenomics tools on the other two treatments. Such treatments may include sorting
aspects of microbial ecology, please refer to other biological samples (including different types of
entries in this encyclopedia. cells and viral particles) based on sizes, removing
materials that inhibit downstream reactions, and
applying different extraction methods that permit
Metagenomics the lysis of cells with specific types of cell walls.
Once the pools of targeted biological materials are
Metagenomics refers to the field of study that ana- obtained, additional treatments of these materials
lyzes genetic materials obtained directly from envi- may be needed before they are channeled into high-
ronmental samples. Several other terms, such as throughput analytical platforms. Below is a brief
environmental genomics, ecological genomics, overview of the applications of metagenomic tools
and community genomics, have emerged over the on estimates of microbial diversity.
M 478 Microbial Ecology in the Age of Metagenomics: An Introduction
Microbial Ecology in the Age of Metagenomics: An throughput technologies. To effectively utilize such data,
Introduction, Fig. 1 Legend: an overview of meta- suits of “meta-programs” are required to analyze and
omics: the direct analyses of biological molecules such integrate the diverse meta-datasets (Modified from Xu
as DNA, RNA, protein, and metabolites using high- 2010)
Estimates of Microbial Genetic Diversity a population will be different (Xu 2010). At the
Using Metagenomic Data species level, microbial diversity is measured as
species diversity. There are various measures of
Depending on the objectives of research, micro- species diversity. One commonly used refers to
bial diversity in the environment can be the frequency that two randomly drawn individ-
expressed as a quantitative measure using several uals in an environment will be different species.
common indices such as phylogenetic diversity, This measure takes into account both the number
species diversity, genotype diversity, gene diver- of species (species richness) and the frequency of
sity, and nucleotide diversity. Above the species each species (species abundance) in the environ-
level, microbial diversity can be quantified based ment. Conceptually, this measure of species
on evolutionary distances among the observed diversity is similar to those used for nucleotide
taxonomic groups from a specific environment. diversity, gene diversity, and genotype diversity.
Below the species level, microbial diversity can Microbial species diversity is among the most
be described using population genetic parameters commonly analyzed and compared in microbial
such as nucleotide diversity, gene diversity, and ecological studies. The earliest and still one of the
genotype diversity. Nucleotide diversity, gene most common metagenomics methods for esti-
diversity, and genotype diversity refer respec- mating species diversity of prokaryotes
tively to the probability that two randomly (including both Bacteria and Archaea) in natural
drawn bases at a specific site of the genome, environments is the direct analyses of sequence
alleles of a specific gene locus, and genotypes in variation at the 16S ribosomal RNA gene
Microbial Ecology in the Age of Metagenomics: An Introduction 479 M
(Pace et al. 1985). These analyses may involve complementary data is the messenger RNA
the polymerase chain reaction (PCR), denaturing sequences obtained from environmental samples.
gradient gel electrophoresis (DGGE), cloning, In combination with DNA sequence data, the
and sequencing. A broadly accepted criterion to mRNA data allow inferences of the potential
delineate prokaryote species is that two strains physiological activities of the different groups
belong to the same species if their 16S rRNA of microorganisms in natural environments
genes show 97 % sequence similarity (de Bruijn 2011).
(de Bruijn 2011). In eukaryotic microbes such
as fungi, a similar criterion (97 % sequence
similarity) is often used, albeit for a different Summary
DNA fragment, the internal transcribed spacer
(ITS) regions of the ribosomal RNA gene cluster This entry serves as an introduction to microor-
(Schoch et al. 2012). However, in more recent ganisms, microbiology, microbial ecology, and
analyses, direct sequencing of extracted environ- metagenomics. The impact of metagenomics on
mental DNA using NGS technologies is increas- estimates of microbial diversity was briefly
ingly used. These analyses suggest that the discussed. With the increasing application of
cultured microbes from most ecological niches high-throughput technologies in analyzing bio-
represent <1 % of the true microbial species logical materials (DNA, RNA, proteins, and
richness in their respective niches and that many metabolites) directly from environments, the
of these uncultured microbes belong to distinct future of microbial ecology is looking brighter
and previously unknown phylogenetic groups than ever.
(de Bruijn 2011). Metagenomic analyses, espe-
cially those based on NGS technologies (Xu
2014), have generated very large datasets from
environments including the human body (e.g., the
Cross-References M
human microbiome initiative; http://nihroadmap.
▶ Microbial Diversity, Bar-Coding Approaches
nih.gov/hmp/) and the oceans (the Global Ocean
Sampling surveys; http://www.jcvi.org/cms/
research/projects/gos/overview/). Scientists
from many countries participate in these large- References
scale projects.
The species diversity studies based on DNA Baldauf SL. The deep roots of eukaryotes. Science.
sequences at the 16S rRNA gene are increasingly 2003;300:1703–6.
Bruijn F. Handbook of molecular microbial ecology I:
complemented by other types of data that aug-
metagenomics and complementary approaches. New
ment our understanding of microbial diversity in Jersey: Wiley/Blackwell; 2011. p. 113–22.
natural environments. One type of such data is Marco D. Metagenomics: theory, methods and applica-
genetic variation among strains within a species. tions. Norfolk: Caister Academic Press; 2009.
Pace NR, Stahl DA, Olsen GJ, Lane DJ. Analyzing natural
With high-throughput DNA sequencing, genetic
microbial populations by rRNA sequences. ASME
variants of a gene fragment from different strains News. 1985;51:4–12.
of the same species in the same ecological niche Schoch CL*, The Fungal Barcode Consortium (one of
can be reliably identified (de Bruijn 2011). With 100 collaborators). Nuclear ribosomal internal tran-
scribed spacer (ITS) region as a universal DNA
sufficient genome coverage, it’s also possible to barcode marker for Fungi. Proc Natl Acad Sci U S A.
uncover genome variants. Such information 2012;109:6241–6.
allows direct comparisons of gene frequencies Woese CR. Bacterial evolution. Microbiol Rev.
and genotype frequencies among microbial 1987;51:221–71.
Xu J. Microbial population genetics. Norfolk: Caister
populations from diverse ecological niches,
Academic Press; 2010.
including the inferences of the modes of repro- Xu J. Next-generation sequencing: technologies and
duction in nature (Xu 2010). The second type of applications. Norfolk: Caister Academic Press; 2014.
M 480 Microbial Ecosystems, Protection of
patterns into community ecology, and a trait- answered when it is possible to study complete
centered perspective would be a tractable way microbial populations at ecologically relevant
for microbial ecology to address the significance scales.
of microbial diversity for ecosystem functioning.
Considering the fact that in plant sciences BEF Inability to Link Species Diversity to Function
studies are also incorporating traits rather than Connecting individual microbial species to the
species richness only, the trait-centered approach biogeochemical processes they catalyze is
may offer options for convergence of macro- and a prerequisite for assessing BEF relationships in
microbial ecology which will be essential for microbial communities. However, considering
including microbes in conservation policy. the lack of a species concept, the metabolic ver-
satility, the large number of unknown species,
Lack of Microbial Biogeography? and the scale issue involved, this is the central
The conventional view of microbial distribution problem area in the field of environmental micro-
of species through space and time has been dom- biology. The majority of studies in the literature
inated for decades by the “Baas-Becking” have relied on correlating changes in activity to
hypothesis “everything is everywhere, but the changes in community composition or diversity,
environment selects.” The lack of dispersal limi- and only a few articles can actually show a causal
tations of microorganisms would ensure a global relationship. A myriad of techniques have been
distribution, but that local deterministic factors developed for linking diversity and function (see
would determine the relative abundance of Wagner 2009). However, many of these tech-
“latent” and “flourishing” species. This view is niques were based on the analyses of ribosomal
in sharp contrast with plants and animals which RNA or mRNA transcripts of functional genes,
show clear taxa-area relationships and biogeog- indicating only the potential to be involved in
raphy. The Baas-Becking legacy is likely one of specific processes. The use of stable isotope prob-
the main reasons why microbial diversity is not ing (SIP) has evoked a major breakthrough in
on the biodiversity-conservation agenda. How- environmental microbiology (see Murrell and
ever, in the last decade there are a number of Whiteley 2011). The general approach is that
studies demonstrating species-area relationships, stable isotopically (13C/15N) labeled substrates
biogeography, and spatial patterns at various are incorporated into taxonomically relevant
scales for microbes (see Zhou et al. 2008). Next molecules (RNA/DNA, lipids, proteins). Only
to this, microbial endemism has been reported as the microbes which have actively been incorpo-
well, while studies using high-throughput rating the stable isotopes are detected when ana-
sequencing technology clearly demonstrated the lyzing RNA/DNA or PLFA using GC-IRMS (gas
presence of habitat-specific communities shaped chromatography-isotope ratio mass spectrome-
by edaphic factors and historical contingencies. try) or proteins using GC-MS or LC-MS (liquid
A meta-analysis of all currently available 16S chromatography-mass spectrometry). The major
rRNA gene sequences revealed clear environ- disadvantages of SIP are the use of unnaturally
mental distributions on the genus or species high substrate concentrations in case of DNA-
level with soil and freshwater as least selective and RNA-based SIP, the different label uptake
habitats, while marine, animal, and thermal hab- rates per species, and cross feeding. More recent
itats were the most selective (Tamames work brought improvements in the shortcomings
et al. 2010). The emerging pattern in microbial of traditional SIP studies by using magnetic bead
biogeography studies is definitely that not all capturing of mRNA, Raman spectroscopy, and
microbial communities occur everywhere and NanoSIMS (secondary ion beam mass spectros-
that local conditions can lead to unique associa- copy) (see Murrell and Whiteley 2011) also in
tions of microbes. However, whether microbes combination with metagenomic techniques
obey the same distribution and community uncovering active species of which no cultured
assembly rules as macroorganisms can only be representatives are available or discovering
Microbial Ecosystems, Protection of 483 M
unknown pathways or genes involved in biogeo- which coincides with an increase in diversity of
chemical processes (see Chen and Murrell 2010). these microbes (Fig. 1; Levine et al. 2011).
The most recent addition to the SIP repertoire Aspects of community composition other than
combined microarray detection and NanoSIMS, richness per se have been demonstrated to regulate
attaining low label incorporation levels and high the stability of biogeochemical processes. The ini-
phylogenetic resolution without PCR amplifica- tial evenness of redundant community members
tion of the target community (Mayali et al. 2012). was demonstrated to be important in resistance to
The challenge in applying SIP-based techniques salt stress in denitrifying communities (Wittebolle
will be in BEF experiments, where experimental et al. 2009), indicating that relative abundance of
designs allowing for causal and mechanistic con- the populations in a community is an important
clusions require high sample throughput. determinative factor for process stability, even in
redundant communities. Functional redundancy
Resistance, Resilience, and Redundancy of sensu stricto is difficult to assess in microbial
Microbial Communities communities, since it requires the contribution of
The absence of microbial diversity in BEF debate, individual community members to processes and
conservation issues, and global biogeochemical separation between diversity and environmental
process models is also caused by the paradigm of factors. The stability of a particular function
microbial omnipresence, high adaptability, and (e.g., methane conversion) in time is very likely
functional redundancy. Indeed, resilience after affected by more properties or traits of species
reduced diversity and redundancy of species car- than the expression of that one particular func-
rying out similar functions has been demonstrated tional gene only, e.g., response to inhibitors or
(see Bodelier 2011). But is this the rule? A number general adaptation of species to a particular envi-
of studies have demonstrated a direct relationship ronment. Moreover, populations of interacting
between diversity and ecosystem process rate (see microbes on microbial relevant scales may not
Bodelier 2011). Recently, a comprehensive meta- consist of many different species also due to spa-
M
analyses demonstrated that out of 110 studies, tial arrangement or isolation, e.g., along roots, soil
more than 70 % demonstrate that microbial com- pores, plant leaves, biofilms, or microbial flocs in
munity composition was not resistant (i.e., the sewage treatment. The growing body of experi-
degree to which community composition remains mental evidence suggests that microbial commu-
unchanged when disturbed) against disturbances nities can be sensitive to disturbances and that
(fertilization, CO2 increase, temperature, carbon resilience is linked to diversity. However, the
amendment) (Allison and Martiny 2008). This majority of studies are descriptive, correlative, or
held true for broad taxonomic groups (fungi, bac- strongly reductionist in nature, not allowing for
teria, archaea) as well as narrow groups with spe- causal or mechanistic conclusions.
cific functions (methane oxidation, nitrification).
The same study demonstrated that the resilience
(i.e., the rate at which microbial community com- Closing the Gaps
position returns to its original composition after
being disturbed) is in the order of years. Fertiliza- It is obvious that the omission of microbial com-
tion even led to differences in communities of munities from the BEF debate and in the manag-
N-cycling microbes (nitrifiers, denitrifiers) for ing and conservation of ecosystems is due to
more than 50 years (Hallin et al. 2009). Similar a lack of understanding of the functioning and
long-lasting effects have also been observed for composition of environmental microbial commu-
methane-consuming microbes. Microbes consum- nities. The controversy between huge diversity
ing atmospheric methane are responsible for and redundancy on the one hand and the lack of
6–10 % of global methane consumption. The pro- knowledge on 99 % of that diversity on the other
cess is sensitive to agricultural practices, and hand leads to the fact that we do not know what
recovery after land abandonment can take decades we have to protect and what might have been lost
M 484 Microbial Ecosystems, Protection of
Microbial Ecosystems, Protection of, Fig. 1 The consumption) are annual averages with error bars
recovery of methanotroph diversity and atmospheric representing standard errors. Land-use treatments are as
CH4 consumption following row-crop agriculture. follows: agricultural management of historically tilled
Increase in methanotroph diversity (open symbols) and lands (AG), early successional fields abandoned from
CH4 consumption (closed symbols) as a function of time agriculture in 1989 (ES), successional forests abandoned
since cessation of agriculture. The data clearly show that from agriculture in the 1950s (SF), managed grasslands on
agricultural use diminishes methanotrophic diversity as never-tilled soil (MG), and deciduous forests (DF) (From
well as function and that it can take decades before recov- Levine et al. 2011, with permission)
ery takes place. All measurements (diversity and
already. This controversy hampers the examina- enabling individual-based physiology and ecol-
tion of the importance of microbial diversity for ogy and even interactions on microbial relevant
ecosystem functioning. Consequently, BEF stud- scale. Theoretical and conceptual approaches
ies in environmental microbiology are largely of from macroecology are being applied to under-
descriptive nature and disconnected to ecological stand microbial community structure and to link
concepts. Approaches have been “top-down” or it to ecosystem processes (Bodelier 2011).
“bottom-up,” treating species/genotypes, com- Ultrahigh-throughput community assessment
munity traits, and interactions as a “black box” methods will facilitate processing of large num-
(see Bodelier 2011). However, the rapid method- ber of samples and replicates in order to obtain
ological developments of the last decades are sufficient information allowing for experimental
narrowing down the limitations which kept envi- designs which yield mechanistic understanding
ronmental microbiology at the descriptive level. of environmental microbial communities, even-
The “omic” techniques enable studying commu- tually leading to the opening of the “black box.”
nity ecology and physiology of known as well as
unknown microbial species, and a systems biol-
ogy approach for microbial communities is not Microbial Community Conservation
out of reach (Raes and Bork 2008). In situ adap-
tation of community members as well as in situ The fact that there are no microbial species on the
profiling of whole genome transcripts and pro- Red List nor are microbial communities in nature
teins of individual species is feasible. Next to conservation policy does not mean that there are
this, methodology and concepts are emerging, no initiatives toward conservation of microbial
Microbial Ecosystems, Protection of 485 M
communities. From the medical as well as bio- because of the fact that soils harbor the largest
technological perspective, there is a need for the source of microbial biodiversity. It is within
preservation of microbial genetic diversity which the soil conservation that many initiatives are
is mainly done by storing isolated and described taken toward conservation of soil biodiversity
microbial species in public culture collections like the EU soil framework directive in devel-
(e.g., ATCC (http://www.lgcstandards-atcc.org/), opment (http://ec.europa.eu/environment/soil/
DSMZ (http://www.dsmz.de/), and NCIMB biodiversity.htm) where also microbes are
(http://www.ncimb.com/)). However, since most explicitly taken into account (Gardi
of the diversity is represented in uncultured and et al. 2009). Combined with the already
not characterized microbes as part of environ- existing EU habitat conservation legislation
mental communities, we run the risk of losing (http://ec.europa.eu/environment/nature/natura
genetic diversity of which we do not know its 2000/index_en.htm), important habitats
value yet, on itself a good reason for conserva- containing a large part of microbial diversity on
tion. Well-known examples of biotechnological Earth are conserved. However, we still need to
spin-off of environmental microbial communities know what it is that needs to be preserved and
have led to conservation efforts. The discovery what we can potentially lose or affect by climate
of the heat-resistant Taq polymerase enzyme, change, habitat destruction, land-use change,
used in PCR reactions, in the bacterium urbanization, etc. This requires inventories of
Thermus aquaticus (http://en.wikipedia.org/ microbial diversity and functioning. Despite the
wiki/Thermus_aquaticus) in hot springs in serious limitations in methods to assess the shear
Yellow Stone National Park, has led to declaring endless microbial diversity, there are a number of
these hot springs as conservation targets in order initiatives going on that come as close as possible.
to preserve the microbial genetic potential mainly The Earth Microbiome Project (http://www.
for biotechnological applications (http://serc. earthmicrobiome.org/) is an initiative to assess
carleton.edu/microbelife/topics/bioprospecting. functional microbial diversity in more than
M
html), thereby making these hot springs the 200,000 environmental samples which will be
first environmental microbial conservation collected and analyzed in a coordinated way and
areas. Another development that contributes will be complemented with essential metadata
substantially to the “protection” of environ- which can be used to infer ecological or biogeo-
mental microbial communities is the TEEB graphical aspects of the communities in the data-
(The Economics of Ecosystems and Biodiver- base. A similar initiative has already been in
sity) initiative which expresses the value of place for a number of years focusing on marine
ecosystems, ecosystem services, and biodiver- microbial communities (http://www.coml.org/
sity in monetary values (http://www.teebweb. international-census-marine-microbes-icomm),
org/). Although this valuing of ecosystems is while the TerraGenome project specifically
controversial and anthropogenically centered, it focuses on soils (http://www.terragenome.org/).
definitely created awareness for biodiversity Hence, many steps on the “roadmap toward
among policy makers, politicians, and industry. microbial conservation,” as put forward by
The assessment of Earth ecosystems, biodiver- Cockell (Cockell and Jones 2009) a number
sity, and ecosystem services by 1,300 experts of years ago, have been taken. Projects
(Millennium 2005) identified key areas of eco- attempting to make microbial diversity inven-
system protection and conservation in order to tories are initiated and scientific approaches to
keep our planet habitable. In all of these eco- link microbial species to ecosystem functions are
systems, microbes play pivotal roles, a fact being developed. Nevertheless, “the Red List”
which is generally being recognized. Especially species approach will definitely not be applicable
soils are a main focus when it comes to micro- to microbes as already pointed out above. Hence,
bial processes because of the many ecosystem we need different approaches and concepts
services soils and soil microbes provide and regarding “conservation units” for microbial
M 486 Microbial Ecosystems, Protection of
communities which are useful and understandable microbes are present in environments in order to
for policy makers and politicians. Habitat conser- be able to monitor changes with possible conse-
vation is a good starting point, but probably we can quences for ecosystem functions. There are many
also put forward “vulnerable” nonredundant envi- initiatives underway seeking to make inventories
ronmental microbes which are carrying out impor- of functional diversity of microbial communities
tant ecosystem functions like methane oxidizers in marine, terrestrial, and freshwater habitats. This
which may be affected by anthropogenic distur- knowledge will facilitate assessing impacts and
bance, diminishing their functioning in the envi- consequences of anthropogenic disturbances on
ronment for decades (see Fig. 1). Next to this, microbial communities and their functioning in
educating the public, policy makers, and politi- the future and pave the way for the protection of
cians on the importance and shear uniqueness of environmental microbial communities. For the
microbes and microbial communities will be of time being, we have to rely on habitat conservation
utmost importance in the process of getting guidelines and legislation to ensure maintenance
microbes on the conservation agenda. If not of microbial communities.
protecting them for their valuable functions, we
should do it for the sake of ethics (Cockell 2011).
References
Summary Allison SD, Martiny JBH. Resistance, resilience, and
redundancy in microbial communities. Proc Natl
Despite the eminent role microbes and microbial Acad Sci U S A. 2008;105:11512–9.
Bodelier PLE. Toward understanding, managing, and
communities play in all ecosystems on Earth,
protecting microbial ecosystems. Front Microbiol.
they are not considered in conservation policy 2011;2(80).
or legislation. This is due to utter lack of funda- Chen Y, Murrell JC. When metagenomics meets stable-
mental knowledge on crucial issues concerning isotope probing: progress and perspectives. Trends
Microbiol. 2010;18(4):157–63.
environmental microbial communities. The
Cockell CS. Microbial rights? EMBO Rep.
species-oriented approach in conservation biol- 2011;12(3):181. 181.
ogy is not emendable to microbes where there are Cockell CS, Jones HL. Advancing the case for microbial
difficulties in defining species and where more conservation. Oryx. 2009;43(4):520–6.
Ducklow H. Microbial services: challenges for microbial
than 99 % of all species present in the environ-
ecologists in a changing world. Aquat Microb Ecol.
ment are not known. Next to this, we have no idea 2008;53(1):13–9.
what the importance is of microbial diversity for Gardi C, et al. Soil biodiversity monitoring in Europe:
ecosystem functioning because of the lack of ongoing activities and challenges. Eur J Soil Sci.
2009;60(5):807–19.
methodology to do so. The most important prob-
Green JL, Bohannan BJM, Whitaker RJ. Microbial bioge-
lem is probably the notion that microbes are so ography: from taxonomy to traits. Science.
abundant, diverse, and resilient that they are not 2008;320(5879):1039–43.
threatened by extinction. However, rapid devel- Hallin S, et al. Relationship between N-cycling commu-
nities and ecosystem functioning in a 50-year-old fer-
opments in the field of environmental microbiol-
tilization experiment. ISME J. 2009;3(5):597–605.
ogy, mainly in the application of genomic and Hooper DU, Adair EC, Cardinale BJ, Byrnes JEK,
isotopic techniques, have revolutionized our Hungate BA, Matulich KL, Gonzalez A, Duffy JE,
knowledge and demonstrate that microbes dis- Gamfeldt L, O’Connor MI. A global synthesis reveals
biodiversity loss as a major driver of ecosystem
play biogeography and are sensitive to environ- change. Nature. 2012. doi:10.1038/nature11118.
mental disturbance and that for a number of Levine UY, et al. Agriculture’s impact on microbial diver-
environmentally relevant processes, community sity and associated fluxes of carbon dioxide and meth-
composition is linked to ecosystem functioning. ane. ISME J. 2011;5(10):1683–91.
Mayali X, Weber PK, Brodie EL, Mabery S, Hoeprich PD,
Hence, microbes are not “untouchable” and omni-
Pett-Ridge J. High-throughput isotopic analysis of
present, but in order to get them onto the conser- RNA microarrays to quantify microbial resource use.
vation agenda, we have to be able to assess which ISME J. 2012;6:1210–21.
Mining Metagenomic Datasets for Antibiotic Resistance Genes 487 M
Millennium, Ecosystem, Assessment 2005. Ecosystems used and to high-throughput DNA sequencing of
and human well-being: general synthesis. United microbial community DNA. These sample types
Nations. www.millenniumassessment.org/en/synthe-
sis.aspx and methods can be used to gather information on
Murrell JC, Whiteley AS, editors. Stable isotope probing genes that code for antibiotic resistance.
and related technologies. American Society for Micro-
biology (ASM); Washington DC, 2011.
Pace NR. Mapping the tree of life: progress and prospects.
Microbiol Mol Biol Rev. 2009;73(4):565–76. Introduction
Philippot L, Andersson SGE, Battin TJ, Prosser JI,
Schimel JP, Whitman WB, Hallin S. The ecological Antibiotics are medicines that are used to kill,
coherence of higher bacterial taxonomic ranks. Nat slow down, or prevent the growth of susceptible
Rev Microbiol. 2010;8:523–9.
Raes J, Bork P. Systems microbiology – timeline – molec- bacteria. They became widely used in the
ular eco-systems biology: towards an understanding mid-twentieth century for controlling disease in
of community function. Nat Rev Microbiol. humans, animals, and plants and for a variety of
2008;6(9):693–9. industrial purposes. Antibiotic resistance is
Tamames J, et al. Environmental distribution of prokary-
otic taxa. BMC Microbiol. 2010;10. a broad term. Depending on the classification
Wagner M. Single-cell ecophysiology of microbes as scheme used, there are between eight and twenty
revealed by Raman microspectroscopy or secondary different classes of antibiotics, with multiple
ion mass spectrometry imaging. Annu Rev Microbiol. compounds in each class. These different catego-
2009;63:411–29.
Wittebolle L, et al. Initial community evenness favours ries represent different basic chemical structures
functionality under selective stress. Nature. and modes of action – some antibiotics will
2009;458(7238):623–6. inhibit cell wall synthesis, for example, while
Zhou JZ, et al. Spatial scaling of functional gene diversity others target portions of the ribosome and
across various microbial taxa. Proc Natl Acad Sci
U S A. 2008;105(22):7768–73. a cell’s protein processing machinery. Just as
there are many types of antibiotics, there are
also many types of antibiotic resistance. Some
M
types of resistance are specific for an individual
Mining Metagenomic Datasets for antibiotic, while others, such as multidrug resis-
Antibiotic Resistance Genes tance efflux pumps, can confer resistance to mul-
tiple different kinds of antibiotics. It is also likely
Lisa Durso that there are naturally occurring antibiotics that
Agroecosystem Management Research Unit, have yet to be described.
US Department of Agriculture, University Of Antibiotic resistance is a normal and natural
Nebraska, Lincoln- East Campus, Lincoln, phenomenon that can be documented even in
NE, USA ancient (permafrost from 30,000 years ago) and
pristine habitats such as Antarctica and the Sar-
gasso Sea (Allen et al 2009; D’Costa et al. 2011;
Synonyms Durso et al. 2012). In addition to naturally occur-
ring antibiotic resistance, there is no doubt that
Anthropogenic and human associated; Horizon- anthropogenic or human-associated use of antibi-
tal gene transfer and lateral gene transfer; Whole- otics for health, food production, veterinary, and
genome sequencing and metagenomic industrial purposes has dramatically impacted
sequencing resistance. The continued emergence of
antibiotic-resistant, opportunistic, and patho-
genic infections in health-care settings has
Definition become a major public health concern, especially
the emergence of bacteria that are resistant to
Metagenomics refers to samples in which the multiple antibiotics or multiple classes of antibi-
entire bacterial or microbial community DNA is otics. Yet few details are known about how
M 488 Mining Metagenomic Datasets for Antibiotic Resistance Genes
antibiotic resistance genes move through envi- bacteria present in the environment, including
ronmental, agricultural, and clinical settings. food, water, and soil. D’Costa et al. (2006) cul-
Metagenomics provides one tool to start charac- tured spore-forming bacteria from soil and
terizing antibiotic resistance genes across screened them against 21 antibiotics, including
habitats. both old and new antibiotics and naturally occur-
The term “metagenomic” has multiple mean- ring and synthetic antibiotics. Based on their
ings. Historically it was used to describe the kind results, they identified the soil as a reservoir of
of sample that was collected and referred to antibiotic resistance genes and proposed the idea
collecting DNA or genomic information not just of a pan-microbial resistome. Contrary to the gen-
from a single organism or isolate but from eral public perception that use of antibiotics in
a whole community, a metagenome, consisting human medicine and agriculture is the root cause
of both cultured and uncultured organisms of antibiotic resistance, the antibiotic resistome
(metagenomic samples). More recently, the term hypothesis supports the idea of a naturally occur-
metagenomic has come to describe a specific type ring global pool of antibiotic resistance and sug-
of analysis that relies on high-throughput nucleic- gests that the environment (especially soil) serves
acid sequencing of either 16S rDNA or whole- as a reservoir of antibiotic resistance elements. In
community DNA (metagenomic sequencing). In this model antibiotic resistance elements can be
addition to providing metagenomic sequencing enriched and selected for by anthropogenic antibi-
information, the new high-throughput sequenc- otic use. However, unlike previous models, the
ing methods can be used to profile whole- concept of the antibiotic resistome expands
community RNA profiles (metatranscriptome) the focus from the selection of pathogens via the
and whole-community protein profiles direct use of antibiotics in clinical settings to
(metaproteome). This entry will examine studies a global pool of antibiotic resistance that can
using both metagenomic samples and the use of potentially be transferred from harmless bacteria
metagenomic sequencing to gather information into human, animal, and plant pathogens. Later
on functional genes that code for antibiotic resis- work by the same group (Wright 2007; D’Costa
tance. Although the focus here will be on mining et al. 2011) as well as others (Riesenfeld
metagenomic data for information on antibiotic et al. 2004; Henriques et al. 2006; Aminov and
resistance genes, it is acknowledged that func- Mackie 2007; Mori et al. 2008) provides
tional and gene-based metagenomic studies com- supporting evidence for the natural occurrence of
plement experiments involving gene expression, antibiotic resistance, especially in soil, and the
protein production, and phenotypic characteriza- global distribution of antibiotic resistance genes.
tion of individual and community resistance. Conceptually, the relationship between
increased anthropogenic use of antibiotics and
increases in the number and types of antibiotic-
The Antibiotic Resistome resistant bacteria and antibiotic resistance genes
is clear. On a practical level, many of the details
The concept of an antibiotic “resistome” was first regarding the ecology of antibiotic resistance and
proposed in 2006 by D’Costa et al. to describe the antibiotic resistance genes in the environment
sum total of all antibiotic resistance genes across remain unknown. These include fate and trans-
the globe and all genetic elements that could give port of naturally occurring and anthropogenically
rise to antibiotic resistance genes (D’Costa induced antibiotic-resistant genes within and
et al. 2006). It includes pathogenic bacteria that between environmental, agricultural, and clinical
cause illness, as well as opportunistic and non- settings as well as rates of gene transfer, rates of
pathogenic bacteria. This concept provides gene expression, and impact of naturally occur-
a framework that unites antibiotic resistance in ring and anthropogenically introduced antibiotic
human, animal, and plant clinical applications, concentrations on short- and long-term microbial
with the broader pool of antibiotic-resistant community structure.
Mining Metagenomic Datasets for Antibiotic Resistance Genes 489 M
Antibiotic Resistance Genes called a library. In the case of antibiotic resis-
tance, the clone or BAC libraries are plated onto
The genes that code for antibiotic resistance are media containing a specific amount of
carried either as part of the regular bacterial chro- antibiotic. If they grow in the presence of the
mosome, which is passed vertically to individual antibiotic, they are considered resistant. If they
daughter cells, or as part of mobile genetic ele- do not grow, they are considered sensitive. In
ments such as plasmids and transposons which human medicine and clinical settings, there are
can be transferred both vertically to daughter well-defined standard methods that specify, by
cells and horizontally to other strains or species organism and antibiotic, the concentration
of bacteria. These antibiotic resistance genes, needed to be considered resistant. In environmen-
sometimes called antibiotic resistance determi- tal and experimental settings, these standards do
nants or antibiotic resistance elements, code for not exist, and there is no consistent definition
a variety of different kinds of proteins involved in across studies.
inactivating the antibiotic, removing the antibi-
otic from the cell, or modifying the target of the
antibiotic so that it is not recognized by the drug. Studying Antibiotic Resistance Genes
For any specific antibiotic, there may be multiple from Metagenomic Samples
types of resistance mechanisms. Many of these
mechanisms are complex and require the coordi- Metagenomic samples can be mined for known as
nation of a suite of genes, so that for any individ- well as uncharacterized antibiotic resistance
ual antibiotic, there are multiple different genes using functional screening of metagenomic
antibiotic resistance genes. clone libraries. After creating the libraries, clones
There are two basic approaches to mining are plated onto media containing the antibiotic of
metagenomic datasets for antibiotic resistance interest. Colonies that grow in the presence of the
genes: those that are database dependent and antibiotic are assumed to be carrying an
M
those that are discovery driven. The database- antibiotic-resistant gene from the original sam-
dependent systems are good for comparative ple. The inserts from the resistant clones can be
studies that screen large numbers of samples or sequenced, and the sequences compared against
large number of genes and examine similarities database of known antibiotic resistance genes. As
or differences in antibiotic resistance gene pat- early as 1997, these methods were used to char-
terns across samples. These methods rely on pre- acterize the diversity of quinolone resistance
viously sequenced antibiotic resistance genes to genes in soil (Waters and Davies 1997). This
provide the information used to design primers or functional metagenomic approach has been used
to provide a list against which new sequences are to target specific classes of antibiotic resistance
compared. The limitation of database-dependent genes, as well as more general surveys of antibi-
methods is that a particular gene must already otic resistance where libraries are screened
have been sequenced in order to be in the data- against multiple antibiotics. For example, tetra-
base, and researchers can only screen against cycline resistance has been assayed from human
genes that have already been discovered, charac- mouth, and organic pig samples (Diaz-Torres
terized, and deposited in the database. Discovery- et al 2003; Kazimierczak et al. 2009) and
driven methods, while time-consuming and b-lactamase genes have been extracted from sam-
low-throughput, can be used to describe novel ples such as tropical surface waters and Alaskan
antibiotic resistance genes. In this approach, soil (Henriques et al. 2006; Allen et al. 2009).
DNA fragments from metagenomic samples are The mining of functional genes focuses on two
cloned into hosts such as E. coli, or constructs main types of samples. When trying to determine
such as bacterial artificial chromosomes (BACs), baseline levels of antibiotic resistance and evolu-
and then screened for a particular phenotype. The tionary relationships of individual genes, pristine
collection of DNA fragments in the new host is samples and those dating from before the use of
M 490 Mining Metagenomic Datasets for Antibiotic Resistance Genes
antibiotics are used (D’Costa et al. 2011). When composed of subsystems. Examples of primary
searching for novel antibiotic resistance genes, SEED functional groups are “cell wall synthe-
complex samples are used, especially those with sis,” “nitrogen metabolism,” and “virulence.”
increased levels of antibiotic compounds such as Within the virulence functional category is
feces or activated sludge (Sommer et al. 2009; a subset of genes that are associated with “resis-
Mori et al 2008). It is also possible to use publicly tance to antibiotic and toxic compounds”
available information to screen for potentially (RATC). Drilling even further down into this
novel antibiotic resistance genes. Both the particular functional group, gene fragments are
National Center for Biotechnology Information binned by categories such as “aminoglycoside
(NCBI) and the MG-RAST server (Meyer adenyltranferases,” “beta-lactamase resistance,”
et al. 2008) have extensive DNA sequence and “resistance to fluoroquinolones.”
datasets that are available to the public. Once After identifying a metagenome, a list of anti-
identified via the public databases, antibiotic biotic resistance genes can be accessed using the
resistance genes of interest can then be charac- “analysis” icon. Under “Data Type” choose “Func-
terized using other methods (Toth et al. 2010). tional Abundance” and “Hierarchical Classifica-
tion.” The Data Selection annotation source
should be “subsystems” and the Data Visualization
Studying Antibiotic Resistance Genes option should be “table.” Then, hit the “generate”
Using Metagenomic Sequencing button. After processing the data, a table will be
Methods displayed with three functional classification levels
displayed, along with abundance and quality data.
One tool that is useful for exploring antibiotic The abundance results are clickable, and open
resistance in metagenomic samples is a window that lists the taxonomic assignments of
MG-RAST (Meyer et al. 2008). MG-RAST, each of the hits, as well as a link to the actual
developed at Argonne National Laboratory and sequence and M5nr nonredundant protein data.
the University of Chicago, provides The M5nr database allows classification of the
metagenomic data analysis tools for both public fragment across multiple classification schemes.
and private metagenomic sequencing sets. There Because metagenomic sequencing is
are hundreds of publicly available metagenomes performed on whole-community DNA without
on the MG-RAST website (http://metagenomics. a PCR step, the data generated can be considered
anl.gov). These can be accessed directly using the quantitative. So in addition to describing which
sample ID number or via the “browse antibiotic resistance genes are present,
metagenome” function. Researchers may submit metagenomic analysis can quantify the relative
their own metagenomic datasets to the site for amounts or proportions of individual genes
analysis, with processing priority given to and/or gene classes – both within any individual
datasets that will be made immediately available sample and across samples from different habi-
to the public. After normalization, both taxo- tats. As with all methods associated with tracking
nomic and functional data are extracted from antibiotic resistance in the environment, there are
the submitted sequences and made available for no standard methods (Allen et al. 2010). How-
visualization via the website. There are many ever, control metrics available through
different classification schemes that are available MG-RAST, in particular a new metric called
for organizing data on MG-RAST. One system, duplicate read inferred sequencing error estima-
called SEED (Overbeek et al. 2005), is designed tion (DRISEE; Keegan et al. 2012), can serve as
to classify functional genes across genomes using screening tools to decide on minimum quality
a standardized system for categorizing genes or standards for inclusion or exclusion of specific
gene fragments. The SEED system of organiza- metagenomic samples for analysis.
tion is hierarchical in nature, and each of the These metagenomic sequencing tools can be
primary functional groups or systems is used to start addressing questions related to the
Mining Metagenomic Datasets for Antibiotic Resistance Genes 491 M
ecology of antibiotic resistance in specific habitats pulled out and used for taxonomic purposes. In
and across ecosystems. Metagenomic analysis of addition, MG-RAST has the ability to link
45 microbiomes across the globe revealed func- protein-coding fragments with taxonomic assign-
tional gene profiles that correlated with environ- ments using SEED and other systems. Currently,
ment (Dinsdale et al. 2008). This idea was the only way to access this linked information for
expanded to antibiotic resistance genes, providing individual reads from MG-RAST is through the
an antibiotic resistance “fingerprint” for some “assignment” column on the functional gene
samples (Durso et al. 2011). A metagenomic anal- table, so it is time-consuming to assemble this
ysis of public datasets was performed specifically linked data, even for a single metagenome.
comparing RATC genes from agricultural and Grouped data are more easily accessible in
nonagricultural metagenomes (Durso et al. 2012). MG-RAST using the “workbench” function. In
Among the 26 metagenomes studied, the total the functional table, the last column contains
percent of RATC gene fragments (based on all a box titled “to workbench.” Reads belonging to
classified fragments) ranged from 0.7 % for the specific functional groups can be selected, and
Sargasso Sea sample to 4.4 % for the dog. The then a second taxonomic-specific analysis can
fecal samples (dog, fish, three human, and cattle) be run exclusively on the reads in the workbench.
had the highest overall percent of RATC genes, Using these methods, information can be gath-
while the marine samples (Chesapeake, ered on which bacteria are likely carrying specific
Galapagos, Zanzibar, Gulf of Mexico, Key West, antibiotic resistance genes and how the bacterial
Madagascar, Gulf of Maine, and Sargasso Sea) had communities may change over time or space.
the lowest overall percent of RATC genes. In Some types of antibiotic resistance, such as
addition to having the highest proportion of beta-lactamase, MDR efflux pumps, and fluoro-
RATC genes, the dog metagenome also displayed quinolone resistance, are broadly distributed
the highest diversity of RATC classes (31 classes) across many (>10) taxa, while other types of
and the Sargasso Sea displayed the lowest diver- resistance genes such as tetracycline and vanco-
M
sity (7 classes). Using MG-RAST, individual clas- mycin resistance are more taxonomically
ses of antibiotic resistance genes could be restricted (4 or 5 taxa each). Within individual
examined. The fish metagenome, for instance, antibiotic resistance classes, the taxonomic dis-
had over ten times as many genes coding for mer- tribution of specific genes or gene classes varies
curic reductase and mercury resistance (3.9 %), by metagenome. For example, MDR efflux pump
compared to the average for the other genes are associated mainly with Clostridia in
metagenomes (0.31 %), while the day 29 kimchi animal agriculture metagenomes but are more
metagenome, a Korean fermented vegetable, had frequently assigned to Gammaproteobacteria in
high levels of the two-protein Gram-positive coastal marine samples. Metagenomic sequenc-
multidrug resistance compared to the other ing enables researchers to track the change in
metagenomes examined. In both of these exam- microbial communities over time. One set of
ples, the metagenomic data reflect what we already publicly available metagenomes follows the fer-
know about the biology of these systems and sug- mentation of kimchi over the course of a month.
gest that metagenomic RATC data can be used to The antibiotic resistance gene profiles associated
distinguish fundamental differences in microbial with the kimchi change dramatically as the fer-
community ecology from diverse microbiomes. mentation progresses, and these specific changes
In addition to information on specific antibi- can be tracked using metagenomic sequencing.
otic resistance genes, analysis of metagenomic The strengths of these metagenomic sequenc-
sequencing data can also provide taxonomic ing methods are that they allow researchers to
information about a sample. The use of the 16S identify and gain a quantitative understanding of
ribosomal RNA gene to classify bacteria is well functional gene relationships across samples and
known. Some of the fragments in a metagenomic geographies. It should be kept in mind that there
sample that code for 16S rRNA genes can be are many places where artifacts of processing of
M 492 Mining Metagenomic Datasets for Antibiotic Resistance Genes
either the sample itself or the resulting sequence D’Costa V, McGramm K, Hughes D, et al. Sampling the
data can influence the results. Although these antibiotic resistome. Science. 2006;311:374–7.
D’Costa V, King C, Kalak L, et al. Antibiotic resistance is
sequence-based metagenomic data are excellent ancient. Nature. 2011;477(7365):457–61.
for getting oriented in a system and providing an Diaz-Torres M, McNab R, Spratt D, et al. Novel tetracy-
overview of what is potentially there, the output cline resistance determinant from the oral
is of fairly low resolution and requires follow-up metagenome. Antimicrob Agents Chemother.
2003;47(4):1430–2.
using other methods before detailed conclusions Dinsdale E, Edwards R, Hall D, et al. Functional
can be drawn. Nonetheless, there is great value in metagenomic profiling of nine biomes. Nature.
the information that these kinds of techniques can 2008;452:629–33.
provide. Like the Lewis and Clark expedition, Durso L, Harhay G, Bono J, et al. Virulence-associated
and antibiotic resistance genes of microbial
which mapped the entire US western frontier populations in cattle feces analyzed using
based on sampling a single route covering much a metagenomic approach. J Microbiol Methods.
less than 1 % of today’s public roads in the area, 2011;84(2):278–82.
data generated by metagenomic sequencing Durso LM, Miller DN, Wienhold BJ. Distribution and
quantification of antibiotic resistant genes and bacteria
methods provide a first step in exploring previ- across agricultural and non-agricultural metagenomes.
ously unknown territory. For antibiotic resis- PLoS One. 2012;7:e48325.
tance, they offer the capacity to examine the Henriques I, Moura A, Alves A, et al. Analysing
prevalence of antibiotic gene distribution on diversity among b-lactamase encoding genes in
aquatic environments. FEMS Microbiol Ecol.
a global scale and the opportunity to begin to 2006;56:418–29.
compare distribution of specific antibiotic resis- Kazimierczak K, Scott K, Kelly D, et al. Tetracycline
tance genes across samples and time. resistance of the organic pig gut. Appl Environ
Microbiol. 2009;75(6):1717–22.
Keegan K, Trimble W, Wilkening J, et al. A platform-
independent method for detecting errors in
Summary
metagenomic sequencing data: DRISEE. PLoS
Comput Biol. 2012;8(6):e1002541. doi:10.1371/jour-
The ecology of antibiotic resistance genes in the nal.pcbi.1002541.
environment remains largely unexplored. Meyer F, Paarmann D, D’Souza M, et al. The
metagenomics RAST server – a public resource for
Metagenomic tools provide the opportunity to
the automatic phylogenetic and functional analysis of
identify novel antibiotic resistance genes, explore metagenomes. BMC Bioinforma. 2008;9:386.
the epidemiology of antibiotic-resistant genes Mori T, Mizuta S, Suenaga H, et al. Metagenomic screen-
across multiple habitats, and begin to define rela- ing for bleomycin resistance genes. Appl Environ
Microbiol. 2008;74(21):6803–5.
tionships between antibiotic resistance genes and
Overbeek R, Begley T, Butler R, et al. The subsystems
the bacteria that likely carry them. The availabil- approach to genome annotation and its use in the
ity of public metagenomic datasets affords all project to annotate 1000 genomes. Nucleic Acids
researchers an opportunity to ask and answer Res. 2005;33:5691–702.
Riesenfeld C, Goodman R, Handelsman J. Uncultured soil
questions about antibiotic resistance.
bacteria are a reservoir of new antibiotic resistance
genes. Environ Microbiol. 2004;6(9):981–9.
Sommer MO, Dantas G, Church GM. Functional
References characterization of the antibiotic resistance reservoir
in the human microflora. Science. 2009;325:1128–
Allen H, Cloud-Hansen K, Wolinski J, et al. Resident 1131.
microbiota of the gypsy moth midgut harbors antibi- Toth M, Smith C, Frase H, et al. An antibiotic-resistance
otic resistance determinants. DNA Cell Biol. enzyme from a deep-Sea bacterium. J Am Chem
2009;28(3):109–17. Soc. 2010;132:816–23.
Allen H, Donato J, Wang H, et al. Call of the wild: Waters B, Davies J. Amino acid variation in the GYRA
antibiotic resistance genes in natural environments. subunit of bacteria potentially associated with natural
Nat Rev. 2010;8:215–59. resistance to fluoroquinolone antibiotics. Antimicrob
Aminov R, Mackie R. Minireview: evolution and ecology Agents Chemother. 1997;41(12):2766–9.
of antibiotic resistance genes. FEMS Microbiol Lett. Wright G. The antibiotic resistome: the nexus of chemical
2007;271:147–61. and genetic diversity. Nat Rev. 2007;5:175–86.
Mining Metagenomic Datasets for Cellulases 493 M
between the polymer chains, and this is largely
Mining Metagenomic Datasets for responsible for its recalcitrance. This network of
Cellulases bonds leads to a mostly uniform arrangement of
fibers, and the resultant crystalline cellulose lacks
David J. Rooks and Alan J. McCarthy enzyme-accessible surface morphologies, further
Microbiology Research Group, Institute of enhancing resistance to hydrolysis (Zhou
Integrative Biology, Biosciences Building, et al. 2009). Cellulose usually occurs naturally
University of Liverpool, Liverpool, UK in close physical association with hemicelluloses,
which are heteropolysaccharides that, in terres-
trial plants, form the lignocarbohydrate matrix
Synonyms enveloping cellulose fibers and essentially con-
stituting the plant cell wall structure. Cotton is the
Environmental DNA; Glycosyl hydrolases; only naturally occurring pure form of highly crys-
Metagenomes; Metatranscriptomes talline cellulose. For microorganisms to hydro-
lyze and metabolize insoluble polymeric
cellulose, extracellular cellulases must be pro-
Definition duced and in multiple forms that act synergisti-
cally. The two primary models are those in which
Metagenomic (DNA) or metatranscriptomic the enzymes are truly secreted, versus the
(cDNA) sequence datasets generated using cellulosome, a surface-bound multimeric com-
DNA or RNA extracts are obtained directly plex of polypeptides comprising catalytic and
from environmental samples. These include soil, non-catalytic components; the cellulosome
water, gut contents, and degrading organic mat- has been likened to a polysaccharide process-
ter/plant biomass and biofilms; laboratory- ing nanomachine (Fontes and Gilbert 2010).
incubated microcosms or mesocosms in which There s a possible third model in which cellulose
M
cellulose-degrading microorganisms are is bound to the bacterial cell surface and fur-
enriched also serve as sources of nucleic acids ther processed in the periplasmic space (see
for the preparation of sequence datasets. Genes Ransom-Jones et al. 2012). Three major types of
encoding glycosyl hydrolases and specifically enzymatic activities are found: (i) endoglucanases,
those likely to be active against cellulose (ii) exoglucanases (cellobiohydrolases), and
(cellulases) can be sought, most efficiently in (iii) b-glucosidases (cellobiases). The evidence
the large sequence datasets generated by the for oxidative attack on cellulose has often
application of pyrosequencing technologies. been equivocal, but there are now data that
establish the involvement of an enzyme
(GH61) in cellulose depolymerization (Quinlan
Cellulose and Its Biodegradation et al. 2011).
are the most diverse enzymes that catalyze GH5 and GH9 have the largest number of bio-
a single reaction. Automated data mining sug- chemically characterized cellulases; however,
gests that there are 15 glycoside hydrolase fami- this could be largely due to the abundance of
lies that contain cellulases; families are defined these cellulases in the limited number of model
by amino acid sequence similarity (CAZy – see cellulolytic organisms that have been studied
below). Structural studies show that cellulases (Sukharnikov et al. 2011). The database is fre-
have eight different protein folds and contain quently updated to provide rich sets of manually
a carbohydrate-binding module, which is usually curated information on all groups of CAZymes,
linked to a catalytic-binding domain (Shoseyov i.e., names, GenBank accession numbers, EC
et al. 2006). Glycosyl hydrolases with open active designations, 3D structure, and taxonomy, and
sites typically exhibit endocellulolytic activity the information can serve as an invaluable
(endoglucanases) and cleave b 1–4 links at amor- resource to identify CAZyme genes or gene frag-
phous sites in the polysaccharide chain to gener- ments in both genomes and metagenomes.
ate chain ends and ultimately oligosaccharides of Although the collection of enzyme data in
various lengths (Horn et al. 2006). Those with CAZy is invaluable for enzymologists, annota-
tunnellike active sites exhibit exocellulolytic tions could be significantly improved; the term
activity and are cellobiohydrolases that act “characterized” in CAZy is applied equally to
in a processive manner on the reducing or proteins that have been analyzed biochemically
nonreducing ends to liberate either glucose or and to those for which function has been compu-
cellobiose as major products. b-Glucosidases tationally predicted (Stam et al. 2006).
convert cellobiose to glucose, completing the
highly synergistic and complete enzymatic depo-
lymerization of cellulose. Metagenomics
can be provided, and it is always an analysis mined for variants of these known cellulases.
worth doing. To identify novel cellulases, more Much longer sequences, ideally complete
sophisticated bioinformatic approaches are genes, are the best source material for bioinfor-
required to search for domains and motifs indic- matic prediction of potential cellulase function,
ative of enzymes with cellulose binding and/or and metagenomic/metatranscriptomic datasets
catalytic functions. Sequence comparisons can provide the probes to identify such genes
among proteins with suggestive domain architec- and their neighbors in contemporaneously pro-
tures or genomic contexts in metagenomic DNA duced fosmid or bacterial artificial chromosome
have the potential to identify novel cellulases; the (BACS) libraries (Rooks et al. 2012). Subse-
discovery of a new carbohydrate-binding module quent cloning, overexpression, and purified pro-
in metagenomic DNA by Mello et al. (2010) is tein production then provide sufficient material
a particularly good example of what can be for the detailed structure/function characteriza-
achieved by the continuing development of tion, combining classical biochemistry and
bioinformatic tools. structural biology approaches, necessary to
With complete sequences and their genomic establish that a novel cellulase has been teased
context if located within larger sequenced DNA out from the metagenome.
fragments, homology-based approaches can be
extended. Firstly, structural modeling of mem-
bers of likely cellulase families can identify Future Prospects
those with unusual binding and catalytic sites
that may therefore exhibit functional novelty. Firstly, the tandem approach of using environ-
Secondly, domains of unknown function, which mental RNA and DNA as the starting material
are likely to be putative cellulase or cellulase- to generate complimentary metatranscriptomes
related sequences because they are consistently and metagenomes, thus benefitting from the spe-
linked by genome context, can be characterized cific advantages of each, is becoming more fea-
through distant homology, non-homology, and sible with developments in ribosomal RNA
structure-based approaches. This is exemplified depletion and messenger RNA enrichment tech-
by the identification of a novel cellulase from niques. Four hundred and fifty four
a sequenced marine bacterial genome through pyrosequencing, which had predominated due to
signature domains that assemble enzymes into the relatively long read lengths (ca. 800 bp) that
plant cell wall degradative complexes (Bras could be obtained, is receding to be replaced by
et al. 2011). next-generation sequencing technology that can
Sequences with matches indicative of cellu- deliver ever-increasing read lengths (currently
lases can of course be identified by BLAST ca. 500 bp by using paired end reads) in combi-
searches against the CAZy database (see nation with read numbers in the 107 range. All of
above) and through functional annotation pipe- this in an economically competitive environment
lines such as SEED (Overbeek et al. 2005) and in which sequencing run costs continue to
MG-RAST (Meyer et al. 2008) to provide decrease. The bioinformatic bottleneck remains
taxonomic affiliations for functional and hypo- in terms of computer processing capacity, and
thetical protein-encoding genes. However, thus time, but specifically in relation to mining
identification of even distant relationships for metagenomes for genes encoding enzymes, the
the short sequence read output (<500 bp) future is the ability to reliably predict and model
that is characteristic of pyrosequencing is protein structure and function in silico and thus
a bioinformatic challenge. The danger of simply identify truly novel cellulases among those
searching against databases of known cellulase numerous translated metagenomic sequences
gene sequences is that true novelty will be that lack homology with any known protein-
missed and the metagenomes will only be coding sequences.
Mock Community Analysis 497 M
References Quinlan RJ, Sweeney MD, Lo Leggio L, et al. Insights into
the oxidative degradation of cellulose by a copper
Bras JL, Cartmell A, Carvalho AL, et al. Structural metalloenzyme that exploits biomass components.
insights into a unique cellulase fold and mechanism Proc Natl Acad Sci U S A. 2011;108:15079–84.
of cellulose hydrolysis. Proc Natl Acad Sci U S A. Ransom-Jones E, Jones DL, McCarthy AJ, et al. The
2011;108:5237–42. fibrobacteres: an important phylum of cellulose-
Cantarel BL, Coutinho PM, Rancurel C, et al. The degrading bacteria. Microb Ecol. 2012;63:267–81.
Carbohydrate-Active EnZymes database (CAZy): an Rooks DJ, McDonald JE, McCarthy AJ. Metagenomic
expert resource for Glycogenomics. Nucleic Acids approaches to the discovery of cellulases. Methods
Res. 2009;37:233–8. Enzymol. 2012;510:375–94.
Damon C, Lehembre F, Oger-Desfeux C, et al. Metatran- Shoseyov O, Shani Z, Levy I. Carbohydrate binding mod-
scriptomics reveals the diversity of genes expressed by ules: biochemical properties and novel applications.
eukaryotes in forest soils. PLoS ONE. 2012;7:e28967. Microbiol Mol Biol Rev. 2006;70:283–95.
Davies G, Henrissat B. Structures and mechanisms of Stam MR, Danchin EG, Rancurel C, et al. Dividing the
glycosyl hydrolases. Structure. 1995;3:853–9. large glycoside hydrolase family 13 into subfamilies:
Duan CJ, Xian L, Zhao GC, et al. Isolation and partial towards improved functional annotations of alpha-
characterization of novel genes encoding acidic cellu- amylase-related proteins. Protein Eng Des Sel.
lases from metagenomes of buffalo rumens. J Appl 2006;19:555–62.
Microbiol. 2009;107:245–56. Sukharnikov LO, Cantwell BJ, Podar M, et al. Cellulases:
Fontes CM, Gilbert HJ. Cellulosomes: highly efficient ambiguous nonhomologous enzymes in a genomic
nanomachines designed to deconstruct plant cell wall perspective. Trends Biotechnol. 2011;29:473–9.
complex carbohydrates. Ann Rev Biochem. Takasaki K, Miura T, Kanno M, et al. Discovery of gly-
2010;79:655–81. coside hydrolase enzymes in an avicel-adapted forest
Gilbert JA, Field D, Huang Y, et al. Detection of large soil fungal community by a metatranscriptomic
numbers of novel sequences in the metatranscriptomes approach. PLoS ONE. 2013;8:e55485.
of complex marine microbial communities. PLoS Voget S, Steele HL, Streit WR. Characterization of
ONE. 2008;3:e3042. a metagenome-derived halotolerant cellulase.
Henrissat B, Teeri TT, Warren RA. A scheme for desig- J Biotechnol. 2006;126:26–36.
nating enzymes that hydrolyse the polysaccharides in Warnecke F, Luginbuhl P, Ivanova N, et al. Metagenomic
and functional analysis of hindgut microbiota of
the cell walls of plants. FEBS Lett. 1998;425:352–4.
a wood-feeding higher termite. Nature.
M
Horn SJ, Sikorski P, Cederkvist JB, et al. Costs and benefits
of processivity in enzymatic degradation of recalcitrant 2007;450:560–5.
polysaccharides. PNAS. 2006;103:18089–18094. Zhou W, Schuttler HB, Hao Z, et al. Cellulose hydrolysis
Huson DH, Auch AF, Qi J, et al. MEGAN analysis of in evolving substrate morphologies I: a general model-
metagenomic data. Genome Res. 2007;17:377–86. ing formalism. Biotech Bioeng. 2009;104:261–74.
Leininger S, Urich T, Schloter M, et al. Archaea predom-
inate among ammonia-oxidizing prokaryotes in soils.
Nature. 2006;442:806–9.
Lynd LR, Weimer PJ, van Zyl WH, et al. Microbial cel- Mock Community Analysis
lulose utilization: fundamentals and biotechnology.
Microbiol Mol Biol Rev. 2002;66:506–77.
McDonald JE, Rooks DJ, McCarthy AJ. Methods for the Sarah Highlander
isolation of cellulose-degrading microorganisms. Genomic Medicine, J. Craig Venter Institute,
Methods Enzymol. 2012;510:349–74. La Jolla, CA, USA
Mello LV, Chen X, Rigden DJ. Mining metagenomic data
for novel domains: BACON, a new carbohydrate-
binding module. FEBS Lett. 2010;584:2421–6.
Meyer F, Paarmann D, D’Souza M, et al. The Definitions
metagenomics RAST server – a public resource for
the automatic phylogenetic and functional analysis of Mock community: A defined mixture of micro-
metagenomes. BMC Bioinforma. 2008;9:386–92.
Overbeek R, Begley T, Butler RM, et al. The subsystems bial cells and/or viruses or nucleic acid molecules
approach to genome annotation and its use in the created in vitro to simulate the composition of
project to annotate 1000 genomes. Nucleic Acids a microbiome sample or the nucleic acid isolated
Res. 2005;33:5691–702. therefrom.
Qin J, Li R, Raes J, et al. A human gut microbial gene
catalogue established by metagenomic sequencing. Microbiome: The microbes (bacteria,
Nature. 2010;54:59–65. archaea, fungi, protists, and viruses) that inhabit
M 498 Mock Community Analysis
a specific environment or host, such as all the Mock Community Analysis, Table 1 Strains in the
microbes that live in and on the human body. HMP mock cell community (BEI HM-280)
Metagenome: The complete DNA (genomic) Genus species Strain number
content of a microbiome sample. The term Acinetobacter baumannii ATCC 17978
“metagenome” was first used by Handelsman Actinomyces odontolyticus ATCC 17982
et al. to describe the “collective genomes of soil Bacillus cereus ATCC 10987
microflora” (Handelsman et al. 1998). Bacteroides vulgatus ATCC 8482
Bifidobacterium adolescentis DSM 20083
Clostridium beijerinckii ATCC 51743
Deinococcus radiodurans ATCC 13939
Introduction
Enterococcus faecalis ATCC 47077
Escherichia coli ATCC 700296
Although a few studies have reported creation of
Helicobacter pylori ATCC 700392
mock communities for environmental microbial
Lactobacillus gasseri ATCC 33323
systems, this review will be restricted to mock Listeria monocytogenes ATCC BAA-679
communities that have been developed for stud- Methanobrevibacter smithii ATCC 35061
ies of the human microbiome. Neisseria meningitidis ATCC BAA-335
Porphyromonas gingivalis ATCC 33277
Propionibacterium acnes DSM 16379
Microbial Mock Communities Pseudomonas aeruginosa ATCC 47085
Rhodobacter sphaeroides ATCC 17023
For the human sampling aspect of the Human Staphylococcus aureus ATCC BAA-1718
Microbiome Project (HMP), the clinical centers Staphylococcus epidermidis ATCC 12228
at Baylor College of Medicine and the Washing- Streptococcus agalactiae ATCC BAA-611
ton University School of Medicine were tasked Streptococcus mutans ATCC 700610
with obtaining microbiome samples from 18 dif- Streptococcus pneumoniae ATCC BAA-334
ferent body sites. These samples were in the form
of saliva, tooth scrapings, buccal swabs, vaginal
swabs, nasal swabs, skin scrapings, feces,
etc. Each sample had a different physical and aerobe or anaerobe, high and low percent G+C,
microbial composition, yet it was necessary to and having completely sequenced genomes. The
have a standard and uniform method of DNA strains were grown under appropriate growth
extraction for each. The method selected conditions to late logarithmic or stationary
included chemical lysis with sodium dodecyl sul- phase and then mixed at an equal ratio at
fate (SDS) and mechanical disruption by bead a concentration of 108 cells/ml. The cell mix is
beating followed by column purification of the available, free of charge, from BEI Resources
DNA from the cell lysate (http://www.hmpdacc. (www.beiresources.org). A similar mixture, for-
org/doc/HMP_MOP_Version12_0_072910.pdf) mulated in 40 % glycerol (BEI HM-281), was
using the MO BIO PowerSoil DNA Isolation Kit also created to be used as a viable mock commu-
(Carlsbad, CA). As a means to evaluate the DNA nity for single cell studies.
purification protocol, we created a mock cell We extracted DNA from the mock cells com-
community that consists of 22 bacterial strains munity using the HMP standard DNA isolation
and one archaeal strain, mostly representing protocol, then performed 454 amplicon sequenc-
strains found at different sites within the human ing of the 16S ribosomal RNA variable regions,
body (Table 1). The strains were selected as hav- V1-V3 regions. We failed to detect any M. smithii
ing different features such as different cell wall or bifidobacterial reads and recovered less
compositions (gram positive, gram negative, than 1 % of the total reads corresponding to the
spore formers, encapsulated, thick cell wall), following input organisms: Acinetobacter
Mock Community Analysis 499 M
baumannii, Actinomyces odontolyticus, Clostrid- bead beating (as included in the HMP protocol)
ium beijerinckii, Deinococcus radiodurans, delivered the best representation of the commu-
Helicobacter pylori, Lactobacillus gasseri, nity structure. Addition of mutanolysin, but not
Rhodobacter sphaeroides, or Streptococcus lysozyme, or lysostaphin, also enhanced recovery
spp. In contrast, the relative abundance of of the expected proportions of 16S rRNA gene
Neisseria reads was approximately 35 % and the sequences. In sum, L. iners was overrepresented
relative abundance of Bacillus and Enterococcus using all techniques and the two gram-negative
reads isolated from the mock community was organisms, E. coli and P. aeruginosa, were
approximately 15 % for each genera. These underrepresented in all. Thus, the authors caution
observations are likely due to a combination of that none of the methods tested returned the
the relative ability of an organism to be lysed and actual representation of the input mock
the percent match of the 16S primers to rRNA community.
gene targets. For example, it is known that the In another comparison of extraction methods,
534R primer has numerous mismatches to Willner et al. created a mock community of
actinobacterial 16S rRNA genes, particularly to 12 strains that included organisms relevant to
the bifidobacteria, and an evaluation of primer respiratory infections and cystic fibrosis
mismatches to other members of the mock (CF) (Willner et al. 2012). The goal was to use
community revealed F27 mismatches to this mock community to compare and evaluate
Acinetobacter, Pseudomonas, and Escherichia methods for DNA extraction prior to their appli-
and numerous 534R mismatches to Methanobre- cation to clinical bronchoalveolar lavage (BAL)
vibacter, as described below. samples obtained from CF patients. They also
Although we did not use our mock cell com- developed an in silico simulation of the mock
munity for rigorous testing of lysis and DNA BAL community using the software package
extraction methods for metagenomics, Yuan Grinder (http://sourceforge.net/projects/
et al. have performed a systematic evaluation of biogrinder/). The mock community was com-
M
common DNA extraction methods (Yuan posed of the following bacterial species from
et al. 2012), using a mock community composed actively growing stocks (relative proportions in
of equal cell counts of 11 bacterial species chosen parentheses): P. aeruginosa (1), Burkholderia
to represent different human body sites: E. coli, cepacia (0.1), S. aureus (0.1), Haemophilus
S. aureus, P. aeruginosa, S. agalactiae, Coryne- influenzae (0.1), Moraxella catarrhalis (0.01),
bacterium tuberculostearicum, Lactobacillus S. epidermidis (0.01), Klebsiella pneumoniae
iners, Lactobacillus crispatus, Atopobium vagi- (0.01), N. meningitidis (0.001), Burkholderia
nae, Gardnerella vaginalis, and P. acnes. They multivorans (0.001), Legionella pneumophila
compared six different DNA methods that com- (0.0001), S. pneumoniae (0.0001), and Neisseria
bined different lysis (enzymatic, chemical, and gonorrhoeae (0.00001). Aliquots of the mock
bead beating) and DNA purification (silica col- community were extracted using a “CTAB
umn or phenol/chloroform plus isopropanol pre- method,” a “saline protocol,” using the
cipitation) methods. DNA yield and DNA NucleoSpin Tissue Kit (pellet and liquid proto-
integrity (shearing) were evaluated. Microbial cols) and the MO BIO PowerSoil Kit. Commu-
abundance was measured by 454 sequencing of nity abundance was evaluated by 454 16S rDNA
the 16S rDNA V1-V2 regions using a mixture of sequencing of the V8-V9 regions using
forward primers that were chosen to prevent bias a degenerate 1,114 F3 primer (Willner
against Lactobacillus spp. and Gardnerella spp. et al. 2012). Data were normalized to 900 reads
(Yuan et al. 2012), followed by statistical ana- per sample. At this level, few (<1 %) to no
lyses that included accommodation for differ- streptococcal reads were detected in most of the
ences in 16S rRNA gene copy number per preparations, and no Legionella reads were
organism. Extraction methods that included detected in any of the preparations. In contrast,
M 500 Mock Community Analysis
the abundance of Neisseria reads was greater than these mock communities were created as controls
that predicted by the in silico model. Unfortu- and calibrators for 16S rRNA gene variable
nately, each sample included Escherichia region sequencing on next-generation sequenc-
and Dechloromonas as contaminants and the ing platforms, but they are useful in the context
CTAB samples had a high percentage of of metagenomic sequencing as well.
Stentrophomonas. This made it difficult to draw Turnbaugh et al. used genomic DNA from
conclusions about the efficiency and reproduc- 67 gut bacterial strains (e.g., containing the gen-
ibility of the methods employed. era Bifidobacterium, Collinsella, Bacteroides,
Diaz et al. created two different oral bacterial Prevotella, Clostridium, Dorea, Roseburia,
mock cell mixes, one in even cell distribution and Ruminococcus, Streptococcus, Citrobacter,
one in unequal distribution, using seven species Enterobacter, Proteus, and Providencia) to cre-
that are representative of the tooth surface (Diaz ate even and uneven mixtures as calibrators for
et al. 2012): Streptococcus oralis, S. mutans, 454 16S rDNA sequencing of the V2 region for
Lactobacillus casei, Actinomyces oris, a twin study of gut microbiomes (Turnbaugh
Fusobacterium nucleatum, P. gingivalis, and et al. 2010). Following quality filtering,
Veillonella sp. Late logarithmic phase cultures pyrosequencing, denoising and chimera removal,
were mixed in an even distribution based on cell the estimated diversity (at 97 % species cutoff) of
counts and in an uneven mixture that replicated the three uneven mock communities was 75, 58,
the proportions found in the oral cavity. These and 63, respectively, which was remarkably close
cell communities were lysed using a single pro- to the 62 phylotypes expected in the community.
tocol that included lysozyme treatment, over- Diversity was not estimated for the even commu-
night proteinase K digestion, and column nities, although the ratio of observed-to-expected
purification of the DNA. As a control, genomic sequences by phylotype was tabulated and
DNAs from the seven bacteria were mixed, in reported. This revealed an absence of
equal proportion based on 16S rRNA gene copy bifidobacterial reads, due to multiple primer mis-
number. All DNAs were used for 454 sequencing matches, and overabundances of sequences map-
of the V1-V2 region of the 16S rRNA gene. Very ping to other genera, including the Bacteroides
few S. mutans or P. gingivalis reads were recov- and the clostridia. The authors acknowledged that
ered, despite efficient sequencing of the control these observations could be the result of a number
DNA, suggesting that the lysis method was inef- of factors including variations in 16S rRNA gene
ficient for these two members of the mock copy number per strain and DNA quality.
community. During the development of standardized
454 16S rDNA sequencing protocols for the
HMP, we created mock DNA communities
DNA Mock Communities using genomic DNAs from 21 of the strains
listed in Table 1 (B. adolescentis and
While mock communities composed of mixtures P. gingivalis were not included) plus Candida
of cells were intended to be used to evaluate albicans MYA-2876. DNAs were prepared from
different DNA extraction methods, they also individual cultures and each DNA preparation
revealed biases in 16S rRNA gene amplification, was validated for purity by Sanger paired-end
sequencing, and classification. DNA mock com- sequencing of 384 full-length 16S rDNA clones
munities have been created in attempts to address obtained from each. Genomic DNAs were com-
these issues, to examine sensitivity and presence bined, based on 16S rRNA gene copy number, to
of chimeric sequences, to serve as controls for form even or staggered mock communities.
protocol development, etc. DNA mock commu- The even communities theoretically contained
nities can be composed of mixtures of genomic 105 16S rDNA copies from each species per
DNA, of plasmid clones of genes (usually the 16S amplification reaction, and the staggered com-
rRNA gene), or of PCR amplicons. Generally, munities had 16S rDNA copies that ranged from
Mock Community Analysis 501 M
Mock Community Analysis, Fig. 1 Deviation from the determined by 454 reads (red) or Sanger 3,730 reads
expected for the 16S rDNA sequencing of the 20 bacterial (blue). Error bars represent standard error. (b) Lowest
+ one archaeal mock community. (a) Distribution of reads percent mismatch between prime and 16S rRNA gene
over the 18 genera; expected frequencies (gray) were copy by organism, sequencing technology, and variable
determined by whole-genome shotgun sequencing of the gene region (Jumpstart Consortium Human Microbiome
mock community, and observed frequencies were Project Data Generation Working Group 2012)
103 to 106 copies from a particular species per mock community was essential to validate and
reaction. All reactions contained approximately benchmark methods for 16S rDNA sequencing
1,000 copies of the C. albicans 18S rRNA gene by all four genome sequencing centers involved
(Haas et al. 2011) (Jumpstart Consortium in the HMP (the Baylor College of Medicine
Human Microbiome Project Data Generation Human Genome Sequencing Center, the Broad
Working Group 2012). These mock communi- Institute, the J. Craig Venter Institute, and the
ties were used to develop an improved chimera Washington University Genome Sequencing
detection tool, called ChimeraSlayer, and Center) and revealed clear cases of primer
revealed a high level of chimerism in short var- mismatches that caused some genera to be
iable region products (Haas et al. 2011). The underrepresented (Fig. 1).
M 502 Mock Community Analysis
Mock Community Analysis, Fig. 2 Quality filtering quality filtering (black), after quality filtering (green), and
and chimera checking and removal improve estimates of after quality filtering and chimera removal (red). The
community diversity as evaluated by rarefaction analysis. expected number of OTUs in the mock community was
Operational taxonomic units (OTUs) are plotted versus 18 (dotted line) (Jumpstart Consortium Human
number of 454 sequence reads for three 16S rDNA vari- Microbiome Project Data Generation Working Group
able region windows, V1-V3, V3-V5, and V6-V9, before 2012)
Sequencing the mock community clearly a dual-index method for 16S amplicon sequenc-
illustrated the need for quality filtering and chi- ing on the Illumina MiSeq platform (Kozich
mera checking of 454 data of variable regions as et al. 2013).
illustrated in Fig. 2. Without filtering, the diver- Another type of mock DNA community is
sity of the 21 species (18 operational taxonomic a set of plasmid clones of nearly full-length 16S
units) in the community is estimated number in rRNA gene fragments (Wu et al. 2010). Here,
the 100s. Following quality filtering and chimera PCR amplicons from Clostridium difficile,
removal, the community richness is only a Bacteroides fragilis, S. pneumoniae,
few-fold higher than expected, especially for the Desulfovibrio vulgaris, Campylobacter jejuni,
V1-V3 and V3-V5 regions of the 16S rDNA Rhizobium vitis, Lactobacillus delbrueckii,
gene. E. coli, Treponema sp., and Nitrosomonas
The HMP mock community also sp. were cloned into pTOPO vectors and then
revealed examples of misclassification, poor used to create even and staggered DNA mixtures,
classifiability (lowest in V6-V9), and which were then used as templates for
unexplained overrepresentation of some genera 454 sequencing of the V1-V2 regions of the 16S
(Jumpstart Consortium Human Microbiome rRNA gene. The authors report that correct pro-
Project Data Generation Working Group 2012). portions of input 16S rDNA sequence type were
This mock community has been used to evaluate recovered following 454 sequencing and analy-
how quality filtering impacts taxonomic classi- sis, although different polymerases used for rep-
fication of reads generated on the Illumina licates of the staggered community gave slightly
platform (Bokulich et al. 2013), and a modified different results. Use of cloned 16S rRNA genes
version has been used to develop as controls is convenient, although genes on
Mock Community Analysis 503 M
supercoiled high copy number plasmids are not molecules such as RNAs or peptides or known
likely to be good surrogates for chromosomal components of the metabolome as being useful
ribosomal genes. controls for microbiome work.
Summary Cross-References
DNA mock communities have identified prob-
▶ Conserved Regions in 16S Ribosome RNA
lems with use of “universal” 16S rRNA gene
Sequences and Primer Design for Studies of
primers for amplification of variable regions for
Environmental Microbes
microbiome sequencing and have revealed flaws
▶ Extraction Methods, Variability Encountered in
in taxonomic classification systems, where
known sequences were classified incorrectly
(Jumpstart Consortium Human Microbiome Pro-
ject Data Generation Working Group 2012). References
They have also shown the critical requirement
Bokulich NA, Subramanian S, Faith JJ, et al. Quality-filtering
for stringent read quality filtering and chimera
vastly improves diversity estimates from Illumina
removal of 16S rDNA sequencing reads, which amplicon sequencing. Nat Methods. 2013;10:57–9.
has helped to reduce estimates of the size the Diaz PI, Dupuy AK, Abusleme L, et al. Using high
“rare biosphere” of human microbiome. throughput sequencing to explore the biodiversity in
oral bacterial communities. Mol Oral Microbiol.
Mock communities of cells have proved valu-
2012;27:182–201.
able as controls for development of uniform DNA Haas BJ, Gevers D, Earl AM, et al. Chimeric 16S rRNA
extraction methods for microbiome samples and sequence formation and detection in Sanger and
DNA mixtures continue to be important as cali- 454-pyrosequenced PCR amplicons. Genome Res.
brators for 16S rDNA and metagenomic sequenc- 2011;21:494–504. M
Handelsman J, Rondon MR, Brady SF, et al. Molecular
ing on changing high-throughput platforms. biological access to the chemistry of unknown soil
Neither type of mock community is perfect. Cell microbes: a new frontier for natural products. Chem
mixtures are easily contaminated; they may have Biol. 1998;5:R245–9.
Jumpstart Consortium Human Microbiome Project Data
incorrect cell counts due to clumping, dead cells,
Generation Working Group. Evaluation of 16S rDNA-
or the presence of bacteriophage; and are limited based community profiling for human microbiome
to species that can be grown without difficulty. research. PLoS ONE. 2012;7:e39315.
DNA mixtures may also be contaminated, so it is Kozich JJ, Westcott SL, Baxter NT, et al. Development of
a dual-index sequencing strategy and curation pipeline
important to validate the purity of the prepara-
for analyzing amplicon sequence data on the MiSeq
tions prior to mixing, and mixtures based on 16S Illumina sequencing platform. Appl Environ
rDNA copy number may be skewed if calcula- Microbiol. 2013;79:5112–20.
tions or assumptions are incorrect, particularly if Turnbaugh PJ, Quince C, Faith JJ, et al. Organismal,
genetic, and transcriptional variation in the deeply
the genomes of the input DNAs are not finished.
sequenced gut microbiomes of identical twins. Proc
Cell communities are plagued with the same Natl Acad Sci USA. 2010;107:7503–8.
issues of amplification bias and misclassification Willner D, Daly J, Whiley D, et al. Comparison of DNA
discovered with DNA using DNA communities. extraction methods for microbial community profiling
with an application to pediatric bronchoalveolar
Despite the flaws inherent in mock communi-
lavage samples. PLoS ONE. 2012;7:e34605.
ties, they are useful as a uniform benchmark for Wu GD, Lewis JD, Hoffmann C, et al. Sampling and
microbiome and metagenome technology devel- pyrosequencing methods for characterizing bacterial
opment and evaluation. The concept could be communities in the human gut using 16S sequence
tags. BMC Microbiol. 2010;10:206.
expanded to include mock communities of viruses, Yuan S, Cohen DB, Ravel J, et al. Evaluation of methods
and fungi. Further, one could imagine developing for the extraction and purification of DNA from the
mock communities composed of different types of human microbiome. PLoS ONE. 2012;7:e33865.
M 504 Molecular Ecological Network of Microbial Communities
Molecular Ecological Network of Microbial Communities, Fig. 1 The study of microbial ecology from species
richness and diversity to interaction network
M
explaining the ecological network structures, relies on phylogenetic molecular markers, such
dynamics, and mechanisms has become an essen- as ribosomal RNA (rRNA) genes or some highly
tial part in ecology. However, the studies on inter- conserved coding genes (e.g., nifH, amoA, gyrB).
actions among microbial species are much more In microbial diversity surveys, consequently, the
difficult than those studies in macro-ecology, definition of operational taxonomic unit (OTU) is
majorly due to their incredibly high species diver- used to delimit the microbial taxa by the similar-
sity. Besides, most natural microbial species are ity of those sequences (Achtman and Wagner
uncultivable and also invisible to the naked eyes, 2008). Each OTU then represents a certain
which makes it more challenging to define net- taxon, such as a species or a genus. The compo-
work structure in a microbial community. Here, sition and diversity of microbial communities
the definition of phylogenetic molecular ecologi- actually are based on molecular OTUs rather
cal network (pMEN) for microbial community, than individual species. Recently, due to the
the network inference, and the common network rapid development of high-throughput sequenc-
properties are first introduced, and then several ing technology, large amounts of microbial diver-
key ecological questions are able to be addressed sity surveys have been carried out in various
through network analysis. environmental habitats through small subunit
(SSU) rRNA sequencing projects. These mas-
sive, community-wide, replicated metagenomic
Phylogenetic Molecular Ecological data provide unprecedented opportunities to
Network for Microbial Communities infer the interaction networks in microbial com-
munities (Raes and Bork 2008).
Owing to the technique innovation of molecular As a result, an ecological network generated
biology, the modern microbial taxonomy often from metagenomic data really reflects the
M 506 Molecular Ecological Network of Microbial Communities
Molecular Ecological
Network of Microbial
Communities, Fig. 2 The
common steps of molecular
ecological network (MEN)
analysis. Two major parts
are included: network
inferences and network
analyses. In each of them,
several key steps are listed
relationships among molecular OTUs. Therefore, genomic biology and ecology (Barabasi and
such molecule-based ecological networks in Oltvai 2004; Faust and Raes 2012). Based on
microbial communities are referred to as molec- the mathematical algorithms, they can be classi-
ular ecological networks (MEN) (Zhou fied into Bayesian network, relevance network,
et al. 2010). The networks derived from func- and ordinary and partial differential equation
tional gene markers are referred to as functional methods [reviewed by De Jong (2002)]. Besides,
molecular ecological networks (fMEN) (Zhou some graphical theory-based methods were
et al. 2010), and those based on phylogenetic recently developed (Kramer et al. 2009).
gene markers as phylogenetic molecular ecolog- Among them, the relevance network method is
ical networks (pMENs) (Zhou et al. 2011). the most commonly used approach due to its
simple calculation procedure and high noise tol-
erance (Deng et al. 2012). For the relevance net-
Network Inference Approach work method, a similarity is first measured
between each two OTUs. This similarity mea-
For metagenomic data analysis, the abundance of surement can be Pearson, Spearman, biweight,
each gene marker in a sample is measured by the and jackknife correlations or mutual information
number of sequences for sequencing data or (Hardin et al. 2007).
hybridization signal intensity for microarray For network inference, another critical step is
data. Thereafter, the determined gene richness to identify a true link (Fig. 2). The key question is
and abundance are used to describe the composi- how similar a true link should be. The most com-
tion and structure of this microbial community. monly used way to choose the similarity thresh-
Based on such experimental data, a network old is based on biological knowledge which could
graph can be constructed to illustrate the interac- confirm some true interactions by previous exper-
tions of different gene markers (species) (Fig. 2). imental discovery and then use similar values
The way of constructing the connection diagram between those interactions to determine the
from the behavior of its components is known as threshold for other links. The constructed net-
network inference or reverse engineering (Faust work through this arbitrary threshold is subjec-
and Raes 2012). tive rather than objective (Barabasi and Oltvai
Various approaches for network inference 2004). There are also a couple of methodical
have been developed and widely used in both ways for determining the similarity threshold,
Molecular Ecological Network of Microbial Communities 507 M
such as the significance level of correlation The “small world” is used to depict that any
(p value), false discovery rate (FDR), permuta- two nodes in a network can be connected just by
tion test, and random matrix theory (RMT)-based passing a few of linked neighbors (Table 1). It is
methods. Among them, p value-based and per- originally referred in sociology that 6 of separa-
mutation test-based methods give the least strict tion between us and everyone else on this planet.
threshold and lead to large amounts of links in This property usually reflects the efficiency of
a messy network that could be alike to random system and may be valuable for microbial com-
network. The FDR-based method has the strictest munities. In the small-world community, the
threshold, which could generate a loose network, energy, materials, and information can be easily
and a lot of true interactions might be ignored. transported within the entire system. In the
RMT-based algorithm has advantages in this step microbial community, this characteristic drives
(Luo et al. 2007). This method is able to automat- efficient communications among different mem-
ically identify a threshold based on the inherent bers so that relevant responses can be taken rap-
property of the similarity matrix. The results idly to environmental changes (Zhou et al. 2011;
indicated it is robust to reveal the meaningful Deng et al. 2012).
relationships through high-throughput data in The modularity property is used to demon-
both genomics and ecology (Luo et al. 2007; strate that a network could be degraded to sub-
Zhou et al. 2010, 2011). networks, also called modules, according to its
structure (Table 1). Each module in gene regula-
tory networks is considered as a functional unit,
Network Properties which consists of several elementary genes and
performs an identifiable task (Luo et al. 2007).
After the species interactions have been inferred, Modularity in an ecological community may
many pMENs are formed for the communities in reflect habitat heterogeneity, physical contact,
different habitats, such as soil, ocean, groundwa- functional association, divergent selection,
M
ter, and human guts (Deng et al. 2012; Faust and and/or phylogenetic clustering of closely related
Raes 2012). Several common topological prop- species (Olesen et al. 2007; Zhou et al. 2010).
erties, such as small world, scale-free, and mod- Also microorganisms in the same module could
ularity (Table 1) were also observed in all kinds have similar ecological niches (Zhou et al. 2011;
of pMENs, like other biological networks from Faust and Raes 2012).
food webs in macro-ecology to complex regula- Except these three common properties, there
tion networks in molecular biology. These com- are many other topological indexes that could be
mon network properties are important for the used to measure the organization and structure of
robustness and stability of complex systems microbial networks, such as clustering coeffi-
(Barabasi and Oltvai 2004; Kitano 2004; Zhou cient, hierarchy, density, transitivity, and con-
et al. 2010, 2011). nectedness [definitions and descriptions seen in
The scale-free is a most notable characteristic Deng et al. (2012)]. All these could become valu-
in complex systems. It is used to describe the able indexes to measure the microbial structure
finding that most nodes in a system have few for the studies of microbial ecology.
directly linked nodes (neighbors), while few
nodes have a large amount of neighbors
(Table 1). It implies the roles of species in the Network Interpretation Aspects
microbial community might be quite different.
A few microbial species could be generalists Once the network graph is drawn, we should
with higher connectivity which are inclined to disclose the ecological meanings behind this
have closer relationships with environmental structure. Several key ecological questions can
traits than other species (Zhou et al. 2011; Deng be revealed through network analysis procedures
et al. 2012). (Fig. 2).
M 508 Molecular Ecological Network of Microbial Communities
Molecular Ecological Network of Microbial Communities, Table 1 The most commonly used topological indexes
and properties for complex networks
Network
property Mathematic measurement Ecological implication
Connectivity Xm It was used to describe the number of interactions of
ki ¼ aij , where m is the number of all neighbors each node, also named as node degree. In most
j¼1
complex systems, the nodes with the highest
(linked nodes) of node i and aij is the strength connectivity always played crucial roles and were
between nodes i and j. For the unweighted network, usually considered as network centers. In pMEN,
ki equals the number of neighbors the study found that nodes with higher connectivity
were inclined to have closer relationships with
environmental traits (Zhou et al. 2011; Deng
et al. 2012)
Scale-free P(k) ~ k g, where P(k) is the number of nodes with In most cases, the connectivity distributions of
k degrees, k is connectivity, and g is a constant pMEN and other complex systems follow this
power law, indicating most nodes in a network have
few neighbors, while few nodes have large amount
of neighbors. This phenomenon suggests the most
species in the communities are peripherals, but
a few of the species could be generalists and play
more important roles than others
X
Small world dij A smaller GD means all the nodes in the network
GD ¼ nðn1Þ : GD is the abbreviation of the average are closer, indicating each two nodes in the network
geodesic distance, where dij is the shortest path could be connected by a small number of
between nodes i and j, and n is the total number of acquaintances, and so-called small-world network.
all nodes Most pMENs are small-world network, which
imply that the energy, materials, and information
can be easily transported through entire systems
(Deng et al. 2012)
" 2 #
Modularity XNM
lb Kb Modularity property was used to demonstrate
M¼ , where NM is the number a network which could be naturally divided into
b¼1
L 2L
subcommunities, so-called modules. A modularity
of modules in the network, lb is the number of links value can be calculated by Newman’s method
th
among all nodes within the b module, L is the (Newman 2006) whose value is between 0 and
number of all links in the network, and Kb is the sum 1. Modularity in an ecological community may
of degrees (connectivity) of nodes which are in the reflect habitat heterogeneity, physical contact,
bth module functional association, divergent selection, and/or
phylogenetic clustering of closely related species
(Olesen et al. 2007; Zhou et al. 2010)
ingredients. Sweet potato mashes contain negative impacts on the quality of the final prod-
104–108 CFU/ml of LAB and rice or barley uct. Proper management of LAB might therefore
mashes contain 104–105 CFU/ml or less. Lacto- introduce shochu having better quality.
bacillus brevis, L. fermentum, L. helveticus,
L. hilgardii, L. kefiri, L. nagelii, L. paracasei,
L. pentosus, L. plantarum, Leuconostoc Summary
mesenteroides, Leuc. citreum, Leuc. lactis,
Lactococcus lactis, Enterococcus faecium, Shochu is a Japanese traditional distilled spirit
Pediococcus pentosaceus, and W. confusa/ made from starchy materials. During the fermen-
cibaria have been found in alcoholic fermenta- tation, Aspergillus spp. works for saccharification
tion mashes made from sweet potato (Endo 2005; of ingredients and Saccharomyces cerevisiae
Endo and Okada 2005b). Lactobacillus plays alcoholic fermentation. Aspergillus
satsumensis is a novel species found in the spp. produces large amounts of citric acid during
mashes (Endo and Okada 2005a). Of such spe- the fermentation and preserves the fermentation
cies, W. confusa/cibaria is the most seen species. from spoilage. LAB have generally low popula-
Several species found in this fermentation can be tion and poor diversity at the beginning of fer-
also seen in wine fermentation. This might be due mentation (yeast-seed stage), but their population
to similar harsh environments (high alcohol con- and diversity increase at the latter fermentation
tent and low pH) in the fermentation of the two (alcoholic fermentation stage). W. confusa/
alcoholic beverages. Mashes made from rice or cibaria, Lactobacillus spp., and Leuconostoc
barley contain poorer LAB diversity than those spp. are usually seen in the fermentation. Such
made from sweet potato. LAB have characteristics to survive in alcoholic
An interesting DNA sequence, which was and acidic environment, suggesting that LAB
characterized as uncultured Leuconostoc sp., have adapted to their habitat.
was found in yeast-seed by DGGE profile.
BLAST analysis of the sequence revealed low
similarities (below 95 %) against known
Cross-References
Leuconostoc spp. but high similarities (99.3 %)
against uncultured Leuconostoc spp. (accession
▶ Culturing
nos. EU469745 and AJ405013) (Endo 2005,
▶ Evaluating Putative Chimeric Sequences from
2011), suggesting the presence of unknown
PCR-amplified Products
LAB in shochu mashes. Population of the organ-
▶ Phylogenetics, Overview
ism is approximately 108 CFU/ml as determined
by qPCR. Because of its predominance in the
yeast-seed, it might be an acid-tolerant
Leuconostoc sp., although Leuconostoc spp. are References
known to be acid sensitive.
Endo A. Lactic acid bacterial diversity during shochu
Most of LAB strains found in shochu mashes
fermentation. PhD thesis, Tokyo University of Agri-
were resistant to 10–15 % (v/v) of alcohol, and, culture; 2005.
moreover, they were able to grow at pH 3.5 (Endo Endo A. Diversity of lactic acid bacteria in fermented
2011). Very few strains were able to grow at products. Jpn J Lactic Acid Bact. 2011;22:87–92.
Endo A, Okada S. Lactobacillus satsumensis sp. nov.,
pH 3.0. Most of the strains metabolize citrate isolated from mashes of shochu, a traditional Japanese
when in the presence of glucose. These charac- distilled spirit made from fermented rice and other
teristics suggest that LAB seen in shochu mashes starchy materials. Int J Syst Evol Microbiol.
have adapted to their habitat. Citrate metabolism 2005a;55:83–5.
Endo A, Okada S. Monitoring the lactic acid bacterial
by LAB produces several aroma compounds,
diversity during shochu fermentation by
including diacetyl, acetoin, and acetic acid. PCR-denaturing gradient gel electrophoresis. J Biosci
These compounds have both positive and Bioeng. 2005b;99:216–21.
MRL and SuperFine+MRL 513 M
steps: first the input source trees are each
MRL and SuperFine+MRL represented by a matrix over {0,1,?}, where
each row represents a species and each column
Tandy Warnow represents a branch in the source tree. These
Institute for Genomic Biology, University of matrices are then concatenated together to form
Illinois, IL, USA the “MRP matrix.” Finally, this matrix is ana-
lyzed using maximum parsimony heuristics,
where maximum parsimony is the NP-hard opti-
Synonyms mization problem that seeks to find a tree on the
species set with the smallest number of total
Phylogeny ¼ phylogenetic tree ¼ tree; changes.
MRL ¼ matrix representation with likelihood; In Swenson et al. (2011), MRP was compared
MRP ¼ matrix representation with parsimony to a collection of other supertree methods and
found to be the most reliable with respect to
accuracy and ability to analyze large datasets.
Introduction However, that study also showed that the Quar-
tets MaxCut (QMC) method developed by Snir
The estimation of evolutionary trees is one of the and Rao (2012) was more accurate than MRP for
basic challenges in biology (Felsenstein 2003), those datasets on which QMC was able to run. An
but current methods have great difficulties with interesting variant on MRP was developed by
large datasets – often due to computational Nguyen et al. (2012), in which the MRP matrix
issues. For example, methods like maximum like- was analyzed under maximum likelihood, using
lihood (ML) and maximum parsimony (MP) are a symmetric two-state model. This method,
highly accurate techniques when they can be called MRL for “matrix representation with like-
properly run, but both are NP-hard (a technical lihood,” was shown to be more accurate than
M
term that has the consequence that exact algo- MRP on simulated datasets.
rithms are not likely to be found except through Thus, while MRP remains the most frequently
exhaustive search techniques). As a result, ML used supertree method, MRL and QMC are new
and MP analyses on large datasets either cannot supertree methods that offer some advantages
be run at all, or take a very long time to run, or over MRP; furthermore, new supertree methods
return poor results. Since most accounts of the continue to be developed.
number of species suggest that the Tree of Life In Swenson et al. (2012), a new technique was
itself will involve many millions of species, truly developed called “SuperFine.” This is a meta-
large-scale phylogenetic estimation is beyond the method that can be used with any supertree
reach of current methods. Instead, an alternative method (e.g., MRP, MRL, QMC, etc.), to pro-
approach has been proposed, in which different duce a modified supertree method. For example,
research groups would calculate phylogenetic when SuperFine is used with MRP, it is referred
trees on subsets of the species set and then these to as SuperFine+MRP, and when it is used
trees would be combined into a tree on the full with MRL, it is referred to as SuperFine+MRL.
dataset. The techniques that combine trees into SuperFine has two steps. The first step computes
a tree on the full taxon set are called “supertree a “strict consensus merger” (SCM) tree from the
methods,” and the resultant large tree is called set of input trees, where the SCM tree contains
a “supertree.” many high degree nodes (“polytomies”). The
There are many supertree methods (surveyed second step refines this tree by using the base
in Bininda-Emonds 2004), but matrix representa- method to refine each polytomy. The refinement
tion with parsimony (MRP), developed in Baum around each polytomy is performed by encoding
(1992) and Ragan (1992), is the most well known each of the source trees on a new leafset {1. . .d},
and most frequently used. MRP operates in two where d is the degree of the polytomy; these
M 514 MRL and SuperFine+MRL
MRL and SuperFine+MRL, Fig. 1 We present tree error branch rates, and the standard deviation for the running
and running times (in minutes) for supertree methods on times. Averages are computed for those replicates with
ten replicates of 1,000-taxon datasets. The method given sufficient taxonomic overlap to perform an accurate
parenthetically indicates the heuristic used to solve MRP supertree analysis: n ¼ 10 for all scaffold densities except
or MRL (e.g., PAUP* for MP and FastTree-2 (Price n ¼ 7 for the 20% scaffold density and n ¼ 9 for 50%
et al. 2010) or RAxML for ML). The scaffold density scaffold density (reproduced (with permission from the
refers to the percentage of the taxa that are in the “scaf- publisher) from Nguyen et al. (2012, 7:3))
fold” dataset. We show standard error for the missing
smaller source trees are then passed to the base to link the clade-based trees together. The scaf-
supertree method, which computes a supertree on fold trees are produced by random sampling of
{1. . .d}, and this supertree replaces the the taxa and then using the universal genes to
polytomy. The refinements around the construct a source tree. Thus, the clade-based
polytomies can be performed in parallel since source trees contain a subset of the taxa in
they are independent. Hence, the second step is a clade, while the scaffold trees contain
not only very fast, but very easily parallelized. In a random subset of the taxa, but may – in some
Swenson et al. (2012), they showed that Super- cases – contain all the taxa. The scaffold density
Fine+MRP gave much more accurate trees than refers to the percentage of the taxa in the scaffold
MRP and was also much faster. They also com- tree. Scaffold source trees for the supertree prob-
pared SuperFine+QMC and QMC and showed lem are produced by selecting a scaffold density
similar improvements. Finally, Nguyen and then concatenating the alignments from the
et al. (2012) compared SuperFine+MRL and universal genes on the randomly selected scaffold
MRL and showed similar improvements. Thus, taxa and computing a maximum likelihood tree
SuperFine is a method that can improve supertree on the concatenated alignment. Similarly, the
methods. clade-based source trees are computed by
A comparison between these different selecting a clade and then finding the genes that
methods (SuperFine+MRP, SuperFine+MRL, provide the best coverage for that clade (from the
MRP, and MRL) is shown in Fig. 1. The experi- clade-based genes), concatenating the align-
ment involves gene trees that evolve within ments, and computing a maximum likelihood
a species tree under a birth-death process, and tree on the concatenated alignments. Finally,
so may not contain all the taxa; however, some supertrees are computed on the source trees
genes are universal and so contain all the taxa. using MRP, MRL, SuperFine+MRP, and Super-
These genes are then used to evolve sequences Fine+MRL. The resultant species trees are then
under different sequence evolution models. compared to each other with respect to the miss-
There are two types of gene trees – “clade- ing branch rate and running time. Figure 1 shows
based” gene trees that are restricted to clades in results obtained on 1,000-taxon simulated
the species tree and “scaffold” trees that are used datasets and demonstrates that SuperFine+MRP
MRL and SuperFine+MRL 515 M
and SuperFine+MRL provide the best accuracy and QMC while also reducing the running time
of all methods and are much faster than the other used by these methods. Thus, SuperFine is
methods. It also shows that MRL outperforms a general purpose meta-method for improving
MRP with respect to accuracy under low scaffold supertree estimation.
density conditions. Finally, MRP is “solved”
using MP heuristics in PAUP* (Swofford 2003),
while MRL is “solved” using either FastTree-2
Funding
(Price et al. 2010) or RAxML (Stamatakis 2006).
This work was supported by NSF grant DEB
Note that the choice of ML heuristic has an
0733029 to T.W.
impact on the running time and accuracy of the
MRL method. Note also that SuperFine+MRL
(FastTree) matches the accuracy of SuperFine References
+MRP(PAUP*) but is much faster.
Baum BR. Combining trees as a way of combining data
sets for phylogenetic inference, and the desirability of
Summary combining gene trees. Taxon. 1992;41:3–10.
Bininda-Emonds O, editor. Phylogenetic supertrees: com-
bining information to reveal the tree of life. Dordrecht:
The construction of a large phylogeny, poten- Kluwer Academic Publishers; 2004.
tially spanning the Tree of Life, is considered to Felsenstein J. Inferring phylogenies. Sunderland: Sinauer
be one of the hardest computational problems in Associates; 2003.
Nguyen N, Mirarab S, Warnow T. MRL and SuperFine
biology. A central approach to this problem +MRL: new supertree methods. Algoritm Mol Biol.
involves using supertree methods, which com- 2012;7:3.
bine source trees, each on a subset of the species, Price M, Dehal P, Arkin A. FastTree 2 – approximately
into a tree on the full set of species. While many maximum likelihood trees for large alignments. PLoS
supertree methods have been developed, MRP
ONE. 2010;5:e9490. M
Ragan MA. Phylogenetic inference based on matrix
(matrix representation with parsimony) is the representation of trees. Mol Phylogenet Evol.
most well known and most frequently used 1992;1:53–8.
supertree method. However, newer supertree Snir S, Rao S. Quartet MaxCut: a fast algorithm for
amalgamating quartet trees. Mol Phylogenet Evol.
methods – including MRL (matrix representation 2012;62:1–8.
with likelihood) and QMC (Quartets MaxCut) – Stamatakis A. RAxML-VI-HPC: maximum likelihood-
have been introduced that provide comparable or based phylogenetic analyses with thousands of taxa
better accuracy to MRP. Finally, a new technique and mixed models. Bioinformatics. 2006;22:2688–90.
Swenson MS, Suri R, Linder CR, et al. An experimental
for “boosting” supertree methods has been devel- study of Quartets MaxCut and other supertree
oped. This method, called “SuperFine,” operates methods. Algoritm Mol Biol. 2011;6:7.
in two steps, where the first step constructs Swenson M, Suri R, Linder CR, et al. SuperFine:
a consensus tree from the source trees and the fast and accurate supertree estimation. Syst Biol.
2012;61:214–27.
second step uses the base supertree method to Swofford DL. PAUP*: phylogenetic analysis using
refine the consensus tree. Simulations show that parsimony (*and other methods), version 4. Sinauer
SuperFine improves the accuracy of MRP, MRL, Associates. 2003.
N
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
N 518 New Computational Methodologies to Understand Microbial Diversity
by BLAST. It is not suitable to make discoveries as Ps ¼ argmaxi Pði, jÞ where N is the total
j¼1
of a protein family with different subcellular
localizations. The third type of methods builds number of base predictors and i is the index of
machine learning models (e.g., support vector a predicted subcellular compartment: cytoplas-
machine) and predicts protein localization using mic (i ¼ 1), cytoplasmic membrane (i ¼ 2),
features, such as amino acid/dipeptide composi- periplasmic (i ¼ 3), outer membrane (i ¼ 4),
tional bias, physicochemical properties of amino and extracellular (i ¼ 5). P(i, j) denotes the vot-
acids, and others. Since these sequence features ing weight of the prediction of the j th element
are derived from whole protein sequences, most predictor for compartment i . It is defined as
algorithms in this category are minimally affected XMj
Pði, jÞ ¼ 2jCK ij W K , where Mj is the num-
by the incompleteness of peptide sequences. k¼0
Examples are CELLO (Lu et al. 2004), SUBLOC ber of predictions of the jth predictor. It means that
(Hua and Sun 2001), and PSLDOC (Chang the voting weight of a prediction by thejth predictor
et al. 2008). Only the third approach is useful in for compartment i depends on the offset of the
the case of metagenomic peptides which are often index CK of its predicted class with regard to
fragmentary. the index i as well as its normalized score WK.
Because all algorithms have their own bias, The voting weight WK for Kth prediction is defined
the predictions from individual algorithms in the on the basis of its relative score by comparison with
third category are frequently inconsistent. This is all other predictions made by this algorithm.
related to the fact that sorting signals targeting Because raw scores of predictions from different
different subcellular locations usually share some component base algorithms are not directly com-
similarities. For example, sorting signals parable, the raw score SK is converted into a nor-
targeting the periplasm and outer membrane malized probability p(K) ¼ p(S SK) by
both have N-terminal positively charged regions. calculating the percentage of predictions with
In this case, prediction algorithms usually have lower raw scores among all predictions for a
some ambiguity for distinguishing these neigh- given algorithm. WK is then defined as WK ¼ p(K).
boring compartments. When an algorithm pre- The performance of MetaP and the component
dicts a protein as a periplasmic protein with the algorithms was evaluated using sets of testing
highest confidence, it also implies that the protein sequences whose localizations were verified by
has a probability of being located in its neighbor- experiments (Menne et al. 2000). For the purpose
ing compartments, including the cytoplasm, inner of testing the accuracy of fragmentary protein
membrane, outer membrane, and extracellular prediction, the N-terminal of the testing
space, with higher probability assigned to the sequences is removed. This benchmark test
locations closest to the periplasm. Indeed, neigh- showed that MetaP makes more accurate predic-
boring compartments are usually reported as tions of fragmentary peptide sequences than any
suboptimal predictions by the component algo- component method.
rithms (CELLO, SUBLOC, and PSLDOC). The MetaP was applied to several protein families
MetaP algorithm proposed recently considers of alkaline phosphatases using the Global Ocean
neighborhood relations among subcellular locali- Sampling (GOS) metagenomic data sets (Luo
zations and also suboptimal predictions. It thus has et al. 2009). Alkaline phosphatases are major
the benefit of resolving conflicting predictions by hydrolytic enzymes of organic phosphoesters
the base algorithms and achieves higher precision which are the dominant forms of dissolved
and accuracy of prediction (Luo et al. 2009). organic phosphorus in the ocean and providing
The predicted location of MetaP for an important source to meet bacterial phosphorus
a sequence s is the one that has the maximum requirements. It was thought that marine
sum of weighted voting for that subcellular local- bacterial alkaline phosphatases are exclusively
ization. The prediction can be denoted formally
New Computational Methodologies to Understand Microbial Diversity 519 N
New Computational
Methodologies to
Understand Microbial
Diversity,
Fig. 1 Subcellular
localization distributions of
APases recovered from the
GOS metagenomic
database (figure adapted
from Luo et al. 2009)
ectoenzymes. However, MetaP predicted that other hand, metagenomic DNA is a mixture
about 40 % of the alkaline phosphatases are from all microbes in the sample, making it diffi-
located in the cytoplasm (Fig. 1). Further bioin- cult to study genome content of a specific micro-
formatic analysis suggested that the cytoplasmic bial lineage in a systematic way. It is therefore N
alkaline phosphatases might play a role in hydro- important to develop high-throughput computa-
lyzing the imported small organic phosphorus tional approaches to systematically classify
compounds. In addition, application of MetaP to metagenomic genes taxonomically. This would
a metatranscriptomics data set showed diel vari- lead to an improved understanding of the ecolog-
ations in the fraction of transcripts encoding inner ical functions of the abundant taxa in the nature.
membrane and periplasmic proteins compared to Definitively assigning sequences from diverse
cytoplasmic proteins (Fig. 2), suggesting a close metagenomic data sets to taxonomic groups is
coupling of photosynthetic extracellular release problematic, however. Most applications rely on
and bacterial consumption (Luo 2012). BLAST-based (Altschul et al. 1997) identifica-
tion of best hits to an annotated sequence data-
base. While the BLAST best hit approach is easy
An Evolutionary Genetic Method to to use, its accuracy is decidedly influenced by the
Classify Metagenomic Reads composition of the annotated database. Thus,
Taxonomically a substantial fraction of best BLAST hits may
not be the closest relatives phylogenetically, an
Metagenomic DNA represents genetic potential issue that is exacerbated when taxonomic groups
of the microbial community in an environment. are not evenly represented in the database (Koski
Due to its unbiased nature, a majority of and Golding 2001). A second type of methods
a metagenomic sample consists of DNA from employs machine learning principles to classify
those abundant microbial lineages. Therefore, it metagenomic reads based on the nucleotide
provides raw material for studying genome con- sequence characteristics (McHardy et al. 2007).
tent of the abundant taxa in the nature. On the These methods are also subject to the high
N 520 New Computational Methodologies to Understand Microbial Diversity
New Computational
Methodologies to
Understand Microbial
Diversity,
Fig. 2 Differential gene
expression in protein
subcellular localizations
between day and night in
surface waters of
North Pacific Subtropical
Gyre. The letter above the
bar indicates the
significance level: (a),
P < 0.001; (b), P < 0.05
(figure adapted from Luo
2012)
false-positive issue, which cannot meet the needs substitutions among some members of the line-
of many ecological studies. age (Luo and Hughes 2012). Therefore, dN is used
A bioinformatic approach is recently devel- to measure the evolutionary distances of protein-
oped to assign metagenomic gene fragments to coding genes. The dN pipeline assigns
taxonomic groups by computing evolutionary a metagenomic gene to a microbial clade (e.g.,
distances of protein-coding DNA sequences the marine Roseobacter clade) based on the
(Luo et al. 2012). In a protein-coding DNA requirement that the mean evolutionary distance
sequence, point mutation occurs both in synony- between a metagenomic gene and each of the
mous sites which do not change the reference orthologous genes from the clade
corresponding amino acid sequence and in members is smaller than the mean of all
non-synonymous sites which change the encoded pairwise comparisons among the reference
amino acids. Thus, the evolutionary distances of orthologous genes in that clade. Mathematically,
protein-coding DNA sequences can be the requirement can be expressed using
represented using synonymous (dS) and Xn Xðn Þ
2
d N , ref meta dN, ref ref
non-synonymous (dN) substitution rate. More spe- 1 < 1 , in which n
n ðn2 Þ
cifically, dS is the number of synonymous sub-
is the number of reference orthologous genes, dN,
stitutions per synonymous site, and dN for the
ref meta is d N between a reference gene and the
number of non-synonymous substitutions per
metagenomic gene fragment, and dN,ref ref is dN
non-synonymous site. Since synonymous muta-
between two reference genes.
tions are largely invisible to natural selection,
The dN pipeline takes in alignments, each
synonymous sites are easily saturated with sub-
consisting of reference orthologous genes
stitutions. In contrast, most non-synonymous
belonging to the core genome of a monophyletic
mutations are deleterious, and many of them
microbial clade and one metagenomic gene frag-
have been eliminated by purifying selection.
ment with unknown taxonomic affiliation. Iden-
Thus, dN is much smaller than dS in a vast majority
tification of putative gene fragments from
of genes (Luo and Hughes 2012). Often, marine
metagenomic reads requires in silico translation
microbial ecologists are interested in highly
of the reads in six reading frames and then selec-
diverged lineages (e.g., Roseobacter, SAR11,
tion of all fragments with a certain minimal
Vibrio, Prochlorococcus). At this level of diver-
length (e.g., 60 amino acids) between stop
gence, the synonymous sites are saturated with
New Computational Methodologies to Understand Microbial Diversity 521 N
New Computational Methodologies to Understand Roseobacter core genes, green colored are other bacterial
Microbial Diversity, Fig. 3 A flowchart of genes homologous to the Roseobacter core genes, and
preprocessing steps for dN pipeline for high-confidence blue colored are not homologous to the core genes which
phylogenetic classification of metagenomic DNA frag- are not recovered by a BLAST similarity search. The dN
ments. The circles on the leftmost are Roseobacter pipeline is designed to filter out other bacterial genes in
genomes, in which pink-colored parts represent core green, but a few true Roseobacter sequences are missing
genomes. The gene fragments in the GOS metagenome because of the conservative nature of the dN pipeline
are categorized into three parts, in which pink colored are (figure adapted from Luo et al. 2012)
codons. Then, BLAST identifies a set of putative “anchor sequence,” and its pair end read is named
metagenomic gene fragments that are homolo- “mate pair sequence.” These assigned
gous to the reference genes (Fig. 3). Each of the metagenomic genes are by no means
homologous metagenomic gene fragments will a comprehensive list of genes affiliated with this
be aligned to the reference genes at the amino microbial clade, since they can be only identified
acid level, and the DNA sequences are imposed if they are core genes or physically linked to N
on the alignment. Next, the PAML software a core gene of that clade.
(Yang 1997) computes dN for each pairwise com- This whole procedure, including
parison in the DNA alignment. preprocessing, the dN pipeline, and the mate read
The output of the dN pipeline is a set of analysis, was applied to assign metagenomic
metagenomic gene fragments that are assigned genes in the Global Ocean Sampling (GOS) data
to the microbial clade. Validation of the dN pipe- sets (Rusch et al. 2007) to the marine
line using phylogenetic analyses showed that the Roseobacter clade. The major finding is that the
false-positive rate is smaller than 1 %. Since these uncultivated Roseobacter populations differ sys-
classified metagenomic gene fragments are tematically in several genomic attributes from
homologous to the core genomes of a microbial their cultured representatives, including fewer
clade which encode for biological functions that genes for signal transduction and cell surface
are essential to basic cellular functionality, they modifications but more genes for Sec-like protein
are unlikely to provide valuable information secretion systems, anaplerotic CO2 incorpora-
about ecologically relevant processes. However, tion, and phosphorus and sulfate uptake (Fig. 4).
depending on the library design for sequencing, Several of these trends match well with character-
a read may be partnered with a pair end read, both istics previously identified as distinguishing
of which are from the same DNA molecular, and r- versus K-selected ecological strategies in
the pair end read may carry an ecologically rele- bacteria, suggesting that the r-strategist model
vant gene. Thus, an important extension of the dN assigned to cultured roseobacters may be less
pipeline is to examine the pair end of the assigned applicable to their free-living oceanic counterparts.
reads. Here, the metagenomic gene fragment that Thus, genomic analyses of cultured roseobacters
is directly identified by the dN pipeline is named appear to be biasing our view of the lineage’s
N 522 New Computational Methodologies to Understand Microbial Diversity
New Computational Methodologies to Understand anaplerotic CO2 incorporation; light purple, Sec secretion
Microbial Diversity, Fig. 4 Differential representation system; dark orange, signaling; light orange, nutrient
of gene families in oceanic compared to cultured transport; teal, antibiotic synthesis or resistance; maroon,
roseobacters (M versus A plot). Families plotting above C1 metabolism; dark green, cell surface properties; and
the line are enriched, and those plotting below the line are light green, hypothetical. This plot shows differential
depleted in the oceanic roseobacters. Non-gray symbols representation for just one of three simulated
represent gene families with significant differential repre- metagenomic data sets that were constructed, all of
sentation between the two metagenomes. Colors indicate which had congruent results (figure adapted from Luo
gene families with similar functions: dark purple, et al. 2012)
New Computational
Methodologies to
Understand Microbial
Diversity,
Fig. 5 Assumed sample
proportions (denoted by
circles) and 95 %
confidence intervals
(dashed lines). Y-axis: to
get the probability,
multiply the y-values by
105; X-axis: the number of
assumed miscounted
executor genes among
N130 ¼ 327, 741 genes.
The filled squares are the
locations of zeros (figure
adapted from Luo et al.
2011)
New Method for Comparative Functional Genomics databases, i.e., KEGG GENES for KAAS, M5nr (Willke
and Metagenomics Using KEGG MODULE, et al. 2012) for MG-RAST (M5nr includes the SEED as
Fig. 1 Outline of the methodology. (a) Workflow from a subset), and NCBI-NR for MEGAN4, and different
sequencing to evaluation of the potential functionomes. default threshold values for the BLAST hits. Each server
(b) Detailed workflow of the three annotation servers, converts the hit entries to the corresponding orthology IDs
KAAS, MG-RAST, and MEGAN4, using query for functional annotation and pathway/module/subsystem
sequences after gene finding process of sequenced data; mapping. Red-colored texts of KAAS indicate its
KAAS and MEGAN4 use BLASTP and BLASTX for improvements in the current study. This figure has been
amino acid and nucleotide query sequences, respectively, modified from the previous one (Takami et al. 2012)
and MG-RAST uses only BLASTX. All use different
communities in different environments. Thus, orthology IDs for mapping annotated sequences
it is difficult to differentiate the functional to functional categories depending on their
potentials between different genomes and desired outputs, namely, pathways in KEGG or
metagenomes by analysis based on COG subsystems in SEED. Notably, KAAS has been
classification. applied to protein-coding sequences from several
Recently, more detailed and comprehensive metagenomic samples, and their annotated
functional categories facilitated in KEGG KEGG pathways and other classifications are
(Kanehisa and Goto 2000) and SEED (Overbeek already available. The outputs of these systems
et al. 2005) have been used for comparative geno- include functional distributions of each sample
mics and as metagenomics tools to highlight by hierarchical classification using KEGG and/or
functional features represented by KAAS SEED and comparisons between several samples
(KEGG Automatic Annotation Server) (Moriya when necessary. However, it is still difficult to
et al. 2007), MG-RAST (Meyer et al. 2008), and evaluate the functional potentials via the current
MEGAN (Huson et al. 2011) (Fig. 1). They all classification systems (such as pathway
employ a similarity-based method for functional map-based analysis) because the functional infor-
annotations, but utilize different databases for mation from different organisms such as
protein sequences, default threshold values, and microbes, plants, and animals has been mixed up.
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE 527 N
On the other hand, KEGG MODULE, a newly and the latest version is available from the KEGG
defined database that collects pathway modules FTP site (http://www.kegg.jp/kegg/download).
and other functional units, presents a promising Each module is defined by the combination of
tool for functional classification (Kanehisa KO identifiers so that it can be used for annota-
et al. 2008). Because the KEGG modules cover tion and interpretation purposes in individual
major metabolisms and physiological processes genomes or metagenomes. Notations of the Bool-
necessary for functional characterization of each ean algebra-like equation for this definition
categorized organisms such as plants, animals, include space-delimited items for pathway ele-
and microbes, a new evaluation method using ments, comma-separated items in parentheses for
the KEGG MODULE database was developed alternatives, a plus sign to define a complex, and
to resolve the difficulties for evaluation of poten- a minus sign for an optional item. Some modules
tial functionome and it was employed for com- have branching points in their reaction cascades,
parative functional genomics and metagenomics leading to different products or alternative reac-
(Takami et al. 2012). Based on this result, we also tion pathways. These modules are divided into
developed metabolic and physiological potential several parts depending on the branching patterns
evaluator (MAPLE) system. The MAPLE pro- and are redefined as submodules for accurate
vides a user-friendly Web interface not only for calculation of the completion ratio. The module
characterization of potential functionome har- completion ratio was calculated for each
bored in the genomic and metagenomic submodule to examine fine-grained functional
sequences but also for comparative analyses for categories (Takami et al. 2012).
the module completion ratio (MCR) and mapping
patterns to the KEGG modules (http://www. Calculation of the Module Completion Ratio
genome.jp/tools/maple/). Based on a Boolean Algebra-Like Equation
The completion ratio of all KEGG functional
modules in each organism was calculated based
Development of New Evaluation on a Boolean algebra-like equation. For this anal- N
Method for Potential Functionome ysis, one genome was selected from each of the
1,041 available prokaryotic species as of March
Kegg Module 2013. As one of the examples, M00009_1 is
KEGG MODULE (Kanehisa et al. 2008) is a core pathway module for the TCA cycle com-
a collection of pathway modules and other func- prising eight components (Fig. 2a). In each KO
tional units designed for automatic functional number set, vertically connected KO identifiers
annotation or pathway enrichment analysis. Path- indicate a complex and therefore represent “And”
way modules such as the TCA cycle core module or “+” in the Boolean algebra-like equation,
(Fig. 2a) are tighter functional units than KEGG whereas horizontally located K numbers indicate
pathway maps and are defined as consecutive alternatives and represent “Or” or “,” in the equa-
reaction steps, operon or other regulatory units, tion. When genes are assigned to all KO identi-
and phylogenetic units obtained by genome com- fiers in each reaction according to the Boolean
parisons. Other functional units include (1) struc- algebra-like equation, the module completion
tural complexes representing sets of protein ratio (MCR) becomes 100 %. If genes are not
subunits for molecular machineries such as pho- assigned to KO identifiers in two components,
tosystems (Fig. 2b), (2) functional sets the MCR is calculated as 75 %
representing other types of essential sets such as (6/8 100 ¼ 75). On the other hand,
aminoacyl-tRNA synthetases, and (3) signature M00163_1 comprising six components in
modules representing markers of phenotypes cyanobacteria represents a complex module for
such as enterohemorrhagic E. coli pathogenicity photosystem I. If genes assigned to KO identifiers
signature for Shiga toxin. The KEGG MODULE in two of those components are missing, the MCR
falls into 56 small functional categories (Table 1), is calculated as 66.7 % (Fig. 2b).
N 528 New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE
New Method for Comparative Functional Genomics identifiers or K numbers for computational applications.
and Metagenomics Using KEGG MODULE, The relationship between this module and the
Fig. 2 KEGG functional modules. (a) A pathway mod- corresponding KEGG pathway map is also shown by
ule. The module M00009 comprising eight components is indicating corresponding K number sets in the module
defined for the citrate cycle (TCA cycle) core module and and EC numbers in the pathway map using the same
represented as a Boolean algebra-like equation of KO index. In each K number set, vertically connected
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE 529 N
Assignment of the Query Sequences to KO FLX Titanium sequencer contains several
Identifiers sequencing errors. The amino acid sequences of
Because KAAS is an efficient tool for assigning complete CDSs identified from the draft genome
KO identifiers to genes from complete genomes were randomly fragmented to 50, 60, 80, 100,
based on a BLAST search of the KEGG GENES 120, 150, and 200 residues in length, and each
database combined with a bidirectional best-hit fragment was subjected to verification of data-
method (Moriya et al. 2007), the KAAS system is base dependency based on the accuracy of KO
used to assign KO identifiers to protein sequences identifier assignment (Fig. 3). In general, because
from metagenome projects and to users’ own data most microbes thriving in natural environments
from other genome and metagenome projects. are uncultivable, many genes in environmental
Recently the KAAS system has just been slightly metagenomes do not show significant similarity
modified to improve the accuracy of KO assign- to those from known species in the public genome
ments by (i) using a variable bit-score threshold database. Especially when microbial genomes
instead of a fixed one (60 in the original KAAS belonging to the same phylum as the query
system) to avoid missed annotations when there microbe are missing in the genome database, the
are sufficient high-scoring hits for KO assign- accuracy rate of KO assignment to proteins phy-
ment and (ii) considering taxonomic information logenetically distant from known phyla is
of each KO when more than one candidate KO is expected to be low. In fact, when all species
obtained (Fig. 1) (Takami et al. 2012). This mod- within phylum Proteobacteria were not included
ification resulted in improved positive predictive in the data set, the accuracy rate of KO assign-
value (#true positives/#all positives) by 2–5 % in ment to full proteins of E. coli decreased to 80 %,
the KO reassignment tests for 30 selected species. but the accuracy rate of approximately 70 % was
The latest stand-alone KAAS system for Linux maintained even in the proteins fragmented to
and Mac OS X is available from the Web site of about 100 residues (Fig. 3). Considering these
KAAS HELP (http://www.genome.jp/tools/kaas/ results, even if the genes from unidentified
help.html). This new KAAS was used for estima- phyla of the so-called candidate division are N
tion of database dependency on the accuracy of included in the metagenomes, the KAAS system
the KO assignment (Fig. 3). Escherichia coli was can presumably assign KO identifiers to genes
selected as a representative of prokaryotic species longer than 300 bp (100 amino acids) with an
and constructed four different types of data sets: accuracy rate of approximately 70 %.
without E. coli and closely related species (1,239
species), without all species within family
Enterobacteriales (1,200 species), without all Distribution Patterns of the Module
species within class Gammaproteobacteria Completion Ratio in 1,256 Prokaryotic
(1,040 species), and without all species within Species
phylum Proteobacteria (755 species). The draft
genome of E. coli from infants in Trondheim, KEGG modules are modular functional units
Norway, (accession, ERX127960) was used for derived from the KEGG pathways and are cate-
this analysis because the assembled genome from gorized into pathway modules, structural com-
the short-read sequences produced by a 454 GS plexes, functional sets, and genotypic
New Method for Comparative Functional Genomics module M00163 comprising six components is defined
and Metagenomics Using KEGG MODULE, Fig. 2 for the type I photosystem. The Boolean algebra-like
(continued) K numbers indicate a complex and therefore equation and the corresponding KEGG pathway map are
represent “And” or “+” in the Boolean algebra-like equa- also shown. This figure has been redrawn with the updated
tion, whereas horizontally located K numbers indicate KEGG module database from the previous one (Takami
alternatives and represent “Or” or “,” in the equation. (b) et al. 2012)
A structural complex module. The structural complex
N 530 New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE
New Method for Comparative Functional Genomics signatures. Each KEGG module is designed for
and Metagenomics Using KEGG MODULE, automatic functional annotation by a Boolean
Table 1 Breakdown of small functional categories of
the KEGG modules algebra-like equation of KEGG Orthology IDs.
However, it remains uncataloged as to which
Pathway modules Structural complex modules
species possess common modules or if certain
Cofactor and vitamin Saccharide and polyol transport
biosynthesis system modules demonstrate universality or rareness
Central carbohydrate Phosphotransferase system between specific species, phyla, etc. Specific
metabolism (PTS) information regarding the phylogenetic profiles
Aromatics degradation ATP synthesis of each module holder would be especially useful
Lipid metabolism Phosphate and amino acid for annotating metagenomes. Thus, the distribu-
transport system
Aromatic amino acid Mineral and organic ion
tion patterns of the completion ratios of the
metabolism transport system KEGG modules were examined in the 1,256 pro-
Carbon fixation ABC-2 type and other transport karyotic species whose genomic sequences have
systems been completed. Although distribution of the
Methane metabolism Bacterial secretion system module completion ratios in the 1,256 species
Glycan metabolism Metallic cation, iron-
siderophore, and vitamin B12
varied greatly depending on the kind of module,
transport system it could be categorized into four patterns
Sterol biosynthesis RNA processing (universal, restricted, diversified, and
Fatty acid metabolism Ubiquitin system nonprokaryotic) regardless of the module type
Lysine metabolism Spliceosome (pathway, structural complex, signature, or func-
Other carbohydrate Protein processing tional set), when considering 70 % of all species
metabolism
to represent a majority measurement for the pat-
Glycosaminoglycan Repair system
metabolism ters (Table 2 and Fig. 4).
Terpenoid backbone DNA polymerase Pattern A defined as “universal” comprised
biosynthesis modules completed by more than 70 % of the
Cysteine and methionine Peptide and nickel transport 1,256 species (Fig. 4a). Of 226 pathway modules
metabolism system
containing submodules, modules grouped into
Nitrogen metabolism Replication system
Branched-chain amino acid RNA polymerase
pattern A account for only 7.5 % (Table 2) and
metabolism mainly belong to the categories of central carbo-
Lipopolysaccharide Proteasome hydrate metabolism and cofactor and vitamin
metabolism biosynthesis. Pattern B defined as “restricted”
Purine metabolism Photosynthesis comprised modules completed by less than
Pyrimidine metabolism Carbohydrate metabolism
30 % of the species (Fig. 4b) and accounted for
Polyamine biosynthesis Ribosome
17.3 % of all the pathway modules, and 37 mod-
Alkaloid and other Glycan metabolism
secondary metabolite ules were rare modules completed by less than
biosynthesis 10 % of the 1,256 species (Table 2). Pattern
Sugar metabolism C defined as “diversified” accounted for 40.3 %
Other terpenoid Functional set modules of all the pathway modules and comprised mod-
biosynthesis
ules ranging widely in completion ratios.
Serine and threonine Two-component regulatory
metabolism system M00012_1 (the glyoxylate cycle comprising
Arginine and proline Aminoacyl-tRNA five components) is one of the representatives of
metabolism pattern C (Fig. 4c). One or several KO identifiers
Phenylpropanoid and Nucleotide sugar were assigned to each reaction in this module;
flavonoid biosynthesis
however, KO identifiers, except for K01637 and
Sulfur metabolism
Histidine metabolism Signature modules
K01638 assigned to the third and fourth compo-
Other amino acid Pathogenicity nents, were also assigned to other pathway mod-
metabolism ules such as the TCA (Krebs) cycle (M00009_1),
first carbon oxidation (M00010_1), reductive
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE 531 N
New Method for Comparative Functional Genomics genera Escherichia, Salmonella, Shigella, and Yersinia
and Metagenomics Using KEGG MODULE, (16 KO identifiers), order Enterobacteriales (90), class
Fig. 3 Effect of database dependency on accuracy of Gammaproteobacteria (203), or phylum Proteobacteria
the KO assignment. Purple triangles show the results (370) were removed in advance from the protein
using the data set without proteins from the genera data set. Here, the accuracy is defined by the sensitivity
Escherichia, Salmonella, Shigella, and Yersinia (1,239 TP/(TP + FN), where TP and FN are the numbers of true
species). Similarly, green squares, brown diamonds, and positives and false negatives, respectively. The truncated
blue dots show the results without proteins from the order proteins were also used to confirm the effect of amino acid
Enterobacteriales (1,200 species), class Gammaproteo- (a.a.) sequence lengths on the accuracy of KO assignments
bacteria (1,040 species), and phylum Proteobacteria as described in the text. This figure has been slightly
(755 species), respectively. KO identifiers specific to the modified from the previous one (Takami et al. 2012) N
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE,
Table 2 Classification of the KEGG modules based on the module completion ratio of 1,256 prokaryotes
Structural Functional sets
Pathways [226] complexes [331] [86] Signatures [9]
No. of modules No. of modules No. of modules No. of modules
Completion Definition of module (%) (%) (%) (%)
pattern type Total rare Total rare Total rare Total rare
A Universal 17 (7.5) 0 (0) 9 (2.7) 0 (0) 1 (1.2) 0 (0) 0 (0) 0 (0)
B Restricted 39 37 133 99 77 67 8 (88.9) 8 (88.9)
(17.3) (47.4) (40.2) (81.1) (89.5) (97.1)
C Diversified 91 41 70 23 5 (5.8) 2 (2.9) 1 (11.1) 1 (11.1)
(40.3) (52.6) (21.1) (18.9)
D Nonprokaryotic 79 0 (0) 119 0 (0) 3 (3.5) 0 (0) 0 (0) 0 (0)
(35.0) (36.0)
[] shows total number of the KEGG modules containing branched modules. “Rare” indicates the modules completed by
less than 10 % of 1,256 prokaryotic species. Universal, the modules completed by more than 70 % of 1,256 prokaryotic
species. Restricted, the modules completed by less than 30 % of 1,256 prokaryotic species. Diversified, the modules that
varies in the module completion ratio among 1,256 prokaryotic species. Nonprokaryotic, the modules not to be
completed by any prokaryotic species
N 532 New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE
New Method for Comparative Functional Genomics are the modules that vary in the module completion ratio
and Metagenomics Using KEGG MODULE, among 1,256 prokaryotic species. M00012_1, which is
Fig. 4 Typical completion patterns to the KEGG glyoxylate cycle, is one of the examples of the pattern
modules by 1,256 prokaryotic species. (a) Universal C. D: Nonprokaryotic modules completed by no prokary-
modules. The modules completed by more than 70 % of otic species. M00014_1, which is glucuronate pathway, is
768 prokaryotic species. M00018_1, which is threonine one of the examples of the pattern D. Breakdown of
biosynthesis (aspartate-homoserine-threonine), is one of taxonomic variations that complete each KEGG module
the examples of the pattern A-1. (b) Restricted modules is summarized in Table 3. This figure has been redrawn
completed by less than 30 % of 768 prokaryotic species. with the updated KEGG module and genome databases
M00038_1, which is tryptophan metabolism, is one of the from the previous one (Takami et al. 2012)
examples of the pattern B. C: Diversified modules. These
TCA cycle (M00173_1), and C4-dicarboxylate the module completion ratio is low, the relation-
cycle (nicotinamide adenine dinucleotide ship between the module completion ratio of the
(NAD)+-malic enzyme type) (M00171_1). targeted module and others to which the same KO
Some KO IDs assigned to many of the modules, identifiers are assigned should be considered.
categorized into pattern C, were also assigned to Pattern D, which accounted for 35.0 % of all
several other independent modules. Thus, when pathway modules, comprised nonprokaryotic
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE 533 N
modules that are not completed by prokaryotic phenotypic properties were selected to test our
species (Fig. 4d). evaluation method for potential functionome
Of the 331 structural complex modules using KEGG modules, in order to differentiate
containing submodules redefined from modules the functional potentials harbored in their
with various complex patterns, 133 modules were genomes.
categorized into pattern B (47.4 %) and 99 were The gene products from eight bacillar
rare modules (Table 1). Pattern C accounted for genomes were assigned to KO identifiers
only 21.1 % in the structural modules compared constructing each module in 139 pathway,
with 40.3 % in the pathway modules. Thus, it was 112 structural complex, and 25 functional set
hypothesized that most of the structural complex modules. There was a significant difference in
modules, except for pattern D, are shared only in the module completion ratio by eight bacilli in
limited prokaryotic species. terms of at least 25 pathway, 40 structural com-
Nonprokaryotic modules account for 35 % of plex, and 15 functional set modules (Fig. 5a, b).
pathway and 36 % of structural complex mod- In particular, the completion ratio in
ules, respectively, and other modules were clas- Oceanobacillus iheyensis, a mesophilic,
sified into various taxonomic patterns such as extremely halotolerant alkaliphile, was very low
prokaryotic, Bacteria specific, and Archaea spe- in three modules for NAD biosynthesis, phospha-
cific based on the MCR profiles (Table 3). These tidylethanolamine biosynthesis, and biotin bio-
four patterns indicate the universal and unique synthesis. These three modules were completed
nature of each module and also the versatility of by all bacilli except for O. iheyensis although
the KO identifiers mapped to each module. Thus, they are categorized into one of the diversified
the four criteria and taxonomic classification for modules (pattern C). Conversely, the module for
each module should be helpful for the interpreta- tryptophan biosynthesis belonging to pattern
tion of results based on module completion C was completed by only O. iheyensis, although
profile. other species partially completed them. Through
these results it was evident that O. iheyensis dif- N
fers from other bacilli in its metabolic potentials.
Application of the Evaluation Method Some of the completed structural complex
for Potential Functionome to Genomic modules were found to be shared in bacilli with
and Metagenomic Analyses the same phenotypic properties or to be indepen-
dently species specific (Fig. 5b). For example,
Comparative Functionome Analysis of Bacilli the Firmicutes-specific modules for the teichoic
Based on the KEGG Modules acid transport system were shared only among
Bacillus and its related species in genera such as three mesophilic neutrophiles (B. subtilis,
Oceanobacillus and Geobacillus reclassified B. amyloliquefaciens, and B. licheniformis),
from genus Bacillus (Bacillus-related species) although this module is widely shared in other
are known to thrive in a wide range of environ- genera such as Staphylococcus, Clostridium, and
mental conditions: pH 2–12, temperatures Listeria within phylum Firmicutes. On the other
between 5 and 78 C, salinity from 0 % to 30 % hand, two other modules, the iron (III) transport
NaCl, and pressures from 0.1 Mpa (atmospheric system and phosphonate transport system which
pressure) to at least 30 MPa (pressure at a depth are shared in many prokaryotic species within
of 3,000 m) (Takami 2006). The genome struc- various phyla and belonged to pattern C, were
ture of these species within family Bacillaceae is shared only among three mesophilic alkaliphiles
comparatively similar, and the core structure (B. halodurans, B. pseudofirmus, and
comprising more than 1,400 orthologous groups O. iheyensis). Although it has been previously
is well conserved among Bacillaceae (Uchiyama reported that the orthologous genes for the
2008). Therefore, moderately related bacillar phosphonate transport system were shared
genomes from eight species with different between O. iheyensis and B. halodurans
N 534 New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE,
Table 3 Breakdown of taxonomic patterns of the KEGG modules
Pathway [226] Structural complex [331]
Major taxonomic pattern Number Major taxonomic pattern Number
(%) (%)
Nonprokaryote 79 (35.0) Nonprokaryote 119 (36.0)
Prokaryote 50 (22.1) Bacteria 55 (16.6)
Bacteria 30 (13.3) Prokaryote 51 (15.4)
Proteobacteria 27 (11.9) Proteobacteria 36 (10.9)
Euryarchaeota 10 (4.4) Firmicutes 17 (5.1)
Proteobacteria/Actinobacteria 5 (2.2) Actinobacteria 5 (1.5)
Firmicutes 4 (1.8) Cyanobacteria 5 (1.5)
Proteobacteria/Firmicutes/Actinobacteria 3 (1.3) Archaea 4 (1.2)
Chloroflexi 2 (0.9) Proteobacteria/Firmicutes 4 (1.2)
Crenarchaeota 2 (0.9) Euryarchaeota/Crenarchaeota 3 (0.9)
Cyanobacteria 2 (0.9) Euryarchaeota/Crenarchaeota/Nanoarchaeota 3 (0.9)
Actinobacteria/Crenarchaeota 1 (0.4) Proteobacteria/Firmicutes/Fusobacteria 3 (0.9)
Chlamydiae/Cyanobacteria 1 (0.4) Euryarchaeota 2 (0.6)
Chloroflexi/Deinococcus-Thermus/ 1 (0.4) Firmicutes/Tenericutes/Actinobacteria 2 (0.6)
Euryarchaeota
Euryarchaeota/Crenarchaeota 1 (0.4) Proteobacteria/Actinobacteria 2 (0.6)
Firmicutes/Euryarchaeota 1 (0.4) Proteobacteria/Aquificae 2 (0.6)
Proteobacteria/Acidobacteria 1 (0.4) Proteobacteria/Firmicutes/Actinobacteria 2 (0.6)
Proteobacteria/Actinobacteria/ 1 (0.4) Actinobacteria/Cyanobacteria 1 (0.3)
Acidobacteria
Proteobacteria/Actinobacteria/ 1 (0.4) Actinobacteria/Verrucomicrobia/Nitrospirae 1 (0.3)
Bacteroidetes
Proteobacteria/Actinobacteria/ 1 (0.4) Firmicutes/Fusobacteria 1 (0.3)
Cyanobacteria
Proteobacteria/Cyanobacteria 1 (0.4) Firmicutes/Spirochaetes 1 (0.3)
Proteobacteria/Firmicutes 1 (0.4) Proteobacteria/Actinobacteria/Deinococcus- 1 (0.3)
Thermus
Proteobacteria/Verrucomicrobia 1 (0.4) Proteobacteria/Actinobacteria/ 1 (0.3)
Verrucomicrobia
Functional set [86] Proteobacteria/Bacteroidetes/Aquificae 1 (0.3)
Major taxonomic pattern Number Proteobacteria/Chlamydiae 1 (0.3)
(%)
Proteobacteria 26 (30.2) Proteobacteria/Chlorobi 1 (0.3)
Firmicutes 19 (22.1) Proteobacteria/Chlorobi/Deferribacteres 1 (0.3)
Bacteria 11 (12.8) Proteobacteria/Cyanobacteria 1 (0.3)
Actinobacteria 6 (7.0) Proteobacteria/Cyanobacteria/Chlorobi 1 (0.3)
Cyanobacteria 6 (7.0) Proteobacteria/Firmicutes/Deferribacteres 1 (0.3)
Nonprokaryote 3 (3.5) Proteobacteria/Firmicutes/Spirochaetes 1 (0.3)
Prokaryote 3 (3.5) Proteobacteria/Tenericutes 1 (0.3)
Firmicutes/Fusobacteria 2 (2.3) Proteobacteria/Thermodesulfobacteria 1 (0.3)
Proteobacteria/Nitrospirae 2 (2.3) Signature [9]
Firmicutes/Tenericutes/Thermotogae 1 (1.2) Major taxonomic pattern Number
(%)
Proteobacteria/Acidobacteria/ 1 (1.2) Proteobacteria 5 (55.6)
Deferribacteres
Proteobacteria/Acidobacteria/ 1 (1.2) Euryarchaeota 1 (11.1)
Planctomycetes
(continued)
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE 535 N
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE, Table 3
(continued)
Pathway [226] Structural complex [331]
Proteobacteria/Chrysiogenetes/Firmicutes 1 (1.2) Proteobacteria/Actinobacteria 1 (11.1)
Proteobacteria/Cyanobacteria 1 (1.2) Proteobacteria/Thaumarchaeota 1 (11.1)
Proteobacteria/Firmicutes/Chlamydiae 1 (1.2) Proteobacteria/Verrucomicrobia/Nitrospirae 1 (11.1)
Proteobacteria/Nitrospirae/Deferribacteres 1 (1.2)
Proteobacteria/Spirochaetes/ 1 (1.2)
Verrucomicrobia
[] shows total number of the KEGG modules containing branched modules
New Method for Comparative Functional Genomics and Metagenomics Using species. Alphabet in parentheses shows the patterns of completion profile based on the
KEGG MODULE, Fig. 5 Comparison of module completion patterns in eight module completion ratio as shown in Table 2 and Fig. 4. bsu, B. subtilis; bao,
phenotypically different Bacillus-related species. (a) Pathway modules showing B. amyloliquefaciens; bli, B. licheniformis; bha, B. halodurans; B. pseudofirmus; oih,
remarkable differences appeared among the eight species. (b) Structural complex O. iheyensis; gka, G. kaustophilus; and gth, G. thermoglucosidasius. This figure has
modules showing remarkable differences appeared among the eight species. Upper been redrawn with the updated KEGG module database from the previous one
plot indicates common or specific modules in the species possessing each phenotype. (Takami et al. 2012)
Green letters show rare modules completed by less than 10 % of 1,256 prokaryotic
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE
New Method for Comparative Functional Genomics and Metagenomics Using microbiomes in the module completion ratio. (c) Typical pathway modules for
KEGG MODULE, Fig. 6 Comparison of module completion patterns in humans which the completion ratio in the human gut microbiome is very low in contrast to
537
and human gut microbiomes from 13 healthy individuals. (a) Typical pathway that in humans. Detailed information of the 13 individuals has been previously
modules showing remarkable differences in the module completion ratio appeared described (Kurokawa et al. 2007). This figure has been redrawn with the updated
among human gut microbiomes from 13 healthy individuals. (b) Typical pathway KEGG module database from the previous one (Takami et al. 2012)
modules possessing complementary relationships between humans and human gut
N
N
N 538 New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE
microbiome did not seem to utilize exogenous applied to estimate database dependency on the
lysine, leucine, and aromatic amino acids such accuracy of the KO assignment using the E. coli
as tryptophan and tyrosine (Fig. 6c). To our draft genome. As a result, the KAAS system
knowledge, this is a novel finding on the nutri- could correctly assign to KO groups with an
tional preference of gut microbes. This may be accuracy rate of approximately 80 %, even if
one of the mutualistic representations of gut the gene hosts were not classified into known
microbiomes to avoid nutritional competition phyla within the reference database. Thus, this
with the host because these aromatic amino method will work well for comparative func-
acids are precursors of various biological sub- tional analysis in metagenomics, able to target
stances such as catecholamines, melatonin, sero- unknown environments containing various
tonin, thyroid hormones, and NAD. Thus, the uncultivable microbes within unidentified phyla,
new evaluation method based on the KEGG mod- although further verification studies on database
ules is expected not only to highlight the meta- dependency for metagenomics should be
bolic linkage between host and commensal performed. Based on this method, we developed
microbes but also to identify microbiome-based the metabolic and physiological potential evalu-
biomarkers for particular diseases. ator (MAPLE) and provided a user-friendly Web
interface not only for the characterization of
potential functionome harbored in the genomic
Summary and metagenomic sequences but also for compar-
ative analyses for the MCR and mapping patterns
A new evaluation method for potential to the KEGG modules (http://www.genome.jp/
functionomes based on the KEGG modules was tools/maple/).
developed. Using this new method, significant
difference in module completion ratio by eight
bacilli in terms of at least 25 pathway, 40 struc- Cross-References
tural complex, and 15 functional set modules was
highlighted, although how the differentiated ▶ Computational Approaches for Metagenomic
functional modules confer phenotypic properties Datasets
directly or indirectly is unclear thus far. Because ▶ Human Gut Microbial Genes by Metagenomic
the coverage of KEGG modules over whole met- Sequencing
abolic and signaling networks is continuously ▶ KEGG and GenomeNet, New Developments,
increasing, differences in module completion Metagenomic Analysis
ratio will provide some important clues to the ▶ Metagenomic Research: Methods and
understanding of phenotypic properties. Further- Ecological Applications
more, variations in the functional potential of
human gut microbiomes from 13 healthy individ-
uals could be characterized by the pathway and
structural complex module units, and the comple- References
mentarity between biochemical functions in
human hosts and nutritional preferences in Eckburg PB, Bik EM, Bernstein CN, et al. Diversity of
human gut microbiomes identified. the human intestinal microbial flora. Science.
2005;308:1635–8.
Functional annotations to metagenomic Huson DH, Mitra S, Ruscheweyh HJ, et al. Integrative
sequences remain difficult because metagenomic analysis of environmental sequences using MEGAN4.
data targeting various environments still contains Genome Res. 2011;21:1552–60.
incomplete genes from various unidentified spe- Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes
and genomes. Nucleic Acids Res. 2000;28:27–30.
cies, absent in a reference database. In this entry,
Kanehisa M, Araki M, Goto S, et al. KEGG for linking
the KAAS system was used for functional anno- genomes to life and environment. Nucleic Acids Res.
tation to the human metagenomes and also 2008;36:D480–4.
Next-Generation Sequencing for Metagenomic Data: Assembling and Binning 539 N
Kurokawa K, Itoh T, Kuwahara T, et al. Comparative affects all species in the world. Traditional
metagenomics revealed commonly enriched gene method of studying microorganisms requires cul-
sets in human gut microbiome. DNA Res.
2007;14:169–81. turing a single kind of microbe and studying each
Li JV, Ashrafian H, Bueter M, et al. Metabolic surgery microbe based on next-generation sequencing
profoundly influences gut microbial-host metabolic (NGS) technology by its genome one at a time
cross-talk. Gut. 2011;60:1214–23. (Perna et al. 2001). However, as a single kind of
Markowitz VM, Chen I-MA, Palaniappan K, et al. IMG:
the integrated microbial genomes database and com- microbe usually cannot live alone and over 99 %
parative analysis system. Nucleic Acids Res. 2012;40: of microbes cannot be cultivated in the laboratory
D115–22. (Rappe and Giovannoni 2003; Eisen 2007), tra-
Meyer F, Paarmann D, D’Souza M, et al. The ditional culture-based method cannot analyze the
metagenomics RAST server – a public resource for
the automatic phylogenetic and functional analysis of interactivity of a microbial community well.
metagenomes. BMC Bioinformatics. 2008;9:386. Metagenomic, which studies all microbes in
Moriya Y, Itoh M, Okuda S, et al. KAAS: an automatic a community as a whole, is introduced for solving
genome annotation and pathway reconstruction server. the problem. Based on the NGS technology
Nucleic Acids Res. 2007;35:W182–5.
Overbeek R, Begley T, Butler RM, et al. The subsystems (Shendure and Ji 2008), instead of sequencing
approach to genome annotation and its use in the each single cultivated microbe one by one,
project to annotate 1000 genomes. Nucleic Acids metagenomic sequences all microbes in an envi-
Res. 2005;33:5691–702. ronment sample as a community directly without
Takami H. Genomic diversity of extremophilic Gram-
positive endospore-forming Bacillus-related species. cultivation (Weinstock 2012; Gilbert and Dupont
In: Williams CR, editor. Trends in genome research. 2011; Hunter et al. 2012; Tremaroli and Backhed
New York: NOVA Publisher; 2006. p. 25–85. 2012; Wooley et al. 2010). Thus, genomes of
Takami H, Taniguchi T, Moriya Y, et al. Evaluation microbes that cannot be studied before can now
method for the potential functionome harbored in the
genome and metagenome. BMC Genomics. be obtained and be analyzed.
2012;13:699. However, the complexity of a microbial com-
Uchiyama I. Multiple genome alignment for identifying munity is high. There can be tens of thousands
the core structure among moderately related microbial kinds of microbes in a single sample. As genomes
genomes. BMC Genomics. 2008;9:515. N
Willke A, Harrison T, Wilkening J, et al. The M5nr: of these microbes coexist in the sample, reads
a novel non-redundant database containing protein (DNA short fragments) obtained from genomes
sequences and annotations from multiple sources and of different microbes are mixed and required to
associated tools. BMC Bioinformatics. 2012;13:141. be separated after NGS step. More seriously, as
the abundance of different microbes in a sample
can vary with several orders of magnitudes (Qin
et al. 2010), few reads are sequenced from the
Next-Generation Sequencing for low-abundance species which may be treated as
Metagenomic Data: Assembling erroneous reads. Thus, several approaches have
and Binning been developed for analyzing metagenomic data
depending on the property of samples and
Henry C. M. Leung, Yi Wang, S. M. Yiu and research objectives.
Francis Y. L. Chin
Department of Computer Science, The University
of Hong Kong, Hong Kong, China Sequencing Biomarker
biologists usually design primers for capturing Besides the problem of sequencing the whole
short regions in the genomes of various microbes, 16S rRNA gene with high throughput, there is
e.g., fingerprinting polymerase chain reaction another problem of analyzing metagenomic data
(PCR) on 16S rRNA genes. Each 16S rRNA using 16S rRNA genes (or 18S rRNA genes).
gene is a 1.5-kilobase-long gene for encoding Microbe can transfer gene from one to another
part of the prokaryotic ribosome. Although each without reproduction process, horizontal gene
genomic sequence varies among different bacte- transfer, and thus the 16S rRNA gene of one
ria, there are some conserved regions (for the kind of microbe may be transferred to another
ribosome function) in the 16S rRNA gene such microbe and introduces problems in analyzing
that primer can be designed for capturing the 16S metagenomic data. In real situation, microbes
rRNA gene for different bacteria. Moreover, spe- can have multiple copies of 16S rRNA genes,
cies with 97 % identical in the 16S rRNA gene varying from 1 to 15 (Case et al. 2007;
usually are in the same operational taxonomic Klappenbach et al. 2001), and horizontal gene
unit (OUT) (Weinstock 2012). Thus, sequencing transfer makes the abundances difficult to be
the 16S rRNA genes can determine which kinds estimated. Recently, other housekeeping genes,
of bacteria in a sample and their relative abun- e.g., rpoB, amoA, pmoA, nirS, nirK, nosZ, and
dances (16S rRNA genes of high-abundance bac- pufM, are used (in addition of 16S rRNA gene)
teria will be sequenced more than those of for identifying different species in a
low-abundance bacteria resulting more reads metagenomic sample.
covering these genes). Instead of 16S rRNA,
18S rRNA gene encodes eukaryotic ribosome
and can also be sequenced for identifying eukary- Sequencing Whole Genome
otes in a sample.
However, as the read lengths of most popular Since using a single or only several biomarkers to
sequencing techniques are shorter than 1.5 kb represent a species may have a problem, another
(typical length of a 16S rRNA gene), biologists way to analyze metagenomic data is sequencing
can only sequence a portion of 16S rRNA genes, the whole genomes of different microbes in the
and the accuracy of identification depends on the sample. With the help on the high-throughput
read length. Traditional Sanger sequencing tech- next-generation sequencing techniques, biolo-
niques can produce 1-kb-long read which can gists can sequence the whole genomes of all
cover a larger portion of 16S rRNA genes. How- microbes in a sample with reasonably high
ever, its throughput is low such that 16S rRNA sequencing depth.
genes of many species may not be sequenced and
the relative abundances of species may not be Assembling Reads
estimated well. One of the next-generation As the read lengths of next-generation sequenc-
sequencing techniques, 454 pyrosequencing, can ing are much shorter than the genomes of
produce several orders more reads than the microbes, analyzing sequenced reads directly is
Sanger sequencing technique, but the read length difficult especially for Illumina platform. One
is about 400 bases, which can cover only a short possible way is assembling overlapped short
portion of 16S rRNA gene, and thus the sensitiv- reads to longer contigs before analysis (Mende
ity of identifying different microbes in a sample et al. 2012). Although there are many existing
will decrease. The Illumina platform, another assembling algorithms (Vyahhi et al. 2012;
next-generation sequencing technique, can Peng et al. 2010) designed for genomic data,
produce several orders more reads than they cannot be applied on metagenomic data
454 pyrosequencing; however, the read length is directly because of the following results:
at most 250 bases, thus resulting to lower sensi- 1. Abundances of different microbes vary in
tivity than 454 pyrosequencing. metagenomic data. Since erroneous reads
Next-Generation Sequencing for Metagenomic Data: Assembling and Binning 541 N
introduce arbitrary for assembling, existing Due to the above problem, several assemblers
genomic assemblers try to determine errone- have been developed for assembling
ous reads and remove them before assembling. metagenomic data, including Genovo (Laserson
Based on the assumption that erroneous reads et al. 2011) for 454 pyrosequencing and
are sampled fewer times than correct reads, MetaVelvet (Namiki et al. 2012), Ray Meta
these genomic assemblers usually consider (Boisvert et al. 2012), Meta-IDBA (Peng
those reads or length k substring of reads, et al. 2011), and IDBA-UD (Peng et al. 2012)
called k-mers, with low sampling rate for the Illumina platform. Since the length of
(multiplicity) as erroneous reads and k-mers. 454 pyrosequencing read is longer than those
These erroneous reads are removed before constructed by Illumina platform and the number
assembling. However, since the abundance of of input reads is much smaller than those by
microbes vary a lot in metagenomic data, cor- Illumina platform, Genovo stores all the input
rect reads and k-mers from low-abundance reads and calculates their pairwise overlapped
microbes could be sampled much fewer than relationship. It then calculates the probability of
the erroneous reads and k-mers from high- a set of reads sampled from the same contigs
abundance microbes. These genomic assem- based on Bayesian approach and applies a series
blers fail to remove erroneous reads and of hill climbing to obtain a set of contigs with the
k-mers and produce either very short contigs highest likelihood. However, this approach fails
or incorrect long contigs. when the number of input reads increases
2. Common regions across different microbes. Due (Boisvert et al. 2012). Because of the huge
to horizontal gene transfer and the existence of amount of input reads, MetaVelvet, Ray Meta,
common housekeeping genes, some common Meta-IDBA, and IDBA-UD all assemble contigs
patterns could appear in multiple genomes. As using de Bruijn graph approach. A de Bruijn
the read length can be shorter than these com- graph represents the connection of a set of reads
mon patterns, genomic assemblers cannot deter- using k-mers, length k strings of the read. Each
mine the genomic sequences of microbes near k-mer in the reads is represented by a vertex, and N
their common patterns. Although similar prob- there is an edge from vertex u to vertex v if and
lem also appears in assembling genomic data, only if k-mers u and v appear in at least one read
the number of common patterns in metagenomic consecutively, i.e., the length-(k-1) suffix of u is
genomic is much more than those in genomic the same as the length-(k-1) prefix of v. Thus,
data (Peng et al. 2011). As a result, shorter or a contig is represented by a path in the de Bruijn
erroneous contigs will be produced by existing graph. Because of the existence of sequencing
genomic assemblers. error and common regions among different
3. Huge data size. As the number of microbes in genomes, paths representing different genomes
a metagenomic data is huge, a high sequenc- may overlap and the de Bruijn becomes compli-
ing depth is required to obtain enough reads cated. Existing metagenomic assemblers apply
(say 10 coverage) from each microbe different approach to decompose the de Bruijn
(especially for the low-abundance microbes). graphs or determine contigs directly from the de
Thus, the total amount of input reads (e.g., Bruijn graph. Meta-IDBA decomposes the de
200G nucleotides in the metagenomic data of Bruijn graph based on the observation that there
cow stomach (Qin et al. 2010), over 100G of are more interconnections between k-mers sam-
nucleotides required for studying soil pled from the same genome than k-mers from
metagenome (Frisli et al. 2013)) for assem- sampled different genomes. After decomposi-
bling metagenomic data can be much more tion, paths representing different genomes will
than the genomic data. How to store and pro- be separated and can be reconstructed easier.
cess this huge amount of reads becomes a big MetaVelvet decomposes the de Bruijn graphs
problem. based on the multiplicities of k-mers.
N 542 Next-Generation Sequencing for Metagenomic Data: Assembling and Binning
By determining some local peaks in the distribu- using different classifiers which help analyzing
tion of multiplicities of k-mers, MetaVelvet the metabolism of the unknown microbes. How-
decomposes the de Bruijn graph according to ever, for the contigs sampled from microbes
the multiplicities. As k-mers sampled from dif- without genome reference and low-abundance
ferent genomes may have similar multiplicities microbes without enough reads for assembling
and k-mers sampled from the same genome could long contigs, binning approach is required. Note
have different multiplicities (due to sequencing that since the most microbes cannot be cultivated
bias), IDBA-UD calculates the average multi- and their genomes are still unknown, many reads
plicity of k-mers in the same contig and uses it and contigs cannot be aligned to reference
to determine erroneous k-mers and k-mers sam- genome in the database.
pled from different genomes. As the threshold is Binning reads and contigs is to cluster reads
determined locally, it can decompose the de and contigs sampled from the same microbes
Bruijn more accurate than Meta-IDBA and using the common property on the reads.
MetaVelvet using global thresholds. Ray Meta Composition-based methods use generic fea-
uses another approach to construct the contigs. tures, e.g., GC content, codon usage, dinucleo-
Instead of decomposing the de Bruijn graph, it tides distribution, and 4-mer distribution to
applies a heuristics-guided graph traversal to classify reads sampled from different genomes.
reconstruct the contig. Although all the above Existing supervised or semi-supervised binning
assemblers try to reconstruct contigs from algorithm (Brady and Salzberg 2009; McHardy
metagenomic data, short contigs (several thou- et al. 2006) can construct a classifier to determine
sand nucleotides) and chimera contigs the source of reads based on reference genome in
(misassembles contigs from different genome the database. Compared with alignment-based
together) could be resulted because of the high methods, these algorithms do not require the
diversity of metagenomic data. exact reference genome. Instead, classifier can
Since the number of k-mer is large, researches be constructed from a similar genome in the
have been performed for investigating storage of database such that more reads can be binned.
de Bruijn graph using less memory. Several effi- However, as there are limited number of refer-
cient data structures have been developed based ence genomes in the database, many reads still
on bloom filter (Chikhi and Rizk 2012; Pell cannot be classified correctly. Some binning
et al. 2012). A bloom filter uses a hash table and algorithms are designed to cluster reads sampled
several hash functions to store the existence of from the same genome using properties on reads
k-mers. When storing a k-mer, each hash function directly without any reference genomes.
will calculate an address based on the pattern of MetaCluster 3.0 (Yang et al. 2010) clusters
k-mer, and all these addresses will be set to 1 in reads based on 4-mer distribution. Given two
the hash table. Thus, the existence of a k-mer in long reads from the same genome, the occurrence
the reads can be determined by checking several frequencies of different 4-mers on the two reads
bits in the hash table. Although there may be should be similar (Zhou et al. 2008). MetaCluster
some false-positive k-mers, the number of false 3.0 calculates the pairwise spearman distance of
positives is small when the hash table is large reads based on 4-mer distributions and clustering
enough and there are multiple hash functions. reads using k-mean clustering methods. How-
ever, MetaCluster 3.0 can only handle
Binning metagenomic data with similar abundances and
After reconstructing contigs, each long contig long read length (500 bp or more). In order to bin
can be aligned to known reference genomes in short reads of length about 100 bp,
the database for identifying the microbes in the AbundanceBin (Wu and Ye 2011) and TOSS
samples (Huson et al. 2011). Even when there is (Tanaseichuk et al. 2012) consider the occurrence
no similar reference genome in the database, gene frequency of k-mers (k ¼ 25) in all the reads.
sequence may be predicted (Rho et al. 2010) k-mers that occur frequently should be sampled
Next-Generation Sequencing for Metagenomic Data: Assembling and Binning 543 N
from high-abundance microbes, while k-mers the problems challenging. A common practice for
that occur rarely should be sampled from analyzing metagenomic data is to assemble short
low-abundance microbes. Based on this assump- reads to longer contigs. Then try to identify
tion, AbundanceBin and TOSS can bin reads microbes in the sample by aligning the contigs
according to the k-mer frequencies. However, and unassembled reads to reference genomes. As
when the abundances of two microbes are similar most of the microbes have no reference in the
(abundance ratio within 1:3), these algorithms database, the unaligned reads and contigs should
fail to separate the reads sampled from the two be binned together using generic features, e.g.,
microbes. MetaCluster 4.0 further improves GC content, codon usage, dinucleotide distribu-
MetaCluster 3.0 by combining overlapped short tion, and 4-mer distribution. Previous study
reads to long virtual contigs and estimates the shows that binning contigs instead of reads can
4-mer or 5-mer distribution of the virtual contigs. improve the accuracy of binning. It is because the
As the lengths of virtual contigs are much longer long contigs carry more generic information than
than the short reads, 4-mer distribution of the the short reads. However, few researches have
virtual contigs can be estimated accurately. By been performed on studying how to improve the
constructing a huge number of small clusters and result of assembling using binning. Moreover,
merging cluster with similar 4-mer distribution, researchers usually use the information of refer-
MetaCluster 4.0 (Wang et al. 2012a) can handle ence genomes by alignment and supervising bin-
metagenomic data with microbes of different ning. In fact, similar genomes in the database
abundances. However, these unsupervised bin- may be used to improve the performance of de
ning algorithms cannot handle low-abundance novo assembling. As the performance of existing
microbes well because they cannot distinguish de novo assemblers and binning algorithms on
reads sampled from these low-abundance real biological data is not satisfied, further
microbes from the error reads sampled from researches on combining assembling, binning,
high-abundance microbes. MetaCluster 5.0 and the use of reference genomes may be
(Wang et al. 2012b) is designed for binning a possible way to improve the performance of N
reads from both high- and low-abundance analyzing metagenomic data.
microbes. It performs binning with two rounds.
In the first rounds, its target is to bin reads sam-
pled from high-abundance microbes using References
restricted parameters for constructing virtual
contigs and clustering reads. Reads sampled Boisvert S, Raymond F, Godzaridis E, et al. Ray Meta:
from low-abundance microbes can be handled scalable de novo metagenome assembly and profiling.
Genome Biol. 2012;13(12):R122.
in the second round using less restricted parame- Brady A, Salzberg SL. Phymm and PhymmBL:
ters. By applying multiple rounds of binning, metagenomic phylogenetic classification with interpo-
MetaCluster 5.0 can bin reads from microbes lated Markov models. Nat Methods. 2009;6:673–6.
with sequencing depth as low as 6 in Case RJ, Boucher Y, Dahllof I, et al. Use of 16s rRNA and
rpob genes as molecular markers for microbial ecology
a metagenomic dataset containing 100 microbes. studies. Appl Environ Microbiol. 2007;73:278–88.
However, it still cannot bin reads sampled from Chikhi R, Rizk G. Space-efficient and exact de Bruijn
microbes with sequencing depth lower than 6. graph representation based on a bloom filter. Algoritm
Bioinforma. 2012;7534:236–48.
Eisen JA. Environmental shotgun sequencing: its potential
and challenges for studying the hidden world of
Conclusion microbes. PLoS Biol. 2007;5(3):e82.
Frisli T, Haverkamp TH, Jakobsen KS, et al. Estimation of
Assembling and binning reads are two important metagenome size and structure in an experimental soil
microbiota from low coverage next-generation
procedures for analyzing metagenomic data. The sequence data. J Appl Microbiol. 2013;114(1):141–51.
high biodiversity and large variations in abun- Gilbert JA, Dupont CL. Microbial metagenomics: beyond
dances of genomes in metagenomic data make the genome. Ann Rev Mar Sci. 2011;3:347–71.
N 544 NGS QC Toolkit: A Platform for Quality Control of Next-Generation Sequencing Data
Hunter CI, Mitchell A, Jones P, et al. Metagenomic anal- Vyahhi N, Pyshkin A, Pham S, et al. From de Bruijn
ysis: the challenge of the data bonanza. Brief graphs to rectangle graphs for genome assembly.
Bioinform. 2012;13(6):743–6. Algoritm Bioinforma, LNCS. 2012;7534:249–61.
Huson DH, Mitra S, Ruscheweyh HJ, et al. Integrative Wang Y, Leung HC, Yiu SM, et al. MetaCluster 4.0:
analysis of environmental sequences using MEGAN4. a novel binning algorithm for NGS reads and huge
Genome Res. 2011;21:1552–60. number of species. J Comput Biol. 2012a;19:241–9.
Klappenbach JA, Saxman PR, Cole JR, et al. rrndb: the Wang Y, Leung HC, Yiu SM, et al. MetaCluster 5.0:
ribosomal RNA operon copy number database. a two-round binning approach for metagenomic data
Nucleic Acid Res. 2001;29:181–4. for low-abundance species in a noisy sample. Bioin-
Laserson J, Jojic V, Koller D. Genovo: de novo assembly formatics. 2012b;28:i356–62.
for metagenomes. J Comput Biol. 2011;18(3):429–43. Weinstock GM. Genomic approaches to studying the
McHardy AC, Martin HG, Tsirigos A, et al. Accurate human microbiota. Nature. 2012;489:250–6.
phylogenetic classification of variable-length DNA Wooley JC, Godzik A, Friedberg I. A primer on
fragments. Nat Methods. 2006;4:63–72. metagenomics. PLoS Comput Biol. 2010;6(2):e1000667.
Mende DR, Waller AS, Sunagawa S, et al. Assessment Wu YW, Ye Y. A novel abundance-based algorithm for
of metagenomic assembly using simulated next binning metagenomic sequences using l-tuples.
generation sequencing data. PLoS ONE. 2012;7(2): J Comput Biol. 2011;18(3):523–34.
e31386. Yang B, Peng Y, Henry CM, et al. Unsupervised binning
Namiki T, Hachiya T, Tanaka H, et al. MetaVelvet: an of environmental genomic fragments based on an error
extension of Velvet assembler to de novo metagenome robust selection of l-mers. BMC Bioinforma. 2010;11
assembly from short sequence reads. Nucleic Acids Suppl 2:S5.
Res. 2012;40(20):e155. Zhou F, Olman V, Xu Y. Barcodes for genomes and
Pell J, Hintze A, Canino-Koning R, et al. Scaling applications. BMC Bioinforma. 2008;9(1):546.
metagenome sequence assembly with probabilistic
de Bruijn graphs. Proc Natl Acad Sci.
2012;109(33):13272–7.
Peng Y, Leung HC, Yiu SM, et al. IDBA- a practical
iterative de Bruijn graph de novo assembler. Res NGS QC Toolkit: A Platform for
Comput Mol Biol. 2010;6044:426–40. Quality Control of Next-Generation
Peng Y, Leung HC, Yiu SM, et al. Meta-IDBA: a de novo Sequencing Data
assembler for metagenomic data. Bioinformatics.
2011;27:i94–101.
Peng Y, Leung HC, Yiu SM, et al. IDBA-UD: a de novo Ravi K. Patel and Mukesh Jain
assembler for single-cell and metagenomic sequencing Functional and Applied Genomics Laboratory,
data with high uneven depth. Bioinformatics. National Institute of Plant Genome Research
2012;28:1420–8.
(NIPGR), New Delhi, India
Perna N, Plunkett III G, Burland V, et al. Genome
sequence of enterohaemorrhagic Escherichia coli
O157:H7. Nature. 2001;409:529–33.
Qin J, Li R, Raes J, et al. A human gut microbial gene Synonyms
catalogue established by metagenomic sequencing.
Nature. 2010;464(7285):59–65.
Rappe MS, Giovannoni SJ. The uncultured microbial Format converters; Illumina; NGS data quality
majority. Annu Rev Microbiol. 2003;57:369–94. control; NGS data trimming; Roche 454
Rho M, Tang H, Ye Y. FragGeneScan: predicting genes in
short and error-prone reads. Nucleic Acids Res.
2010;38(20):e191.
Sanger F, Coulson AR. A rapid method for determining Definition
sequences in DNA by primed synthesis with DNA
polymerase. J Mol Biol. 1975;94(3):441–8. NGS QC Toolkit is a Perl-based stand-alone pro-
Shendure J, Ji H. Next-generation DNA sequencing. Nat
gram package for the quality control (QC) of
Biotechnol. 2008;26:1135–45.
Tanaseichuk O, Borneman J, Jiang T. A probabilistic next-generation sequencing (NGS) data. In addi-
approach to accurate abundance-based binning of tion to QC tools, it consists of many subsidiary
metagenomic reads. Algoritm Bioinforma. tools for handling and processing of data obtained
2012;7534:404–16.
from Illumina and Roche 454 sequencing plat-
Tremaroli V, Backhed F. Functional interactions between
the gut microbiota and host metabolism. Nature. forms. The open-source toolkit is freely available
2012;489:242–9. at http://www.nipgr.res.in/ngsqctoolkit.html.
NGS QC Toolkit: A Platform for Quality Control of Next-Generation Sequencing Data 545 N
Introduction parameters in these tools are set to the sensible
default values, they can be adjusted by the users
The need for fast and high-throughput sequenc- to optimize QC analysis, which makes these tools
ing has resulted into discovery of NGS technolo- versatile for different NGS assays. IlluQC has the
gies. The advent of these technologies has ability to identify different FASTQ file variants
transformed the genomics research by providing (Cock et al. 2009) and set the quality scoring
an opportunity to study genetic information at system accordingly for further analysis. Reads
a single-base resolution in cost-effective manner are analyzed based on their quality, and the
(Metzker 2010). However, usually several arti- poor-quality reads not fulfilling the user-specified
facts are reflected in NGS data due to technical criteria are discarded. The filtered reads are
errors and limitations associated with different checked for the primer/adaptor sequence contam-
NGS platforms. These sequence artifacts, includ- ination and the matching reads are discarded. The
ing read errors, poor-quality reads, and primer/ high-quality filtered data is exported as output
adaptor contamination, might affect downstream along with various quality statistics. 454QC
sequence analysis, such as de novo genome and tools read FASTA files and filter reads based on
transcriptome assembly, gene expression studies, the specified length cutoff at several stages in the
and single nucleotide polymorphism detection. analysis. The tool can also perform trimming of
To avoid misleading conclusions, it is necessary reads containing homopolymer(s) longer than
to filter the NGS data for these sequence artifacts specified length. Further, the quality check and
(Benaglio and Rivolta 2010). primer/adaptor sequence match are performed
NGS platform vendors have developed com- similar to that of IlluQC tools. However, unlike
mercial QC pipelines dedicated to mitigate the IlluQC tools, 454QC tools trim respective ends of
effect of limitations associated with their plat- the read showing primer/adaptor match. Eventu-
forms. However, even after processing through ally, the high-quality reads are exported in
these pipelines, many sequence artifacts remain FASTA format. Processing of Roche 454 PE
in the data. Several efforts have been made to data (using 454QC_PE.pl) requires an additional N
resolve one or the other sequence artifacts, but step of finding the linker sequence to separate and
many of them are specific to a particular sequenc- process both end reads simultaneously.
ing platform. NGS QC Toolkit (Patel and Jain
2012) can handle many of the known sequence
artifacts in Illumina and Roche 454 sequencing Key Characteristics
data. It is a stand-alone and user-friendly toolkit
written in Perl programming language by While NGS QC Toolkit shares its features with
employing modularized structure supported by many other QC tools (Schmieder and Edwards
several subroutines for various tasks, which 2011; Cox et al. 2010; Lassmann et al. 2009;
allows better maintainability. The toolkit com- Pandey et al. 2010), it also provides few unique
prises many easy-to-use tools for quality check attributes for the QC analysis of NGS data. In
and filtering, trimming, generating statistics, and addition to high-quality filtered data output, it is
different file format/variant conversion for also equipped with the modules for generating
Illumina and Roche 454 sequencing data (Fig. 1). several different kinds of statistics in graphical
format along with text files to help users make
better understanding of the data quality (Patel and
QC Workflow Jain 2012).
The toolkit provides dedicated tools for the QC of Reduced Computational Time and Storage
single-end (SE) and paired-end (PE) data from Space Requirement
Illumina (IlluQC tools) and Roche 454 (454QC Continued improvement in NGS technologies
tools) sequencing platforms. Although various has achieved larger read length and manyfold
N 546 NGS QC Toolkit: A Platform for Quality Control of Next-Generation Sequencing Data
increase in throughput. To reduce time require- the analysis easier, IlluQC tools are programmed
ment for the QC of several gigabases of sequence to first identify the input FASTQ variant automat-
data, parallel computing has been implemented in ically and set appropriate scoring system for fur-
the QC tools. Significant decrease in the analysis ther QC analysis.
time was evident using parallelized QC tools on
multi-core computer systems (Patel and Jain
2012). Nevertheless, tools can also be run on Additional Tools
single-core computers without any additional
requirement. Another challenge with the huge Apart from QC tools, a number of additional tools
NGS dataset is the increased storage space are provided in the toolkit to manage and gener-
requirement, which is considerably reduced by ate statistics for the NGS data (Fig. 1). A set of
the use of compressed (gzip) files. The high- sequence format converter tools offer facility to
quality filtered output data in compressed gzip convert between different variants of the FASTQ
files can be used directly for downstream analy- format based on the equations described previ-
sis, which saves large amount of storage space. ously (Cock et al. 2009). It also provides tools for
conversion between FASTQ and FASTA for-
Conservation of PE Data Integrity mats. TrimmingReads.pl tool is capable of trim-
PE sequencing data helps increase sequence cov- ming reads based on two criteria. It can trim
erage and confidence in the alignment which is given number of bases from the 50 and/or 30 end
very crucial for downstream analysis. However, of the reads. Another mode of trimming is to trim
surprisingly, not many QC pipelines maintain the low-quality bases from the 30 end of the reads
pairing information of the PE data in the filtered using user-defined threshold value of quality
data but the NGS QC Toolkit. QC tools analyze scores. HomopolymerTrimming.pl, as the name
both reads of each pair concurrently and export suggests, clips the 30 read end from first nucleo-
the high-quality filtered PE data along with the tide of the homopolymer of user-defined cutoff
unpaired reads (when only one read of the pair length. A newly introduced tool upon request
passes QC filters). In this way, QC tools maintain from users, i.e., AmbiguityFiltering.pl, helps to
PE data integrity and try to retain all important filter reads containing ambiguous bases (N/X
high-quality sequencing data. content) or to trim flanking ambiguous bases.
A couple of tools, AvgQuality.pl and N50Stat.
Homopolymer Trimming pl, generate statistics to help nonexpert users to
A major artifact is introduced in Roche access various sequence statistics.
454 pyrosequencing data by the use of pyrophos-
phate for the detection of incorporated bases. It
was found that linearity of signal intensity is Installation
disturbed when longer homopolymer is encoun-
tered (Margulies et al. 2005). This artifact may The toolkit requires Perl interpreter and few addi-
affect the downstream analysis due to frameshift. tional Perl modules like GD (optional; required to
454QC tools provide an optional parameter to generate QC graphs) and String:: Approx. Users
trim the homopolymer of the given minimum need to download NGSQCToolkit zip folder
threshold length. from the website. The toolkit is ready to use just
after unzipping the folder. The distribution
FASTQ Variant Detection includes all the tools along with a user manual,
Use of inconsistent variants of FASTQ format by which provides important links for the module
different sequencing platforms makes it tough for installation and describes the tools and their
the users to apply appropriate tools for the anal- usage in detail. Tools can report the missing
ysis, because the quality scoring system varies dependencies, if required modules are not found
with the variants (Cock et al. 2009). To make or improperly installed.
NGS QC Toolkit: A Platform for Quality Control of Next-Generation Sequencing Data 547 N
Trimming
TrimmingReads.pl
Trimming of reads from the ends
HomoPolymerTrimming.pl
Quality Control Trimming of reads at 3 end from the homopolymer of Format Conversion
user-specified length
N50Stat
Calculation of various statistics for sequences in FASTA
format
NGS QC Toolkit: A Platform for Quality Control of Next-Generation Sequencing Data, Fig. 1 Various QC and
data processing tools included in the NGS QC Toolkit
Cock PJA, Fields CJ, Goto N, et al. The sanger FASTQ file restricted and the fragments are cloned. The
format for sequences with quality scores, and the clones are screened to select colonies with the
Solexa/Illumina FASTQ variants. Nucleic Acids Res.
2009;38:1767–71. desired xylanase gene, and the insert is sequenced
Cox MP, Peterson DA, Biggs PJ. SolexaQA: at-a-glance and the gene is subcloned and expressed. The
quality assessment of Illumina second-generation recombinant xylanase is purified and character-
sequencing data. BMC Bioinformatics. 2010;11:485. ized and tested for its applicability in generating
Lassmann T, Hayashizaki Y, Daub CO. TagDust-a pro-
gram to eliminate artifacts from next generation xylo-oligosaccharides from agro-residues and
sequencing data. Bioinformatics. 2009;25:2839–40. pulp bleaching.
Margulies M, Egholm M, Altman WE, et al. Genome
sequencing in microfabricated high-density picolitre
reactors. Nature. 2005;437:376–80.
Metzker ML. Sequencing technologies – the next genera- Introduction
tion. Nat Rev Genet. 2010;11:31–46.
Pandey RV, Nolte V, Schlotterer C. CANGS: a user- Hemicellulosic components are integral part of
friendly utility for processing and analyzing lignocellulosic residues and the second most
454 GS-FLX data in biodiversity studies. BMC Res
Notes. 2010;3:3. abundant renewable polymer of plant cell walls
Patel RK, Jain M. NGS QC Toolkit: a toolkit for quality after cellulose. Xylan is the main constituent
control of next generation sequencing data. PLoS One. in hemicelluloses of lignocellulosic agro-
2012;7(2):e30619. residues. b-1,4-linked xylosyl residues form the
Schmieder R, Edwards R. Quality control and
preprocessing of metagenomic datasets. Bioinformat- backbone of xylan that makes it a homopoly-
ics. 2011;27:863–4. saccharide. Since xylan contains several groups
such as arabinosyl, acetyl, and glucuronosyl
residues that are present in the side chains,
xylans are heteroploysaccarides (Hori and
Novel Alkalistable and Thermostable Elbein 1985; Coughlan and Hazlewood 1993).
Xylanase-Encoding Gene (Mxyl) Heteropolymeric xylan requires synergistic
Retrieved from Compost-Soil action of multiple xylanolytic enzymes for com-
Metagenome plete degradation. The complex xylanolytic sys-
tem includes endoxylanase (1,4-b-D-xylan
Digvijay Verma and Tulasi Satyanarayana xylanohydrolase; EC 3.2.1.8), b-xylosidase (1,4
Department of Microbiology, University of b-D-xylan xylohydrolase; EC 3.2.1.37),
Delhi, New Delhi, India a-glucuronidase, a-L-arabinofuranosidase, and
acetyl xylan esterase. The CAZY database
(http://www.cazy.org/fam/acc_GH.html) classi-
Synonyms fied xylanases into six glycosyl hydrolase fami-
lies GH5, GH8, GH10, GH11, GH30, and GH43
Community genomics; Culture-independent (Collins et al. 2005). Family 10 and 11 xylanases
approach; Environmental genomics; are however widely distributed in nature. Owing
Endoxylanase; Endo-b-1,4 xylanase; Thermo- to low molecular weight and substrate stringency,
alkali-stable xylanase; Xylanase family 11 xylanases are considered as true
xylanases, while GH10 xylanases share broad
substrate specificity with higher molecular
Definition weight.
Xylanases have successfully been used in var-
For retrieving genes encoding thermo-alkali- ious industries like ramie fiber degumming, food
stable xylanases by culture-independent processing, and textile, biofuels, feed, and paper/
(metagenomic) approach, the DNA extracted pulp industries. However, xylanases must be
from hot and alkaline environmental samples is alkalistable and thermostable to withstand the
Novel Alkalistable and Thermostable Xylanase-Encoding Gene (Mxyl) 549 N
extreme conditions prevailing in the paper indus- Germany). Hundred nanogram of insert DNA
tries in the pre-bleaching of kraft pulp. Although and 300 ng of Bam HI digested and
several xylanases have been reported from a large dephosphorylated p18GFP vector were ligated
number of microorganisms, most of them do not by using T4 DNA ligase overnight at 16 C. The
have adequate thermostability and alkalistability ligation mixture was transformed into competent
for their utility in paper and pulp industries. E. coli DH10B cells by heat shock method. The
Majority of xylanases have been obtained from metagenomic library was spread and screened for
the culturable 0.1–1 % of the total microbial xylanase activity on 0.3 % (w/v) RBB-xylan
diversity existing in natural environments. The (4-O-methyl-D-glucurono-D-xylan-remazol bril-
culture-independent metagenomic approaches liant blue R) (Sigma, St. Louis, MO, USA)
permit retrieval of genes encoding useful LB-ampicillin agar plates. The transformants
enzymes from environmental samples without were grown at 37 C overnight and observed for
involving laborious and elaborate methods of the zone of xylan hydrolysis.
cultivation of microbes. The immense demand
for alkalistable and thermostable xylanases Screening for Xylanase and Sequence
encouraged us to adapt this innovative strategy Analysis
for retrieving genes that encode thermo-alkali- The pure clone (TSDV-MX1) showing clear zone
stable xylanases from environmental of xylanase hydrolysis was sequenced using M13
metagenomes. forward and reverse primers followed by differ-
In this investigation, a metagenomic library ent internal primers using Applied Biosystem
was constructed and screened for clones with 373 stretch automated sequencer (Applied
xylanase activity. Xylanase-encoding gene Biosystems, Foster City, CA, USA) at Nucleic
(Mxyl) (accession no. AFP81696) was subcloned Acid Sequencing Facility of the University of
and expressed, and the recombinant xylanase was Delhi South Campus, New Delhi (India), for
purified and characterized. To the best of our obtaining full sequence of the insert. The ORFs
knowledge, this is the first report on retrieving were identified by using the NCBI’s open reading N
thermo-alkali-stable GH 11 family xylanase by frame (ORF) finder tool (http://www.ncbi.nlm.
a metagenomic approach. nih.gov/gorf/gorf.html). BLASTN and BLASTP
of NCBI were used to align the nucleotide and
amino acid sequences, respectively. Multiple
Methodology alignments of the amino acids were carried out
using the CLUSTALW program (http://www.ebi.
Collection of Samples and Construction of ac.uk/clustalW). The phylogenetic analysis was
Metagenomic Library done using MEGA 2.1 with neighbor-joining
The samples of compost soil were collected in strategy.
sterile polyethylene bags from the vicinity of
a hot water spring near Fukuoka Japan and stored Construction and Expression of Plasmids
at 4 ºC. The pH of the samples is in the acidic pET28a-Mxyl and pET22b-Mxyl
range (3.0–4.5). Soil DNA was extracted The xylanase gene was amplified and ligated into
according to Verma and Satynarayana (2011). the digested vectors followed by transformation
Metagenomic DNA was processed for into competent E. coli XL1 blue cells to obtain
constructing the metagenomic library. Five mg pET28-Mxyl and pET22-Mxyl. The recombinant
of metagenomic DNA was partially digested constructs were confirmed by colony PCR
with 0.5 U of restriction enzyme Sau3AI. The followed by double digestion of the construct
fragments of 3–12 kb were eluted from agarose with restriction enzymes. The clones having
gel (1.2 %, w/v) by gel extraction kit according to xylanase gene were transformed into E. coli
manufacturer’s protocol (Macherey-Nagel, BL21(DE3) and processed for sequencing.
N 550 Novel Alkalistable and Thermostable Xylanase-Encoding Gene (Mxyl)
The recombinant plasmid having the accurate enzyme (Km and Vmax) on different xylans from
sequence was then transformed into E. coli birchwood, beech wood, and oat spelt were
BL21 (DE3) competent cells for the expression calculated from Lineweaver-Burk double recip-
of recombinant proteins from pET28a-Mxyl and rocal plots.
pET22b-Mxyl. The expression was induced by
adding isopropyl-b-D-1-thiogalactopyranoside Saccharification of Agro-residues/Hydrolysis
(IPTG) to a final concentration of 1 mM and of Xylan
the culture was further cultivated at 30 C. The One percent (w/v) standard xylo-
samples were collected at 1 h intervals for oligosaccharides (X2–X6) and agro-residues
determining the enzyme titers. Localization of (wheat bran, corncobs, and sugarcane bagasse)
the recombinant protein was determined by were treated with recombinant xylanase
collecting the intracellular, extracellular, and (10 U–20 U/g) to find out the hydrolysis of XOs
periplasmic fractions from the cells followed by and lignocellulosic substrates. All the substrates
assay for xylanase (Verma and Satyanarayana (wheat bran, corncobs, and sugarcane bagasse)
2012). were suspended in glycine-NaOH buffer
(pH 9.0) and incubated at 80 C. Aliquots at the
Site-Directed Mutagenesis desired intervals were collected and analyzed on
Multiple sequence alignment of recombinant silica-based TLC plates (Merck, Germany) to
xylanase with the known xylanases revealed determine the hydrolysis products. The sacchari-
Glu117 and Glu209 to be catalytically important fication of agro-residues was determined using
residues. Experimentally it has been proved by DNSA reagent (Miller 1959).
site-directed mutagenesis using GeneArt site-
directed mutagenesis kit (Invitrogen, Carsband,
USA). Two point mutations (Glu117Asp and Results
Glu209Asp) were created in the metagenomic
xylanase gene and expressed in E. coli Construction of metagenomic library, DNA
BL21(DE3) cells. The induced mutations were sequencing, and bioinformatics analysis.
confirmed by sequencing. When 5.0 mg of high molecular weight
(20–30 kb) metagenomic DNA was digested
with Sau3AI and the fragments were ligated into
Xylanase Assay p18GFP vector with an efficiency of 3.6 104
Xylanase was assayed according to Archana and clones per mg of DNA in constructing the library,
Satyanarayana (1997) at 80 C and pH 9.0. One the insert sizes were in the range of 3.0–8.0 kb
unit of xylanase is defined as the amount of with an average size of 5.5 kb. On screening,
enzyme required to liberate 1mmole of reducing a clone having xylanase gene was spotted on
sugar as xylose ml1 min1 under the assay RBB xylan containing LB-amp plate. The full
conditions. sequence of the insert showed the size of 6.231
kbp that revealed its prokaryotic origin on blast
Purification and Biochemical analysis. The complete insert contained nine tran-
Characterization of rMxyl scriptional units with a complete ORF of 1,077 bp
The rMxyl was purified by affinity chromatogra- long xylanase gene. The sequence showed puta-
phy using Ni2+-NTA agarose (Novagen, Ger- tive sequences of 35 (CACGCCA), 10
many) (Verma and Satyanarayana 2012). The (TAAAAA), and ribosomal binding sites
characteristics of the recombinant xylanase like (AGGGG) at the upstream of xylanase gene
the effect of pH, temperature, metal ions, inhibi- followed by complete ORF having ATG and
tors and detergents on enzyme activity, thermo- TAA as start and stop codons, respectively
stability, and substrate specificity have been (Fig. 1). The xylanase displayed five conserved
studied. Kinetic properties of the recombinant regions (I–V) of GH11 xylanase having two
Novel Alkalistable and Thermostable Xylanase-Encoding Gene (Mxyl) 551 N
Novel Alkalistable and Thermostable Xylanase- cyan-highlighted regions represent GH11 catalytic
Encoding Gene (Mxyl) Retrieved from Compost-Soil domain. Gray-highlighted regions are compositionally
Metagenome, Fig. 1 Deduced amino acid sequence biased regions that were not used in database search and
of recombinant xylanase (rMyl) and its nucleotide proposed as linker regions. Bluish-green-highlighted
sequence. The red underlined region is leader sequence; region depicts substrate binding domain
N 552 Novel Alkalistable and Thermostable Xylanase-Encoding Gene (Mxyl)
Novel Alkalistable and Thermostable Xylanase- sp. BR), 302868167 (Micromonospora aurantiaca
Encoding Gene (Mxyl) Retrieved from Compost-Soil ATCC 27029), 386849796 (Actinoplanes sp. SE50/110),
Metagenome, Fig. 2 Multiple sequence alignment of 194368056 (Streptomyces sp. S27). Five signature
xylanase with other xylanases available in database. sequences: I (AYLTLYGW), II (VEYYIVDN), III
GenBank accession number and source of microorgan- (FWQYWSV), IV (HFDAWASLG), and V(MATEGY)
isms were given as follows: 182406872 (glycosyl hydro- of GH11 family are colored. Two catalytically important
lase family 11 precursor [uncultured bacterium]), residues (Glu 117 and Glu 209) are marked with black
17826947 (Pseudomonas sp. ND137), 29367333 circle
(uncultured Cellvibrio sp.), 388259220 (Cellvibrio
Novel Alkalistable and Thermostable Xylanase- microbial GH 11 xylanase. Neighbor-joining (NJ) tree is
Encoding Gene (Mxyl) Retrieved from Compost-Soil constructed by using MEGA 4.0 software. Bootstrap
Metagenome, Fig. 3 Phylogenetic tree of recombinant values (n ¼ 1,000 replicates) are represented as percent-
xylanase. rMxyl showed highest homology with xylanase age. The scale bar depicts the allowed changes per amino
of Cellulomonas fimi ATCC 484 followed by uncultured acid position
Saccharification of Agro-residues/
Hydrolysis of Xylan
Novel Alkalistable and Thermostable Xylanase-
Encoding Gene (Mxyl) Retrieved from Compost-Soil The rMxyl hydrolyzed xylan from various
Metagenome, Fig. 4 Analysis of rMxyl using sources. The enzyme activity was very high in
SDS-PAGE (15 % polyacrylamide gel). (a). Lane 1 protein birchwood xylan (relative activity 100 %) in
marker, Lane 2 and 3 are washes with 20 and 30 mM
imidazole. Recombinant xylanase was eluted using differ-
comparison with that on xylan from beech wood
ent concentrations of imidazole (100, 200, 250, 300, (97 %) and arabinoxylan (80 %). There was no
400, 450, 500 mM). Purified xylanase showed molecular activity on carboxymethylcellulose (CMC) and
mass of ~42 kDa on staining with Coomassie Brilliant other non-xylan polysaccharides (starch,
Blue R-250. (b). Zymogram analysis of purified xylanase
using Congo red staining method
pullulan, and chitin). The Km and Vmax values
of the enzyme on birchwood xylan are
8.0 1.21 mg/ml and 300 09.12 mmol/min/
The rMxyl is active in the temperature range mg, respectively. The saccharification of wheat
between 40 C and 100 C (Fig. 5b) with opti- bran was high (15.2 %) as compared to that of
mum at 80 C and retains more than 90–95 % corncobs (9.89 %) and sugarcane bagasse
activity after exposure to 60 C and 70 C for 3 h. (4.71 %). Various xylo-oligosaccharides were
The enzyme has a T1/2 of 2.0 h at 80 ºC and detected in the hydrolysates (Fig. 6).
15 min at 90 ºC (Fig. 5c). The recombinant
enzyme did not lose activity after 3 h exposure
to pH 8.0 and 9.0, and thereafter, it declined Discussion
(50 % residual activity after 4 h). Approximately
20–45 % loss in activity was recorded on either Although several xylanases have been reported
side of the pH optimum after 1 h incubation from diverse microbiota using traditional culture-
(Fig. 5d). Mg2+, Sn2+, and Fe2+ stimulated dependent approaches, majority of them do not
rMxyl activity, while Hg2+ and Mn2+ strongly endure the extreme temperature and alkaline con-
inhibited enzyme activity even at 1 mM. Other ditions prevailing in industrial processes. An alter-
metal ions exerted varied inhibitory action on nate strategy was, therefore, adapted to retrieve
xylanase. More than 30 % activity was lost in a thermo-alkali-stable xylanase gene (Mxyl) by
the presence of Mn2+ (Table 1). NBS and PMSF culture-independent metagenomic approach. The
inhibited the activity to a significant extent even metagenomic library constructed with the DNA
at 1 mM concentration. b-ME and DTT strongly extracted from the compost-soil samples yielded
inhibited enzyme activity. A stimulatory effect a clone that produced xylanase. Although, the com-
EDTA was recorded on xylanase activity. post soils are in the acidic pH range, an alkalistable
Most of the metal ions did not affect enzyme and thermostable endoglucanase had been reported
activity at 1 mM concentration. Xylanase from rice straw compost (Son-Ng et al. 2009).
Novel Alkalistable and Thermostable Xylanase-Encoding Gene (Mxyl) 555 N
Novel Alkalistable and Thermostable Xylanase- glycine-NaOH buffer without substrate and kept at vari-
Encoding Gene (Mxyl) Retrieved from Compost-Soil ous temperatures. Aliquots were collected at various time
Metagenome, Fig. 5 Effect of pH and temperature on interval and store at 0 C for calculating residual activity.
the activity and stability of rMxyl. (a and b) The recom- (d) Similarly enzyme was incubated in various buffers
binant xylanase incubated in various buffers (pH 3–12) (pH 8–11) and aliquots of different time intervals were
and temperatures (40–100 C) and assayed for xylanase used xylanase assays
activity. (c) Recombinant xylanase was incubated in
The culture-independent approach has started by a short stretch of arginine- and threonine-
yielding the useful biocatalysts from the hidden rich non-catalytic region (WSVRQ2R2TG2TIT2).
Pandora’s Box of non-culturable microbial diver- In addition, serine-rich Q linker region
sity. The protein encoded by xylanase gene com- (S2GS2DITVG2TS2G2TS2G2S3G2S10G4) has also
prises 358 amino acids, of which 16 are acidic and been detected from amino acid 213 to 248 just after
21 basic. The predicted molecular weight, pI, and catalytic domain. Such repeated amino acids make
instability index of recombinant xylanase are linker regions that usually discriminate catalytic
~40 kDa, 8.8, and 33.44 respectively. The xylanase domain from carbohydrate-binding domain
contained a 43-amino-acid-long leader sequence at (Gilkes et al. 1991). Moreover, linkers have also
the N-terminal region followed by a catalytic been reported as integral parts of various xylanases
domain (44th–212th) of GH11 family interrupted that connect thermo-stabilizing domains, surface
N 556 Novel Alkalistable and Thermostable Xylanase-Encoding Gene (Mxyl)
Novel Alkalistable and Thermostable Xylanase-Encoding Gene (Mxyl) Retrieved from Compost-Soil
Metagenome, Table 1 Effect of modulators on rMxyl activity
Metal ions 1 mM 5 mM 10 mM
Mg2+ 106.45 1.05 99.65 0.98 87.38 0.45
Fe2+ 108.65 0.75 116.01 0.27 93.67 1.32
Sn2+ 110.43 0.67 76.12 0.44 45.17 0.63
Ni2+ 91.21 0.22 79.01 1.34 32.84 0.43
Zn2+ 91.67 0.32 76.64 0.78 32.89 0.89
Pb2+ 81.33 067 20.78 0.32 09.65 0.67
K+ 81.21 1.08 20.62 0.12 12.67 0.45
Ag2+ 73.48 0.53 54.55 0.69 27.83 0.98
Ca2+ 72.43 0.43 35.45 0.21 12.09 0.19
Mn2+ 71.76 0.63 27.34 1.32 09.67 0.27
Ba2+ 66.45 0.67 23.91 0.34 18.65 0.33
Cd2+ 54.67 0.43 29.33 0.49 12.87 0.65
Co2+ 59.15 1.23 29.63 0.65 12.54 1.12
Na+ 61.43 0.78 39.75 1.06 27.35 0.78
Cu2+ 29.12 0.18 15.76 0.76 10.09 0.87
Hg2+ 0 0 0
Inhibitors 1 mM 5 mM 10 mM
NBS 46.66 0.12 35.67 0.09 20.12 0.11
IAA 103.45 0.54 89.75 0.32 69.85 1.56
b-ME 0 0 0
DTT 0 0 0
EDTA 105.65 1.23 107.19 1.01 89.98 0.56
Detergents 0.1 % (v/v) 0.5 % (v/v)
Tween 20 103.45 1.32 105.67 0.98
Triton X100 108.32 0.96 104.05 0.92
SDS 97.34 1.32 65.89 0.19
Control 100 0.12 100 0.23 100 0.67
layer homology domains, and dockerin domains recombinant xylanase, it was subjected to high
which play a role in stabilizing the protein. Amino temperature prior to purification by Ni2+-NTA
acid homology and hydrophobic cluster analysis agarose resins. This step reduced the extra load
categorized this high molecular weight xylanase of non-His-tagged, less thermostable, and con-
into GH11 family. Metagenomic origin, distinct taminant host proteins (Mamo et al. 2006;
characteristics, lower homology, and higher Verma and Satyanarayana 2012).
molecular weight (>30 kDa) make this a novel The rMxyl exhibits optimum activity at
xylanase. The integrated N-terminal pelb signal higher temperature (80 C) and pH (9.0) which
sequence in pET22b(+) directed the enzyme to is similar to xylanases produced by Dictyoglomus
periplasm that further led to secretion into the thermolacticum, Thermotoga maritima,
extracellular environment. Bacillus stearothermophilus, and Geobacillus
The site-directed mutagenesis of two residues thermoleovorans having optimal activity at or
of glutamate to aspartate resulted in a complete above 80 C (Uchino and Fukuda 1983; Mathrani
loss of xylanase activity due to disruption in and Ahring 1992; Khasin et al. 1993; Verma and
double-displacement mechanism. In order to Satyanarayana 2012). The activity and stability
take the advantage of thermostability of the of rMxyl at higher pH are the crucial properties of
Novel Alkalistable and Thermostable Xylanase-Encoding Gene (Mxyl) 557 N
Novel Alkalistable and Thermostable Xylanase- X2 and X3. While X3, X4, and X5 were detected
Encoding Gene (Mxyl) Retrieved from Compost-Soil from hydrolysate of sugarcane bagasse (C1–C4)*. Lane
Metagenome, Fig. 6 Profile of xylo-oligosaccharides M: standards of various XOs. X1 xylose, X2 xylobiose, X3
liberated by the action of rMxyl. Lane (A1–A4)*: spots xylotriose, X4 xyloptetraose, X5 xylopentaose. *: 1/2/3/4
of X1, X2, and X3 were detected from wheat bran. Lane time intervals of 5, 15, and 30 min and 1 h, respectively
(B1–B4)*: hydrolysate from corncobs showed prominently
xylanases for their applicability in paper Cations (Mg2+, Sn2+, and Fe2+) stimulated the
processing industry. The shelf-life of rMxyl is rMxyl activity while 1 mM, Hg2+, and Mn2+
more than 3 months at 4 C, which retains greater significantly inhibited the activity. The inhibition N
than 90 % activity. The recombinant xylanase is of xylanase by Hg2+ suggests the presence of
optimally active at 80 C and pH 9.0 that distin- tryptophan residues that oxidize indole ring,
guishes it from already reported xylanases. The thereby inhibiting the xylanase activity. The inhi-
xylanase of Thermotoga maritima has Topt of bition of xylanase activity by Cu2+ is similar to
90 C, but it gets inactivated fast at pH 6.0 the majority of the xylanases (Matteotti
(Yoon et al. 2004). Similarly the alkalistability et al. 2012). In Glaciecola mesophila KMM
at higher pH is reported in many xylanases but are 241, EDTA caused ~25 % enhancement in activ-
active at lower temperatures (Khasin et al. 1993). ity (Guo et al. 2009). NBS inhibition suggests the
The recombinant xylanase of GH10 family involvement of tryptophan in xylanase activity.
from Bacillus halodurans showed both Total loss of xylanase activity by b-ME and DTT
properties together having optima at 75 C and suggests the distortion of disulfide linkages pre-
pH 9.0, but it losses 50 % activity at 65 C after sent between cysteine residues (Maalej
4 h and gets inactivated very fast at 80 C et al. 2009; Matteotti et al. 2012). Detergents
(Mamo et al. 2006). The metagenomic xylanase, exerted a slight stimulatory effect on the recom-
on the other hand, has good thermostability at binant xylanase which is a common feature of the
higher temperatures (60 C, 70 C and 80 C) other xylanases. However, rMxyl was inhibited
with only 20–30 % loss after 3 h exposure. The by SDS.
most significant aspect of this investigation is The rMxyl hydrolyzed birch wood and beech
obtaining a highly alkalistable (pHopt. 9.0) wood xylans efficiently. The structural similarity
and thermostable (Topt. 80 C) xylanase from of beech wood and birch wood xylans may be the
environmental samples by a metagenomic reason for the high activity. The enzyme
approach. exhibited almost similar activities on oat spelt
N 558 Novel Alkalistable and Thermostable Xylanase-Encoding Gene (Mxyl)
xylan and arabinoxylan. Oat spelt xylan is a type Coughlan MP, Hazlewood GP. b-1,4 D-xylan-degrading
of arabinoxylan very rich in arabinose (xylose/ enzyme systems: biochemistry, molecular biology and
applications. Biotechnol Appl Biochem. 1993;17:
arabinose ¼ 66:34) (Gruppen et al. 1992; 259–89.
Kormelink and Voragen 1993). Interestingly the Gilkes NR, Henrissat B, Kilburn DG, et al. Domains in
rMxyl liberated xylo-oligosaccharides from microbial 4-glycanases: sequence conservation, func-
xylan in just 5 min and it was sustainable on tion, and enzyme families. Microbiol Rev. 1991;
55:303–15.
prolonged incubation. Several xylanases have Gruppen H, Hamer RJ, Voragen AGJ. Water
been reported from various microorganisms that unextractable cell wall material from wheat flour.
liberate xylo-oligosaccharides following xylan 2. Fractionation of alkali extracted polymers and com-
hydrolysis. Alkaline xylanases show better action parison with water extractable arabinoxylans. J Cereal
Sci. 1992;16:53–67.
on agro-residues by lowering the stearic hin- Guo B, Chen X, Sun C, et al. Gene cloning, expression and
drance caused by cellulose and enhancing the characterization of a new cold-active and salt tolerant
solubility of hemicellulosic materials (Gruppen endo-b-1, 4-xylanase from marine Glaciecola
et al. 1992). The metagenomic xylanase finds mesophila KMM 241. Appl Microbiol Biotechnol.
2009;84:1107–15.
application in food industry for the production Hori H, Elbein AD. The biosynthesis of plant cell wall
of xylo-oligosaccharides as prebiotics (Vazquez polysaccharides. In: Higuchi T, editor. Biosynthesis
et al. 2000). and biodegradation of wood components. Orlando:
Academic; 1985. p. 109–35.
Khasin A, Alchanati I, Shoham Y. Purification and char-
acterization of a thermostable xylanase from Bacillus
Summary stearothermophilus T-6. Appl Environ Microbiol.
1993;59:1725–30.
Most of the xylanases retrieved by culture- Kormelink FJM, Voragen AGJ. Degradation of different
[(glucurono)arabino] xylans by a combination of puri-
dependent and culture-independent approaches fied xylan-degrading enzymes. Appl Microbiol
exhibit optimal activity in the pH and tempera- Biotechnol. 1993;38:688–95.
ture ranges of 6.0–8.0 and 40–60 C, respec- Maalej I, Belhaj I, Masmoudi NF, Belghith H. Highly
tively. The xylanase (rMxyl) obtained in this thermostable xylanase of the thermophilic
fungus Talaromyces thermophilus: purification and
investigation through metagenomic approach characterization. Appl Biochem Biotechnol. 2009;
displays alkalistability as well as thermostability. 158:200–12.
This is the first report on the xylanase with twin Mamo G, Delgado O, Martinez A, et al. Cloning, sequenc-
stabilities obtained through a culture- ing analysis and expression of a gene encoding an
endoxylanase from Bacillus halodurans S7. Mol
independent approach. A very low similarity in Biotechnol. 2006;33:149–59.
amino acid sequence of the enzyme with other Mathrani IM, Ahring BK. Thermophilic and alkaliphilic
known xylanases makes it a novel xylanase. The xylanase from several Dictyoglomus isolates. Appl
possibility of obtaining thermo-alkali-stable Microbiol Biotechnol. 1992;38:23–7.
Matteotti C, Bauwens J, Brasseur C, et al. Identification
xylanase from composts may lead to an intense and characterization of a new xylanase from gram-
search for similar enzymes in this and other positive bacteria isolated from termite gut
related niches. (Reticulitermes santonensis). Protein Expr Purif.
2012;83:117–27.
Miller GL. Use of dinitrosalicylic acid reagent for deter-
mination of reducing sugars. Anal Chem. 1959;
31:426–8.
References Son-Ng I, Li CW, Yeh Y, et al. A novel endoglucanase
from the thermophilic bacterium Geobacillus
Archana A, Satyanarayana T. Xylanase production by sp. 70PC53 with high activity and stability over
thermophilic Bacillus licheniformis A99 in solid-state broad range temperatures. Extremophiles. 2009;
fermentation. Enzyme Microb Technol. 1997; 13:425–35.
21:12–7. Uchino F, Fukuda O. Taxonomic characteristics of an
Collins T, Gerday C, Feller G. Xylanases, xylanase fam- acidophilic strain of Bacillus producing thermophilic
ilies and extremophilic xylanases. FEMS Microbiol acidophilic amylase and thermostable xylanase. Agric
Rev. 2005;29:3–23. Biol Chem. 1983;47:965–7.
Novel Approaches to Pathogen Discovery in Metagenomes 559 N
Vazquez MJ, Alonso JL, Dominguez H, et al. Xylo- which isolation of disease causative microbe
oligosaccharides: manufacture and applications. and determination of its etiological features are
Trends Food Sci Technol. 2000;11:387–93.
Verma D, Satyanarayana T. Cloning, expression and of the essence (Falkow 2004; Lipkin 2010).
applicability of thermo-alkali-stable xylanase of There are fascinating and tragic stories in medical
Geobacillus thermoleovorans in generating xylo- history of human volunteers or doctors who
oligosaccharides from agro-residues. Bioresour sacrificed health or even their lives to test patho-
Technol. 2012;107:333–8.
Verma D, Satynarayana T. An improved protocol for gens on themselves to satisfy the postulates. The
DNA extraction from alkaline soil and sediment sam- principles guided the development of clinical
ples for constructing metagenomic libraries. Appl microbiology and remain the important guide-
Biochem Biotechnol. 2011;165:454–64. lines, if not the rules, even in the era of molecular
Yoon HS, Han NS, Kim CH. Expression of thermotoga
maritima endo-b-1, 4-xylanase gene in E. coli and biology and genomics. Nevertheless, study has
characterization of the recombinant enzyme. Agric. shown that the vast majority of microorganisms
Chem. Biotechnol. 2004;47:157–160. cannot be readily grown or are not cultivable at
all (Handelsman 2004). It is also true for patho-
gens; in other words, there are numerous varieties
of potential pathogens that exist and evolve in the
Novel Approaches to Pathogen
environment; it is just a matter of time when and
Discovery in Metagenomes
where they will emerge or reemerge to cause
sporadic cases or outbreak. In addition, the man-
Jun Hang
ifestation of some diseases is contributable to
Viral Diseases Branch, WRAIR, Silver Spring,
coexistence of multiple organisms or imbalanced
MD, USA
microbial community at host tissues.
Technique approach for pathogen diagnostics
Synonyms evolves along with scientific discovery and
technology innovation on microbiology as well
Community genomics; Metagenomics and path- as other disciplines. A variety of techniques are N
ogen identification; Microbiome and virome; used in clinical labs, including the traditional
Pathogenomics microbiology tests, rapid serological assays, and
various molecular assays. They are well designed
and validated with reliable sensitivity and speci-
Definitions
ficity (Lipkin 2010). Many of them are automated
for improved speed, convenience, and accuracy.
Pathogen discovery: identification of causative
However, in spite of the great effectiveness and
microbial or viral agent(s) for an illness or
robustness, threat from emerging pathogens
asymptomatic infection. The identification may
remains real. In particular, because of the rising
refer to etiological diagnosis for individuals, epi-
globalization and drastic climate changes, novel
demiology investigation on population scale, and
pathogens and new variant strains have more
animal or environmental surveillance on orphan
often appeared and spread. There are chances that
pathogens.
a highly virulent pathogen may escape detection by
Metagenomics: genomic study on a population
conventional methods and can cause a widespread
of biologically or functionally close microorgan-
outbreak and public health crisis with dramatic
isms as a whole community, without separation of
economic loss and social consequences.
components into pure culture isolates.
To answer the emergent challenge, novel
approaches utilizing the advanced technologies
Introduction have been developed to effectively identify path-
ogens as well as elucidate pathogenesis mecha-
The best-known statement on pathogen discovery nism in comprehensive way (Lipkin 2010; Olsen
probably is the so-called Koch’s postulates, in et al. 2012). Metagenomics analyzes all genomic
N 560 Novel Approaches to Pathogen Discovery in Metagenomes
information in a specifically defined population. analytical approaches are vital for the sensitivity
The deep and comprehensive metagenomic and accuracy for pathogen discovery.
information allows individual organisms of inter-
est to be interrogated in the context of the whole
community and with its phylogenetic relatives 16S Ribosomal RNA Gene Sequencing
(Joseph and Read 2010). The significant strategy for Human Microbiota Assessment and
has transformed the way we perceive microbial Identification of Bacterial Pathogens
world. The related laboratory and bioinformat-
ics approaches were successfully used in identi- Bacterial 16S rRNA gene sequence has long been
fying causal pathogens for outbreaks and used to classify bacteria down to taxonomy levels
providing vital insights into the source and/or of genus or lower. In contrast to amplification,
evolutionary origins (Koser et al. 2012). cloning, and sequencing of full length 16S rRNA
Approaches based on rich knowledge from genes by Sanger method, NGS enables massive
metagenomics are vigorously implemented to acquisition of a million or more 16S rRNA gene
pathogen discovery and are believed to be clear segment sequences in a single run to decipher
path of future perspectives of the clinical diag- bacterial composition (species richness and
nostics (Eisen and MacCallum 2009; Olsen abundance) in a community (Kuczynski
et al. 2012). et al. 2012). Sequence across two to three vari-
able regions has been suggested to contain taxo-
nomic information unique enough for
Strategy and Schemes classification. Roche 454 pyrosequencing is cur-
rently the method of choice due to its relatively
Genomic approaches to detection of pathogen in long read length and low sequence error rate. Read
clinical specimens are either based on known length average 300–500 bases for Roche GS FLX
genomic information (sequence dependent) or Titanium system and 500–800 bases for the recent
designed to capture unique and disease-relevant FLX + system. FLX + application on amplicon
as well as redundant and irrelevant sequences sequencing is currently under development and
altogether (sequence independent) (Olsen yet to be validated for 16S sequencing which will
et al. 2012). Metagenomics was initially devel- achieve longer read length without comprising
oped in the era of Sanger sequencing (Fredericks sequence quality. Different from genome sequenc-
and Relman 1996; Handelsman 2004) and truly ing in which reads are assembled by overlapping
thrived with the emerging of the next-generation to obtain a consensus sequence, in 16S-based
sequencing (NGS) technologies which make metagenomic analysis, 454 sequencing reads are
DNA sequencing much less expensive and classified individually, i.e., each read is one oper-
hugely productive (Petrosino et al. 2009). It is ational taxonomic unit (OTU). Therefore, high-
now feasible and affordable to either sequence performance sample preparation and sequencing
a number of amplicons at exceedingly high depth procedures, stringent data processing, and analyt-
to capture rare variants or sequence all DNA and/or ical pipeline are critical for achieving and
RNA by design in a complex sample. NGS allows maintaining accuracy and sensitivity. Many stud-
direct sequencing of microbial contents without ies to compare materials and methods for optimi-
microbiological cultivation for isolation and zation have been published (Kuczynski
enrichment. Numerous molecular biology tech- et al. 2012). One significant open resource is the
niques for sample preparation prior to sequencing Data Analysis and Coordination Center (DACC)
and bioinformatics tools for data mining and ana- from the National Institutes of Health (NIH)
lyses were developed (Thomas et al. 2012). Exper- Common Fund supported Human Microbiome
iment design and the choice of technical and Project (HMP) and is available at website
Novel Approaches to Pathogen Discovery in Metagenomes 561 N
http://www.hmpdacc.org/. The fundamental knowl- and phenotypic information will be the key to
edge on healthy human microbial communities and a metagenomics-based clinical test (Joseph and
the developed metagenomics techniques and ana- Read 2010). Other components essential to the
lytic tools are being brought into the clinical arena feasibility include streamlined sample processing
with encouraging successes on making diagnoses and sequencing system with automation, conve-
of difficult diseases and complex outbreaks nient data collection and management procedure,
(Loman et al. 2012). Pathogenomics is showing and efficient bioinformatics pipeline in concert
its power and clinical importance by revealing with reference information for sequence analysis
genomics and metagenomics basis for complicated and amenable to integration with medical record
syndromes which cannot be explicitly understood and interactive communication to worldwide dis-
with conventional clinical tests. In consequence, ease networks and specific study consortiums
improved therapeutic practices, reduced medica- (Koser et al. 2012).
tion costs, and more-informed disease prevention In addition to the promising clinical use of
measures can result in dependable public health whole-genome metagenomics, the scientific and
protection. technical resources gaining from metagenomics
quests have a multitude of utilities that can make
existing pathogen discovery methods design
Microbial Metagenomics and Single-Cell and perform better (Fournier and Raoult 2011).
Sequencing For instance, with the comprehensive genomic
information on the microbial community
There are considerable and ongoing efforts to corresponding to the specimens, multilocus
characterize collective whole genomes in an sequence typing (MLST), PCR-based molecular
entire community. For example, several research assays, microarray-based assays, etc. can be
teams used Illumina’s NGS technology and shot- made more specific for the targets with reduced
gun sequencing approach to generate several nonspecificity. Moreover, assay results can be
hundred gigabases of microbial sequences for interpreted with better estimation of probability N
extensive cataloging of genes in human gut of miss-calling and the false-positive, therefore
microbes (Qin et al. 2010; Arumugam et al. concluded with increased confidence.
2011). From a number of studies, the depth and Another promising approach is single-cell
comprehensiveness of our knowledge on human genome sequencing for pathogen discovery. Indi-
microbiomes is unprecedented, and it would not vidual microorganisms or parasites are physically
be possible without having the advanced NGS isolated out of a complex community, i.e., clinical
technologies and the associated sophisticated matrix, either under microscopy by morphology or
bioinformatics tools (Kuczynski et al. 2012; using devices such as flow cytometry cell sorting.
Thomas et al. 2012). However, such a shotgun Both methods are well established and already rou-
unselective metagenomic strategy requires tre- tinely used in clinical laboratory. Harvested single
mendous computational power and may not be cell or a homogenous pool of cells are then
efficient or cost-effective enough for a routine subjected to amplification and sequencing. Multiple
pathogen diagnostics practice. There are multiple displacement amplification (MDA) from a single
approaches that may facilitate the overcoming of cell has been shown robust and faithful for down-
the hurdles for the wide use of metagenomics in stream sequencing and microarray applications.
clinical settings. The HMP and other interna- Studies showed 95 % or higher genome coverage
tional programs aim to build a database of fully by using single-cell genomic sequencing (Pallen
annotated complete genome sequences for bacte- et al. 2010). The culture-free approach coupled
ria of clinical and human health relevant. The with lab-on-chip microfluidic cell harvesting and
high-quality reference genome database with processing automation may make its way to
rich and definitive genomic, genetic, functional, become suitable for clinical diagnostic use.
N 562 Novel Approaches to Pathogen Discovery in Metagenomes
a
Specimen collection, Storage, Transportation,
Log and archiving, Deidentification, etc.
Solids Liquids
cells, bacteria virus, DNA/RNA
DNA/RNA extraction
DNA/RNA
Next-genaration Sequencing
Sequence
Reads
Amplicon V5-V3
Primer A
Novel Approaches to Pathogen Discovery in choice of downstream NGS platform, e.g., Roche/454
Metagenomes, Fig. 1 Pathogen discovery workflow. GS. Amplicon(s) for each sample can be barcoded indi-
(a) Flow diagram of the main procedures to pathogen vidually using sequences such as 454’s 10-nt Multiplexing
identification. (b) 16S-based targeted metagenomics for identifier (MID) sequences. (c) Unbiased random amplifi-
determination of bacterial composition. Top panel shows cation. Random reverse transcription is primed by random
16S rRNA gene and hypervariable regions 1–9. Center hexamers or octamers tailed with specific sequence. Sub- N
panel shows three amplicons commonly used in sequent random PCR uses the random primers and the
16S-based metagenomic sequencing. The arrows indicate primer matching with the specific sequence. The double-
sequencing direction. Bottom panel shows fusion primers stranded random amplicons can be sequenced with NGS
for PCR amplification of 16S rRNA gene segments. for viral sequence identification
Primer A/B and key sequences are compatible to the
Unbiased Random Amplification and virus or a new virus variant escaped initial detec-
Sequencing for Viral Pathogen tions or was misdiagnosed and caused an out-
Identification break. De novo approach with no requirement
for known sequence is therefore advantageous
While microbial metagenomics for bacterial for viral pathogen discovery. One technique
pathogen identification is still at its early stage, breakthrough is to identify novel viral sequence
viral metagenomics has become a robust by unbiased random amplification and massive
approach for hunting novel viral pathogens sequencing with NGS platforms. The process
when viral culture and molecular assays cannot illustrated in Fig. 1 includes the following major
make the diagnosis (Djikeng and Spiro 2009; steps: sample preparation which may require
Mokili et al. 2012). Because of the vast number extra preprocessing to enrich viruses and reduce
and variety of viruses in nature and the high non-virus contents, random reverse transcription
frequency of evolution events including nucleo- and anchored random PCR amplification,
tide mutagenesis, sequence recombination, and sequencing the random amplicons by NGS, and
segment reassortment, it is not rare that a novel data mining for identification of viral pathogen
N 564 Novel Approaches to Pathogen Discovery in Metagenomes
Novel Approaches to
Metagenomic random sequencing raw reads
Pathogen Discovery in
Metagenomes,
Fig. 2 Bioinformatics
strategy for identification of Pre-filtering: Trim QV17 bases; Remove reads shorter than 50 bases
viral sequences.
Metagenomic sequencing
data are processed using
streamlined multiple Trim primer sequence from both ends
sequence analyzing tools to
search for disease-related
sequence hits. Two typical
Sequence clustering De novo assembly
analysis paths are shown as
examples. It is crucial for
the efficiency to reduce Remove host sequences Unassembled reads Contigs
redundancy (e.g., sequence (decontamination)
assembly) and human
genome sequences Bacterial hits
(decontamination) prior to Bacterial hits (bacteria database)
database alignment while (bacteria database)
retaining relevant
sequences. Reduced Viral hits
volume sequences are Viral hits
(virus database) (virus database)
subjected to thorough
alignments to the specified
as well as mega databases Other hits
Other hits
(nr/nt database) (nr/nt database)
Artificial/Novel Artificial/Novel
sequences sequences
Taxonomic Annotation
Summary Report
Qin J, Li R, et al. A human gut microbial gene catalogue composition, and it is by these patterns that oth-
established by metagenomic sequencing. Nature. erwise anonymous metagenomic sequences can
2010;464(7285):59–65.
Relman DA. Microbial genomics and infectious diseases. be grouped into inferred populations enabling
N Engl J Med. 2011;365(4):347–57. in-depth functional analysis.
Thomas T, Gilbert J, et al. Metagenomics – a guide from Extensive sequencing of microbial DNA made
sampling to data analysis. Microb Inform Exp. possible the large-scale analysis of this genome
2012;2(1):3.
base composition. Such analyses have revealed
that the various patterns in base composition may
be related to specific molecular machinery within
microbial cells that help shape base composition.
Nucleotide Composition Analysis: These biasing effects are thought to be mediated
Use in Metagenome Analysis by the processes of DNA repair and replication,
mutations, and base-step conformational tenden-
Isaam Saeed cies that operate in concert to give rise to the
Optimisation and Pattern Recognition Group, characteristic base composition of different
Melbourne School of Engineering, The University microbial genomes (Karlin et al. 1997).
of Melbourne, Parkville, Australia Since the sequencing methodology of
metagenomics does not preserve the association
between sequenced reads and their genome of
Synonyms origin, functional analysis of a metagenome can
only provide an overall snapshot of what
Binning; Genome signature; Nucleotide a microbial community can potentially
frequency do. However, if the association between
a sequence in a metagenome and the original
genome (or population) from which it was sam-
Definition pled from can be inferred, then the resulting func-
tional analysis can probe deeper into the inner
The composition of nucleotide bases in workings of a microbial community. Processing
a microbial genome is not random and is instead sequences in this manner prior to functional anal-
biased toward different compositional structures ysis is referred to as binning. There are currently
that vary between organisms. These biases occur two major ways to address the binning problem
as identifiable patterns in oligonucleotide base (McHardy and Rigoutsos 2007): the first classifies
composition, and it is by these patterns that oth- sequences using a database of preexisting knowl-
erwise anonymous metagenomic sequences are edge of microbial organisms; and the second
grouped into inferred populations. This allows groups related sequences based on the common
for more in-depth analysis of the functional patterns that arise from biases in the base compo-
potential of a sampled microbial community in sition of microbial genomes. The latter approach
the context of constituent members (inferred reflects the exploratory nature of metagenomics,
populations). given that the majority of microorganisms cannot
be cultivated in a laboratory environment and
therefore they may not be represented in current
Introduction databases as yet.
When considering the use of patterns
The composition of nucleotide bases in (or genome signatures) in nucleotide base com-
a microbial genome is not random and is instead position for binning, there are two major factors
biased toward different compositional structures that will influence the quality of the resulting set
that vary between organisms. These biases occur of indentified groups (inferred populations). The
as identifiable patterns in oligonucleotide base first is the taxonomic resolution of patterns to be
Nucleotide Composition Analysis: Use in Metagenome Analysis 567 N
used, which is governed by the between-genome Nucleotide Frequency
distinctness of a pattern. The second is the accu-
racy at which these patterns can be grouped, With the advent of large-scale, high-throughput
which is governed by the within-genome conser- DNA sequencing, the increased sample size of
vation of a pattern. sequenced DNA molecules provided a foundation
for extensive statistical evaluation of nucleotide
composition in different genomes. Further studies
A Simple Binning Strategy Using GC of genomic composition, in light of the increasing
Content number of available genome sequences, pioneered
the use of higher-order statistics to describe signa-
It is well established that there are differences in tures in microbial genomes. The underlying princi-
GC content between various microbial genomes. ple of these signatures is based on the observation
The benefit of this to binning is that these biases that specific oligomers are under-/overrepresented
can often be used as a representative pattern to in different genomes and that the similar biases
group related sequences that share similar GC occur in related genomes. Nucleotide frequency is
content. Although localized GC content can among the most widely used ways of representing
vary throughout a genome, if large enough these biases and is calculated by counting all occur-
sequences are available in a metagenome, then rences of fixed length oligos (or n-mers) within
the assumption that the observed GC content is a sequence and then normalizing by the total num-
representative of the full genome composition ber of oligos in that sequence to arrive at an esti-
still holds. It should be noted that GC content is mate of the oligonucleotide frequency content. The
not a unique property of individual genomes and features of microbial genomes based on nucleotide
if it is used it will group sequences coarsely frequency, which have been successfully applied to
(in terms of microbial taxonomy). In such metagenomic studies, include the following: the
a scenario, if GC content is significantly different dinucleotide odds ratio, codon signatures/trinucle-
between the identified groups, then it can be otide frequencies, and tetranucleotide frequencies. N
assumed that these groups are unrelated, but it is
not conclusive to say that the sequences within Dinucleotide Odds Ratio
each group are related unless further analysis is Among the earliest of these nucleotide frequency
conducted (due to the nonuniqueness property of signatures that was found to be biologically rele-
GC content). To increase the taxonomic resolu- vant was the dinucleotide odds ratio, which was
tion of binning using GC content, for example, it based on early in vitro studies on differences in
can often combine with other complementary dinucleotide content between various organisms
features. An approach of this sort was used to (Karlin et al. 1997). This signature considered the
group sequences of the metagenome of an acid dinucleotide frequency content of a sequence and
mine drainage biofilm (Tyson et al. 2004), where factored out the effect of mononucleotide fre-
GC content was combined with local assembly quencies using a normalization scheme based on
depth to distinguish the dominant populations a Markov model, as given by
that shared similar GC content.
Generally, metagenomes with a small number f XY
r2i ¼
of dominant species tend to be easier to assemble, f X f Y0
and the resulting contig lengths can often make GC
content a viable option to group sequences in such where X and Y represent the first and second
data sets. However, there are still limitations to this mononucleotide in the dinucleotide to be normal-
approach, and for more complex metagenomes, ized and f represents the frequency of mono-/
GC content has been superseded by higher-order dinucleotides. The derived statistic, also referred
statistics of base composition, referred to as oligo- to as the dinucleotide odds ratio, could ade-
nucleotide (or n-mer) frequencies. quately describe biases specific to various
N 568 Nucleotide Composition Analysis: Use in Metagenome Analysis
microbial organisms. For example, it was 50 full-length genes from a given genome to
observed that there is a general TpA avoidance make a stable estimate, and these ratios can be
mechanism across various microbial genomes biased within a genome (not only between
and a CpG underrepresentation in thermophilic genomes), depending on the set of gene classes
microbes. Distances between sequences that comprise the genes in a genome. Due to these
represented using the dinucleotide odds ratio are issues, such signatures may cause difficulty in
evaluated using the Manhattan distance, also grouping sequences for the purpose of binning.
referred to as the d* distance. When this odds
ratio differs from 1, the resulting statistic pro- Tetranucleotide Frequency
vides a means to estimate the under-/ Tetranucleotide frequency (all possible combina-
overrepresentation of specific dinucleotides, tions of 4-mers, of which there are 256) offers
given by the limits 0.78 and 1.23, respectively greater discrimination between species in
(Karlin et al. 1997). Although several genome- a metagenome than lower-order nucleotide frequen-
wide biases were found when using this odds cies. For this reason, tetranucleotide frequency is
ratio statistic, the discrimination between larger perhaps the most widely used in clustering
sets of microbial genomes (more representative metagenomic sequences. Moreover, it has been
of real-world metagenomes) is still better handled found to capture a species-specific signature
by higher-order frequencies. (a reasonably strong phylogenetic signal at lower
taxonomic ranks), which makes it not only a more
Codon Signatures/Trinucleotide Frequency powerful alternative to clustering metagenomic
Gene sequences are relatively conserved within sequences but also offers biologically meaningful
a genome, as any changes at critical locations groupings of sequences (Teeling et al. 2004). This
may cause the gene product to be defective. This was also confirmed by (Mrazek 2009) who corre-
motivated the use of signatures based on this lated 16S rRNA distances with various signatures
knowledge to capture more representative patterns and found that tetranucleotide frequency was able to
in microbial genomes. Codon usage in the gene outperform other feature sets. It has also been found
sequences is thought to be mediated by the overall that tetranucleotide frequency can be used to find
genome composition and is also related to the conserved signatures flanking 16S rRNA genes,
flexibility of the choice of codons due to the which can in turn be used to assign classes to the
degeneracy of the genetic code. Trinucleotide fre- identified groups of sequences (Chan et al. 2008).
quencies (i.e., frequencies of all possible 3-mers:
AAA, AAC, AAT, . . ., GGG) can be used to Strand Bias
capture some of these biases, and alternatively, Prior to the use of these signatures, it should be
an extension to the dinucleotide odds ratio is also noted that for oligonucleotide frequencies, the
able to capture these codon signatures using dinu- feature vector requires correction for biases
cleotides (Karlin et al. 1998). This codon signature between strands (Tyson et al. 2004). This is
is constructed as follows: often remedied by counting the number of
f XY ð1, 2Þ n-mers on the original sequence, as well as on
gXY ð1, 2Þ ¼ the reverse complement, and then taking the aver-
f X ð1Þf Y ð2Þ
f ð2, 3Þ age of the two prior to normalization.
gXY ð2, 3Þ ¼ XY
f X ð2Þf Y ð3Þ
f ð3, 4Þ
gXY ð3, 4Þ ¼ XY 0 Normalization Techniques for
f X ð3Þf Y ð4Þ
Nucleotide Frequency
where the indices represent the nucleotide base at
the first, second, or third nucleotide within Given the observed nucleotide frequencies for
a codon (with index 4 referring to the first base each sequence, it is often necessary to normalize
of the next codon). This signature requires at least each observation prior to further analysis. (Note:
Nucleotide Composition Analysis: Use in Metagenome Analysis 569 N
it is still possible to simply take the frequencies of frequency-derived error gradient (OFDEG)
the observed number of n-mers in a sequence.) (Saeed and Halgamuge 2009). OFDEG was
observed by exploring the relationship of
Markov Normalization nucleotide frequency between short fragments
The dinucleotide odds ratio is a special case of and the original DNA sequence from which
Markov normalization. In the general case, the they are sampled. The observation is based on
maximal-order Markov normalization of an a resampling method, where instead of using
observed nucleotide frequency vector is given by the entire sequence to estimate the maximum
likelihood point estimate of nucleotide fre-
f 1...k f 2...k1 quency, the OFDEG measure resamples the
rki ¼ ,
f 1...k1 f 2...k nucleotide base composition of varying length
subsequences to capture the distribution of
where the appropriate statistic for tetranucleotide
oligomeric frequencies.
frequency, for example, is given when k ¼ 4. This
For example, it is straightforward to com-
normalization scheme essentially aims to filter
pute the un-normalized nucleotide frequency of
out lower/higher nucleotide frequencies. Lower-
an entire genome (referred to in this definition
order normalization schemes are possible, but
as an occurrence vector). Similarly, the occur-
they have poorer correlation properties with phy-
rence vector for a short subsequence sampled
logenetic distances (Mrazek 2009).
from anywhere along the genome can be easily
computed. Intuitively, the error between these
Z-Score Normalization
two occurrence vectors (defined in terms of
Another approach to normalization uses the
Euclidean distance) would be large. Neverthe-
Z-score transform to assess the statistical signif-
less, the error is recorded and another subse-
icance of observed n-mers (Tyson et al. 2004).
quence, of increased length, is sampled again
For tetranucleotide frequency, the Z-score nor-
from anywhere along the genome. Trivially,
malization is computed as follows: the expected
the error between the occurrence vector of N
value for a given tetramer is calculated by
this new subsequence and the occurrence vec-
N ðn1 n2 n3 ÞN ðn2 n3 n4 Þ tor of the genome would be reduced. This
Eð n 1 n 2 n 3 n 4 Þ ¼ , process is continued until the length of sub-
N ð n1 n2 Þ
sequences is equivalent to the length of the
and the variance is calculated using genome, while keeping track of the error at
each sampling instance. The resulting error as
s2 ðn1 n2 n3 n4 Þ ¼ Eðn1 n2 n3 n4 Þ a function of subsequence length is found to be
½N ðn2 n3 Þ N ðn1 n2 n3 Þ ½N ðn2 n3 Þ N ðn2 n3 n4 Þ linear (up to a given subsequence length). The
, rate of error reduction (or gradient) of this
N ð n2 n3 Þ 2
linear trend, within the bounds of the linear
which gives the required normalization for each region, is referred to as the OFDEG. It has
tetramer: been found that this linear gradient is different
for various genomes and is remarkably consis-
tent within genomes as well as between frag-
N ð n1 n2 n3 n4 Þ Eð n 1 n2 n3 n4 Þ
Z ðn1 n2 n3 n4 Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ments of a genome. The measure essentially
s2 ðn1 n2 n3 n4 Þ captures the relative magnitude of biases in
nucleotide base composition in a manner sim-
The Oligonucleotide Frequency-Derived ilar to entropic measures and has been used in
Error Gradient (OFDEG) combination with other complementary fea-
tures to successfully group related sequences
An extension to oligonucleotide frequency- in various simulated and real-world
based features is the oligonucleotide metagenomes.
N 570 Nucleotide Composition Analysis: Use in Metagenome Analysis
Combining Features to Increase the training set in relation to the metagenome under
Taxonomic Resolution of Binning investigation.
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
O 574 Open Resource Metagenomics
metagenomic genes will be those whose function promoters for their transcription (Mastropaolo
could not have been predicted by sequence alone; et al. 2009). Host-specific limitations on gene
these genes would be more likely to encode prod- expression include posttranscriptional controls,
ucts with truly novel properties. including translation initiation, codon usage, pro-
A major advantage of metagenomic libraries tein folding, enzyme activation, and transport.
is that once they are made, they can be Also, wild-type and mutant strains that are most
a permanent resource, a snapshot of the microbial appropriate for a given screen might be available
community that the DNA was extracted from. only in a host background that does not support
The same library, if stored properly, can be replication of a given vector. This is especially
screened multiple times, indefinitely. Below we true when using vectors that only replicate in
outline several methodological considerations Escherichia coli and other Gammaproteo-
for maximizing benefit from open resource bacteria. For these reasons, it is advantageous to
metagenomic libraries. choose or design vectors that can be maintained
Although they have sometimes been used, in diverse host backgrounds.
small-insert libraries are not optimal for func- Metagenomic libraries are often constructed
tional metagenomics. The smaller the insert, the for specific applications, such as to screen for
less chance that individual clones will contain a desired enzyme activity. Unlike the situation
full operons, including the regions required for single culture isolates, which must be depos-
for control of gene expression. As a result, the ited in accessible culture collections or otherwise
use of bacterial artificial chromosome (BAC; made available as a requirement of publication of
Kakirde et al. 2012) and cosmid/fosmid research results involving them, there is no such
(Aakvik et al. 2009; Neufeld et al. 2011; requirement or expectation for metagenomic
Taupp et al. 2011) vectors enables the cloning libraries. This is unfortunate, as high-quality
of fragments that are large enough to include metagenomic libraries are technically challeng-
multiple operons. Such large-insert libraries ing and costly to construct, and their full value is
require fewer clones to ensure that they are often not realized if their use is restricted to one or
representative. a few research groups.
Depending on DNA yields, quality, and size,
metagenomic libraries of environmental micro-
bial communities may yield several million Achieving Metagenomic Resource
clones. If such libraries are distributed into Sharing
384-well plates, this would represent over 2,500
plates per million clones. Plate storage would We formally proposed that to ensure maximum
require extensive freezer space, and screening value, metagenomic libraries should be made
such libraries, one clone at a time, would be publicly available to members of the research
prohibitively laborious and costly, even with the community, without restriction (Neufeld
use of robotic manipulation. An alternative strat- et al. 2011). This is the concept of open resource
egy we recommend is to recover and maintain the metagenomics that libraries be pooled to ensure
libraries as pools of clones. This procedure ease of archiving as frozen stocks and for subse-
involves physical harvesting and mixing of all quent distribution and handling. We also
individual colonies from all initial library plates, recommended that cosmid libraries be used,
followed by the preparation of aliquot suspen- because they allow the efficient cloning of large
sions for cryopreservation and subsequent inserts of >30 kb. To facilitate screening in
distribution. a diversity of host backgrounds, cosmid vectors
Another important consideration is that differ- with broad host range origins of replication are
ent host backgrounds will selectively express recommended, as well as Gateway recombina-
only a subset of an environmental metagenome. tional systems for easy transfer of inserts to
For example, Bacteroides genes use specialized other vectors. An example of such a resource is
Open Resource Metagenomics 575 O
the Canadian MetaMicroBiome Library project for transfer of metagenomic libraries to a variety of
(CM2BL; http://cm2bl.org), which houses bacterial species. FEMS Microbiol Lett. 2009;296:
149–58.
a collection of Canadian soil metagenomic librar- Ekkers DM, Cretoiu MS, Kielak AM, Van Elsas JD. The
ies in an IncP cosmid Gateway vector. The largest great screen anomaly – a new frontier in product dis-
library in this collection contains over eight mil- covery through functional metagenomics. Appl
lion clones. To assist users in deciding which Microbiol Biotechnol. 2012;93:1005–20.
Handelsman J, Rondon MR, Brady SF, Clardy J, Good-
libraries to choose for a given application, exten- man RM. Molecular biological access to the chemistry
sive metadata and taxonomic sequence informa- of unknown soil microbes: a new frontier for natural
tion is accessible in an online database. products. Chem Biol. 1998;5:R245–9.
Kakirde KS, Wild J, Godiska R, Mead DA, Wiggins AG,
Goodman RM, Szybalski W, Liles MR. Gram negative
shuttle BAC vector for heterologous expression of
Summary metagenomic libraries. Gene 2012;475:57–62.
Mastropaolo MD, Thorson ML, Stevens AM. Comparison
The open resource metagenomics initiative aims of Bacteroides thetaiotaomicron and Escherichia coli
16S rRNA gene expression signals. Microbiology.
to increase the availability of metagenomic 2009;155:2683–93.
libraries to the research community as a public Neufeld JD, Engel K, Cheng J, Moreno-Hagelsieb G, Rose
and scientific resource. The principle of free and DR, Charles TC. Open resource metagenomics;
open sharing of metagenomic libraries is central a model for sharing metagenomic libraries. Stand
Genomic Sci. 2011;5:203–10.
to this initiative, including direct access to asso- Rondon MR, August PR, Betterman AD, Brady SF,
ciated metadata and DNA sequences. Increased Grossman TH, Liles MR, Loiacono KA, Lynch BA,
gene discovery as a result of the use of these Macneil IA, Minor C, Tiong CL, Gilman M, Osburne
libraries not only has the potential to provide MS, Clardy J, Handelsman J, Goodman RM.
Cloning the oil metagenome: a strategy for accessing
novel, biotechnologically useful genetic material the genetic and functional diversity of uncultured
but should increase the overall understanding of microorganisms. Appl Environ Microbiol. 2000;66:
gene functions and their relationship to DNA 2541–7.
sequence. Taupp M, Mewis K, Hallam SJ. The art and design of
functional metagenomic screens. Curr Opin
Biotechnol. 2011;22:465–72.
Thomas T, Gilbert J, Meyer F. Metagenomics - a guide O
References from sampling to data analysis. Microb Inform Exp.
2012;2:3.
Aakvik T, Degnes KF, Dahlsrud R, Schmidt F, Dam R, Wexler M, Johnston AB. Wide host-range cloning for
Yu L, Völker U, Ellingsen TE, Valla S. A plasmid functional metagenomics. Methods Mol Biol. 2010;
RK2-based broad-host-range cloning vector useful 668:77–96.
P
Phylogenetics, Overview each clade; and to allow retro direction, i.e., the
ability to infer ancestral properties based on
Phylogenetics: A Root and Branch Analysis of the observable characteristics of extant organisms.
Tree of Life A significant limitation of traditional
morphology-based phylogeny approaches is the
Roy Sleator fact that reconstructing ancient evolutionary
Department of Biological Sciences, Cork events requires a vast sum of character changes.
Institute of Technology, Cork, Co. Cork, Ireland Furthermore, many of these morphological char-
acters are likely under selective pressure and
subject to convergence (Sleator 2010). Based
Synonyms solely on this criterion, most organisms lack
sufficient phenotypic characters to perform
Evolutionary relatedness effective comparative analyses (Lopez and
Bapteste 2009).
The development of modern DNA and protein
Definition sequence technologies has however effectively
eliminated this limitation. Modern phylogenetic
Phylogenetics, derived from the Greek terms analysis involves the progressive alignment of
phylon (meaning “tribe”) and genetikos nucleic acid and/or protein sequences between
(meaning “genitive” or origin), is the study of extant organisms. A hypothesis is then produced
the evolutionary history of species, organisms, to explain the repartition of character states, and
genes, or proteins through the construction and the results presented as a phylogenetic tree –
analysis of mathematical entities known as trees which is simply a graphic representation of the
or phylogenies. computed output.
The accelerating accumulation of molecular
sequence data arising from recent concerted
Introduction large-scale genomic and metagenomic sequenc-
ing projects (Sleator et al. 2008) continues to
Darwin’s The Origin of Species marked the birth afford new opportunities and perspectives for
of phylogeny, a discipline whose primary aims dissecting evolutionary relationships. Indeed,
are to classify all living organisms, grouping all while early molecular phylogenetic approaches
extant descendants of a given ancestor within centered on individual DNA sequences coding
specific groups or clades; to provide insights for RNA or proteins, or the derived amino acid
into the shared properties of members within sequences of the latter, more recent analysis of
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
P 578 Phylogenetics, Overview
Phylogenetics,
Overview, Fig. 1 Tree-
building methods.
Schematic overview of the
major analytical
approaches to phylogenetic
tree building
whole genomes has led to the development of have a number of significant limitations. NJ, for
phylogenomics – a powerful approach to analyze example, provides only a single tree as opposed
complete genome sequences as a metasequence to character-based methods which compute
(Forterre and Gadelle 2009). a consensus tree from several optimal or near
optimal candidates. Furthermore, NJ may com-
pute different tress depending on the order in
Discussion which the constituent sequences are added.
Finally, given that differences are presented as
Tree-Building Methods: The Major Analytical distance values, it is impossible to identify the
Approaches specific character changes that support a branch
While several methods exist for inferring evolu- (Soltis and Soltis 2003a).
tionary relatedness, most can be classified as Character-based methods (also referred to as
either distance- or character-based methods tree-searching methods) search for the most prob-
(outlined in Fig. 1). Distance (or algorithmic) able tree for a specific sequence set based on
methods employ an algorithm incorporating characters at each position of the sequence align-
a model of evolution (e.g., amino acid substitu- ment and a model of evolution. The most com-
tion) to compute a distance matrix from which mon character-based approaches include
a phylogenetic tree is calculated by means of maximum parsimony (MP), maximum likelihood
progressive clustering. Specifically, distances in (ML), and to a lesser extent Bayesian methods.
the matrix relate to the number of differences MP seeks to find the tree or trees that are
between each pair of sequences (either DNA or compatible with the minimum number of substi-
protein). The model of evolution specifies how tutions among sequences, i.e., the fewest evolu-
amino acid substitutions occurred in the protein tionary changes. An advantage of MP is that it
sequence since they last shared a common ances- provides diagnosable units (i.e., specific sets of
tor. Finally, the tree is constructed from the characters) for each clade and branch lengths in
numerical data in the matrix, with the most terms of the number of changes on each branch of
closely related sequences occupying a position the tree. However, a significant limitation of the
on the tree which is distant from the less closely MP approach is that it requires strict assumptions
related sequences. Both the neighbor-joining of consistency across sites and among lineages.
(NJ) and the unweighted pair group method Thus, MP performance is significantly affected
using arithmetic averages (UPGMA) approaches when mutational rates differ between conserved
to tree building employ distance-based methods. and hypervariable regions or if evolutionary rates
Although fast and readily available in user- are highly variable among evolutionary lineages.
friendly software packages such as MEGA Finally, parsimony lacks an explicit model of
(Tamura et al. 2011), distance-based methods evolution.
Phylogenetics, Overview 579 P
ML methods are based on specific probabilis- searches are effective up to ~20 taxa. Heuristic
tic models of evolution and search for the tree (or “best guess”) searches employ a “hill
with maximum likelihood under these models. climbing” approach; an initial tree is chosen and
The model of evolution may be empirical, subsequently modified; changes leading to an
derived from general assumptions about the evo- inferior tree descend the hill and the tree is
lution of sequences, or parametric, based on rejected; changes leading to an improvement
values estimated from the dataset. The major ascend the hill – when no further improvement
advantage of likelihood approaches is that they is possible, the search is terminated. Although an
are based on powerful statistical theory which extremely fast approach, there is no guarantee
facilitates the application of robust statistical that the returned tree is the global optimal (the
hypothesis testing and significant refinements to summit) or merely a local optimum (a foothill’s
the resulting phylogenetic trees. However, while plateau).
these strong statistical foundations make ML Once an optimum tree is chosen, some statis-
techniques arguably the most powerful approach tical measure of internal support for clades must
in terms of phylogenetic reconstruction, paradox- also be provided to prove that the tree is suffi-
ically this strength is also a significant weakness, ciently robust and biologically meaningful. To
in that ML approaches are computationally inten- this end a variety of methods have been proposed
sive and, as a result, significantly slower than to verify the evolutionary reliability of trees of
alternative approaches. As such, ML analysis which the most commonly used is the bootstrap
can only be practically applied to a limited num- analysis. Bootstrapping can be divided into both
ber of sequences (Soltis and Soltis 2003a). parametric and nonparametric approaches
In practice, both distance- and character-based (Wrobel 2008). Nonparametric bootstrapping is
methods tend to be used in tandem. An initial tree a numerical resampling approach in which
may be estimated by a distance-based method a subset of sequence alignments referred to as
and used to test the parameters of the model of bootstrap or pseudo-alignments are formed from
evolution. The most appropriate of these might the dataset by random sampling. This process is
then be used in a maximum likelihood tree repeated several times (depending on the size of
search. the dataset and the specifications of the analysis)
usually with a default setting of 1,000 replicates.
Testing the Reliability of a Tree Bootstrap values are conservative measures of P
There are two approaches to finding the best tree: phylogenetic accuracy with values of 70 % or
those that use optimality criteria that can be eval- more representing “true” clades in experimental
uated for any given tree (used for MP and ML) phylogenies. Parametric bootstrapping on the
and those that involve the progressive clustering other hand creates replicate samples using
of sequence subsets (used for NJ and UPGMA). numerical simulation as opposed to resampling.
In the optimality methods trees are evaluated one This approach is usually applied to test compet-
by one by either exhaustive, branch and bound, or ing hypotheses.
heuristic searches. Exhaustive searches evaluate Although generally effective, the bootstrap
all possible bifurcating trees to find a globally approach rests on a number of assumptions
optimal topology; such an approach is only fea- which are not optimal when applied to molecular
sible for a relatively small number of taxa (<10). sequence analysis (for an overview see Box 1). In
Rather than evaluating every possible tree, the addition to bootstrapping another measure of
branch and bound approach first chooses a local internal support which is often used in phyloge-
optimum value for tree length representing the netic analyses is jackknifing. Although similar to
total number of evolutionary changes on the tree; bootstrapping, jackknifing involves one signifi-
any tree length greater than the local optimum is cant difference; rather than resampling the data,
automatically discarded, thus saving time and this approach uses only subsets of the available
computational expense. Branch and bound data (i.e., resampling without replacement to
P 580 Phylogenetics, Overview
create a smaller dataset). The purpose of which is sequences. However, while effective this
to account for the presence of possible “outlier” approach has a number of shortcomings in terms
characters which might have a disproportionate of phylogenetic analysis (for an overview of these
influence on the resulting tree. shortcomings, see Box 2).
Other less common approaches to measuring An alternative approach involves the
internal support include the decay index for par- application of motif finding algorithms which
simony analyses (Hernandez Fernandez and Vrba select common sequence motifs and align
2005) and the posterior probabilities generated in only these most conserved domains with no
Bayesian inference (Wrobel 2008). allowance for gaps or insertions (Lawrence
et al. 1993).
Box 1. Limitations of Bootstrap Analysis when In addition to alignment difficulties, two of the
Applied to Molecular Sequences most significant problems associated with
• The statistical bases of bootstrap analysis assessing tree reliably are long-branch attraction
require that all positions of an alignment are (LBA) associated with mutational saturation and
independently identically distributed. How- lateral gene transfer (LGT) mediated, at least in
ever, this assumption fails to hold true for part, by viruses and mobile genetic elements
either nucleotide or amino acid sequences. (Sapp 2007). As mutations cumulate during evo-
For example, in proteins certain di-residues lution, a point of mutational saturation is reached
(in the primary structure) are either over- or at which there is no further divergence between
underrepresented (Karlin et al. 1991), while taxa (Brocchieri 2001). From this point on it
strong correlations are observed between posi- becomes impossible to estimate evolutionary dis-
tions that interact within the 3D structure tance; furthermore very divergent sequences tend
(Karlin et al. 1994). to be attracted together (Fig. 2) – hence the
• Bootstrap analysis is hampered by unequal name – thus skewing their true position (Lopez
evolutionary rates. If mutational rates are too and Bapteste 2009).
high or uneven among lineages, the bootstrap
proportion P is usually an overestimate (Soltis Box 2. Sequence Alignment Shortcomings
and Soltis 2003b). • Heuristic methods, although fast, only provide
• Molecular sequences are not representative of a best guess or estimate of the optimal
a homologous population, and as such alignment.
resulting bootstrap values may not signify reli- • Alignments are sensitive to the choice of sim-
able clusters (Brocchieri 2001). ilarity matrix (for amino acid sequence align-
ments) and gap penalty which are user
Difficulties Associated with Creating Reliable adjustable – thus requiring human
Phylogenetic Trees intervention.
Phylogenetic inferences are only as good as the • Hierarchically aligning pairs of sequence is
alignments they are drawn from – “Garbage in; prone to generate biases and dominance by
garbage out.” The majority of current alignment the most similar sequences.
protocols are based on dynamic programming
(DP) procedures which seek to identify the max- What Next. . .?
imal alignment score, a value determined by the Phylogenomics – the merging of phylogenetics
choice of scoring matrix (e.g., PAM or and genomics – is perhaps the most exciting
BLOSUM) and the assignment of gap penalties. recent development in the field of evolutionary
Rather than searching for the optimal alignment mapping (Delsuc et al. 2005). Rather than con-
of n sequences in an n-dimensional space, most centrating on a single phylogenetic marker,
DP methods employ fast heuristic or “greedy” whole-genome phylogenomic approaches
approaches, progressively aligning pairs of involve comparisons of gene content: the
Phylogenetics, Overview 581 P
Phylogenetics, Overview, Fig. 2 Long-branch attrac- branches. (ii) An inferred tree of the taxa in which B and
tion. A simulated example of long-branch attraction. C are artificially grouped together because of the phenom-
(i) The real tree of the relationships among five taxa, enon of long-branch attraction
with two taxa (B and C) having long evolutionary
taxa to be known for every analyzed fragment. access to the vast portion of the microbial world
A fraction of a species’ genome sequence, typi- that cannot be cultured with standard techniques.
cally 100 kb or more, suffices as reference data. Bioinformatics methods are subsequently applied
Reference data can be obtained by identifying to process the data. Assembly software is used to
contigs with conserved marker genes such as generate genomic sequence fragments, which
16S rRNA from the sample itself or by additional could principally originate from any member
sequencing of large insert libraries containing species of the microbial community. In taxo-
marker genes (Warnecke et al. 2007; Pope nomic assignment or “binning,” the fragments
et al. 2010). Oligomer-based assignment there- are assigned to individual species or higher-
fore is advantageous for taxonomic assignment of ranking clades from which they originate.
metagenomes from microbial communities with PhyloPythia and its successor PhyloPythiaS are
few available sequenced genomes of its members oligomer signature-based classifiers for the taxo-
or of related species. Oligomer-based taxonomic nomic assignment of metagenome sequence frag-
assignment is faster than alignment-based ments. Oligomer signature-based taxonomic
methods, as no sequence similarity searches in assignment is faster than alignment-based
a large collection of reference sequences are methods, as no sequence similarity searches in
required. This makes it well suited for the analy- a large collection of reference sequences are
sis of large next-generation sequence samples. required. Oligomer signature-based assignment
For short fragments of less than 1 kb or for is well suited for the taxonomic assignment of
assignment over long taxonomic distances, metagenomes from microbial communities with
homology-based methods tend to be more accu- few available sequenced genomes of its members
rate (Patil et al. 2011). With PhyloPythia, the or of related species. For microbial community
relative frequencies of 4–6 mer oligomer patterns members with draft genomes reconstructed by
with up to two wildcard characters in a sequence taxonomic binning, a functional analysis based
fragment are used as features to train ensembles on gene content and reconstruction of metabolic
of multi-class support vector machine classifiers potential can be performed.
with a Gaussian kernel for individual taxonomic
ranks. These are subsequently combined for the
assignment of variable length sequence frag- Cross-References
ments. PhyloPythiaS uses an ensemble of struc-
tured support vector machines with a linear ▶ A 123 of Metagenomics
kernel trained with the relative frequencies of ▶ Genome Portal, Joint Genome Institute
4–6 mer oligomers in sequence fragments. The ▶ KEGG and GenomeNet, New Developments,
structured output formulation allows to learn Metagenomic Analysis
a classifier simultaneously for the entire taxon-
omy under consideration of commonalities of
clades with partially shared evolutionary References
histories.
Droge J, McHardy AC. Taxonomic binning of
metagenome samples generated by next-generation
sequencing technologies. Brief Bioinforma. 2012;
Summary 13(6):646–55.
Glass EM, Wilkening J, Wilke A, Antonopoulos D, Meyer
Metagenomics uses random shotgun sequencing F. Using the metagenomics RAST server (MG-RAST)
for analyzing shotgun metagenomes. Cold Spring
to recover genome sequence information from
Harb Protoc. 2010; 2010(1):pdb prot5368.
microbial communities without the need for cul- Hess M, Sczyrba A, Egan R, Kim TW, Chokhawala H,
tivation of its member species. It thus gives Schroth G, et al. Metagenomic discovery of
Plasmid Capture from Metagenomes 585 P
biomass-degrading genes and genomes from cow Wu M, Eisen JA. A simple, fast, and accurate method of
rumen. Science. 2011;331(6016):463–7. phylogenomic inference. Genome Biol. 2008;9(10):
Hugenholtz P. Exploring prokaryotic diversity in the R151.
genomic era. Genome Biol. 2002;3(2): Wu M, Scott AJ. Phylogenomic analysis of bacterial
REVIEWS0003. and archaeal sequences with AMPHORA2. Bioinfor-
Illumina. 2012. Available from: http://www.illumina.com/ matics. 2012;28(7):1033–4.
Documents/systems/hiseq/datasheet_hiseq_systems.pdf.
Iverson V, Morris RM, Frazar CD, Berthiaume CT,
Morales RL, Armbrust EV. Untangling genomes
from metagenomes: revealing an uncultured class of
marine Euryarchaeota. Science. 2012;335(6068):
587–90. Plasmid Capture from Metagenomes
Markowitz VM, Chen IM, Chu K, Szeto E, Palaniappan K,
Grechkin Y, et al. IMG/M: the integrated metagenome
data management and comparative analysis Brian V. Jones
system. Nucleic Acids Res. 2012;40(Database issue): Center for Biomedical and Health Science
D123–9. Research, University of Brighton, School of
McHardy AC, Garcia-Martin H, Tsirigos A,
Pharmacy and Biomolecular Sciences, Brighton,
Hugenholtz P, Rigoutsos I. Accurate phylogenetic
classification of variable-length DNA fragments. Nat East Sussex, UK
Methods. 2007;4(1):63–72.
Metzker ML. Sequencing technologies – the next genera-
tion. Nat Rev Genet. 2010;11(1):31–46.
Namiki T, Hachiya T, Tanaka H, Sakakibara Y.
Definitions
MetaVelvet: an extension of velvet assembler to de
novo metagenome assembly from short sequence
reads. Nucleic Acids Res. 2012;40(20):e155. Metagenome: The collective genomes of all
Patil KR, Haider P, Pope PB, Turnbaugh PJ, Morrison M, members of a bacterial community.
Scheffer T, et al. Taxonomic metagenome sequence
Mobile metagenome: The total pool of
assignment with structured output models. Nat
Methods. 2011;8(3):191–2. mobile genetic elements associated with
Pope PB, Denman SE, Jones M, Tringe SG, Barry K, a bacterial community.
Malfatti SA, et al. Adaptation to herbivory Mobile genetic element (MGE): A discrete
by the tammar wallaby includes bacterial and
genetic unit capable of mediating its own transfer
glycoside hydrolase profiles different from other her-
bivores. Proc Natl Acad Sci U S A. 2010;107(33): between distinct DNA molecules and/or between
14793–8. distinct host cells of the same or different species. P
Segata N, Waldron L, Ballarini A, Narasimhan V, Plasmids, transposons, insertion sequences,
Jousson O, Huttenhower C. Metagenomic microbial
conjugative transposons, integrons, and bacterio-
community profiling using unique clade-specific
marker genes. Nat Methods. 2012;9(8):811–4. phage are all examples of MGE.
Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, Plasmid: Closed circular DNA molecule that
O’Dwyer JP, Green JL, et al. PhylOTU: a replicates within host cells as an autonomous
high-throughput procedure quantifies microbial com-
extrachromosomal element.
munity diversity and resolves novel taxa from
metagenomic data. PLoS Comput Biol. 2011;7(1): Plasmidome: Plasmid fraction of the mobile
e1001061. metagenome. May be defined as the total pool of
Sun S, Chen J, Li W, Altintas I, Lin A, Peltier S, et al. plasmids associated with a microbial community
Community cyberinfrastructure for advanced micro-
and a component of the mobile metagenome as
bial ecology research and analysis: the CAMERA
resource. Nucleic Acids Res. 2011;39(Database a whole.
issue):D546–51. Horizontal gene transfer: Transfer and
Warnecke F, Luginbuhl P, Ivanova N, Ghassemian M, acquisition of genetic material between distinct
Richardson TH, Stege JT, et al. Metagenomic and
cells or species, outside of and in addition to the
functional analysis of hindgut microbiota of a wood-
feeding higher termite. Nature. 2007;450(7169): normal process of inheritance (vertical gene
560–5. transfer).
P 586 Plasmid Capture from Metagenomes
plasmids from the non-cultivatable fraction of A particular issue faced by all approaches to
microbial communities, which account for the survey microbial plasmidomes, as well as other
vast majority of bacterial species in these ecosys- facets of a given mobile metagenome, is the dif-
tems. The development of such tools has been, ficulty in evaluating the ability of any method to
and will continue to be, a major challenge with provide universal access to the plasmidome and
current approaches each exhibiting distinct identify any bias in the plasmids that may be
strengths and weaknesses when applied to identified and recovered (Ogilvie et al. 2012).
community-level analysis of plasmids Unlike analysis of the core chromosomal content
(Summarized in Table 1). of a microbiome, where detailed surveys of
Plasmid Capture from Metagenomes, Table 1 Relative merits of approaches available for analysis of microbial
plasmidomes and plasmid capture from metagenomes (Modified from Ogilvie et al. 2012)
Plasmid isolation
strategy Advantages Disadvantages Reference
Endogenous • Original bacterial host is known • Requires host cultivation restricting utility
Reviewed in
isolation • May be used for all cultivatable for study of natural communities Smalla and
bacteria • Reliance on plasmid encoded traits if Sobecky
• Applicable to all plasmid types surrogate host species required for plasmid (2002)
characterization Heuer and
Smalla (2012)
Ogilvie
et al. (2012)
Exogenous • Culture independent • Relies on plasmid encoded traits for Bale
isolation • Selective isolation of plasmid transfer, selection, and maintenance et al. (1988)
self-transmissible or in surrogate host
mobilizable elements • Original bacterial host unknown
• Potentially capable of isolating • Range of plasmids isolated dependent on
all plasmid types (circular and mating conditions used and dictated by
linear) and sizes numerous “unknown” environmental
• Can isolate plasmids irrespective variables influencing host cell physiology
of abundance in community and plasmid transfer kinetics
PCR-based • Culture independent • Original bacterial host unknown Götz
detection • High throughput • Complete characterization of plasmid et al. (1996)
• Sensitive detected generally impossible
• Scope for accurate quantitation of • Limited to detection of known and
plasmids characterized plasmid lineages used for
primer design
TRACA • Culture independent • Original bacterial host unknown Jones and
• Suitable for development of • Transposon may inactivate genes of Marchesi
high-throughput strategies interest, impeding phenotypic (2007b)
• Can isolate plasmids irrespective characterization Jones
of abundance in a community • Currently available Tn elements and et al. (2010)
• Fully independent of plasmid surrogate host may limit range of plasmids Warburton
encoded traits isolated et al. (2011)
• Sequence-based characterization • Linear plasmids not captured Zhang
of plasmids facilitated by known Tn • Transformation step may introduce size et al. (2011)
sequence in plasmids bias
• Potentially applicable to all • Plasmids belonging to same
circular plasmids and bacterial incompatibility group as Tn origin may not be
communities captured due to stability issues in surrogate
• May permit capture of MGE host
other than plasmids when present as • Potential for bias towards numerically
circular DNA molecules dominant plasmids
(continued)
Plasmid Capture from Metagenomes 589 P
Plasmid Capture from Metagenomes, Table 1 (continued)
Plasmid isolation
strategy Advantages Disadvantages Reference
Standard • Culture-independent • Original bacterial host unknown Kazimierczak
metagenomic • Suitable for development of • Likely bias towards numerically dominant et al. (2009)
libraries high-throughput strategies plasmids
(BAC/Fosmid) • Initial capture independent of • Screening relies on plasmid encoded traits
plasmid encoded traits expressed in surrogate host species
• Sequence-based characterization • Not specifically designed for plasmid
facilitated capture, and non-plasmid sequences
dominate libraries
• Generally only incomplete, partial
plasmids identified
• General compatibility of library
construction methods with plasmid capture
unknown
• Plasmids belonging to same
incompatibility group as vector
(BAC/Fosmid) may not be represented due to
instability of clones in surrogate host
• Plasmids belonging to same
incompatibility group as vector may not be
captured due to stability issues in surrogate
host
Shotgun • Culture-independent • Original bacterial host unknown Zhang
sequencing of • Suitable for development of • Removal of contaminating chromosomal et al. (2011)
plasmidomes high-throughput strategies DNA potentially problematic Kav
• Independent of plasmid encoded • Not suitable for survey of linear plasmids et al. (2012)
traits with present strategies for removal of
• Potential for complete access to chromosomal DNA
circular elements within a bacterial • Accurate assembly of complete plasmids
plasmidome will likely require a more comprehensive set
of reference plasmid genomes than presently
available
• Pre-sequence processing of plasmid DNA
(removal of chromosomal fragments and P
plasmid DNA amplification) likely to
introduce bias into final dataset. Requires
subsequent quantitative analysis to confirm
relative abundance of particular plasmids
population structure can be first undertaken using plasmidomes and have been applied to study
conserved housekeeping genes present in all bac- a range of microbial ecosystems yielding impor-
terial chromosomes (such as genes encoding 16S tant fundamental insights into the composition
rRNA), no such global survey is possible for and functional content of associated plasmid
plasmids (Ogilvie et al. 2012). As such, surveys pools.
of microbial plasmidomes are impeded by Endogenous isolation: The simplest and most
a fundamental lack of knowledge regarding the widely used approach to study plasmids is the
composition of these malleable gene pools, mak- direct isolation of plasmid DNA from host bacte-
ing the development and validation of methods ria. This approach classically involves the culti-
which access a representative cross section of the vation of host species, usually with selection for
plasmidome virtually impossible at present. Nev- particular traits of interest believed to be plasmid
ertheless, available strategies still offer the poten- encoded (reviewed in Smalla and Sobecky 2002;
tial to provide much insight into microbial Jones and Marchesi 2007a; Ogilvie et al. 2012).
P 590 Plasmid Capture from Metagenomes
Extracted plasmid DNA is subsequently trans- plasmid carrying recipient cells are subsequently
ferred into a new host, ideally of the same spe- identified by cultivation on media selectable for
cies, with E. coli K12-type strains most the recipient organism (often rifampicin resis-
commonly deployed. Plasmids are then typically tance), as well as plasmid encoded traits.
characterized based on the phenotypes they con- Biparental matings, involving only the donor
fer upon host species but are increasingly exam- community and selectable recipient, can be used
ined at the nucleotide level, and plasmids to retrieve self-transmissible plasmids capable of
sequenced as part of whole genome sequencing initiating autonomous conjugal transfer pro-
projects may also be considered as examples of cesses (Fig. 1). Alternatively donor cells carrying
endogenous isolation. a “helper” plasmid may also be introduced along
Aside from its simplicity and general applica- with the selectable recipient, in a tri-parental mat-
bility to all plasmid types (including linear plas- ing approach (Hill et al. 1992). In this case, the
mids), the major benefit of this approach is the “helper” plasmid sets up plasmid conjugation
identification of the natural hosts species of apparatus, which can subsequently be exploited
a particular plasmid. Conversely, the reliance on by plasmids that may be mobilized between
cultivation of host bacteria, as well as the reliance cells, but are not capable of independent transfer
on plasmid encoded traits and their expression in (Fig. 1). In particular, the retrieval of self-
surrogate hosts species (selectable markers and transmissible elements may be seen as
plasmid replication machinery), severely restricts a strength of the exogenous isolation approach,
the utility of this approach for access to the since these elements are likely to be the most
plasmidome. However, the general strategy of informative and important in understanding
direct isolation of plasmid DNA from host cells MGE-mediated prokaryotic gene flow both
can also be applied to the total community with- within and between microbiomes.
out prior cultivation, and when combined with Although this method offers a number of sig-
high-throughput sequencing or other culture- nificant advantages over endogenous approaches,
independent approaches, this direct extraction the capture of plasmids is still reliant on plasmid
method forms the basis for many “metagenomic” encoded traits, including the presence of select-
strategies for plasmidome analysis (discussed in able markers, as well as the ability of plasmids to
detail below). successfully replicate in the surrogate host spe-
Exogenous isolation: Exogenous isolation cies used (Ogilvie et al. 2012; Smalla and
approaches were the first to address some of the Sobecky 2002; Heuer and Smalla 2012). Plas-
limitations inherent in endogenous approaches mids lacking in traits selected for, or unable to
for community level analysis of plasmidomes replicate successfully in surrogate hosts, will not
(Bale et al. 1988; Hill et al. 1992). Exogenous be captured using these approaches. In addition,
methods rely on the natural ability of plasmids to the cell-cell transfer of plasmids is influenced by
initiate or participate in cell-cell transfer between numerous environmental variables, as well as
distinct host species. This strategy accesses plas- the physiological status of donor and recipient
mids using a selectable surrogate host species cells, with metabolically inactive community
(most typically E. coli) in biparental or members unlikely to participate in conjugal
tri-parental matings with the donor population, transfer processes. These factors also impact on
during which plasmids may be transferred from plasmid transfer rates, the types of plasmid that
donor cells in the community to the selectable can be acquired and the portion of the
recipient (Fig. 1; Bale et al. 1988; Hill et al. 1992; plasmidome that may be accessed (Ogilvie
Reviewed in Ogilvie et al. 2012; Smalla and et al. 2012; Smalla and Sobecky 2002; Heuer
Sobecky 2002; Heuer and Smalla 2012). Essen- and Smalla 2012). Collectively, these factors
tially this system utilizes the surrogate host as restrict the range of plasmids that may be cap-
a “fishing net,” to pick up plasmids circulating tured and limit the utility of this approach for
within the donor community under study, and studying microbial plasmidomes.
Plasmid Capture from Metagenomes 591 P
Plasmid Capture from
Metagenomes,
Fig. 1 Overview of
exogenous isolation
approaches for the
acquisition of plasmids
from microbial
communities. Arrows
indicate plasmid transfer
between donor (mixed
microbial community),
recipient, and “helper”
populations. Purple arrows
indicate plasmid transfer in
biparental matings in which
selectable recipient cells
are used to acquire self-
transmissible plasmids
directly from the donor
population. Green arrows
indicate transfer events in
tri-parental matings, in
which cells harboring
a self-transmissible
“helper” plasmid are
utilized to initiate conjugal
transfer events with the
donor population and the
selectable recipient, in
order to acquire
mobilizable but non-self-
transmissible plasmids
Direct plasmid detection by PCR: A range This limits the range of plasmids encompassed
of PCR primers have been developed in order in such surveys to those families already isolated
to distinguish between plasmids of different and characterized. A further disadvantage is that
families based on backbone sequences, but along with a lack of data on host range, no infor-
these have also been employed as surveying mation on functional content of plasmids is
tools to identify the presence of particular offered by this method, and there is little or no
plasmid types in total community DNA scope to characterize detected plasmids in greater
extracts (Götz et al. 1996; Smalla detail. As such, this approach does not at present
et al. 2000b). While this approach is poten- constitute a viable strategy for in depth and com-
tially useful in gaining an overview of the prehensive analysis of entire plasmidomes, but
types of plasmids comprising a particular may be used to augment other strategies and
plasmidome and their relative abundance provide further information on isolated plasmids.
(if utilized with a quantitative PCR strategy), Despite the present limitations, the usefulness of
its usefulness is currently limited by the rela- this approach is likely to grow as more sequence
tively small number of plasmid genomes avail- information and associated data is generated, and
able from which discriminatory primer sets greater numbers of habitat associated reference
may be established. data sets become available in the future.
P 592 Plasmid Capture from Metagenomes
Plasmid Capture from Metagenomes, chromosomal sequences and to amplify the recovered
Fig. 2 Overview of culture-independent metagenomic plasmid DNA for certain plasmidome if necessary. Plas-
approaches for microbial plasmidome analysis. Acquisi- mid capture and plasmidome access: Recovered
tion of plasmid DNA: Plasmid DNA may be harvested and plasmidome extracts may then be used in conjunction
processed in a number of ways before use in strategies to with one or more culture-independent approaches for
capture plasmids or access the plasmidome. Plasmid DNA plasmid capture or general access to the plasmidome.
may be acquired from either total metagenomic DNA Available culture-independent approaches include the
extracts of the microbial community or specific plasmid generation of standard metagenomic libraries, the used
extraction methods. Recovered pools of plasmids may of the TRACA plasmid capture approach, or direct shot-
subsequently be processed to remove contaminating gun sequencing of amplified plasmids
Transposon aided capture (TRACA): The transposon (Tn) system encoding this informa-
culture-independent transposon-aided capture tion (Fig. 2).
system (TRACA) has been specifically designed Following Tn integration, plasmids are subse-
for the acquisition of plasmids from whole com- quently transformed into a surrogate bacterial
munities and to overcome some of the main lim- host and cells carrying plasmids selected for
itations of endogenous and exogenous based on antibiotic resistance genes harbored by
approaches (Jones and Marchesi 2007b). The the inserted Tn. In this way, plasmids may be
basic premise of this system is to retrofit all acquired independently of the traits they encode,
plasmids with a suitable selectable marker and and their replication in the surrogate host is facil-
an origin of replication compatible with the sur- itated (Jones and Marchesi 2007b). This provides
rogate host biomachinery, using an in vitro access to plasmids in a bacterial community
Plasmid Capture from Metagenomes 593 P
regardless of functions encoded and has been and expand the range of plasmids that may be
successfully applied to study plasmids in acquired with this system.
a number of environments, including the human Retrieval of plasmids from standard
gut, the oral cavity, and activated sludge (Jones metagenomic libraries: Access to plasmid
and Marchesi 2007b; Warburton et al. 2011; sequences contained in standard metagenomic
Zhang et al. 2011). libraries derived from total community DNA
Although the TRACA system offers major have also been described (Fig. 2; Kazimierczak
advantages over other approaches, this method et al. 2009). In particular, the isolation of plas-
does not circumvent all issues and may be subject mids or plasmid fragments, from such libraries of
to a unique limitation in regard to the size of the organic pig gut microbiome, has been dem-
plasmids that can be captured when using this onstrated and included those with the ability to
approach (reviewed in Jones and Marchesi replicate autonomously when liberated from the
2007a; Ogilvie et al. 2012). Plasmids isolated by library vector and reconstructed by self-ligation
this system to date have all been in the smaller (Kazimierczak et al. 2009). Despite the novelty
size range (~14 Kb and smaller), indicating the of this approach, this strategy suffers from the
TRACA system may be biased towards the cap- same drawbacks as endogenous and exogenous
ture of small plasmids or even unable to acquire methods in its reliance on plasmid encoded traits
larger plasmids altogether. The reasons behind for initial plasmid identification and subsequent
this potential size restriction are presently demonstration of autonomous replication in sur-
unclear, although the transformation step in rogate host species (Kazimierczak et al. 2009).
which Tn-tagged plasmids are introduced into Furthermore, this approach is not at present
surrogate host cells is known to work more effi- designed to specifically retrieve plasmids, but
ciently with smaller DNA molecules, and there is rather total community DNA which is dominated
also potential for a size bias to be introduced by chromosomal sequences. As such this
during the purification of plasmid DNA (Jones approach is not presently suitable for the specific
and Marchesi 2007b). analysis of microbial plasmidomes, and in the
It is also possible that the size range of plas- original study by Kazimierczak et al. (2009),
mids captured by this system will be a function of libraries were analyzed for clones encoding anti-
the plasmidome composition and the predomi- biotic resistance genes, rather than plasmid
nance of smaller plasmids in the ecosystems sequences per se. However, there is clearly P
that have been explored with this method to date scope to utilize this method to further explore
(Ogilvie et al. 2012). Although there is presently existing metagenomic data sets and enhance the
no definitive data available on the average plas- interpretation of these valuable resources by illu-
mid size in any given microbial ecosystem, initial minating mobile genetic elements captured in
evidence suggests that physical features of plas- these repositories.
mids, such as size, are responsive to pervading Shotgun sequencing of plasmidomes: More
environmental and ecological conditions in the recently the first true applications of the
same way as host chromosomes (Slater metagenomic approach to study plasmidomes
et al. 2008). Overall, it is most probable that have been described (Fig. 2; Zhang et al. 2011;
both the composition of the plasmidome and Kav et al. 2012). In these studies, plasmid DNA
inherent attributes of the TRACA system dictate was extracted from the target community without
the profile of plasmids captured by this approach. any prior enrichment or cultivation, subjected to
Regardless of these potential limitations, the high-throughput sequencing, and fragments of
TRACA method provides an additional and use- plasmid genomes subsequently assembled from
ful tool for the exploration of bacterial the resulting reads (Zhang et al. 2011; Kav
plasmidomes, overcoming some of the major dis- et al. 2012). This permitted a global survey of
advantages of other methods. There is also much plasmid-encoded functions present in the bovine
scope to improve the existing TRACA approach plasmidome (Kav et al. 2012), as well as an
P 594 Plasmid Capture from Metagenomes
activated sludge microbial community (Zhang analysis constitutes a major advance in accessing
et al. 2011), demonstrating proof of principal for plasmids resident in microbial communities, in
the shotgun sequencing approach to plasmidome terms of both depth of coverage and the cross
analysis. section of plasmids that may be covered.
Although this approach should in theory be Further development of such approaches, in
able to offer total and unbiased access to the parallel with the development of more detailed
entire plasmidome of a given microbial commu- and extensive reference data sets from plasmids
nity, in practice limitations and potential biases captured through TRACA or exogenous
remain. For example, in the study by Kav approaches, for the first time places the compre-
et al. 2012, sufficient plasmid DNA for sequenc- hensive analysis of a microbial plasmidome
ing was only obtained after amplification of the within reach.
recovered plasmid DNA by rolling circle ampli-
fication. As such there is potential for some plas- Retrieval of Host Range Data Following
mids to be preferentially amplified over others, Plasmid Capture from Metagenomes
introducing bias into the resulting data set. In A major drawback of all culture-independent
addition, the complete removal of contaminating community-level approaches for investigation
chromosomal sequences is also challenging, and of microbial plasmidomes, and capture of plas-
despite the availability of “plasmid safe” DNases mids from metagenomic data sets, is the loss of
which do not act on circular molecules, total host range data inherent in these strategies
elimination of chromosomal DNA from plasmid (Table 1). All such strategies effectively divorce
extracts appears to constitute a bottleneck in this acquired plasmids or plasmid sequences of any
strategy (Zhang et al. 2011; Kav et al. 2012), with phylogenetic affiliation, undermining a primary
linear plasmids also likely to be removed during motivation for undertaking many such surveys:
this process. As such there is further potential to a fundamental understanding of gene flow in
alter the composition of the plasmid pool these communities. Despite this, several
obtained during this stage of plasmid DNA approaches may be used to supplement the initial
preparation. culture-independent plasmid capture strategy and
There is also potential for errors in assembly provide some indication of plasmid phylogenetic
due to the mosaic nature of these elements, affiliation and long-term host range.
a situation that may be exacerbated by the pres- Plasmids captured through culture indepen-
ence of any contaminating chromosomal dent approaches may subsequently be utilized to
sequences. In this regard, the availability of ref- develop fluorescent probes suitable for use in
erence plasmid genomes captured by methods fluorescence associated cell sorting (FACS)
which acquire whole, intact plasmids (such as applications (reviewed in Ogilvie et al. 2012).
exogenous isolation and TRACA) will constitute The development and use of such probes in
a highly valuable resource that will significantly FACS systems permits intact cells harboring
enhance the power and accuracy of the shotgun target genes or sequences to be separated from
plasmidome approach (Fig. 2), and some the rest of the microbial community and subse-
researchers have already begun to combine quently identified through culture-independent
these strategies (Zhang et al. 2011). Finally, molecular genetic approaches, such as 16S
extensive sequencing will likely be required for rDNA sequence analysis (Zwirglmaier et al.
most plasmidomes, in order to move beyond rep- 2004). This strategy, termed Ring-FISH
resentation of numerically dominant plasmids (recognition of individual genes by fluorescence
(particularly for assembly of complete replicons) in situ hybridization), has previously been
and provide the depth of coverage required to implemented and demonstrated as a feasible
access the full diversity of a given plasmidome. approach for the recovery of cells encoding
Despite these potential issues, it is clear that genes of interest, including those encoded by
the shotgun sequencing approach to plasmidome plasmids.
Plasmid Capture from Metagenomes 595 P
Alternatively a range of in silico approaches Summary
have been applied to plasmid host affiliation
(reviewed in Ogilvie et al. 2012). Plasmid There is now much evidence to support the
sequences may be compared directly to curated concept of distinct, community-associated
sequence databases where phylogenetic informa- plasmidomes and wider mobile metagenomes
tion on plasmid genomes and other genes is avail- (reviewed in Jones 2010; Ogilvie et al. 2012).
able. The homology of plasmid sequences to However, the mobile and promiscuous nature of
database entries may then be used to infer phy- many MGE (including many plasmids) makes
logeny of captured plasmids (Jones and Marchesi this a much less clearly defined genetic reservoir,
2007b; Jones et al. 2010; Kav et al. 2012; Zhang and membership of a particular mobile
et al. 2011). However, the mosaic nature of plas- metagenome will be far less exclusive than for
mids and the potential for a single element be the core chromosomal compliment of the associ-
composed of genetic material with highly diverse ated microbiome (Jones 2010). A greater under-
origins, coupled with inherent biases in public standing of the composition and functional
databases due to the paucity of available plasmid capacities of these mobile metagenomes, and
genomes, undermines the accuracy of this key MGE such as plasmids, will be important
approach and particularly when applied to frag- for understanding and ultimately manipulating
mentary data sets such as metagenomic libraries many important microbial ecosystems, as well
and shotgun plasmidomes. as providing fundamental insight into the mech-
Alternatively, strategies based on correlation anisms of gene flow within and between distinct
of nucleotide usage patterns in plasmids with microbiomes. Although no available method for
bacterial chromosomes have also been described accessing microbial plasmidomes represents
(Campbell et al. 1999; Suzuki et al. 2010). These a panacea for the study of these dynamic gene
are based on the premise that over time, plasmids pools, the application of tools currently available,
and other MGE that are long-term residents of particularly when used in combination,
a given host species adapt to their host at the holds much potential for greatly expanding our
nucleotide level and acquire a corresponding knowledge of plasmid diversity, abundance,
“genomic signature” in terms of nucleotide and functionality within microbial mobile
usage profiles (Campbell et al. 1999; Suzuki metagenomes.
et al. 2010). As this underlying genomic signature P
has been shown to permit discrimination between
chromosomal sequences of different bacterial References
species, there is also scope to employ plasmid
Bale MJ, Day MJ, Fry JC. Novel method for studying
nucleotide usage patterns to retrieve host range plasmid transfer in undisturbed river epilithon. Appl
information. Dinucleotide and trinucleotide Environ Microbiol. 1988;54(11):2756–8.
usage patterns, based on the abundance of all Campbell A, Mrazek J, Karlin S. Genome signature com-
possible two-nucleotide or three-nucleotide com- parisons among prokaryote, plasmid, and mitochon-
drial DNA. Proc Natl Acad Sci U S A. 1999;96:
binations in a given DNA sequence, have been 9184–9.
used in this way and shown to provide insight into Espinosa M, Cohen S, Couturier M, et al. Plasmid repli-
plasmid host range, at least in terms of potential cation and copy number control. In: CM Thomas (ed)
long-term bacterial host species to which plas- The horizontal gene pool, bacterial plasmids and gene
spread. Amsterdam: Harwood Academic Publishers;
mids are well adapted (Campbell et al. 1999; 2000. p. 207–48.
Suzuki et al. 2010). There is much scope to incor- Götz A, Pukall R, Smit E. Detection and characterization
porate such analyses into culture-independent of broad-host-range plasmids in environmental bacte-
surveys of bacterial plasmidomes, as downstream ria by PCR. Appl Environ Microbiol. 1996;63:1980–6.
Heuer H, Smalla K. Plasmids foster diversification and
processing steps that may provide some of the adaptation of bacterial populations in soil. FEMS
phylogenetic inference lacking in metagenomic Microbiol Rev. 2012. doi:10.1111/j.1574-6976.2012.
approaches. 00337.x.
P 596 Protein-Coding Genes as Alternative Markers in Microbial Diversity Studies
Hill K, Weightman AJ, Fry JC. Isolation and screening of plasmids and gene spread. Amsterdam: Harwood Aca-
plasmids from the epilithon which mobilise recombi- demic Publishers; 2000a. p. 207–48.
nant plasmid pD10. Appl Environ Microbiol. 1992; Smalla K, Krögerrecklenfort E, Heuer H, et al. PCR-based
58:1292–300. detection of mobile genetic elements in total commu-
Jones BV. The human gut mobile metagenome: nity DNA. Microbiology. 2000;146:1256–7.
a metazoan perspective. Gut Microbes. 2010;1(6): Smillie CD, Smith MB, Friedman J, et al. Ecology drives
417–33. a global network of gene exchange connecting the
Jones BV, Marchesi JR. Accessing the mobile human microbiome. Nature. 2011. doi:10.1038/
metagenome of the human gut microbiota. Mol nature10571.
Biosyst. 2007a;3:749–58. Strom SL. Microbial ecology of ocean biogeochemistry:
Jones BV, Marchesi JR. Transposon-aided capture a community perspective. Science. 2008;320:1043–5.
(TRACA) of plasmids resident in the human gut Suzuki H, Yano H, Brown CJ, Top EM. Predicting plas-
mobile metagenome. Nat Methods. 2007b;4:55–61. mid promiscuity based on genomic signature.
Jones BV, Sun F, Marchesi JR. Comparative J Bacteriol. 2010;192(22):6045–55.
metagenomic analysis of plasmid encoded functions Warburton P, Allan E, Hunter S, et al. Isolation of bacte-
in the human gut microbiome. BMC Genomics. 2010; rial extra-chromosomal DNA from human dental
11:46. plaque associated with periodontal disease, using
Kav AB, Sasson G, Jami E, et al. Insights into the bovine transposon-aided capture (TRACA). FEMS Microbiol
rumen plasmidome. Proc Natl Acad Sci U S A. 2012; Ecol. 2011;78:349–54.
109:5452–7. Zhang T, Zhang X-X, Ye L. Plasmid metagenome reveals
Kazimierczak KA, Scott KP, Kelly D, Aminov high levels of antibiotic resistance genes and mobile
RI. Tetracycline resistome of the organic pig gut. genetic elements in activated sludge. PloS ONE.
Appl Environ Microbiol. 2009;75:1717–22. 2011;6:e26041.
Ley RE, Peterson DA, Gordon JI. Ecological and evolu- Zwirglmaier K, Ludwig W, Schleifer KH. Recognition of
tionary forces shaping microbial diversity in the individual genes in a single bacterial cell by fluores-
human intestine. Cell. 2006;124:837–48. cence in situ hybridization – RING-FISH. Mol
Lozupone CA, Hamady M, Cantral BL, et al. The conver- Microbiol. 2004;51(1):89–96.
gence of carbohydrate active gene repertoires in
human gut microbes. Proc Natl Acad Sci U S A.
2008;105:15076–81.
Novick RP. Plasmid incompatability. Microbiol Rev.
1987;51:381–95.
Ochman H, Lawrence JG, Groisman EA. Lateral gene Protein-Coding Genes as Alternative
transfer and the nature of bacterial innovation. Nature. Markers in Microbial Diversity
2000;405:299–304. Studies
Ogilvie LA, Firouzmand S, Jones BV. Evolutionary, eco-
logical and biotechnological perspectives on plasmids
resident in the human gut mobile metagenome. Bioeng Martin Wu
Bugs. 2012;3(1):1–19. Department of Biology, University of Virginia,
Reyes A, Haynes M, Hanson N, et al. Viruses in the faecal Charlottesville, VA, USA
microbiota of monozygotic twins and their mothers.
Nature. 2010;466:334–8.
Schl€uter A, Szczepanowski R, P€ uhler A, et al. Genomics
of IncP-1 antibiotic resistance plasmids isolated from Synonyms
wastewater treatment plants provides evidence for
a widely accessible drug resistance pool. FEMS
Automated Phylogenomic Inference Application
Microbiol Rev. 2007;31:449–77.
Slater FR, Bailey MJ, Tett AJ, Turner SL. Progress (AMPHORA)
towards understanding the fate of plasmids in
bacterial communities. FEMS Microb Ecol.
2008;66:3–13.
Smalla K, Sobecky PA. The prevalence and diversity of
Introduction
mobile genetic elements in bacterial communities of
different environmental habitats: insights gained from The small ribosomal unit RNA (SSU rRNA or
different methodological approaches. FEMS 16S rRNA) has been widely used in microbial
Microbiol Ecol. 2002;42:165–75.
systematic and diversity studies. The appeal of
Smalla K, Osburne AM, Wellington EMH. Isolation and
characterisation of plasmids from bacteria In: CM using 16S rRNA gene as a marker gene is numer-
Thomas (ed) The horizontal gene pool, bacterial ous. First of all, it is distributed in every single
Protein-Coding Genes as Alternative Markers in Microbial Diversity Studies 597 P
cellular organism. Secondly, because regions of sequenced directly from environments without
16S rRNA sequence are highly conserved, 16S prior isolation, culturing, and PCR amplification.
rRNA gene can be PCR amplified from a wide Metagenomics therefore overcomes a major hur-
diversity of taxa using “universal” primers and dle for using protein genes for microbial diversity
sequenced, bypassing the need to isolate and cul- studies in that it makes the sequences of protein
ture the organisms in question. Consequently, genes readily accessible. Because metagenomic
millions of 16S rRNA reference sequences are sequencing is random in nature, microbial com-
available for microbial classification and identi- position estimated based on metagenomic
fication (Cole 2009). sequencing is less biased than the 16S rRNA
Although 16S rRNA has been the “gold stan- PCR-based survey. When using single-copy
dard” in microbial diversity studies, it has several protein-coding genes for relative species abun-
shortcomings. First, because 16S rRNA only dance estimation, it further eliminates the bias
makes up a tiny fraction of a genome (~0.1 %), associated with the copy-number variations of
its application as a marker gene in classifying the 16S rRNA gene.
metagenomic sequences is seriously limited. Sec- The rapid growth of genomic data also pre-
ondly, the widely recognized bias in 16S rRNA sents challenges for using protein-coding genes
PCR skews the estimation of the relative abun- in microbial diversity studies. In order to answer
dance of species in a population (Acinas the question of “who is there” in metagenomic
et al. 2005). Thirdly, the 16S rRNA gene copy studies, there is a pressing need for developing
number varies substantially from species to spe- an automated high-throughput, high-quality
cies, further complicates the effort to accurately application for metagenomic phylotyping. Sev-
estimate microbial composition (Kembel eral factors should be considered for such an
et al. 2012). To circumvent these problems, application. First, because genes can be
protein-coding genes such as as rpoB, pyrG, exchanged in bacteria and archaea, it is impera-
recA, and HSP70 have been used as alternative tive to only use genes that are recalcitrant to
phylogenetic markers to complement rRNA- lateral gene transfer for phylotyping. Secondly,
based analyses (Ludwig and Klenk 2000; Santos for accurate estimation of the microbial compo-
and Ochman 2004). Because protein genes are sition, only single-copy protein genes should be
conserved at the amino acid level and not at the used as the marker genes. Thirdly, tree-based
nucleotide level, they evolve faster and thus have phylotyping involves multiple steps including P
more power at resolving the relationships of marker identification, sequence alignment, tree
closely related species than the 16S rRNA gene. reconstruction, and taxonomy assignment. For
Unfortunately for the same reason, it is extremely large-scale phylogenetic analysis, several techni-
difficult to design “universal primers” that can be cal hurdles need to be overcome to make
used to PCR amplify protein-coding genes from high-quality sequence alignments prior to the
distantly related species (Santos and Ochman phylogenetic inference.
2004). As a result, protein-coding genes have
seen very limited use in broad-spectrum
microbial surveys. Description
Recent explosive growth in genomic
sequences has changed the landscape. Thousands AMPHORA is an automated phylogenomic
of complete bacterial genomes are available and inference application (Wu and Eisen 2008; Wu
many more are on the way of being sequenced and Scott 2012). It offers speed, reliability, and
(Pagani et al. 2012). With each genome sequence high-quality analyses using protein-coding genes
come along thousands of protein-coding genes, as alternative marker genes for microbial diver-
vastly expanding the amount of data available for sity studies. The main components of the
protein marker genes. In metagenomic studies, AMPHORA are illustrated in Fig. 1 and are
genomes of a mixed microbial population are described in detail below.
P 598 Protein-Coding Genes as Alternative Markers in Microbial Diversity Studies
Protein-Coding Genes as
Alternative Markers in
Microbial Diversity
Studies, Fig. 1 A
flowchart illustrating the
major components of
AMPHORA
Protein-Coding Genes as Alternative Markers in Microbial Diversity Studies, Fig. 2 Bacterial composition of
the GOS dataset analyzed using AMPHORA
AMPHORA Analysis of the Global Ocean sequences in each marker gene can be used as
Survey Dataset approximation for the relative organismal abun-
AMPHORA was used to phylotype the environ- dance in the population. In agreement, the rela-
mental shotgun sequencing reads of the Global tive abundance of Pelagibacter ubique clade
Ocean Survey (GOS) (Rusch et al. 2007). From estimated by AMPHORA (35.8 %) is very close
the 41 million predicted peptides, 213,583 peptides to previous quantitative estimations by fluores-
were identified that corresponded to the 31 bacterial cence in situ hybridization showing that, on aver-
and 104 archaeal marker genes. Using the number age, cells of the clade account for one-third of the
of reads per marker, it was estimated that 95.4 % of ocean surface bacterioplankton communities
the reads in GOS dataset belonged to bacteria while, (Morris et al. 2002). Also as expected, the bacte-
4.6 % of the reads were from archaea, indicating that rial diversity profiles are remarkably consistent
the ocean surface water is dominated by bacteria. between the different marker genes (Fig. 2).
The relative abundance of major bacterial groups is
shown in Fig. 2. Alphaproteobacteria is the most
abundant group overall, making up 47.8 % of the Summary
bacterial population. This is mainly due to a single
clade of Pelagibacter ubique that constituted Metagenomics has the potential to transform the
35.8 % of the bacterial population sampled in GOS. way we study microbial diversity. To fully realize
Because all the marker genes in AMPHORA this potential, it is important to develop a set of
are single-copy genes, the relative abundance of well-curated protein-coding genes as alternative
Proteomics and Metaproteomics 601 P
marker genes. AMPHORA builds on a set of Loytynoja A, Goldman N. Phylogeny-aware gap place-
universally conserved, single-copy protein ment prevents errors in sequence alignment and evo-
lutionary analysis. Science. 2008;320:1632–5.
genes that are ideal for analyzing bacterial diver- Ludwig W, Klenk H-P. Overview: a phylogenetic back-
sity. It facilitates the large-scale phylogenetic bone and taxonomic framework for procaryotic sys-
analysis of these marker genes and should be of tematics. In: Boone DR, Castenholz RW, Garrity GM,
broad application in the study of microbial evo- editors. Bergey’s manual of systematic bacteriology,
vol. 1. New York: Springer-Verlag; 2000. p. 49–65.
lution and ecology. Matsen FA, Kodner RB, Armbrus EV. pplacer: linear time
maximum-likelihood and Bayesian phylogenetic
placement of sequences onto a fixed reference tree.
BMC Bioinforma. 2010;11.
References Morris RM, Rappe MS, Connon SA, et al. SAR11 clade
dominates ocean surface bacterioplankton communi-
Acinas SG, Sarma-Rupavtarm R, Klepac-Ceraj V, ties. Nature. 2002;420:806–10.
et al. PCR-induced sequence artifacts and bias: Morrison DA, Ellis JT. Effects of nucleotide sequence
insights from comparison of two 16S rRNA clone alignment on phylogeny estimation: a case study of
libraries constructed from the same sample. Appl 18S rDNAs of apicomplexa. Mol Biol Evol.
Environ Microbiol. 2005;71:8966–9. 1997;14:428–41.
Berger SA, Krompass D, Stamatakis A. Performance, Pagani I, Liolios K, Jansson J, et al. The Genomes OnLine
accuracy, and Web server for evolutionary placement Database (GOLD) v.4: status of genomic and
of short sequence reads under maximum likelihood. metagenomic projects and their associated metadata.
Syst Biol. 2011;60:291–302. Nucleic Acids Res. 2012;40:D571–9.
Cammarano P, Creti R, Sanangelantoni AM, et al. The Rusch DB, Halpern AL, Sutton G, et al. The sorcerer II
archaea monophyly issue: a phylogeny of translational global ocean sampling expedition: northwest Atlantic
elongation factor G(2) sequences inferred from an through eastern tropical Pacific. PLoS Biol. 2007;5:
optimized selection of alignment positions. J Mol e77.
Evol. 1999;49:524–37. Santos SR, Ochman H. Identification and phylogenetic
Castresana J. Selection of conserved blocks from multiple sorting of bacterial lineages with universally con-
alignments for their use in phylogenetic analysis. Mol served genes and proteins. Environ Microbiol.
Biol Evol. 2000;17:540–52. 2004;6:754–9.
Cole JR, Wang Q, Cardenas E, et al. The ribosomal data- Sorek R, Zhu Y, Creevey CJ, et al. Genome-wide exper-
base project: improved alignments and new tools for imental determination of barriers to horizontal gene
rRNA analysis. Nucleic Acids Res. 2009;37:D141–5. transfer. Science. 2007;318:1449–52.
Eddy SR. Accelerated profile HMM searches. PLoS Wu M, Eisen JA. A simple, fast, and accurate method of
Comput Biol. 2011;7:e1002195. phylogenomic inference. Genome Biol. 2008;9:R151.
Grundy WN, Naylor GJ. Phylogenetic inference from Wu M, Scott AJ. Phylogenomic analysis of bacterial and P
conserved sites alignments. J Exp Zool. 1999; archaeal sequences with AMPHORA2. Bioinformat-
285:128–39. ics. 2012;28:1033–4.
Huson DH, Auch AF, Qi J, et al. MEGAN analysis of Wu M, Chatterji S, Eisen JA. Accounting for alignment
metagenomic data. Genome Res. 2007;17:377–86. uncertainty in phylogenomics. PLoS ONE. 2012;7(1):
Hwang UW, Kim W, Tautz D, et al. Molecular phyloge- e30288.
netics at the Felsenstein zone: approaching the
Strepsiptera problem using 5.8S and 28S rDNA
sequences. Mol Phylogenet Evol. 1998;9:470–80.
Jain R. Horizontal gene transfer among genomes: the
complexity hypothesis. Proc Natl Acad Sci.
1999;96:3801–6.
Proteomics and Metaproteomics
Kembel SW, Wu M, Eisen JA, et al. Incorporating 16S
gene copy number information improves estimates of Rembert Pieper, Shih-Ting Huang and
microbial diversity and abundance. PLoS Comput Moo-Jin Suh
Biol. 2012;8:e1002743.
J. Craig Venter Institute, Rockville, MD, USA
Koski LB, Golding GB. The closest BLAST hit is often
not the nearest neighbor. J Mol Evol. 2001;52:540–2.
Lake JA. The order of sequence alignment can bias the
selection of tree topology. Mol Biol Evol. Synonyms
1991;8:378–85.
Landan G, Graur D. Heads or tails: a simple reliability
check for multiple sequence alignments. Mol Biol Global proteomics; Protein profiling of microbial
Evol. 2007;24:1380–3. communities; Proteomics of biological systems
P 602 Proteomics and Metaproteomics
Proteomics and Metaproteomics, Fig. 1 Protein pro- strips, proteins were separated according to size in second
files of (a) Shiga toxin-producing E. coli (serotype H157: dimension 8–18 %T SDS-PAGE gels (25 20 cm) for
O7) and (b) a murine stool fraction enriched in bacteria 1.8 kVh. Gels were stained with the dye Coomassie Bril-
displayed in 2D gels. Samples of ~150 mg protein were liant Blue G250 (Courtsey of Christine Peterson and
loaded onto pH 4–7 25 cm immobilized pH gradients Prashanth Parmar for their contributions to the gel elec-
strips and isoelectrically focused applying 64 kVh. Fol- trophoresis data depicted in the courtesy)
lowing reduction and alkylation of proteins in the IPG
Proteomics and Metaproteomics, Fig. 2 Shotgun combined from all fractions are collected and interpreted
proteomics workflow. After generation of a cell lysate or with an algorithm and a relevant protein sequence data-
protein extract, proteins are digested with an base. Identified peptides are assigned to proteins of origin
endoproteinase (e.g., trypsin). The peptide mixture is sep- and counted to obtain a protein quantity estimate. Abun-
arated on a reversed phase C18 or a strong anion exchange dance profiles from different samples can be displayed in
LC column. Peptide fractions are lyophilized and applied form of heat maps
to nano LC-MS/MS sequentially. The mass spectra
to assign a peptide sequence based on its original the database of theoretical PSMs, the more com-
m/z value and tandem MS data that deliver putationally challenging it is to determine the
a series of daughter m/z values for N- and best peptide match for an experimental mass
C-terminal fragment ions (Fig. 3b). The MS and spectrum. Herein lies one of the fundamental
subsequent MS/MS analyses are performed in challenges of metaproteomics: protein sequence
automated duty cycles defined by the LC-MS databases to be searched not only contain
instrument software so that tens or even hun- sequences derived from one but numerous fully
dreds of thousands of MS and subsequent annotated genomes or large metagenomic read
MS/MS scans are performed in series. The assemblies that are partially annotated. Their
aggregate of data from these scans describes content of predicted protein sequences is sub-
the proteome in the shotgun proteomics experi- stantially increased. MS platforms have recently
ment. Due to the fact that the matching of the- moved towards ultra-high pressure LC for high-
oretical MS/MS (peptide fragment) spectra and resolution peptide separations and high-
experimental spectra is performed with probabi- resolution, high-mass accuracy MS, such as the
listic models defined by a software and its inher- Orbitrap and Quadrupole-TOF instruments.
ent algorithm(s), shotgun proteomics data yield Excellent peptide separation has the benefit
a number of peptide (and protein) identifications that more peptides derived from
at a specific false discovery rate, often in con- low-abundance proteins are enriched in frac-
junction with an MS score matrix that also attri- tions and more likely to be selected during the
butes a measure of correct peptide identification. MS data-dependent duty cycle for MS/MS anal-
Algorithms integrated in such software tools are ysis. High-mass accuracy and resolution
used to score PSMs and assign the highest scor- enhance the confidence in peptide (sequence)
ing PSM to a peptide. Protein inferences are assignments via PSMs. For a detailed review
made by assigning peptides to a distinct protein of LC-MS platforms used for proteomics appli-
sequence in the database (Fig. 3c). The larger cations, see (Yates et al. 2009).
Proteomics and Metaproteomics
Proteomics and Metaproteomics, Fig. 3 Principles of peptide mass fingerprinting 698.81 (+2 ion charge). Bottom: the peptide at m/z 698.81 was selected for CID
(PMF) and tandem mass spectrometry (MS/MS). (a) PMF of a purified 2D gel protein fragmentation in a Velos Pro ion trap instrument and the sequence
spot (e.g., from MALDI-TOF/TOF analysis). (b) Snapshot of MS and subsequent EFVGGGYVTVLVR assigned based on y- and b-ion series. For m/z values of the
MS/MS scan in a shotgun proteomics dataset. Top: MS spectrum representing PMF and m/z values for a peptide and its MS/MS data, assignments to a protein in the
a peptide mixture derived from a variety of proteins, with one peptide peak at m/z searched database are made (c)
607
P
P
P 608 Proteomics and Metaproteomics
matched protein sequences derived from A particular challenge pertains to the high
Bacteroides and Bifidobacterium, confirming rel- amino acid sequence identities among highly
atively high abundance of these genera in the conserved (housekeeping) proteins of related
human gut. Despite the application of a bacterial species in a microbiome. Since protein identifi-
enrichment procedure, 30 % of the PSMs cation in shotgun proteomics relies on peptide
represented matches to human proteins, includ- sequence data followed by in silico assignment
ing a large proportion of those active in cell-cell to proteins, it impedes taxonomic profiling on the
adhesion and innate immunity. This finding species level analogous to the short reads of
supported the notion that the host immune system NextGen sequencing technologies. Nonetheless,
interacts extensively with its gut microbiome. metaproteomic data already contribute effec-
Analysis of urinary tract metaproteomes linked tively to the elucidation of the metabolic capacity
to asymptomatic bacteriuria (Fouts et al. 2012) of complex biological systems and the cross-talk
resulted in protein identifications from two to five of such systems with their host environments.
different opportunistic pathogens and provided Robust computational algorithms and workflows
preliminary evidence for host-bacterial interac- will have a positive impact on the future of
tions, specifically a battle for iron. Human metaproteomics. Use of multiple “omics” tech-
lactotransferrin, an iron sequestration protein, nologies allows insights into complex intra- and
and iron acquisition proteins and receptors from extracellular biological processes and their cross-
E. coli and Klebsiella pneumoniae were identi- talk and integration into a biological system.
fied in the same samples.
References
Summary and Outlook
Chen YT, Chen HW, et al. Multiplexed quantification of
In conclusion, proteomics is a highly advanced 63 proteins in human urine by multiple reaction
monitoring-based mass spectrometry for discovery of
discipline that contributes to science at the bio- potential bladder cancer biomarkers. J Proteome.
logical systems level. Metaproteomics has clear 2012;75(12):3529-45
potential to elucidate functional interactions of de Godoy LM, Olsen JV, et al. Comprehensive mass-
coexisting microbial species and, if applicable, spectrometry-based proteome quantification of hap-
loid versus diploid yeast. Nature. 2008;455(7217):
those with their eukaryotic host environments. 1251–4.
Major challenges to enable in-depth and accurate Elliott MH, Smith DS, et al. Current trends in quantitative
metaproteomic profiling efforts for highly proteomics. J Mass Spectrom. 2009;44(12):1637–60.
diverse communities remain to be addressed. Fouts DE, Pieper R, et al. Integrated next-generation
sequencing of 16S rDNA and metaproteomics differ-
Only a fraction of the genomes represented in entiate the healthy urine microbiome from asymptom-
complex microbial communities have been atic bacteriuria in neuropathic bladder associated with
sequenced. Comprehensive metagenomic spinal cord injury. J Transl Med. 2012;10(1):174.
sequence datasets are very promising resources Gorg A, Weiss W, et al. Current two-dimensional electro-
phoresis technology for proteomics. Proteomics.
for advanced proteomic data searches. However, 2004;4(12):3665–85.
such datasets can be incomplete and may have Hall-Stoodley L, Costerton JW, et al. Bacterial biofilms:
sequence inaccuracies and significant redun- from the natural environment to infectious diseases.
dancy which, in turn, affects the reliability of Nat Rev Microbiol. 2004;2(2):95–108.
Ho Y, Gruhler A, et al. Systematic identification of protein
assignments of peptides and proteins on the complexes in Saccharomyces cerevisiae by mass spec-
species level via PSMs derived from MS-based trometry. Nature. 2002;415(6868):180–3.
proteomic datasets. Further improvement of Kuhner S, van Noort V, et al. Proteome organization in
metagenomic assembly and computational a genome-reduced bacterium. Science. 2009;
326(5957):1235–40.
methods will benefit the quality of Markert S, Arndt C, et al. Physiological proteomics of
metaproteomic datasets since their analysis the uncultured endosymbiont of Riftia pachyptila.
depends on predicted protein sequence data. Science. 2007;315(5809):247–50.
Proteomics and Metaproteomics 611 P
Mueller LN, Brusniak MY, et al. An assessment of soft- Prokisch H, Scharfe C, et al. Integrative analysis of the
ware solutions for the analysis of mass spectrometry mitochondrial proteome in yeast. PLoS Biol. 2004;
based quantitative proteomics data. J Proteome Res. 2(6):e160.
2008;7(1):51–61. Ram RJ, Verberkmoes NC, et al. Community proteomics
Nagaraj N, Wisniewski JR, et al. Deep proteome and of a natural microbial biofilm. Science. 2005;
transcriptome mapping of a human cancer cell line. 308(5730):1915–20.
Mol Syst Biol. 2011;7:548. Rodriguez-Valera F. Environmental genomics, the big
O’Farrell PH. High resolution two-dimensional electro- picture? FEMS Microbiol Lett. 2004;231(2):153–8.
phoresis of proteins. J Biol Chem. 1975;250(10): Speers AE, Cravatt BF. Activity-based protein profiling
4007–21. (ABPP) and click chemistry (CC)-ABPP by MudPIT
Olsen JV, Vermeulen M, et al. Quantitative mass spectrometry. Curr Protoc Chem Biol. 2009;1:29–41.
phosphoproteomics reveals widespread full phosphor- van Noort V, Seebacher J, et al. Cross-talk between phos-
ylation site occupancy during mitosis. Sci Signal. phorylation and lysine acetylation in a genome-
2010;3(104):ra3. reduced bacterium. Mol Syst Biol. 2012;8:571.
Picotti P, Rinner O, et al. High-throughput generation of Verberkmoes NC, Russell AL, et al. Shotgun
selected reaction-monitoring assays for proteins and metaproteomics of the human distal gut microbiota.
proteomes. Nat Methods. 2010;7(1):43–6. ISME J. 2009;3(2):179–89.
Pieper R, Gatlin CL, et al. The human serum proteome: Wolf-Yadlin A, Sevecka M, et al. Dissecting protein func-
display of nearly 3700 chromatographically separated tion and signaling using protein microarrays. Curr
protein spots on two-dimensional electrophoresis gels Opin Chem Biol. 2009;13(4):398–405.
and identification of 325 distinct proteins. Proteomics. Wolters DA, Washburn MP, et al. An automated
2003;3(7):1345–64. multidimensional protein identification technology for
Pieper R, Huang ST, et al. Characterizing the dynamic shotgun proteomics. Anal Chem. 2001;73(23):5683–90.
nature of the Yersinia pestis periplasmic proteome in Yates JR, Ruse CI, et al. Proteomics by mass spectrome-
response to nutrient exhaustion and temperature try: approaches, advances, and applications. Annu Rev
change. Proteomics. 2008;8(7):1442–58. Biomed Eng. 2009;11:49–79.
P
R
RITA: Rapid Identification of High- means uniform and they often cannot distinguish
Confidence Taxonomic Assignments closely related organisms. Further compounding
for Metagenomic Data the problem is the reliance on short-read-based
approaches to metagenome sequencing, which
Norman J. MacDonald1, Donovan H. Parks1,2 and can generate reads less than 200 nucleotides in
Robert G. Beiko1 length, and short or ambiguous assemblies in
1
Faculty of Computer Science, Dalhousie many cases. Successful classification methods
University, Halifax, NS, Canada use homology (e.g., BLAST comparisons against
2
Australian Centre for Ecogenomics, University genes or proteins from a set of reference
of Queensland, Brisbane, QLD, Australia genomes) or composition (e.g., distribution of
tetranucleotide sequences) for classification,
with a newer generation of “hybrid” classifiers
Definition using both (e.g., PhymmBL; Brady and Salzberg
2009). We have developed RITA, a hybrid
Algorithm, software, and Web service for taxo- approach that uses streamlined approaches to
nomic classification of metagenome fragments rapidly generate homology and composition
using both homology and compositional information and combines these sets of predic-
information. tions in a supervised classification pipeline that
sorts sequences into different classification
groups based on the strength and agreement of
Introduction the two types of predictions.
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
R 614 RITA: Rapid Identification of High-Confidence Taxonomic Assignments for Metagenomic Data
For the stand-alone version, a locally installed comparisons between a translated query sequence
copy of either the BLAST+ software suite or and reference database of protein sequences. The
USEARCH (Edgar 2010) is necessary. objective in using this ordering is to place the
Reference Databases: Since RITA is fastest algorithms first, which removes the need
a supervised classifier, it requires a reference data- to run the slower algorithms on all query
base of sequenced genomes with associated taxo- sequences. The stand-alone version of RITA also
nomic information. Genomic information is includes the option to use UBLAST (Edgar 2010),
typically acquired from the NCBI database of which aims to prioritize searches against
sequenced genomes and can be performed auto- a reference database in order to avoid searching
matically using the scripts provided with the FCP the entire database. Approaches such as LCA
software package. From these sequenced genomes (Huson et al. 2007), MetaPhyler (Liu et al. 2011),
(and optionally, similarly formatted files provided and CARMA (Gerlach and Stoye 2011) use phy-
by the user), RITA can build reference models for logenetic information for taxonomic classification,
both composition and homology. If rank-flexible but our trials of RITA showed no additional ben-
classification (described below) is to be efit to the use of phylogenetic trees in the classifi-
performed, a set of 16S rRNA gene sequences cation scheme we describe below.
corresponding to the reference sequenced For compositional classifications, we encode
genomes will be required as well. Instructions on each reference genome as a series of nucleotide
acquiring and preparing these can be viewed at words (i.e., k-mers) of a fixed length to generate
http://kiwi.cs.dal.ca/Software/RITA. frequency distributions of each word. These fre-
Input Data: The user must provide their quency profiles are then used to train a naı̈ve
metagenomic sequences in a FASTA-formatted Bayes (NB) classifier (Parks et al. 2011), which
file. The sequences can be of any length. If rank- assigns likelihoods to each query fragment based
flexible classification is desired, a list of sampled on the match between its k-mer profile and those
16S rRNA gene sequences must also be provided. representing the different genomes in the reference
database. The genome with the largest likelihood
for a given fragment is the best compositional
The RITA Pipeline match to that fragment. The crucial assumption of
the NB classifier is of independence among input
The primary objective of RITA is to make taxo- k-mers: while this assumption is clearly violated by
nomic assignments that consider both the agree- k-mer decompositions of DNA sequences (for
ment between composition and homology and the instance, the frequency of the 6-mer AAAAAA
strength of evidence from both types of classifica- will be closely tied to that of AAAAAC), in prac-
tion technique. Homology-based classification is tice this does not impact on the performance of the
performed using local alignment-based compara- classifier. Phymm is a compositional classifier that
tive tools such as BLAST (Altschul et al. 1997). uses more-sophisticated Markov models of
Many variants of BLAST have been developed sequence composition: while these are better at
which differ in the type of sequence information describing the compositional profile of a genome,
being compared (e.g., nucleotide, 6-way translated in practice they are much slower and no more
nucleotide, amino acid), as well as sensitivity and accurate than our NB approach.
speed. Although RITA can be configured to run in The RITA pipeline combines homology and
a number of different ways, the default approach is composition information by first assessing
based on the sequential use of three different algo- whether the predictions of composition and
rithms: Discontiguous MEGABLAST for fast but homology agree. While homology alone outper-
low-sensitivity comparisons between a nucleotide forms composition alone in most classification
query and nucleotide database, BLASTN for tasks, the genomic patterns reflected in composi-
slower but more sensitive nucleotide-nucleotide tional profiles provide complementary informa-
comparisons, and BLASTX for sensitive tion, and agreement between the two types of
RITA: Rapid Identification of High-Confidence Taxonomic Assignments for Metagenomic Data 615 R
data is not trivially obtained. If agreement is found appropriate taxonomic group and rank based on
for a given fragment and the first BLAST algo- the strength of available evidence. To perform
rithm considered, then the fragment will be clas- rank-flexible classification of a metagenome
sified with the predicted taxonomic label and sample using RITA, the user must provide a list
assigned to group 1, the highest-confidence of 16S rRNA genes that were identified from the
group. If the predictions of composition and sample. These genes are used to limit the taxo-
homology disagree, then classification using nomic scope of the RITA predictions. The pro-
homology alone will be attempted in the follow- vided 16S rRNA genes are mapped into a tree of
ing manner. When running RITA, the user spec- all 16S genes from the reference database of
ifies a minimum margin for homology-based sequenced genomes. All genomes represented
classification based on e-values: the default value within a minimal clade containing one of the
is 20 orders of magnitude. If the globally best sampled 16S genes will be flagged as assignable
e-value is greater than the best e-value from to a taxonomic rank that is no more precise than
a different taxonomic group by an amount greater the rank covering all members of that clade. For
than or equal to this margin, then the result is example, if a sampled 16S gene maps to the
considered as strong evidence for assignment to reference tree such that all of its sister taxa are
the best-matching group, and the fragment is from the same order, then RITA will consider
assigned to group 2. If the fragment remains matches to those taxa to be equivalent at the
unclassified, the same procedure is followed for rank of order. In this manner, the level of classi-
subsequent BLAST algorithms with potential fication is determined by the density of reference
classifications to group 3, group 4, etc. If all genome sampling around the observed 16S rRNA
homology-based options have been exhausted, gene sequences from the environmental sample.
classification is made to one of two groups based
on the NB classifier alone. Similar to the homol-
ogy margin described above, the globally best NB Interpreting RITA Results and Factors
likelihood is compared to the best likelihood from Affecting Prediction Accuracy
a different taxonomic group. If this ratio exceeds
a user-specified amount, then the fragment is RITA Output: RITA returns detailed results of
assigned to the higher-confidence composition- both composition- and homology-based models.
only group. If the ratio does not exceed this Most critical in the RITA output is a tab-separated
amount, then the fragment is assigned to the last file that lists the predictions associated with each
and lowest-confidence group. DNA sequence. Examples of RITA output are
The procedure above describes rank-specific given in Table 1, with some taxonomic ranks R
classification, where all fragments are classified omitted to fit each result on a single line:
at a given taxonomic rank, for instance, phylum The first column contains the name of the
or genus. However, different groups of microbes sequence as obtained from the sequence file.
may be more or less represented by sequenced The second and third columns give the confi-
genomes, and there may be more evidence to dence group associated with the prediction, first
make precise assignments to some groups than by number and then by name. Group 1,
to others. In the extreme case, some bacterial “NB_DCMEGABLAST,” indicates agreement
phyla are represented by a single sequenced indi- between the first homology prediction method
vidual, making it impossible to distinguish used (Discontiguous MEGABLAST) and the
between genera and other groups within this phy- NB classifier, while group 2 corresponds to
lum. One solution to this problem is to classify all a prediction made based on a strong separation
fragments to a very high rank such as phylum or between the best and second-best groups
class, but this discards precision in cases where it according to homology. The fourth column
may be available. Our solution is to use a rank- shows the taxonomic rank at which the prediction
flexible version of RITA that assigns an was made, and the remaining columns give the
R 616 RITA: Rapid Identification of High-Confidence Taxonomic Assignments for Metagenomic Data
labels associated with that prediction, with the Consequently the classification accuracy on
final column showing the actual genome that fragments from genomes that are taxonomi-
yielded the best prediction. cally novel (i.e., that have no relatives in the
Summarizing Results: In most cases, we do reference database at ranks such as order or
not recommend using all classes when building class) will be extremely poor. This presents a
taxonomic summaries of the contents of significant challenge when samples are known
a metagenome. In particular, the accuracy of the to be enriched in poorly represented phyla
final two classes (which are based on composition such as Verrucomicrobia, Acidobacteria, or
only) tends to be very poor when sequences are the many candidate phyla that lack sequenced
short. If high precision is desired, then a user can representatives. If human microbiome sam-
focus their attention on either group 1 alone or ples are being processed using RITA, it is
those groups in which homology is a factor in the highly desirable to add the draft genomes
prediction. In the example above, this would sequenced by the Human Microbiome Con-
include groups 1–4 and exclude the last two sortium (Markowitz et al. 2012) to increase
groups, 5 and 6. However, when sequences are the coverage of common human-associated
long (>2,500 nt in length) due to assembly, pre- taxonomic groups: the effects of including or
dictions based on composition alone are more excluding these genomes are shown in Fig. 1.
reliable and can be included in the final set of Short fragments. The effect of fragment length on
predictions. Also, if the user has a reasonable classification accuracy has been extensively
expectation of “who is there” based on, e.g., characterized (McHardy and Rigoutsos 2007;
taxonomic assignment of marker genes, this Brady and Salzberg 2009; MacDonald
knowledge can be used as the basis for accepting et al. 2012). While hybrid classifiers such as
a subset of predictions from the last two groups. RITA can give accuracy in excess of 50 %
Factors Affecting the Accuracy of RITA even on metagenomic fragments 50 nt in
Predictions: Several factors have been tested length, a high degree of misclassification is
and shown to impact on the accuracy of RITA likely and many false-positive predictions
predictions. Among the most notable are: can be anticipated. Restricting predictions to
Reference genome availability. Classification of the “agreement” groups such as group 1 is
a fragment to a taxonomic group at a given highly desirable in this case.
rank obviously depends on the existence of at Long fragments. A different problem is seen
least one sequenced genome from this group when applying RITA to long, assembled
in the reference database. Even at the level of metagenomic fragments. Since RITA considers
genus, inclusion of multiple reference only the best BLAST match for a given frag-
genomes is desirable to adequately map out ment, the homology prediction for a long
the pan-genome for homology-based predic- assembly will be based on one of many genes.
tions and to capture compositional variation If the prediction associated with this gene is
within the group. Compositional signal is incorrect (for instance, if it was recently trans-
highly variable within order, class, and phy- ferred into the sequenced organism from
lum, and best matching to homologs is diffi- a different genome), then homology and com-
cult as well due to the confounding effects position will likely disagree, and the entire
of gene loss and lateral gene transfer. fragment will likely be assigned to a
RITA: Rapid Identification of High-Confidence Taxonomic Assignments for Metagenomic Data 617 R
RITA: Rapid Identification of High-Confidence Taxo- sequenced by the HMP. A majority of sequences are
nomic Assignments for Metagenomic Data, assigned to the low-confidence “NB ratio” category. (b)
Fig. 1 RITA classifications of 33,000 metagenomic Classifications of the same data set with inclusion of the
fragments from obese twin gut metagenomes HMP reference genomes, showing a doubling of the num-
(Turnbaugh et al. 2009). D-BLASTN ¼ Discontiguous ber of assignments to the highest-confidence group
MEGABLAST. The number of assignments to each (NB and D-BLASTN) and a near-halving of assignments
RITA group is shown, with different colors indicating to the NB ratio group. Plots were generated by the RITA
assignments to different genera. (a) Assignments made Web server
to a set of reference genomes, excluding the draft genomes
R 618 RITA: Rapid Identification of High-Confidence Taxonomic Assignments for Metagenomic Data
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
S 620 SATe-Enabled Phylogenetic Placement
had good accuracy for genes that evolve under alignment subset, into which the query sequence
low rates of evolution, but when the rate of evo- is then placed. Both parameters influence the
lution increased, then their accuracy dropped accuracy and running time of SEPP.
substantially. Two observations resulted: first, Thus, the most important difference between
under a high rate of evolution, the reference SEPP and EPA and pplacer is just how the
alignment and tree would be difficult to estimate, extended alignment is computed. The technique
an observation that has been made elsewhere. in SEPP for calculating the extended alignment is
However, more surprisingly, when the sequences based on decomposing the taxon set into subsets
evolved under a high rate of evolution, even with using the reference tree, and so the important
a good alignment and tree, the technique for issue is how the taxon set is decomposed. They
computing the extended alignment did not have used the centroid edge decomposition strategy
good accuracy. first employed in the SATe multiple sequence
Mirarab et al. developed a divide-and-conquer alignment method (Liu et al. 2012). This strategy
technique for improving this approach to phylo- removes an edge that breaks the taxon set roughly
genetic placement, which they termed SEPP. in half and then repeats the process on each
This technique operates by using the reference subtree until the desired number of subtrees is
tree to divide the dataset into subsets (as used in computed. Thus, SEPP is SATe-enabled phylo-
SATé-II (Liu et al. 2012)) and then uses HMMER genetic placement.
(Eddy 1998) to compute an HMM on each subset SEPP takes two parameters – the size of the
using the induced alignment from the reference “alignment subsets” that the reference tree is
alignment. Thus, SEPP stands for SATé-enabled decomposed into and the size of the larger subsets
phylogenetic placement. Thus, instead of using (called “placement subsets”) into which the query
a single HMM to represent the reference align- sequences can be placed after their extended
ment, a collection of HMMs is used, each on alignments are computed. Both parameters
a different subset of the taxa. The calculation of impact accuracy and speed. For example, smaller
the extended alignment for each query sequence alignment subsets result in better accuracy but
is then made by using HMMER to score the fit increase the running time. Similarly, larger place-
between the query sequence and each of the sub- ment subsets improve accuracy but increase the
set HMMs, and the one that has the best score is running time. The experimental study showed
used to align the query sequence to the alignment that setting both parameters identically and
on that subset. Because the subset alignments are decomposing to ten subsets gave a good trade-
all in agreement with the reference alignment on off between accuracy and running time.
the full dataset, transitivity then provides the The experimental study in Mirarab
alignment of the query sequence to the full et al. (2012) showed that this default setting for
dataset. In this way, the extended alignment of SEPP gave improved accuracy compared to
each query sequence can be computed. Once the pplacer and PaPaRa; results from this study are
extended alignment is calculated, the query reproduced below in Fig. 1. The test datasets have
sequence can be inserted into the reference tree 500 query sequences (half “long” and half
using maximum likelihood, just as in EPA and “short,” where long sequences have a length on
pplacer. average of 250 and short sequences have a length
SEPP also allows the user to limit the subtree on average of 100), and the placement methods
of the reference tree into which the query insert these query sequences into a reference tree
sequence will be placed through an additional and alignment on 500 full-length sequences
parameter. Thus, SEPP takes two (average length 1,000 nt). Mirarab et al. (2012)
parameters: the number of leaves in the subtree also showed that SEPP provided improved com-
on which SEPP builds an HMM (based on the putational performance over these methods with
induced alignment) and the number of leaves in respect to both time and peak memory usage for
the (perhaps larger) subtree that contains the very large datasets.
SATe-Enabled Phylogenetic Placement 621 S
8 than both EPA and pplacer, because instead of
using a single HMM to represent the entire refer-
7
ence alignment, it uses multiple HMMs, each on
6 a different subset of the taxa. Although formu-
Delta Error (edges)
GT GT GT GT
CA CA CA CA n
1. Gel sizing
iv) 2. Blunt end polishing
3. Cloning
(i) PCR amplification. The V1 region is Taq DNA polymerase or a hot-start dNTP
amplified using BsgI-Bact64F (50 -dual mix is recommended in the PCR amplifica-
biotin-TTT GAC CGT GCA GCY TAA tion to reduce formation of primer dimers,
YRC ATG CAA GTCG-30 ) and BsgI- which can contaminate the RSTs.
Bact109R (50 -dual biotin-TTT GAC CGT (ii) Digestion of PCR products and primer
GCA GYY CAC GYG TTA CKC ACC removal. The purified PCR products are S
CGT-30 ). Each primer has an extension digested with BsgI, a type IIs restriction
region that contains a recognition site for endonuclease that cuts 16 base pairs
BsgI (bolded and underlined), and the most (bp) downstream from the recognition site.
50 nucleotide of this extension region is The released RSTs are separated from the
labeled with at least one biotin or biotin- primers using streptavidin-coated magnetic
tetra-ethyleneglycol (biotin-TEG) molecule. beads, such as Dyna 280 beads (Dynal, Oslo,
The quality and quantity of the PCR prod- Norway), which immobilize the primers that
ucts are evaluated using PAGE (8 %T, 19:1) have a biotin label at the 50 end.
mini gel. Then, the PCR products are purified (iii) Concatenation of individual RSTs. Each
using the QIAquick PCR Purification Kit of the freed RSTs has one 2-nt overhang
(QIAGEN, Valencia, CA) or by ethanol pre- at both 30 termini, and these overhangs
cipitation following extraction with phenol/ facilitate annealing of individual RSTs in
chloroform. Hot-start PCR using a hot-start hand-to-tail orientation in series (Fig. 1).
S 624 Serial Analysis of V1 Ribosomal Sequence Tags
3.0M
2,492,653
2.5M
rRNA Sequences
2.0M
1,471,257
1.5M
995,747
1.0M
756,668
504,295
500.0k
286,257
194,696
60,274 83,960 101,781
473 1,379 2,251 2,849 2,849 4,332 6,205 7,322 16,277 16,277
0
19 19 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 S
92 93 994 995 996 997 998 999 000 001 002 003 004 005 006 007 008 009 010 011 ILV
A
r1
11
Year
SILVA Databases, Fig. 1 Growth of the ribosomal RNA databases since 1992
comprehensive rDNA datasets comprising which provides detailed statistics and documenta-
sequences from the Bacteria, Archaea, and tion of the resource.
Eukarya domains. All sequences are checked for
anomalies, carry a rich set of sequence-associated
contextual information, and have multiple taxo- SILVA Datasets
nomic classifications and the latest validly
described nomenclature. The SILVA datasets are The SILVA project provides datasets for all SSU
based on the EMBL/EBI Nucleotide Sequence and LSU rDNA sequences found in EMBL-Bank
Database (EMBL-Bank), a member of the Inter- that fulfill the SILVA quality criteria. Since their
national Nucleotide Sequence Database Collabo- first public release in February 2007, based on
ration (INSDC) comprising all publicly available EMBL-Bank release 89, these datasets have S
DNA sequences. They are generated by an auto- increased in size by a factor of 10 and 5 for the
matic software pipeline for the extraction of SSU SSU Parc and LSU Parc datasets, respectively.
and LSU rDNA sequences as well as quality con- Moreover, the growth is clearly exponential
trol. The alignment is based on the latest compre- (Fig. 1) as is the growth of the general DNA
hensive ARB (Ludwig et al. 2004) alignments. sequence databases. Detailed information on the
The datasets are extensively annotated by third- current SILVA database content can be found in
party data integration. Substantial manual curation the documentation section of the SILVA web
of the alignment and taxonomy is performed on portal.
each public release. SILVA dataset updates and The SILVA SSU and LSU rDNA datasets
new online features are continuously released on each consist of two subsets: (1) the “Parc”
the SILVA web portal (http://www.arb-silva.de) datasets comprising the complete SILVA
S 628 SILVA Databases
database content and (2) the “Ref” datasets com- particularly about one-fourth in case of the SSU
prising high-quality subsets of sequences in the Parc database and about one-tenth for the LSU
Parc datasets. For the SSU dataset, additionally, Parc database (ratios for the SILVA 111 release).
two subsets are provided, (3) the “Ref NR” Sequences originating from the “Human Skin
dataset, a nonredundant version of the Ref subset, Microbiome” (HSM) (Grice et al. 2009), the
and (4) a type strain dataset provided by the “All- “Mouse Wound Microbiota” (MWM) (Grice
Species Living Tree Project” (LTP) (Munoz et al. 2010), and the “Guerrero Negro
et al. 2011) which is also available for the LSU Hypersaline Microbial Mat” (GNHM) large-
rDNA. All datasets and individual subsets can be scale sequencing projects are excluded from
downloaded as ARB files for direct use with the the SSU Ref dataset. Instead, these sequences,
ARB software package (Ludwig et al. 2004). In with more than 490,000 (SILVA 111) long
the future, only SSU Ref and SSU Ref NR sequence reads in total, are provided in
datasets will be offered in the ARB format to a dedicated dataset. This is done to further restrict
avoid unmatched hardware demands on the the size of the SILVA SSU Ref dataset and to
user side. avoid overrepresentation of sequences of a spe-
cific origin.
SILVA Parc For both SILVA Ref datasets, the ARB files
All sequences in the Parc datasets have are supplemented with a manually classified
a minimum length of 300 aligned nucleotides “guide tree,” incrementally built using the ARB
within the boundaries of the rRNA genes. parsimony tool with filters to remove highly var-
Sequences are only accepted if they have less iable positions and followed by removal of
than 2 % ambiguities or homopolymers or vector sequence entries represented by anomalous tree
contamination. Additionally, after the alignment, branch lengths. These trees also represent the
minimal quality requirements for sequence qual- basis for the SILVA taxonomy (see section
ity and base-pair score, as well as alignment “SILVA Taxonomy” below).
identity and quality, are applied. For details,
please refer to the section “Quality Control” SILVA Ref NR (Nonredundant)
and the respective dataset documentation page For users interested in a representative SSU
(e.g., http://www.arb-silva.de/documentation/ rDNA sequence collection, the SILVA project
background/release-111/). offers a nonredundant (NR) version of the SSU
Ref subset. This dataset is created by applying
SILVA Ref clustering at 99 % (up to SILVA 108) and 98 %
The SILVA Ref datasets represent subsets of the (from SILVA 111 on) sequence identity. Of each
corresponding SILVA SSU and LSU Parc cluster, only the longest sequence is kept. This
datasets. They comprise only “full-length” or reduces the size of the dataset to less than 50 % of
nearly “full-length” sequences. An SSU sequence its original size, even though the sequences omit-
is considered to be of “full length” if it contains at ted in the SSU Ref dataset from the HSM, MWM,
least 1,200 aligned bases within the rRNA gene and GNHM projects (see above) are included for
boundaries. For sequences classified as Archaea, clustering. Sequences from cultivated species are
this threshold has been lowered to 900 aligned preserved in all cases to lead as an anchor for
bases to avoid losing the majority of sequences. taxonomy. The resulting SSU Ref NR dataset
LSU sequences are considered “full length” if with its manually curated “guide tree” can be
they are at least 1,900 bases long. used as a representative dataset for classification,
More stringent thresholds for alignment qual- phylogenetic analysis, and probe design. It is the
ity and identity are applied for the Ref datasets. recommended dataset to be used as a starting
Consequently, the Ref datasets contain consider- point for all users interested in environmental
ably less sequences than the Parc datasets, rDNA sequence analysis.
SILVA Databases 629 S
SILVA Taxonomy domains and are manually curated and continu-
ously enhanced.
A substantial revision of the classification of all
prokaryotic sequences in the Ref datasets was SEEDs for Quality Control
first published with SILVA release 100. Based The SEED used for the detection of sequence
on the “guide trees,” all phylogenetic assign- anomalies in SSU sequences is based on the
ments are manually curated, taking into account corresponding alignment SEED with all
taxonomic information provided by Bergey’s sequences removed if any indication of an anom-
Taxonomic Outline of the Prokaryotes (Garrity aly was found. This reduces the size of the SEED
et al. 2004); the taxonomic outlines for volumes by a factor of 6. The detection of anomalies is not
3, 4, and 5 of Bergey’s Manual; and the List of done for LSU rDNA sequences because none of
Prokaryotic names with Standing in Nomencla- the available tools can be applied.
ture (Euzéby 1997). Furthermore, extensive effort For identification of vector contaminations,
is spent to represent prominent uncultured and not a SEED based on the EMVEC (EBI) and UniVec
validly published environmental clades, groups, (NCBI) reference datasets is used with all
and taxa, respectively. The majority of these clades sequences removed resembling an rDNA
and groups are annotated in the “guide tree” based sequence.
on literature surveys and personal communica-
tions. Taxonomic groups consisting only of
sequences from uncultured organisms are named Data Retrieval
after the clone sequence submitted earliest. Due to
this exhaustive manual approach, SILVA currently Three strategies are applied to retrieve SSU and
contains the most up-to-date and detailed bacterial LSU rDNA sequences from EMBL-Bank:
and archaeal taxonomic classification. • A keyword search is used to extract annotated
To create also an improved and unified taxon- SSU and LSU rDNA sequences. Additionally,
omy for Eukarya based on 18S rDNA sequences, a set of relaxed keywords is applied to account
the Eukaryotic Taxonomy Working Group for sequences with spelling mistakes in the
(ETWG) has been founded in October 2011. annotation.
The first version of these efforts is deployed • A whitelist taken from the Ribosomal Data-
with SILVA release 111. base Project (RDP) (Cole et al. 2005) is used
to retrieve sequences that are not covered by
SILVA SEED Datasets the keyword search.
SILVA uses customized and specialized refer- • HMMs (one for each of the three domains of
ence datasets for specific tasks within its software life for both LSU and SSU) taken from the
pipeline. Such internal reference datasets are RNAmmer tool (Lagesen et al. 2007) are S
called SEEDs. searched against the complete EMBL-Bank.
Sequences that match one of the HMMs and
SEEDs for Alignment were not already imported by one of the two
As of July 2012, the SEED used for SSU rRNA previous approaches are added.
gene sequence alignment has 50,000 alignment In all cases, the entries in the datasets are
positions including all gaps and consists of about flagged by its origin of retrieval.
57,000 high-quality, aligned SSU rDNA refer-
ence sequences. The alignment SEED of the
LSU rRNA gene comprises 150,000 positions Alignment
but includes only about 3,000 aligned sequences.
Both SEEDs contain representative sequences After import, sequences are aligned using the
from the Bacteria, Archaea, and Eukarya SINA software (SILVA Incremental Aligner)
S 630 SILVA Databases
(Pruesse et al. 2012). Similar to the ARB project, reduced sequence quality caused by the sequenc-
the tool follows the concept of an incremental ing process. As a consequence, if homopolymers
alignment. Briefly, no de novo multiple sequence of five or more nucleotides are found within
alignment is created; instead, the highly accurate a sequence and these stretches count for more
manual alignment of closely related sequences than 2 % of the sequence within the rRNA gene
found in the corresponding alignment SEED is boundaries, the sequence is excluded from the
used as a template to align each sequence SILVA datasets.
included in the SILVA datasets. This approach Unaligned overhangs of a sequence are
guarantees a high-quality alignment of rDNA checked against the vector SEED using BLAST
sequences. (Altschul et al. 1998) to identify cloning artifacts.
If it is likely that the unaligned part of a sequence
is a vector sequence and the unaligned part is
Quality Control longer than the aligned part, the sequence is
excluded from SILVA. Sequences in SILVA are
Every imported and aligned SSU and LSU gene not allowed to contain more than 2 % vector
sequence has to pass a multistage quality inspec- contamination.
tion to assure the high quality of the SILVA The three parameters are combined into an
datasets. Sequences are checked for sequence overall “sequence quality” value. This score rep-
and alignment quality using various parameters. resents the mean of the three individual parame-
Sequences are excluded from the SILVA releases ters. It is normalized to values in the range of
in case they fail any of the applied tests or show 0–100, such that 100 represents the best possible
reduced quality based on combined quality quality of a sequence.
values. Additionally, sequences are tested for All thresholds to reject a sequence were
anomalies but no filtering is done by the SILVA defined based on statistical analysis of the
project based on these results. The information is retrieved SSU and LSU rDNA sequences.
provided to the users for individual filtering of the
datasets, if required. Alignment Quality
Detailed statistics on the SILVA quality con- Four characteristics of the alignment process are
trol can be found on the SILVA web portal for all evaluated in the pipeline and a sequence is
SILVA releases, e.g., http://www.arb-silva.de/ rejected if it fails to pass one of these: the base-
documentation/background/release-111/. pair score, the alignment quality, the alignment
identity, and the alignment length within the
Sequence Quality boundaries of the rRNA gene.
The SILVA sequence quality checks test for The base-pair score is calculated from the
ambiguous bases, extended homopolymeric number of bases involved in helix binding
stretches, and vector contaminations. according to the secondary structure model of
Ambiguous bases are nucleotides representing Gutell et al. (1994).
valid characters according to the International The alignment quality score is a measure of
Union of Pure and Applied Chemistry (IUPAC) the identity of the query sequence to the reference
DNA encoding but do not resolve to “A”, “C”, sequences that are used as a template for the
“G”, or “T”. A maximum of two percent of alignment. High values (>90) indicate that
ambiguous nucleotides within the rRNA gene closely related sequences have been found in
boundaries is allowed by SILVA. the alignment SEED and that the resulting align-
Homopolymers are stretches of identical ment is likely to be accurate. Low values suggest
nucleotides that commonly appear with that further manual inspection of the particular
a maximum of up to four nucleotide repetitions sequence is needed.
in native rDNAs. In contrast, extended stretches Additionally, the alignment identity of the
within a sequence represent an indication of query sequence to its closest relative in the
SILVA Databases 631 S
alignment SEED is considered to guarantee the For LSU rDNA sequences, only the EMBL-
specificity of the alignment. Two positions in the Bank and SILVA taxonomy are available due to
alignment are considered identical if both posi- a lack of additional resources.
tions have the same unambiguous nucleotide
according to the IUPAC encoding. Nomenclature
To fit the SILVA unified scoring scheme, the With every SILVA release, all organism names
base-pair and alignment quality scores are nor- are updated according to the “Nomenclature
malized to values between 0 and 100, such that Up-to-Date” website of the “Deutsche Sammlung
100 represents the maximum score. f€ur Mikroorganismen und Zellkulturen”
(DSMZ). All synonyms and name replacements
Chimera/Anomaly Detection are recorded.
To detect sequence anomalies, a customized
version of the Pintail software (Ashelford Strain Annotation
et al. 2005) is used. This software checks whether The strain field of an entry in the SILVA datasets
a pair of sequences is mutually anomalous (e.g., is annotated using SILVA-specific labels if an
chimeric) by computing a distance profile and entry matches one or more of the following
comparing it to a predicted distance profile. The criteria:
result is “yes,” “likely,” or “no,” depending on • The label “e[G]” is added if an entry is part of
the amount of measured deviation from expecta- the list of genomes offered by the EBI.
tion. From this operation, the SILVA Pintail • The label “l[T]” is added if the entry is part of
score is constructed by running each sequence the type strain datasets of “The All-Species
against the ten most similar sequences retrieved Living Tree” Project (Munoz et al. 2011).
from the chimera SEED. Sequences that have • The label “s[T]” is added if an entry is listed as
passed all tests with “no” (not anomalous) get a type strain by the StrainInfo project
a score of “100 %,” whereas all tests returning (Dawyndt et al. 2005).
“likely” would yield a 50 % score. Only SSU • The label “s[C]” is added if an entry is
sequences are checked for anomalies because a cultured strain according to the StrainInfo
the Pintail software does not contain profiles for project.
sequences other than 16S rDNAs. • The label “r[T]” is added if an entry is listed as
a type strain by the RDP project.
Furthermore, manually curated habitat infor-
Third-Party (Meta) Data mation and GPS coordinates are assigned to each
entry based on information provided by the
One of the unique features of the SILVA datasets megx.net project (Kottmann et al. 2010).
is extensive data integration based on various S
third-party resources and manifold linkage of
the SILVA database entries to external data SILVA Website/Online: Service
sources.
One of the problems associated with the
Taxonomies ever increasing amount of sequences is the
Every sequence in the SILVA databases carries hardware resources required to store and analyze
the EMBL-Bank taxonomy assignment. Where the data. As a response to allow users to still
available, the greengenes (DeSantis et al. 2006) work with these datasets, features requesting
and RDP (Cole et al. 2005) taxonomies are added comprehensive reference datasets such as probe
for comparison. All entries of the SILVA Ref and primer evaluation for testing the in silico
datasets are also assigned to the taxonomy of accuracy of oligonucleotide signatures are now
the SILVA project (see section “SILVA offered by the SILVA web portal. Additionally,
Taxonomy”). the SILVA website offers extensive data retrieval
S 632 SILVA Databases
SILVA Databases, Fig. 2 The entry of Amorphus coralli (DQ097300) within the genus Amorphus displayed in the
SILVA “Taxonomy Browser”
functions for the compilation of individual The browser can also be used to create cus-
sequence subsets from the comprehensive online tomized subsets of the SILVA databases and to
database as well as preconfigured, quality- display the results of the online services provided
constrained subsets for direct download. by SILVA. For each taxonomic group in the
browser, the fraction of corresponding sequences
Taxonomy Browsing, Searching, and in the cart can be highlighted (Fig. 2).
Download Cart The advanced search functionality offered on
The SILVA “Taxonomy Browser” allows navi- the SILVA website allows the user to easily com-
gation through a selected taxonomy by clicking pile custom subsets of sequences. Besides simple
on the respective nodes. The browser starts with searches, e.g., for accession numbers, organism
showing all taxonomic groups of the highest level names, taxonomic entities, or publication
of the selected taxonomy. By selecting one of DOI/PubMed IDs, complex queries including
these groups, a new list view appears with all several database fields are also possible. Con-
subgroups, preserving the former levels within straints such as the sequence length or quality
a horizontal scroll bar layout. If a sequence values can be used to further filter the sequences.
entry is selected, a detailed summary will be Customized sequence subsets compiled by the
opened. This summary shows full annotation of user including the results of the SILVA online
an entry and a traffic light like view of the main services can be collected in the SILVA cart sys-
quality parameters (Fig. 2). tem and downloaded in various formats.
SILVA Databases 633 S
SILVA Databases, Fig. 3 The web interface and results of the SILVA “TestProbe” service
Alignment, Sequence-Based Searches, and Finally, the “Least Common Ancestor” fea- S
Classification ture of the aligner can be used to classify
Users can align their own sequences using the sequences against any of the taxonomies pro-
SILVA SSU and LSU SEEDs with a fully vided by the SILVA project.
configurable online version of the SILVA aligner
(SINA). The aligned sequences can be TestProbe
downloaded in either ARB or FASTA file The SILVA probe match and evaluation tool
formats. called “TestProbe” detects and displays all occur-
Submitted sequences can also be searched rences of a given probe or primer sequence within
against one of the predefined datasets (Parc, any specified SILVA datasets or subsets thereof.
Ref, or NR). This function will return a list of It is offered to test and visualize in silico speci-
closely related sequences which can be added to ficity and target group coverage (sensitivity) of
the cart system for building and downloading rDNA-targeting probes and single primers
customized datasets. against the SILVA datasets. The tool can be
S 634 SILVA Databases
SILVA Databases, Fig. 4 The web interface and results of the SILVA “TestPrime” service
configured to allow up to five mismatches within the SILVA datasets or subsets thereof
between probe and target sequences and mis- which are targeted by a given pair of primers.
matches can be weighted. The resulting number The number of allowed mismatches can be con-
of matches and non-matches is shown as a set of figured and results are shown in overview pie
pie charts (Fig. 3), and an additional list provides charts (Fig. 4) and the corresponding sequences
sequence names, accession numbers, and can be selected for download.
a graphical representation of the probe’s binding
site within all matches. Sequences in this list can
be added to the cart system for subsequent Summary
download.
The SILVA project provides comprehensive,
TestPrime quality-controlled, richly annotated, and aligned
Similar to the SILVA “TestProbe” tool, reference rDNA datasets to support the molecular
“TestPrime” allows searching for all sequences assessment of biodiversity, as well as
Simultaneous Quantification of Multiple Bacteria 635 S
investigations of the evolution of organisms. Euzéby JP. List of bacterial names with standing in
Applications of these datasets range from basic nomenclature: a folder available on the internet. Int
J Syst Bacteriol. 1997;47(2):590–2.
research in microbiology and molecular ecology Garrity GM, Bell JA, et al. Taxonomic outline of the
to the detection of contaminants and pathogens in prokaryotes. Bergey’s manual of systematic bacteriol-
biotechnology and medicine. The taxonomically ogy. 2nd ed. New York: Springer; 2004. Release 5.0.
fully classified Ref and Ref NR datasets are Grice EA, Kong HH, et al. Topographical and temporal
diversity of the human skin microbiome. Science.
perfectly suited for the classification of 2009;324:1190–2.
metagenomic or amplicon-based next-generation Grice EA, Snitkin ES, et al. Longitudinal shift in diabetic
sequencing data. wound microbiota correlates with prolonged skin
The combination of SILVA datasets with the defense response. Proc Natl Acad Sci U S A. 2010;
107:14799–804.
ARB software suite provides an easy to use work- Gutell RR, Larsen N, et al. Lessons from an evolving
bench for researchers to perform in-depth rRNA: 16S and 23S rRNA structures from
sequence analysis and phylogenetic reconstruc- a comparative perspective. Microbiol Rev. 1994;58:
tions as well as manual curation of rDNA 10–26.
Kottmann R, Kostadinov I, et al. Megx.net: integrated
datasets. Furthermore, the SILVA datasets have database resource for marine ecological genomics.
become an integral part of the MOTHUR Nucleic Acids Res. 2010;38:D391–5.
(Schloss et al. 2009), QIIME (Caporaso Lagesen K, Hallin P, et al. RNAmmer: consistent and
et al. 2010), and MG-RAST (Meyer et al. 2008) rapid annotation of ribosomal RNA genes. Nucleic
Acids Res. 2007;35:3100–8.
analysis tools and pipelines. Ludwig W, Strunk O, et al. ARB: a software environment
for sequence data. Nucleic Acids Res. 2004;32(4):
1363–71.
Cross-References Meyer F, Paarmann D, et al. The metagenomics RAST
server – a public resource for the automatic phyloge-
netic and functional analysis of metagenomes. BMC
▶ A 123 of Metagenomics Bioinformatics. 2008;9:386.
▶ Computational Approaches for Metagenomic Munoz R, Yarza P, et al. Release LTPs104 of the
Datasets all-species living tree. Syst Appl Microbiol. 2011;34:
169–70.
Pruesse E, Peplies J, et al. SINA: accurate high throughput
multiple sequence alignment of ribosomal RNA genes.
Bioinformatics. 2012;28:1823–9.
References Schloss PD, Westcott SL, et al. Introducing mothur: open-
source, platform-independent, community-supported
Altschul S, Madden T, et al. BLAST and PSI-BLAST: software for describing and comparing microbial
a new generation of protein database search programs. communities. Appl Environ Microbiol. 2009;75:
FASEB J. 1998;12:A1326. 7537–41.
Ashelford KE, Chuzhanova NA, et al. At least 1 in
20 16S rRNA sequence records currently held in pub-
lic repositories is estimated to contain substantial S
anomalies. Appl Environ Microbiol. 2005;71:
7724–36. Simultaneous Quantification
Caporaso JG, Kuczynski J, et al. QIIME allows analysis of
high-throughput community sequencing data. Nat
of Multiple Bacteria
Methods. 2010;7:335–6. Nature Publishing Group.
Cole JR, Chai B, et al. The Ribosomal Database Project Annalisa Ballarini and Olivier Jousson
(RDP-II): sequences and tools for high-throughput Laboratory of Microbial Genomics, Centre for
rRNA analysis. Nucleic Acids Res. 2005;33:D294–6.
Integrative Biology (CIBIO), University of
Dawyndt P, Vancanneyt M, et al. Knowledge accumula-
tion and resolution of data inconsistencies during the Trento, Trento, Italy
integration of microbial information sources. IEEE
Trans Knowl Data Eng. 2005;17(8):1111–26.
DeSantis TZ, Hugenholtz P, et al. Greengenes, a chimera-
checked 16S rRNA gene database and workbench
Synonyms
compatible with ARB. Appl Environ Microbiol.
2006;72:5069–72. Composition assessment; Abundance determination
S 636 Simultaneous Quantification of Multiple Bacteria
gene for high-throughput detection of microbial target sequence allows introducing a PCR-based
communities both in the environment and clinical amplification step for bacterial target enrichment
samples (Brodie et al. 2007; Ghosh et al. 2009; (e.g., GreenChipPm). This pre-amplification step
Wu et al. 2010). Several versions were devel- ensures an increased sensitivity in detection but
oped, the latest being the G3 version (Hazen does increase the processing time and may intro-
et al. 2010). The G3 comprises 1.1 million DNA duce biases in relative abundance quantitation.
probes and covers nearly 60,000 operational tax- Besides that, being universally conserved
onomic units. In order to increase the reliability, within the bacterial kingdom, the 16S RNA
some microarray designs, including the gene may not be sufficient for specific and repro-
PhyloChips, define multiple target regions within ducible bacterial identification, especially in
the marker adopting a so-called multiple probe complex systems. In fact, the high conservation
concept to increase the overall detection score of this gene across taxa has been reported to
accuracy. cause cross-hybridization events, affecting both
HOMIM (Preza et al. 2009) is the acronym resolution and abundance determination, and to
for Human Microbial Identification Microarray, fail to discriminate below the genus level for
a tool developed to detect simultaneously many clades.
300 bacterial species from the oral microbiome, Recently, as alternative to 16S or single
including non-cultivable ones. The target bacte- marker array for microbial profiling, a multiple
ria were selected among the ones identified by marker phylogenetic microarray has been
16S rRNA sequencing in health roots and root designed, the BactoChip (Ballarini et al. 2013).
caries in elderly. Experiments performed with The array design was based on the notion that
this array showed a general agreement in the metagenomic sequencing data offer a powerful
results with 16S RNA gene sequencing analysis. view on the microbial diversity of the sampled
Since 2008, a core facility at the Forsyth Institute communities and an increasingly higher number
(Cambridge, Massachusetts) provides a service, of complete and annotated bacterial genomes are
based on this platform, to rapidly screen clinical publicly available.
samples from the oral cavity, esophagus, and
lungs.
The HITChip (human intestinal tract chip) The BactoChip: A Multi-marker
(Rajilic-Stojanovic et al. 2009) is a microarray- Phylogenetic Microarray for
based metagenomic tool designed for profiling Species-Level Resolution
the human gastrointestinal microbiota. This phylo-
genetic microarray comprised 4,809 oligonucleo- The BactoChip (Ballarini et al. 2013) was
tide probes and discriminate 1,140 species via two designed with the aim to overcome the issues of
hypervariable regions of the small subunit ribo- resolution and abundance determination of
somal RNA (SSU rRNA) gene. The validation 16S-based microarrays and thus approach the
performed with SSU rRNA clones and clinical throughput and specificity of sequence-based
samples proved that this microarray provides techniques. Up to date, one version of the
a highly reproducible fingerprint and has also quan- BactoChip has been described, detecting via
tification potential. In particular, tests performed a PCR-independent approach a set of 54 bacterial
with synthetic mixtures showed it can detect 40 dif- species belonging to multiple genera of clinical
ferent amplicons and also those with relative abun- interest. The number of target bacteria was lim-
dance of 0.1 %. The HITChip showed to correctly ited by the availability of typed strains for exper-
identify a universal microbiota at genus-level imental validation and of complete bacterial
resolution. genome sequences for computational microarray
Overall, the most used phylogenetic marker is design. However, the developed method for
the 16S rRNA gene. The presence of highly con- marker selection may be extended to the whole
served regions flanking the variable 16S rRNA microbial world, thus allowing high accuracy of
Simultaneous Quantification of Multiple Bacteria 639 S
microbial composition assessment even in com- performed with multiple congeneric bacterial
plex samples. species from the Staphylococcus genus showed
Computational and Experimental Design. how this microarray design can resolve to the
The BactoChip in silico design is based on the species level even genera known to be poorly
knowledge deriving from metagenomic datasets resolved by the 16S marker genes. The perfor-
and complete bacterial genome sequences. The mance of the BactoChip in identifying bacteria
computational tool for DNA marker identifica- and determining relative abundances was tested
tion employed a pairwise identity threshold by means of synthetic bacterial communities
above 99 % to define core genes for most species, comprising 9 and 15 different species at even
where core genes are those shared by all available and staggered concentrations. The species-level
sequenced strains of the same species. Unique specificity was confirmed also in this experimen-
genes (i.e., core genes unique for each bacterial tal setting. The microarray quantified both bacte-
species) were then selected by removing all core rial communities with high accuracy with an
genes with blastn hits outside the target species. overall high correlation (0.97, p < 1010)
Probes targeting an average of 10 markers per between reference relative abundance values
bacterial species were designed to have similar and estimated ones. Experiments performed on
physicochemical parameters and were directly saliva microbiomes isolated from healthy volun-
synthesized on “custom high-definition Agilent teers, spiked in with reference species in known
DNA Comparative Genomic Hybridization amounts, proved the feasibility of this approach
arrays 8x15K” (Agilent Technologies, Santa for microbiome profiling, and detected the native
Clara, CA, USA). Besides internal control and and spiked-in species within clinical samples
other probes, the BactoChip includes 2,094 over a 100-fold dynamic range.
marker gene probes targeting 54 bacterial
species.
Testing on Pure Isolates, Synthetic Com- Summary and Conclusions
munities, and Clinical Samples. The BactoChip
was validated by performing hybridization exper- High-throughput metagenomic technologies
iments with 37 bacterial species singularly, mul- have provided an extensive amount of data on
tiple congeneric species, and synthetic bacterial microbial composition, functions, and dynamics,
communities of up to 15 microorganisms. Also, it accelerating the development of complementary
was tested with oral microbiomes from two or alternative methods for environmental studies,
healthy subjects spiked with 5 different species clinically oriented studies, and routine diagnos-
at known relative abundance. Single reference tics. Definitely, next-generation sequencing tech-
strains used for validation were collected from nology leads, without the need of a priori
the LGC Standards ATCC, the Leibniz Institute knowledge, to the maximum amount of informa- S
DSMZ, or university hospitals. Synthetic com- tion on the genomic sequences’ composition of
munities were obtained by mixing single strains a microbial sample. However, this technology
in known DNA quantities. Oral microbiomes requires complex computational analyses to
were collected from saliva, DNA was extracted extrapolate information of interest and still
with standard protocols, and the bacterial load requires high costs and processing times.
was determined by real-time PCR. The Among the alternative molecular-based tech-
BactoChip identified univocally almost all tested niques currently available (multiplex, real-time
species (97.3 %) from 19 genera with near- PCR, or array-based assays), microarrays repre-
perfect accuracy (AUC > 0.99). In case of sent the most promising technique for parallel
malfunctioning probes (false negative or false detection and relative abundance quantitation of
positive), the presence of multiple probes per bacteria with complex microbial samples, com-
marker genes and multiple genes per species bining a high-throughput with a user-friendly
prevented species misidentification. Testing rapid protocol and a low cost per sample. Besides
S 640 Simultaneous Quantification of Multiple Bacteria
STAMP: Statistical Analysis of Metagenomic Pro- Metagenomic profiles typically consist of sev-
files, Table 2 Example metadata file eral hundred or thousand features. Care must be
Sample Id Location Phenotype Gender Sample size taken when performing multiple hypothesis tests.
Sample 1 Canada Obese Female 4,000 For example, a profile consisting of 1,000 fea-
Sample 2 Canada Lean Male 2,000 tures will have 50 features with a p-value less
Sample 3 Italy Lean Female 3,000 than 0.05 simply due to chance variation.
STAMP provides two techniques for correcting
p-values when multiple hypothesis tests are being
corresponding STAMP profile. Additional col- performed. The first controls the familywise error
umns may specify any other data relevant to rate using a correction method such as
the samples being considered. Within STAMP, Bonferroni, Holm-Bonferroni, or Šidák. This
these additional columns can be used to define adjusts the reported p-values so that the probabil-
groups (i.e., collections of one or more samples) ity of observing one or more false positives is
over which statistical tests and plots can be less than a specified probability. During data
calculated. An example metadata file is given in exploration, this approach can be too conserva-
Table 2. tive and it may be beneficial to adjust the
p-values using a false discovery rate procedure.
Under this approach, a q-value is calculated
Statistical Analysis of Metagenomic for each feature that indicates the expected
Profiles proportion of false positives within the set of
features with a smaller q-value (Benjamini and
STAMP provides statistics for assessing biologi- Hochberg 1995). Additionally, STAMP can filter
cally relevant differences between pairs of features using a number of criteria in addition to
metagenomic samples or treatment groups. p- or q-values in order to focus on biologically
Two-sample (e.g., Fisher’s exact test, G-test), interesting features, e.g., those with a large effect
two-group (Welch’s t-test, White’s nonparamet- size or consisting of a substantial number
ric t-test), and multigroup (ANOVA, Kruskal- of reads.
Wallis H-test) statistical hypothesis tests are
provided for identifying statistically significant
features. Features with p-values below a nomi- Exploration of Metagenomic Profiles
nally chosen threshold (e.g., 0.05) can reasonably
be assumed to be enriched or depleted due to STAMP provides the following interactive,
ecological differences between samples or treat- publication-quality plots for exploring
ment groups as opposed to representing metagenomic profiles:
a sampling artifact. STAMP also reports effect Bar plots indicate the proportion of sequences of
size statistics such as the difference or ratio each feature within a pair of samples or the
between proportions in order to aid in determin- proportion of sequences of a single feature
ing if a statistically significant feature is of bio- across all samples (Fig. 1a).
logical relevance. Consideration of effect sizes Box plots illustrate how the proportion of
is essential as small, biologically uninteresting sequences of a single feature is distributed
differences may be statistically significant when within different treatment groups using
sample sizes are large. Confidence intervals are a box-and-whiskers graphic (Fig. 1b).
computed for all effect size statistics. These indi- Box-and-whiskers graphics show the median
cate the range of effect size values that have of the data as a line, the mean of the data as
a specified probability (typically 95 %) of being a star, the 25th and 75th percentiles of the data
compatible with the observed data and are an as the top and bottom of the box, and use
important additional statistic for reasoning about whiskers to indicate the most extreme data
biological relevance. point within 1.5*(75th–25th percentile) of
STAMP: Statistical Analysis of Metagenomic Profiles 643 S
a
Bacteroides
40 –0.0
PC2 (21.2%)
–0.1
30
–0.2
20
+ –0.3
p-value
Enterotype 1 : Enterotype 2 <0.001
Enterotype 3 : Enterotype 2 ≥0.1
STAMP: Statistical Analysis of Metagenomic Pro- enterotype. (c) Principal coordinate analysis plot deter-
files, Fig. 1 Exploration of the gut microbiota of 32 indi- mined from the proportion of reads assigned to each
viduals reported by Arumugam et al. (2011) to form three genera within a sample. (d) Post hoc plot for Bacteroides
S
distinct clusters or enterotypes. (a) Bar plot showing the indicating (1) the mean proportion and standard deviation
relative proportion of Bacteroides. Samples are colored within each enterotype, (2) the difference in mean pro-
according to the enterotype to which they have been portions between each pair of enterotypes along with 95 %
assigned. (b) Box plot showing the distribution in the confidence intervals, and (3) a p-value indicating if the
proportion of Bacteroides from samples assigned to each mean proportion is equal for a given pair
the median. Data points outside of the whis- plot indicates the sample represented by
kers are shown as crosses. the marker.
PCA plots give the first three principal compo- Post hoc plots contrast each pair of groups con-
nents of a metagenomic profile as determined sidered in a multigroup statistical hypothesis
by applying principal component analysis test (Fig. 1d). It indicates the mean proportion
(Fig. 1c). Clicking on a marker within the of sequences within each group, the difference
S 644 STAMP: Statistical Analysis of Metagenomic Profiles
Peptostreptococcus 0.017
Heliobacterium 0.029
Parvimonas 0.033
p-value
Aliivibrio 0.054
Bradyrhizobium 0.062
Anaerococcus 0.088
Geobacillus 0.098
STAMP: Statistical Analysis of Metagenomic Pro- indicates all genera where Welch’s t-test produces an
files, Fig. 2 Exploration of compositional differences uncorrected p-value < 0.1. All genera are overabundant
in the gut microbiota of males and females sampled by within the gut microbiota of males (M) compared to
Arumugam et al. (2011). The extended error bar plot females (F)
in mean proportions for each pair of groups to the first two principal components, and
along with the confidence interval of this individual panels of the extended error bar
effect size statistic, and a p-value indicating plot can be selectively hidden. Plots can be
if the mean proportion is equal for a given pair. saved in either vector (PDF, PS, EPS, SVG) or
Extended error barplots display the p-value, raster (PNG) formats. The resolution of raster
effect size, and associated confidence interval files can be set to allow for generation of plots
for all unfiltered features in a metagenomic suitable for printed publication or display on
profile (Fig. 2). In addition, a bar plot indicates posters.
the proportion of sequences assigned to Tabular views of statistical results are also
a feature in each sample or group. This pro- provided and columns can be sorted to help iden-
vides all information required to reason about tify interesting patterns. Tables can be saved as
the biological relevance of a feature in tab-separated value files for subsequent display in
a single plot. any text editor or spreadsheet program or for
Scatter plots indicate either the proportion of inclusion as supplemental information in
sequences or mean proportion of sequences publications.
assigned to each feature within a pair of sam-
ples or a pair of treatment groups, respec-
tively. This plot is useful for identifying Summary
features that are clearly enriched in one of
the two samples or groups. When considering Statistics can greatly aid in the comparison of
a pair of samples, confidence intervals calcu- metagenomic profiles. STAMP provides
lated with the Wilson score method can be a simple graphical environment for performing
shown. For a pair of treatment groups, differ- statistical analyses that are tailored to the needs of
ent statistics indicating the spread of the data comparative metagenomic studies. It provides
can be displayed (e.g., standard deviation, a range of statistical hypothesis test and can iden-
minimum and maximum proportions). tify statistically significant features between pairs
of samples or defined treatment groups. Different
All plots provide a range of customization multiple test correction methods are provided in
options. For example, PCA plots can be restricted order to account for the large number of features
Subtractive Hybridization Magnetic Bead Capture 645 S
typical of metagenomic profiles and to aid in data
exploration. The biological relevance of signifi- Subtractive Hybridization Magnetic
cant features can be assessed though a range of Bead Capture: Molecular Technique
publication-quality plots that provide key statis- for Recovery of Full-Length ORFs
tics such as effect sizes and confidence intervals. from Metagenomes
Interactive filtering allows the most biologically
interesting features to be quickly identified and Don Cowan, Sandra Ronca and Jean-Baptiste
plots of specific features to be generated. Ramond
STAMP’s wide range of statistics and simple Centre for Microbial Ecology and Genomics
interactive interface makes it a valuable tool in (CMEG), Genome Research Institute (GRI),
comparative metagenomic studies. University of Pretoria, Hatfield, Pretoria,
South Africa
Cross-References
Synonyms
▶ MEtaGenome ANalyzer (MEGAN):
Metagenomic Expert Resource Recovery of full-length ORFs from metagenomic
▶ Taxonomic Classification of Metagenomic DNA
Shotgun Sequences with CARMA3
Definition
References
Subtractive hybridization magnetic bead capture
Arumugam M, Raes J, Pelletier E, et al. Enterotypes
(SHBMC) is a sequence-based metagenomic
of the human gut microbiome. Nature. 2011;473:
174–80. technique for the recovery of full-length ORFs
Benjamini Y, Hochberg Y. Controlling the false discovery from heterogeneous metagenomic DNA samples.
rate: a practical and powerful approach to multiple
testing. J R Stat Soc B. 1995;57:289–300.
Lingner T, Aßhauer KP, Schreiber F, Meinicke
P. CoMet – a web server for comparative functional Introduction
profiling of metagenomes. Nucleic Acids Res. 2011;39
Suppl 2:W518–23. It is widely acknowledged that the vast majority
MacDonald NJ, Parks DH, Beiko RG. Rapid identification
(~99 %) of microorganisms present in the envi-
of high-confidence taxonomic assignments for
metagenomic data. Nucleic Acids Res. 2012;40:e111. ronment are resistant to culture using classical
Markowitz VM, Ivanona NN, Sveto E, et al. IMG/M: microbiological methods. Approximately half of
a data management and analysis system for the total estimated bacterial phyla (61) are still to
metagenomes. Nucleic Acids Res. 2008;36(Database
S
be cultured (Vartoukian et al. 2010). However,
issue):D534–8.
Meyer F, Paarmann D, D’Souza M, et al. The environmental microbial communities constitute
metagenomics RAST server – a public resource for a valuable resource for biotechnology and are
the automatic phylogenetic and functional analysis of a valid target for identification of novel genes
metagenomes. BMC Bioinforma. 2008;9:386.
and/or biological compounds such as biocatalysts
Parks DH, Beiko RG. Identifying biologically relevant
differences between metagenomic communities. or secondary metabolites (Sharma et al. 2005). In
Bioinformatics. 2010;26:715–21. order to bypass the limitations of microbial cul-
Schloss PD, Westcott SL, Ryabin T, et al. Introducing turing and to discover new microbial genes and
mother: open-source, platform-independent,
functions, two approaches have been
community-supported software for describing and
comparing microbial communities. Appl Environ implemented, either culture-based, through the
Microbiol. 2009;75:7537–41. development of innovative strategies and media
S 646 Subtractive Hybridization Magnetic Bead Capture
Heat denaturation
Subtractive hybridization:
1. Hybridization / 2. Wash
Subtractive Hybridization Magnetic Bead Capture: Molecular Technique for Recovery of Full-Length ORFs
from Metagenomes, Fig. 1 Schematic subtractive hybridization magnetic bead capture (SHMBC) protocol
used as a pre-enrichment tool prior to performing subtractive hybridization magnetic bead capture.
post-functional studies or sequence-based func- Methods Mol Biol. 2010;668:287–97. Clifton, NJ.
Meyer QC, Burton SG, Cowan DA. Subtractive hybridi-
tional analyses. zation magnetic bead capture: a new technique for the
recovery of full-length ORFs from metagenome.
Biotechnol J. 2007;2:36–40.
Morales SE, Holben WE. Linking bacterial identities and
Cross-References ecosystem processes: can ‘omic’ analyses be more
than the sum of their parts? FEMS Microbiol Ecol.
▶ Approaches in Metagenome Research: 2011;75:2–16.
Progress and Challenges Paszkiewicz K, Studholme DJ. De novo assembly of short
sequence reads. Brief Bioinform. 2010;11:457–72.
▶ Biological Treasure Metagenome Riesenfeld CS, Schloss PD, Handelsman J.
▶ Metagenomic Research: Methods and Metagenomics: genomic analysis of microbial com-
Ecological Applications munities. Annu Rev Genet. 2004;38:525–52.
▶ Mining Metagenomic Datasets for Antibiotic Roh C, Villatte F, Kim B-G, Schmid RD. Comparative
study of methods for extraction and purification of
Resistance Genes environmental DNA from soil and sludge samples.
▶ Mining Metagenomic Datasets for Cellulases Appl Biochem Biotechnol. 2006;134:97–112.
▶ Protein-Coding Genes as Alternative Markers Schmeisser C, Steele H, Streit WR. Metagenomics, bio-
in Microbial Diversity Studies technology with non-culturable microbes. Appl
Microbiol Biotechnol. 2007;75:955–62.
Sharma R, Ranjan R, Kapardar RK, Grover A.
‘Unculturable’ bacterial diversity: an untapped
resource. Curr Sci. 2005;89:72–7.
References Thomas T, Gilbert J, Meyer F. Metagenomics - a guide
from sampling to data analysis. Microbiol Inform Exp.
Cowan D, Meyer Q, Stafford W, Muyanga S, Rory 2012;2:3–3.
Cameron R, Wittwer P. Metagenomics gene discov- Vartoukian SR, Palmer RM, Wade WG. Strategies for
ery: past, present and future. TRENDS Biotechnol. culture of ‘unculturable’ bacteria. FEMS Microbiol
2005;23:321–9. Lett. 2010;309:1–7.
Harris RP, Groth DM, Ledger J, Lee CY. Identification of Wang J, McCord B. The application of magnetic bead
sex specific DNA regions in the snake genome using hybridization for the recovery and STR amplification
a subtractive hybridization technique. Proc Assoc of degraded and inhibited forensic DNA. Electropho-
Advmt Anim Breed Genet. 2009;18:572–5. resis. 2011;32:1631–8.
Knauth S, Schmidt H, Tippkotter R. Comparison of com- Wang HH, Zhao CY, Li F. Rapid identification of
mercial kits for the extraction of DNA from paddy mycobacterium tuberculosis complex by a novel
soils. Lett Appl Microbiol. 2013;56:222–8. hybridization signal amplification method based on
Latisnere-Barragan H, Lopez-Cortes A. Isolation of phaC self-assembly of dna-streptavidin nanoparticles.
gene from marine bacteria Paracoccus homiensis Braz J Microbiol. 2011;42:964–72.
strain E33 by magnetic beads subtractive hybridiza- Waschkowitz T, Rockstroh S, Daniel R. Isolation and
tion. Ann Microbiol. 2012;62:1691–5. characterization of metalloproteases with a novel
Liu L, Li Y, Li S, Hu N, He Y, Pong R, Lin D, Lu L, domain structure by construction and screening of
Law M. Comparison of next-generation sequencing metagenomic libraries. Appl Environ Microbiol.
systems. J Biomed Biotechnol. 2012. doi:10.1155/ 2009;75:2506–16.
2012/251364. Zhou JZ, Bruns MA, Tiedje JM. DNA recovery from soils
Meiring T, Mulako I, Tuffin MI, Meyer Q, Cowan of diverse composition. Appl Environ Microbiol.
DA. Retrieval of full-length functional genes using 1996;62:316–22.
T
Taxa Counting Using Specific protein. SPs were introduced by Kunik et al.
Peptides of Aminoacyl tRNA (2007), and their predictive powers on full pro-
Synthetases tein sequences were established by Weingart
et al. (2009). Their results are the basis of the
David Horn webtool http://horn.tau.ac.il/DME11.html which
School of Physics and Astronomy, Tel Aviv supplies enzymatic assignments for queried pro-
University, Tel Aviv, Israel tein sequences. This methodology has been
applied directly to short reads, obtaining enzy-
matic and taxonomic signatures of data, by
Synonyms Weingart et al. (2010). These authors have
extracted a set of SPs that are associated with
Short Read Analysis; Specific Peptides; Taxa single proteins of the aaRS families, known as
Counting the S61 set (because the EC numbers of these
enzymes, indicating their 4-level enzymatic clas-
sification, start with 6.1.1.). The application of
Definition SPs to taxa counting in metagenomic data has
been developed by Persi et al. (2012). To ensure
Motifs that appear on Aminoacyl tRNA Synthe- high precision of the prediction process, it is
tases can serve as specific peptides (SP) whose required that the length of the SPs in the S61 set
presence in a metagenome indicates which taxa it is at least nine amino acids. The resulting list
contains. This is used to devise a method, based contains 3,949 SPs.
on gene fragments rather than on 16S rRNA
sequences, which allows for taxa counting from
short read metagenomic data. It is exemplified on The Taxa Counting Algorithm
human gut microbial data.
For short read data one first converts all genomic
reads to amino acid strings in the six possible
Introduction: The SP Approach reading frames. One then identifies all reads that
share a single SP. Choosing the largest group
Specific peptides (SPs) are short deterministic of such reads, one tries to group the short reads
motifs whose presence in the protein sequence into sets such that all reads within a set are
is a good predictor of an enzymatic activity of the consistent with one another (i.e., can be fused
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
T 652 Taxa Counting Using Specific Peptides of Aminoacyl tRNA Synthetases
with each other) and every set is inconsistent with exist more than 1,000 different species in their
the other ones. Although this mathematical prob- metagenomic data. Persi et al. (2012) has argued
lem is NP complete, one may devise simple that the prevalent genes, when analyzed using the
algorithms that carry it out efficiently (Persi S61 approach, display only half this count. If,
et al. 2012). The strong consistency conditions however, the full set of contigs is analyzed, an
can be relaxed to allow for errors, such that reads estimate of over 1,000 different species and
within a set may differ from each other by one strains is obtained. The number of different gen-
amino acid, and different sets have to differ from era has been estimated to be relatively small,
each other by at least two amino acids. The num- presumably of the order of a few tens.
ber of different sets becomes a lower bound on Of particular interest is the application of the
the number of different taxa. For short reads, novel method to short read data. Here this method
distinguishing between species belonging to the is quite unique. It allows for a quick estimate of
same genus is impossible. Depending on the species count directly from raw data. Short read
length of the short reads, chances are high how- singletons that are often discarded from
ever for distinguishing between different fami- metagenomic analysis, because they cannot com-
lies, classes, and phyla. For the case of long bine with other short reads to form longer contigs,
sequences or extensive contigs, one can resort to can be readily included in this analysis. More-
searching for sequences that share several SPs of over, one can test the sensitivity of the results to
the same aaRS enzyme. This allows one to sample size, to the minimal distance d allowed
address the question of counting different species between reads that are classified in the same taxa,
or even different strains of the same species. and to noise in the data.
The raw data contain errors, and every
misidentification of an amino acid will affect
Tests and Applications taxa counts. The probability of such errors was
estimated to be below 1 %. This was then tested
Persi et al. (2012) have compared the S61 SP by inserting artificial random errors at the level of
approach with the 16S rRNA analysis on an arti- 1 % into analyzed reads. The results showed that
ficial metagenome composed of 64 genomes of the d 2 counts of the set with artificial errors
different species that represent bacterial taxo- are similar to the d 1 estimates drawn from the
nomic diversity. For some of the principal raw data. One may therefore conclude that limit-
phyla, they selected pairs of strains of the same ing oneself to d 2 analysis of the raw data
species, such that the resolutions of the taxo- suffices to eliminate the majority of errors in the
nomic delineation of the two methods can be data. Sample sizes of order 1,000 short reads of
tested and compared. The SP approach has been the Qin et al. (2010) data lead to counts of 200 or
proved to match the accuracy provided by the more taxa. The counts keep increasing linearly
16S analysis and sometimes even to surpass with sample size, indicating that greater depth
it. The novel method has then been applied to unravels larger numbers of strains and species.
species counting in the human gut microbiome Focusing on large distances between reads, such
employing the data of Qin et al. (2010). These as d 7, the taxa counts in the analysis of Persi
data were based on samples taken from 124 indi- et al. (2012) saturate at about 60, providing
viduals. In addition to raw short read data, the a stable bound on the number of species that are
authors have presented genomic contigs, as well expected to have quite large Hamming distances
as a nonredundant set of 3.3 M ORFs derived (over 150) between their relevant protein
from full genomic analysis (also called “preva- sequences. Finally it is interesting to note that
lent genes”). The analysis of the prevalent genes an analysis of Persi et al. (2012) carried out for
has led Qin et al. (2010) to conclude that there all short reads of one of the subjects has shown
Taxonomic Classification of Metagenomic Shotgun Sequences with CARMA3 653 T
10 % novel species with respect to the contigs of Cross-References
Qin et al. (2010), and about 45 % novelties when
compared to all Uniprot enzymes. ▶ Computational Approaches for Metagenomic
Datasets
▶ Human Gut Microbial Genes by Metagenomic
Discussion Sequencing
against the NCBI NR protein database. All pro- The reciprocal search provides similarity
tein fragments in the database that have an align- scores in terms of BLAST bit scores between t1
ment with the metagenomic sequence are and all other database sequences. Since the taxo-
extracted. These sequences, as well as the protein nomic affiliations of the other database
translation of the metagenomic DNA sequence, sequences, except the metagenomic sequence,
as given by the BLAST alignment between the are known, the reciprocal search provides means
metagenomic query and the best database hit, are to correlate BLAST bit scores with phylogenetic
used to create a small protein BLAST database. distances. Database sequences that are more
In the second step of CARMA3, the reciprocal closely related to t1 tend also to have higher
BLAST search, the extracted protein fragment reciprocal bit scores than the less closely related
that corresponds to the best BLAST hit is sequences. A toy example for this is given in
searched against this database using BLASTp. Fig. 1a.
Since the protein fragment that is searched Each ti is projected onto the lowest common
against the database is included in the database, ancestor of ti and t1, a taxon within the lineage of
this database sequence produces a perfect align- t1. For each taxon in the lineage of t1 that gets
ment and yields the best BLAST bit score. projections from a subset of ti, an interval is
Let ti be the taxonomic affiliation of the ith defined by the minimum and the maximal recip-
best BLAST hit in the reciprocal search, and let rocal bit scores from the BLAST hits in this
x be the (unknown) species of the metagenomic subset. Intervals for the reciprocal search exam-
sequence. Clearly t1, which is also the taxonomic ple are depicted in Fig. 1b. These intervals can be
affiliation of the best BLAST hit in the first used to assign a metagenomic sequence to a taxon T
BLAST search, is the phylogenetically closest in the lineage of t1 based on its reciprocal score.
known relative of x. Since the taxonomy assign- In general (case a) this method tries to assign the
ment t1 is usually located at taxonomic rank spe- metagenomic sequence to the lowest taxonomic
cies, strain, or substrain, and metagenomic rank at which its reciprocal score is still within
sequences mostly come from species that are the borders of the interval at that rank. If such an
phylogenetically more distantly related, using t1 interval does not exists (case b), the lowest taxo-
as taxonomic classification for x would be an nomic rank is chosen for which all bit scores are
overprediction. Therefore, the purpose of this still lower than the bit score of the metagenomic
method is to approximate the lowest common sequence.
ancestor of t1 and x, which would be the best Two examples for the taxonomic classifica-
possible taxonomic classification. tion are given in Fig. 1b. Metagenomic read
T 656 Taxonomic Classification of Metagenomic Shotgun Sequences with CARMA3
q with unknown phylogenetic affiliation x has SOrt-ITEMS and MEGAN using simulated data.
a reciprocal score of 90. Since the bit score of The simulated metagenome consists of 25 ran-
q is higher than the bit score of the single hit in the domly chosen bacterial genomes from the NCBI
interval at rank family (t5 ¼ 80), but smaller than ftp site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/).
any hit in the interval at taxonomic rank genus N ¼ 25 000 metagenomic reads were simulated
(t4 ¼ 95 and t2 ¼ 120), x gets assigned to the rank using MetaSim (Richter et al. 2008) with the
family (case b). The second metagenomic read q0 default 454 sequencing error model resulting in
with reciprocal score of 105 is within the borders an average read length of 265 bp. The second
of the interval at rank genus and thus x’ gets experiment is an example of the applicability of
assigned to the rank genus (case a). CARMA3 in the case of very large metagenomes
Real data often does not show the properties as that can be produced, for example, by the Illumina
assumed in this model, and sometimes reciprocal sequencing technology. In this experiment the real
scores are missing for a taxonomic rank. To data set consists of 3.3 million nonredundant
accommodate for this, CARMA3 additionally microbial genes of the gene catalogue of the
employs techniques like polishing, linear inter- human gut microbiome (Qin et al. 2010). Fecal
polation, and a fallback method, described in samples from different individuals were
detail in the original publication (Gerlach and sequenced with the Illumina Genome Analyzer
Stoye 2011). CARMA3 is also available in (GA) which yielded in 576.7 Gb of sequence.
a variant that is based on HMMER3 homology The reads were assembled into longer contigs,
searches against the Pfam (Finn et al. 2010) data- and a gene finder was used to detect open reading
base. In this variant the metagenomic sequences frames (ORFs). Similar ORFs were clustered to
are aligned against Pfam family alignments from obtain the final nonredundant gene set. This gene
which reciprocal scores can be computed that are set was downloaded and the ORFs were translated
required for the taxonomic classification. Both into protein sequences using the NCBI Genetic
the BLAST and the HMMER variants of Code 11.
CARMA3 can also be used for the taxonomic
classification of amino acid sequences. Comparison with Other Methods Using
Simulated Data
To evaluate the different BLAST-based methods
Results and Discussion regarding their ability to classify sequences of
unknown source organism, three BLAST NR
CARMA3 is available via the WebCARMA protein databases were created: “order-filtered,”
pipeline that takes metagenomic reads as input without sequences from species that share the
and output taxonomic and functional classifica- same order as any of the species from the simu-
tions. The pipeline runs on the compute cluster lated metagenome; “species-filtered,” without
of the Bielefeld University Bioinformatics sequences from species in the simulated
Resource Facility at the Center for Biotechnology metagenome; and “All,” the complete NR
(CeBiTec) and is freely accessible at http:// database.
webcarma.cebitec.uni-bielefeld.de. The com- The BLASTx runs for CARMA3, SOrt-
plete source code of CARMA3 (C/C++) has ITEMS, and MEGAN against these three data-
been released under the GPL and is available for bases were performed with default E-value
download from the WebCARMA homepage. threshold (-e 10), soft sequence masking
CARMA3 has been evaluated in various exper- (-F “mS”), and frameshift penalty 15 (-w 15).
iments including simulated and real metagenomes. To ensure comparability, CARMA3 used the
In the following the results of two of these same thresholds as SOrt-ITEMS regarding the
experiments are shown. The first experiment is BLASTx hits, a minimal bit score of 35, and
a qualitative comparison of CARMA3 with a minimal alignment length of 25. The parameter
Taxonomic Classification of Metagenomic Shotgun Sequences with CARMA3 657 T
Taxonomic Classification of Metagenomic Shotgun Sequences with CARMA3, Table 1 Comparison of the
taxonomic classification accuracy of the different BLASTx-based methods CARMA3, SOrt-ITEMS, and MEGAN
using the order-filtered database
CARMA3 SOrt-ITEMS MEGAN
TP FP TP FP TP FP
Superkingdom 12,696 861 12,576 786 12,626 1,849
Phylum 8,989 1,224 9,254 1,736 8,079 1,985
Class 4,066 1,495 4,062 1,937 3,649 2,479
Order – 2,507 – 4,011 – 4,975
Family – 1,186 – 2,565 – 4,087
Genus – 210 – 798 – 4,041
Species – 23 – 0 – 3,544
for the minimal number of reads that are required numbers TP, FP, and U sum up to the total num-
to report a taxon in SOrt-ITEMS and MEGAN ber N of reads used in the evaluation and U equals
was set to 1 in all experiments. To ensure com- N TP FP, U is not explicitly given in the
parability of MEGAN with the other two results.
BLAST-based methods, the top percent parame- The complete table for all results can be found
ter was increased from ten (default) to 15 resulting in the original publication. Table 1 below shows
in more conservative predictions. Although the results for the evaluation on the order-filtered
CARMA3 is a parameter-free method, an artifi- database. While CARMA3 performs better than
cial parameter p was introduced to slightly SOrt-ITEMS at rank class, since it has the same
increase the sensitivity at the cost of decreased number of true positives but fewer false positives,
specificity, in order to yield a sensitivity compa- for the ranks superkingdom and phylum, it is not
rable to that of the other two methods. This clear which method is better. At the taxonomic
allowed for evaluating each of the methods ranks order to genus, where the metagenomic
based on their number of false positives. The sequences have been filtered away, CARMA3
values of p were 1.024 for order-filtered, 1.033 has much fewer (37–74 %) false positives
for species-filtered, and 1.15 for the unfiltered than SOrt-ITEMS. CARMA3 has better results
database. than MEGAN at all taxonomic ranks, while SOrt-
The taxonomic classification methods assign ITEMS has better results than MEGAN at all
to a metagenomic read one taxon and therefore taxonomic ranks below superkingdom. The
also one taxonomic rank. This taxon implicitly results for the species-filtered and the complete
provides a taxonomic classification also for the NR database, where closely related or identical
higher taxonomic ranks. For example, the taxon reference species are available in the database,
Gammaproteobacteria at the taxonomic rank show that in such a setting CARMA3 performs T
class implicitly provides the taxonomic classifi- similar to the other two methods.
cation Bacteria at the taxonomic rank
superkingdom. The taxonomic ranks below the Taxonomic Classification of the Human Gut
predicted taxon can be considered to be classified Microbiome with CARMA3
as “unknown.” Therefore, for each taxonomic A taxonomic classification based on BLAST has
rank, a metagenomic read can either be correctly the advantage of a high sensitivity, which is in
classified and counts as a true positive (TP), can particular important if no closely related refer-
be wrongly classified and counts as a false posi- ence species are available. The main bottleneck
tive (FP), or it is not classified and counts as of this approach is the computation time required
unknown (U). As for each taxonomic rank the for the BLAST search. Over 98 % of the total
T 658 Taxonomic Classification of Metagenomic Shotgun Sequences with CARMA3
running time of a CARMA3 analysis is due to the D. longicatena that are found by the 16S rDNA
initial BLASTx search against the NR database. analysis could be confirmed by CARMA3. How-
While a BLASTx analysis of a complete 454 run ever, the species E. hadrum and R. callidus that
is feasible on a compute cluster in the order of have been found by 16S rDNA were not found by
hours or a few days, this approach seems to be CARMA3. The genus Clostridium which is the
less practical for the analysis of all unassembled taxon found by CARMA3 to have the highest
reads produced by a complete run of an Illumina abundance in the class Clostridia is not reported
sequencing machine that produces one to two by the 16S rDNA analysis. The reason for this
orders of magnitude more bases in total than might be that the 16S rDNA sequence of Clostrid-
a 454 sequencing machine in a single run. ium bartlettii, which mostly contributes to the
One way to overcome this limitation is the genus Clostridium and is known to be found in
usage of data reduction techniques. This is human feces, might not have been available at the
a common strategy to handle the amount of data time of the 16S rDNA analysis (Song et al. 2004).
produced by Illumina sequencing machines (Qin Also the species R. inulinivorans and
et al. 2010; Hess et al. 2011). Typical steps R. intestinalis of the genus Roseburia, which are
involve the assembly of reads into longer frag- found by CARMA3 but not by the 16S rDNA
ments, gene detection with a gene finder to detect analysis, are known to occur in human feces
open reading frames (ORFs), clustering of highly (Duncan et al. 2002; Scott et al. 2011). For the
similar ORFs, and translation of the second most abundant phylum, the Bacteroidetes,
nonredundant ORFs into protein sequences. the authors of the 16S rDNA analysis report a high
Such a metaproteome has, in contrast to the full variability in the distribution of phylotypes in
set of unassembled Illumina reads, a size that samples from different subjects. Nevertheless, all
makes the analysis with the BLASTp variant of phylotypes reported by the authors of the 16S
CARMA3 possible on a compute cluster in the rDNA analysis, B. vulgatus, Prevotellaceae,
order of hours or a few days. To evaluate the B. thetaiotaomicron, B. caccae, and B. fragilis,
applicability of CARMA3 on amino acid were among the 11 or, in case of B. putredinis,
sequences derived from assembled Illumina among the 22 most abundant taxa predicted by
reads, the BLASTp variant of CARMA3 was CARMA3 (Gerlach et al. 2011, Supplementary
used to analyze the gene catalogue of the human Figs. S22–S25).
gut microbiome (Qin et al. 2010). The results The comparison of the taxonomic predictions
were compared to the taxonomic classification of the 16S rDNA analysis and CARMA3 has
of another study of the human intestinal micro- revealed a high consistency in the results of
bial flora based on 13,355 prokaryotic 16S both methods. This shows that CARMA3 can
ribosomal RNA gene sequences (Eckburg also be used for the taxonomic classification of
et al. 2005). amino acid sequences obtained from assembled
Both methods, the 16S rDNA analysis and Illumina reads.
CARMA3, identify Firmicutes and Bacteroidetes
as the most abundant phyla, followed by
Proteobacteria, Actinobacteria, Verrucomicrobia, Summary
and Fusobacteria. Also, in both analyses, the
phylum Firmicutes consists mainly of the class CARMA3 is a method for the taxonomic classi-
Clostridia. Nearly all genera of the Clostridia fication of assembled and unassembled
that have been predicted by the 16S rDNA analy- metagenomic sequences that can be used in com-
sis, like Eubacterium, Ruminococcus, Dorea, bination with BLAST- and HMMER-based
Butyrivibrio, and Coprococcus, have also been homology searches. Except for the homology
predicted by CARMA3. Also most of the species search and the fallback scenario, this method is
of Clostridia like E. rectale, E. hallii, R. torques, parameter-free. In addition, for the HMMER-
R. gnavus, F. prausnitzii, D. formicigenerans, and based variant, it also provides a functional
Taxonomic Classification of Metagenomic Shotgun Sequences with CARMA3 659 T
classification of the metagenomic sequence. References
Typically, a metagenomic sample contains
many novel species that have not been sequenced Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura T.
Novel phylogenetic studies of genomic sequence frag-
before. Such a scenario has been simulated with
ments derived from uncultured microbe mixtures in
the order-filtered database, and it also has been environmental and clinical samples. DNA Res,
shown that in most cases CARMA3 not only Center for Information Biology, National Institute of
performs better than existing BLAST-based Genetics, The Graduate University for Advanced
Studies (Sokendai) Mishima, Shizuoka, Japan.
methods, but most strikingly, it is better at
2005;12:281–290.
avoiding FP predictions on lower taxonomic Altschul SF, Gish W, Miller W, Myers EW, Lipman
ranks when only remote homologues are avail- DJ. Basic local alignment search tool. J Mol Biol.
able for the classification of novel species. 1990;215(3):403–10.
Altschul SF, Madden TL, Sch€affer AA, Zhang J, Zhang Z,
One reason for the high accuracy of CARMA3
Miller W, Lipman DJ. Gapped BLAST and
is because reciprocal hits provide a reasonable PSI-BLAST: a new generation of protein database
estimation of the last common ancestor of the search programs. Nucleic Acids Res. 1997;25(17):
metagenomic sequence and its best hit in the 3389–402.
Diaz NN, Krause L, Goesmann A, Niehaus K, Nattkemper
sequence database. In contrast to the other
TW. TACOA: taxonomic classification of environ-
BLAST-based methods, this method is not mental genomic fragments using a kernelized nearest
based on the LCA and therefore does not discard neighbor approach. BMC Bioinforma. 2009;10:56.
reciprocal hits that can provide valuable informa- Duncan SH, Hold GL, Barcenilla A, Stewart CS,
Flint HJ. Roseburia intestinalis sp. nov., a novel
tion for the taxonomic classification.
saccharolytic, butyrate-producing bacterium from
A drawback of using BLASTx is its running human faeces. Int J Syst Evol Microbiol.
time. The computational bottleneck of the 2002;52(Pt 5):1615–20.
CARMA3 pipeline is the homology search, in Eckburg PB, Bik EM, Bernstein CN, Purdom E,
Dethlefsen L, Sargent M, Gill SR, Nelson KE, Relman
particular the BLAST search. In the evaluation
DA. Diversity of the human intestinal microbial flora.
the initial BLAST search accounted for over Science, Division of Infectious Diseases and Geo-
98 % of the total running time. However, this is graphic Medicine, Stanford University School of
a problem shared with all BLAST-based Medicine, Room S-169, 300 Pasteur Drive, Stanford
CA 94305-5107, USA. 2005;308:1635–1638.
approaches. Furthermore, it has been shown in
Eddy SR. Profile hidden Markov models (review). Bioin-
the evaluation that this problem can be dealt with formatics. 1998;14(9):755–63.
by the use of data reduction strategies which Finn RD, Mistry J, Tate J, et al. The Pfam protein families
include assembly and gene detection steps. database. Nucleic Acids Res. 2010;38(Database
issue):D211–22.
Currently available biological sequence data-
Gerlach W, Stoye J. Taxonomic classification of
bases are known to be biased because they mainly metagenomic shotgun sequences with CARMA3.
contain sequences of species that are culturable. Nucleic Acids Res. 2011;39(14):e91.
Although the authors have tried to minimize the Gerlach W, J€ unemann S, Tille F, Goesmann A, Stoye
J. WebCARMA: a web application for the functional
effect of this bias on the results of their evaluation
and taxonomic classification of unassembled
by creating the order-filtered database, this bias metagenomic reads. BMC Bioinforma. 2009;10:430. T
has to be kept in mind when generalizing the Gish W, States DJ. Identification of protein coding regions
evaluation results to metagenomic reads from by database similarity search. Nat Genet.
1993;3(3):266–72.
unculturable species.
Haque MM, Ghosh TS, Komanduri D, Mande SS. SOrt-
ITEMS: sequence orthology based approach for
improved taxonomic estimation of metagenomic
sequences. Bioinformatics. 2009;25(14):1722–30.
Cross-References Hess M, Sczyrba A, Egan R, et al. Metagenomic discovery
of biomass-degrading genes and genomes from cow
rumen. Science. 2011;331(6016):463–7. New York,
▶ MEtaGenome ANalyzer (MEGAN): N.Y.
Metagenomic Expert Resource Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis
▶ PhyloPythia(S) of metagenomic data. Genome Res. 2007;17:377–86.
T 660 The Vaginal Microbiome in Health and Disease
which is rarely dominant in bacterial vaginosis results from the interplay between microbial vir-
(BV), L. iners can be detected at high levels in ulence, numerical dominance, and the innate and
most subjects with and without BV, and in many adaptive immune response of the host (Smith
studies it is the only Lactobacillus species 1934). The most common disorder of vaginal
detected in women with BV. It has been postu- flora is BV, which is a polymicrobial condition
lated that this may be because L. iners may be characterized by a decrease in the quality or
better adapted to the conditions associated with quantity of lactobacilli and by a 1000-fold
BV, i.e., the polymicrobial state of the vaginal increase in the number of other organisms, deter-
flora and elevated pH. Alternatively, it could be mined by culture-dependent techniques, particu-
the relative resistance of L. iners to unknown larly anaerobes such as Mycoplasma hominis,
factors that led to the demise of other Lactoba- G. vaginalis, and Mobiluncus species. BV is
cillus species during the onset of BV or to increasingly associated with adverse outcomes
a relative lack of antagonism of L. iners to in gynecology such as pelvic inflammatory
BV-associated anaerobes, so that their domi- disease, postabortal sepsis, infertility, post-
nance predisposes the individual to the acquisi- hysterectomy vaginal cuff infections, and the
tion of BV. acquisition of STIs such as gonorrhea, Chla-
mydia, trichomoniasis, and HIV. In pregnancy,
Community Group Variations Among BV has been associated with early and late mis-
Different Ethnic Groups carriage, recurrent abortion, postpartum endome-
The vaginal microbiome and pH in asymptom- tritis, and preterm birth.
atic, sexually active women who were fairly
equally represented according to self-reported Atopobium vaginae: Under-detected and
ethnic group (Hispanic, Black, Asian, White) Underappreciated
has been studied (Ravel et al. 2011). The propor- The genus Atopobium is a member of the family
tion of each community group and pH among the Coriobacteriaceae and forms a distinct branch
four ethnic groups varied significantly. Bacterial within the phylum Actinobacteria. Following
communities dominated by lactobacilli were sequence analysis, three species formerly desig-
found significantly more commonly in Asian nated Lactobacillus minutus, Lactobacillus rimae,
and White women (80.2 % and 89.7 %, respec- and Streptococcus parvulus, within the lactic acid-
tively) compared to only 59.6 % and 61.9 % in producing group of bacteria, have been
Hispanic and Black women, respectively. Simi- reclassified as the genus Atopobium. In 1999, an
larly, median pH values were significantly higher organism similar but not identical to these three
in Black and Hispanic women compared to Asian species was isolated from the vagina of a healthy
and White women. woman in Sweden, and the organism was named
Atopobium vaginae (Rodriguez et al. 1999). Since
that time, using molecular-based techniques,
Abnormal Vaginal Flora A. vaginae has frequently been detected in the
vagina and is found much more commonly in
Abnormal vaginal flora may occur because of women with BV than in those with normal flora
a sexually transmitted infection (STI), e.g., (Lamont et al. 2011). A. vaginae is strictly anaer-
trichomoniasis, or through colonization by an obic and is very sensitive to clindamycin in vitro,
organism that is not part of the normal vaginal but is highly resistant to nitroimidazoles such as
community. Alternatively, abnormal vaginal metronidazole and secnidazole.
flora may result from overgrowth or increased
virulence of an organism that is a constituent High Diversity of Flora in Bacterial Vaginosis
part of normal vaginal flora such as Escherichia Compared with Normal Flora
coli. Alterations in vaginal flora do not necessar- Using various molecular-based techniques and
ily imply disease or result in symptoms. Disease the Amsel clinical criteria, or Nugent score to
The Vaginal Microbiome in Health and Disease 663 T
classify normal or abnormal flora, a number of more cells/sample. In summary, these studies
studies have demonstrated a high diversity of have demonstrated that different subjects with
organisms in women with BV compared to BV have different microbial profiles, indicating
women with normal flora. Collectively, these heterogeneity in the composition of bacterial taxa
studies demonstrate the presence of species such in women with BV. Women without BV had
as A. vaginae, Porphyromonas asaccharolytica, bacterial communities dominated by Lactobacil-
bacterial vaginosis-associated bacteria lus species, accounting for 86 % of all sequences.
(BVAB)-1, BVAB-2, and BVAB-3 in the order In contrast, women with BV did not possess
Clostridiales and species of Megasphaera, a single dominant phylotype, but instead had
Leptotrichia, Dialister, Chloroflexi, Eggerthella, a diverse array of vaginal bacteria, often at
Olsenella, Streptobacillus, and Shuttleworthia relatively low abundances.
which are either novel or unfamiliar to clinicians
(Lamont et al. 2011). For many of these The Diagnosis of Bacterial Vaginosis
undetected or under-detected organisms, there is Bacterial vaginosis can be diagnosed clinically,
evidence of disease association. The renamed microscopically, enzymatically, and chromato-
Atopobium parvulum, Atopobium minutum, and graphically, using qualitative or semiquantitative
Atopobium rimae have been associated with oral culture methods or using composite clinical
infections, dental and tubo-ovarian abscesses, criteria. Currently, the gold standard is the
and abdominal wound infections, supporting the Nugent score (Nugent et al. 1991), but the num-
view that these organisms can be pathogenic to ber of diagnostic methods testifies to the fact that
the host. Leptotrichia sanguinegens/amnionii has no single test is ideal and that they can all provide
been reported in association with postpartum false-positive and false-negative results.
endometritis, adnexal masses, and fetal death
and has been detected in the amniotic fluid of Confounding Factors
women with preterm labor, preterm prelabor rup- Findings from molecular-based studies are now
ture of the membranes, and preeclampsia. Also, highlighting possible explanations for why diag-
in a study of 45 women with salpingitis and nosis by microscopy may be inconsistent and
44 controls (women seeking tubal ligation), bac- why molecular methods may replace them:
terial 16S rDNA sequences were found in the 1. Mobiluncus: One of the three organisms quan-
fallopian tube specimens of 24 % of cases, but tified as part of the Nugent score is
in none of the controls. Bacterial phylotypes Mobiluncus. Several cloning and sequencing
closely related to Leptotrichia species and studies have only rarely identified
A. vaginae were among those identified in the Mobiluncus. Fluorescence in situ hybridiza-
cases. In addition, Dialister pneumosintes was tion (FISH) technology has demonstrated that
found as the sole agent in the blood culture from BVAB-1 has curved-rod morphology, similar
a woman with suppurative postpartum ovarian to Mobiluncus morphotypes, and it is possible
thrombosis. that during microscopic examination of vagi- T
It has also been demonstrated that many of nal smears, Mobiluncus species may have
these organisms have specificity for BV and that been overrepresented and mistaken for
the number of phylotypes found in association BVAB-1. Alternatively, as species-specific
with BV is statistically significantly greater than PCR agrees with the Nugent score,
the number detected in the presence of interme- Mobiluncus may be missed in universal PCR
diate flora (a distinct entity in its own right) studies because it frequently falls below
(Taylor-Robinson et al. 2003) or normal flora. a threshold titer where it can be detected.
This statistic largely results from the extreme 2. Atopobium: The urea produced by Atopobium
dominance of lactobacilli in healthy women, species is associated with halitosis, and simi-
which makes detection of other species unlikely, larly, species of Megasphaera cause beer
even when they are present at levels of 100,000 or spoilage by producing turbidity, off-flavors
T 664 The Vaginal Microbiome in Health and Disease
and off-colors. Accordingly, if two genera for BV, and the detection of Leptotrichia and
associated with malodorous metabolites can A. vaginae was three times more likely, and
be found in the vagina of healthy women and BVAB-1 twice as likely, when women
amines can be found in women without BV, reported douching.
then diagnostic techniques to diagnose BV,
based on amine production and odor forma- Diagnosis of BV Using Qualitative and
tion, may need to be reconsidered. Microscop- Quantitative Molecular Techniques
ically, Atopobium species are gram-positive, Some organisms or combinations of organisms
elliptical cocci, or rod-shaped organisms that have high sensitivities or specificities for the
occur singly, in pairs, or in short chains. The diagnosis of BV using the Amsel criteria and
variable cell morphology of Atopobium ren- the Nugent score (Fredricks et al. 2005; Fredricks
ders it well camouflaged among the mixture of et al. 2007). Using quantitative real-time PCR,
other species present in bacterial communities the association of individual organisms with BV
where the Nugent score is 4. A. vaginae is diagnosed by Nugent score was examined quali-
fastidious, grows anaerobically, and forms tatively. At a threshold of 108 DNA copies/ml,
small pinhead colonies on culture that are eas- Lactobacillus species was predictive of normal
ily missed. Although phylogenetically differ- flora (sensitivity 44 %; specificity 100 %).
ent from other lactic acid-producing bacteria, BVAB-1, BVAB-2, and BVAB-3 alone, or in
they are not phenotypically exceptional, and it combination, had high specificity for BV diag-
is not difficult to see why the significance of nosed by Amsel criteria.
this organism based on culture, microscopy, Since A. vaginae and G. vaginalis are fre-
and phenotype may be overlooked and quently detected in association with BV,
underappreciated. a number of authors using molecular-based tech-
3. Symptomatic relationships: Using species- niques have examined the possibility of combin-
specific primers, the relationships between ing these two organisms as a means of diagnosing
five fastidious organisms associated with BV BV. Using DNA quantitation, 19 out of 20 BV
were compared with BV diagnosed by Amsel samples had either a DNA level for A. vaginae
and/or Nugent scores, and also with the indi- 108 copies/ml or G. vaginalis 109 copies/ml,
vidual Amsel clinical criteria (Haggerty and nine out of 20 had both. The combination of
et al. 2009). The two biovars of Ureaplasma an A. vaginae DNA level 108 copies/ml and
urealyticum (Ureaplasma parvum and a G. vaginalis DNA level 109 copies/ml dem-
Ureaplasma urealyticum – biovar 2) were onstrated the best predictive criteria for the diag-
associated with vaginal discharge and raised nosis of BV with excellent sensitivity (95 %),
pH, but not with BV by either Amsel or specificity (99 %), negative predictive value
Nugent criteria or any of the individual (NPV, 99 %), and positive predictive value
Amsel clinical criteria. In contrast, with (PPV, 95 %) (Menard et al. 2008).
Leptotrichia sanguinegens/amnionii, A. vagi-
nae, and BVAB-1, an elevated pH >4.5 was
a universal feature, and they were all associ- Culture-Independent Studies in
ated with BV by both Amsel and Nugent Pregnancy
criteria and with the finding of >20 % of
epithelial cells as clue cells, a feature that has Culture-independent techniques have been used
already been reported. A positive test for to measure prevalence, diversity, and abundance
amine odor upon the addition of 10 % solution of organisms, particularly ureaplasmas in amni-
of potassium hydroxide was significantly otic fluid, in association with suspected
more likely in women testing positive for cervical insufficiency, preterm labor, preterm
BVAB-1. Douching is a recognized risk factor prelabor rupture of membranes (PPROM),
The Vaginal Microbiome in Health and Disease 665 T
small-for-gestational-age babies, preeclampsia, Conclusions
and the potential for bacteria from the oral cavity
to colonize amniotic fluid. However, apart from Stability and resilience of the vaginal ecosystem
combining pregnant women with nonpregnant is now recognized to be of importance in the
women to increase sample numbers, the informa- health of a bacterial community as well as the
tion with respect to the vaginal microbiome in response to perturbations. The relative abun-
pregnant women is limited, particularly with dance of certain phylotypes correlates well with
respect to the outcome of pregnancy, especially low or high Nugent scores, which is used for the
preterm birth. Using species-specific primers, diagnosis of normal flora or BV. The inherent
Wilks et al. quantified the production of H2O2 by difference within and between women in differ-
lactobacilli from swabs taken at 20 weeks of ges- ent ethnic groups strongly argues for a more
tation from the vagina of 73 women considered to refined definition of the subtypes of bacterial
be at high risk of preterm birth (Wilks et al. 2004). communities normally found in healthy women
The levels of H2O2 production varied between and the need to appreciate differences between
species of Lactobacillus. The presence of individuals so they can be taken into account in
lactobacilli producing high levels of H2O2 was risk assessment and diagnosis of disease.
associated with a reduced incidence of BV at
20 weeks of gestation and subsequent chorioam-
nionitis. The authors postulated that H2O2- References
producing lactobacilli reduced the incidence of
Aagaard K, Riehle K, Ma J, Segata N, Mistretta TA,
ascending genital tract colonization in pregnancy,
Coarfa C, et al. A metagenomic approach to charac-
which leads to infection and preterm birth. In terization of the vaginal microbiome signature in preg-
a longitudinal study of 100 pregnant women, vag- nancy. PLoS ONE. 2012;7:e36466.
inal swabs were obtained at mean gestational ages Costello EK, Lauber CL, Hamady M, Fierer N, Gordon JI,
Knight R. Bacterial community variation in human
of 8.6, 21.2, and 32.4 weeks, respectively
body habitats across space and time. Science.
(Verstraelen et al. 2009). In the first trimester, 2009;326:1694–7.
77 women had normal or Lactobacillus-dominated Fredricks DN, Fiedler TL, Marrazzo JM. Molecular iden-
flora, 13 of whom developed abnormal flora in the tification of bacteria associated with bacterial vagino-
sis. N Engl J Med. 2005;353:1899–911.
second or third trimester. When the first-trimester
Fredricks DN, Fiedler TL, Thomas KK, Oakley BB,
normal flora was dominated by L. gasseri or Marrazzo JM. Targeted PCR for detection of vaginal
L. iners, there was a tenfold risk of conversion to bacteria associated with bacterial vaginosis. J Clin
abnormal flora. In contrast, normal flora compris- Microbiol. 2007;45:3270–6.
Gajer P, Brotman R, Bai G, Sakamoto J, Sch€ utte U,
ing L. crispatus had a fivefold decreased risk of
Zhong X, et al. Temporal dynamics of the human
conversion to abnormal flora. This may be because vaginal microbiota. Sci Transl Med. 2012;4:132ra152.
only a small percentage of L. gasseri and L. iners Haggerty CL, Totten PA, Ferris M, Martin DH, Hoferka S,
strains produce H2O2. Astete SG, et al. Clinical characteristics of bacterial
vaginosis among women testing positive for fastidious T
Knowledge of the vaginal microbiome in
bacteria. Sex Transm Infect. 2009;85:242–8.
pregnancy is limited to only a few studies Hernández-Rodriguez C, Romero-González R, Albani-
(Verstraelen et al. 2009; Hernández-Rodriguez Campanario M, Figueroa-Damián R, Meraz-Cruz N,
et al. 2011; Aagaard et al. 2012), none of which Hernández-Guerrero C. Vaginal microbiota of healthy
pregnant Mexican women is constituted by four Lac-
analyzed samples collected longitudinally.
tobacillus species and several vaginosis-associated
Recently, using 16S rDNA sequencing in normal bacteria. Infect Dis Obstet Gynecol. 2011;2011:
pregnant women sampled longitudinally, the 851485.
vaginal microbiome was found to be different Lamont R, Sobel J, Akins R, Hassan S,
Chaiworapongsa T, Kusanovic J, et al. The vaginal
from that of nonpregnant women; also the vaginal
microbiome: new information about genital tract
microbiome during pregnancy is more stable than flora using molecular based techniques. BJOG.
in the nonpregnant state (Romero et al. 2014). 2011;118:533–49.
T 666 tRNA Gene Database Curated Manually by Experts
tRNA Gene Database Curated Manually by Experts, Fig. 1 Basic functions of tRNADB-CE
supports two types of sequence search: sequence Archaea by sequence alignment using the
similarity search “BLASTN” and pattern search. CD-HIT (Li and Godzik 2006), we found high
In pattern search (i.e., oligonucleotide sequence phylogenetic preservation of tRNA genes;
search), we can focus the search area on the a particular tRNA sequence was found only in
stems/loops of cloverleaf structures and combine a particular lineage of phylogenetic groups. We
the areas in various patterns. After selecting designated here the tRNA group with an identical
tRNA genes of interest using the sequence search sequence as “identical sequence group: ISG”
procedures, multiple alignments with ClustalW (Fig. 2a) and listed the numbers of ISGs for
and downloads of aligned sequences and each phylotype (Fig. 2b) and for each anticodon
obtained dendrograms are available (Fig. 1b). (Fig. 2c). tRNAs with one anticodon type were
classified and listed according to the ISG along
with the phylotype information of each tRNA
Identical Sequence Groups and Their (Fig. 2d), and thus, the range of phylotypes
Use as Phylogenetic Markers for found for each ISG could be examined. If we
Environmental Metagenomic Sequences focused on ISGs composed of more than five
sequences, approximately 95 % of ISGs were
When we conducted the clustering of tRNA gene conserved at a phylum level, showing most
sequences, except the 30 CCA terminal sequence, tRNAs to be good phylogenetic markers at least
from complete and draft genomes of Bacteria and at the phylum level. The ISGs could provide
tRNA Gene Database Curated Manually by Experts 669 T
tRNA Gene Database Curated Manually by Experts, Fig. 2 List and search for identical sequence group (ISG)
a strategy for selecting reliable phylogenetic a function for searching for sequences with 97 %
markers. In addition, approximately 65 % of or 95 % sequence identity (2- or 3-nt difference,
ISGs were conserved even at the genus level, respectively) (Fig. 2a). By using tools in the data-
showing the possible existence of good genus- base and specific markers found by users (e.g.,
specific markers. By combining the data provided genus-specific markers), users can clarify micro-
by this database with other detailed knowledge of bial populations in an ecosystem by themselves.
a particular tRNA obtained by experiments or The present strategy can be applied even to
from literature, users may obtain useful phyloge- data of short sequences obtained with T
netic markers (e.g., genus-specific markers) by new-generation sequencers, such as Sequence
themselves. Read Archive (SRA), in NCBI. In metagenomic
Interestingly, among tRNA genes found in analyses using new-generation sequencers, the
metagenomic sequences derived from environ- phylogenetic characterization of short sequences
mental samples, approximately 25 % of tRNA with existing bioinformatics methods was partic-
genes were identical in sequence to genes from ularly difficult, except for sequences unambigu-
species-known prokaryotes. Using tRNAs found ously mapped on a known sequenced genome.
in an environment sample that were assigned to Because complete tRNA genes can be found
ISGs, we could predict the microbial community even from short genomic fragments of around
structure in an environmental ecosystem at least at 100 bases, tRNA genes should become one of
the phylum level (Fig. 2e). The database also has the most effective means for identifying
T 670 tRNA Gene Database Curated Manually by Experts
microbial populations in an ecosystem in the case group, therefore, can be used as molecular phy-
of metagenome studies conducted with next- logenetic markers to clarify microbial commu-
generation sequencers. nity structures of environmental ecosystems.
When we consider the rapid growth of genomic The tRNADB-CE allows users to obtain
and metagenomic sequences accumulated in phylotype-specific markers (e.g., genus-specific
DDBJ/EMBL/GenBank, our present strategy to markers) by themselves and to clarify microbial
search for reliable tRNAs, including manual community structures in ecosystems in detail.
curation by experts, may be inadequate. Our tRNADB-CE can be accessed freely at http://
group previously developed BLSOM (Batch- trna.ie.niigata-u.ac.jp.
Learning Self-Organizing Map) for oligonucleo-
tide composition, which clustered (self-organized)
genomic sequence fragments according to the
phylogenic group (Abe et al. 2003). The oligonu- References
cleotide BLSOM was successfully applied to
the phylogenetic classification of a large quantity Abe T, Kanaya S, Kinouchi M, Ichiba M, Kozuki T,
Ikemura T. Informatics for unveiling hidden genome
of metagenomic sequences (Abe et al. 2005). signatures. Genome Res. 2003;13:693–702.
When we conducted BLSOM for the tetra- and Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura
pentanucleotide compositions of bacterial tRNAs, T. Novel phylogenetic studies of genomic sequence
tRNAs were accurately separated according to the fragments derived from uncultured microbe mixtures
in environmental and clinical samples. DNA Res.
amino acid, showing the BLSOM to be an addi- 2005;12:281–90.
tional informatics strategy for the assignment of Abe T, Ikemura T, Ohara Y, Uehara H, Kinouchi M,
reliable tRNAs. When we focused on tRNAs with Kanaya S, Yamada Y, Muto A, Inokuchi H. tRNADB-
the same anticodon, tRNAs were separated CE: tRNA gene database curated manually by experts.
Nucleic Acids Res. 2009;37:D163–8.
according to the phylotype on BLSOM, showing Abe T, Ikemura T, Sugahara J, Kanai A, Ohara Y,
that the BLSOM is also applicable to the Uehara H, Kinouchi M, Kanaya S, Yamada Y,
phylogenetic assignment of tRNAs present in Muto A, Inokuchi H. tRNADB-CE 2011: tRNA gene
metagenomic sequences. database curated manually by experts. Nucleic Acids
Res. 2011;39:D210–3.
Kinouchi M, Kurokawa K. tRNAfinder: a software system
to find all tRNA genes in the DNA sequence based on
Summary the cloverleaf secondary structure. J Comp Aided
Chem. 2006;7:116–26.
Laslett D, Canback B. ARAGORN, a program to detect
By compiling the tRNAs of known prokaryotes tRNA genes and tmRNA genes in nucleotide
with identical sequences, we found high phylo- sequences. Nucleic Acids Res. 2004;32:11–6.
genetic preservation of tRNA sequences, espe- Li W, Godzik A. Cd-hit: a fast program for clustering and
cially at the phylum level. Furthermore, a large comparing large sets of protein or nucleotide
sequences. Bioinformatics. 2006;22:1658–9.
number of tRNAs obtained by metagenome ana- Lowe TM, Eddy SR. tRNAscan-SE: a program for
lyses had sequences identical to those found for improved detection of transfer RNA genes in genomic
known prokaryotes. The identical sequence sequence. Nucleic Acids Res. 1997;25:955–64.
U
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
U 672 Use of Bacterial Artificial Chromosomes in Metagenomics Studies, Overview
(Piel 2011). Third, a cis-regulatory element is However, indirect DNA extraction can produce
often required for the expression of a gene or less representative metagenomic DNA because
operon. However, the inserts cloned into some microbial cells can be difficult to isolate
a plasmid, cosmid, or fosmid vector might not from the sample matrices. The choice of a direct
allow for the cloning of both a gene or operon or an indirect extraction method depends on the
and its cis-regulatory element into the same nature of the environmental sample. For exam-
clone. The lack of a cis-regulatory element can ple, for a sample with high levels of contami-
prevent the metabolic phenotype of interest from nants, such as soils and sediments, there may be
being detected during activity-based screening of an advantage in using indirect extraction to sig-
clone libraries. Fourth, cloned metagenomic DNA nificantly reduce humic acid levels co-extracted
fragments are less stable in a plasmid or a cosmid with the DNA. However, indirect extraction
vector than in a BAC vector. Therefore, BAC methods may not yield sufficient DNA from
libraries have unique utility in metagenomic stud- samples with lower cellular abundance, such as
ies of microbiomes. In this entry, the construction, sediments and aquifer. Therefore, it is important
screening, bioinformatic and biochemical analy- to use an empirical approach and evaluate mul-
sis, and utilization of BAC clone libraries are tiple methods for HMW DNA extraction.
overviewed. Several protocols have been developed to
overcome the difficulties in extracting HMW
DNA for metagenomic studies. Stein et al.
Isolation of High Molecular Weight DNA (1996) introduced an innovative technique that
embeds microbial cells in agarose gel matrix.
Because the purpose of BAC cloning is to recover These embedded cells are lysed in situ to prevent
contiguous regions of microbial genomes, the mechanical shearing of the microbial DNA. The
DNA fragments recovered from a microbiome Nycodenz extraction technique is another tech-
sample should be significantly longer than that nique that is used to avoid mechanical shearing of
required for fosmid or other cloning vectors. HMW DNA (Berry et al. 2003). This technique
Many studies have compared DNA extraction prevents physical damage to bacterial cells by
methods suitable for metagenomic applications cushioning them during the high-speed centrifu-
(Delmont et al. 2011), and this section will gation step. Some environmental contaminants
review those methods that are appropriate for are also removed during the centrifugation.
BAC cloning. The use of multiple extraction methods for
Extraction of high molecular weight (HMW) a single sample may increase the ultimate yield
DNA from microbiome is always a significant and phylogenetic representation of the recovered
issue due to the inherent conflict between the metagenomic DNA. The adoption of a particular
need to recover DNA from diverse microorgan- DNA extraction method should be evaluated by
isms while preserving DNA integrity. Direct the DNA yield and the representation of diverse
DNA extraction methods, by which DNA is microbial genomes within the recovered DNA.
recovered directly from an environmental sam- The diversity represented in the recovered DNA
ple, provide high DNA yields from phylogeneti- can be assessed by molecular phylogenetic
cally diverse microorganisms; yet the vast analysis based on sequencing of the universally
majority of the DNA is sheared and likely less conserved 16S rRNA gene. Although phyloge-
than 100 kb. In contrast, indirect DNA extraction netic diversity analysis can be influenced by
methods, in which microbial cells are first iso- many factors, including biases inherent in PCR,
lated from sample matrices prior to DNA extrac- it is a rapid analysis to assess the phylogenetic
tion, result in a lower DNA yield, but the resultant composition of samples and libraries. Next-
DNA is of significantly greater molecular generation sequencing (NGS) of 16S rRNA
weight compared to direct extraction methods. gene amplicons permits a greater depth of
Use of Bacterial Artificial Chromosomes in Metagenomics Studies, Overview 673 U
sequencing coverage compared to traditional results in random fragmentation of metagenomic
Sanger sequencing technology. DNA. This method has been demonstrated with
Subsequent to extraction, the metagenomic multiple eukaryotic genomes and recently applied
DNA may need to be purified to achieve efficient to construction of soil BAC libraries studies
cloning. Purification is especially needed to (Kakirde and Nasrin et al., unpublished data).
remove contaminants from metagenomic DNA A major challenge in constructing high-quality
extracted from soil samples because soil samples BAC libraries is to retain large DNA fragments
can contain high levels of humic acids and other while removing small ones that can be preferen-
phenolic compounds that tend to be co-extracted tially cloned. Multiple strategies are available to
with the DNA and hamper downstream processes recover and clone large metagenomic DNA frag-
(e.g., restriction digestion and cloning of the ments. Pulsed field gel electrophoresis (PFGE) is
DNA). Multiple approaches have been developed the most frequently used method for size selection
to purify metagenomic DNA, including the use of of partially digested or sheared metagenomic
CTAB, hydroxyapatite column purification, or DNA. Alternatively, agarose gel electrophoresis
formamide denaturation. Formamide can be can be used as it can provide better resolution of
more effective in removing some contaminants HMW DNA. Because sucrose gradient centrifuga-
from HMW DNA due to its inherent capability to tion can only resolve DNA fragments of 5–60 kb,
denature DNA and remove contaminants that are it is not a suitable method to separate HMW DNA
tightly intercalated between the DNA bases and fragments for BAC library construction.
strands (Liles et al. 2008). Formamide or
polyvinylpolypyrrolidone (PVPP) can also help
remove nuclease contaminant. Hydroxyapatite Construction of BAC Clone Libraries
chromatography has advantages over other purifi-
cation approaches as this method can efficiently Several BAC vectors have been developed for
fractionate nucleic acids with different conforma- metagenomic cloning that enable transfer and
tions (i.e., dsDNA, ssDNA, dsRNA, and ssRNA) expression of cloned DNA in multiple heterolo-
while helping remove sample contaminants gous hosts. The initial development of the
(Andrews-Pfannkoch et al. 2010). This differential pBELOBAC11 vector (Kim et al. 1996) was
elution of nucleic acids can easily be accom- instrumental in permitting BAC cloning of envi-
plished by changing the phosphate concentrations ronmental DNA (Rondon et al. 2000). However,
at a constant temperature or a combination of the inherent restriction of the pBELOBAC11 vec-
increasing temperatures and phosphate concentra- tor to single-copy within an E. coli host cell was
tions of the elution buffers. a severe limitation for downstream analysis of the
resultant BAC libraries. The development of
inducible-copy control BAC vectors by inclusion
Fragmentation and Size Selection of of an RK2 origin of replication enabled more
Metagenomic DNA facile BAC library construction (Wild et al.
2002), and different derivatives of these vectors
Fragments of metagenomic DNA with uniform were constructed and commercially available U
length about 150 kb are prepared either enzymati- (Lucigen, Middleton, WI; Epicentre, Madison,
cally or mechanically. Enzymatic fragmentation WI). The inducible-copy control BAC vectors
relies upon partial restriction digestion. However, were further modified by including the complete
the extent of partial digestion is difficult to control. mini-RK2 replicon within the BAC vector to
Additionally, partial restriction digestion can result enable shuttling of BAC clones into multiple het-
in nonrandom DNA fragmentation and a significant erologous hosts (Kakirde et al. 2011), greatly
reduction in DNA size. An alternative to partial expanding the host range for heterologous expres-
restriction digestion is mechanical shearing, which sional analysis of BAC libraries.
U 674 Use of Bacterial Artificial Chromosomes in Metagenomics Studies, Overview
The size-selected metagenomic DNA frag- access the functional diversity that can be deter-
ments are ligated into an appropriate BAC vector mined phenotypically. The strategies for activity-
using a DNA ligase. DNA fragments prepared by based screening of BAC libraries depend on the
partial restriction digestion can be directly ligated nature of the compounds or enzymes of interest
to the chosen BAC vector that has been linearized and should be designed carefully (Taupp et al.
with the same restriction enzyme. If randomly 2011). The advent of new technologies has
shared DNA fragments are cloned, however, enabled the application of high throughput
both ends of each fragment need to be repaired screening (HTS) approaches to identify clones
to blunt ends. To increase cloning efficiency, with the desired phenotypes (producing certain
adaptors of an appropriate restriction enzyme compounds or enzymes) from a large number of
can also be ligated to the repaired ends. After BAC clones. The analysis of cell lysates, DNA,
ligation, the insert-carrying BAC vector is or supernatants from BAC clones can be
transformed into highly competent E. coli cells performed with a great diversity of screening
typically using electroporation. It should be noted targets to identify the clones carrying the desired
that even if a shuttle vector capable of transfer to activities or genetic targets (Lakhdari et al. 2010).
other hosts is used, it is advisable to first use For example, BAC clones expressing the desired
E. coli for BAC library construction to take activity can be identified by applying an indicator
advantage of high transformation efficiency, substrate of the enzymes of interest into the
even if the ultimate expression host is not growth medium. Depending on the nature of the
E. coli. It should also be noted that this is the assay, the active clones may be detected by visual
most difficult step in BAC cloning and that inspection of an indicator agar plate, flow
the larger number of fosmid libraries reported in cytometry, a spectrophotometer, or fluorescent
the literature compared to BAC libraries is merely microtiter plate reader (Taupp et al. 2011).
a reflection of the more facile fosmid cloning. One of the first proof-of-concept studies for
Once transformants are isolated on respective identifying a functional natural product from
antibiotic selection plates, a representative a BAC library was accomplished via sequence-
number of colonies should be evaluated for the based screening. In the seminal paper of Béjà
percentage of BAC clones with insert the average et al. (2000), the first bacterial proteorhodopsin,
insert size. a light-driven proton pump, was discovered by
If the library statistics are satisfactory for fur- identifying specific BAC clones that contained
ther analysis, then colonies can be archived. The a 16S rRNA gene sequence and then identifying
archiving of a BAC library, which typically con- the other linked functional genes contained
sists of a vast number of clones, is an important within the same clone. While this study used
step since researchers frequently need to access a fosmid library, this approach is equally appli-
clones for confirmation, verification, screening, cable to BAC libraries and has been used to
and other analyses. BAC clones are usually describe some of the functional diversity associ-
suspended in a cryoprotectant medium, which is ated with as-yet-uncultured bacteria (Liles
usually 10–15 % glycerol or 8 % dimethyl sulf- et al. 2008). This approach is inherently limited
oxide (DMSO) in the original growth medium of by the metagenomic DNA sequences that are
the bacterial host. The BAC clones are usually immediately adjacent to an rRNA operons, but
grown in 96-well or 384-well format for high was nonetheless a useful method in the initial
throughput handling and screening. exploration of BAC libraries.
Enzymatic activities expressed from BAC
clones may be identified via many different
Screening of BAC Libraries methods, including colorimetric or fluorescent
assays, as well as indicator media
Unlike shotgun metagenomic sequencing, BAC (Taupp et al. 2011). For example, lipase-
libraries provide the opportunity to identify and producing BAC clones can be detected on LB
Use of Bacterial Artificial Chromosomes in Metagenomics Studies, Overview 675 U
agar plates supplemented with 1.0 % tributyrin by Sequencing and Bioinformatic
formation of a halo around individual clones due Analysis of BAC Libraries
to tributyrin hydrolysis. Cellulases and xylanases
can be detected using agar plates supplemented The inserts of BAC clones that exhibit certain
with carboxymethyl cellulose (CMC) and soluble metabolic activities can be sequenced to deter-
xylan, respectively. Other enzymatic activities mine the coding sequence, structural and regula-
identified using a BAC approach include but tory features of the gene(s), and potential
are not limited to esterase, alcohol dehydroge- phylogenetic markers. Three different common
nase, amidase, amylase, protease, chitinase, sequencing strategies are typically used.
dehydratase, and b-lactamase (Lorenz and Eck
2005). Subcloning and Sequencing of Individual
Identification of secondary metabolites BAC Clones
expressed from a heterologous host is dependent The insert of a BAC clone is first fragmented
upon having large-insert BAC clones, along with mechanically or enzymatically using
suitable transcriptional and translational machin- a restriction enzyme. The resultant smaller inserts
ery. The best examples of a functional can be cloned into a plasmid vector (e.g., pUC
metagenomic approach to identify antimicrobial vector) and then sequenced individually using the
activities were the isolation of turbomycin A and Sanger sequencing technology (see the
B (Gillespie et al. 2002), the identification of “Subcloning” section below). The full length of
antibacterial activities expressed in cosmid the BAC insert can be assembled from the
libraries in different proteobacterial hosts (Craig sequenced subclones (see “Sequence Assembly”
et al. 2010), and identification of gene clusters section below). Because it is time-consuming to
involved in synthesis of antifungal activities subclone and sequence a large number of BAC
(Chung et al. 2008). clones, this approach is primarily used to
In addition to activity-based screening, sequence one or a few of BAC clones of interest.
sequence-based screening is a widely used
approach to find genes or gene clusters involved End Sequencing
in particular functions within a BAC library. For Both ends of a BAC clone insert can be
example, an alternative to functional expression sequenced using the Sanger sequencing technol-
of libraries to identify antibacterial-active clones ogy and the primers that specifically anneal to the
is to first identify clones that contain known vector regions that flank the insert (Pope and
pathways involved in secondary metabolite syn- Patel 2008). This approach only allows sequenc-
thesis and then to express these pathways in ing a short region at both ends of a BAC clone,
a related host, permitting isolation of novel ana- and thus only limited sequence information can
logs of known metabolites. This has been dem- be determined. In contemporary studies, end
onstrated previously in the case of polyketide sequencing is primarily used to match a BAC
synthases (PKS) and nonribosomal peptide clone with its corresponding sequence that is
synthetases (NRPS). Feng et al. first identified determined using shotgun sequencing of pooled
the type II PKS biosynthetic system in two dif- BAC clones (see the “Shotgun Sequencing” sec- U
ferent cosmid clones by sequence-based homol- tion below). The genetic information of each
ogy screening of a cosmid library (Feng et al. BAC clone can then be analyzed with respect to
2010). Their sequence-based screening followed its phenotypic activities observed during activity
by heterologous expression of a type II PKS screening.
biosynthetic gene cluster identified three new
fluostatins that were previously uncharacterized Shotgun Sequencing of Pooled Select
in cultured species (Feng et al. 2010). This BAC Clones
approach can be equally applicable to screening Recent advancement in DNA sequencing tech-
BAC libraries. nologies, especially the NGS technologies, made
U 676 Use of Bacterial Artificial Chromosomes in Metagenomics Studies, Overview
Use of Viral Metagenomes from Yellowstone Hot Springs to Study Phylogenetic Relationships and Evolu-
tion, Table 1 Sample sites and abundance of viral and microbial counts
Hot Virus: Virus/mL in Virus/mL
spring Temp pH Cells/mL Viruses/mL microbe ratio concentrate theoreticala Efficiency
Bear 74 7.34 4.3 106 1.44 106 0.33 1.48 108 7.21 109 2.1 %
paw
Octopus 93 8.14 9.0 105 3.07 105 0.34 2.18 108 1.53 109 14.2 %
Based on a concentration factor of 5,000 (500 L to 100 mL)
a
Use of Viral
Metagenomes from
Yellowstone Hot Springs
to Study Phylogenetic
Relationships and
Evolution, Fig. 1 TEM
images of viruslike
particles directly isolated
from YNP hot springs.
Images from Bear Paw
(Panels a and b) and
Octopus (Panels c and d)
hot springs are shown. The
bar in each figure is 200 nm
(Images are courtesy of Sue
Brumfield and Mark
Young, Montana State
University. Reproduced
with permission from
Schoenfeld et al. (2008))
(Wommack and Colwell 2000). The virus/ Morphologies of viral particles in the concen-
microbe ratios (VMRs) in the hot springs were trates represent most morphological families of
much lower than in moderate-temperature envi- known thermophilic viruses. Tailed morphol-
ronments (typically 3–10). These low VMRs may ogies are commonly associated with bacterio-
be related to the observation that none of the phages and euryarchaeotal viruses (Geslin
cultured thermophilic crenarchaeal viruses pro- et al. 2003; Yu et al. 2006); rod-shaped and fila-
liferate via lytic infections, a lifestyle that would mentous morphologies are more commonly asso-
result in large burst sizes at the same time as the ciated with crenarchaeal viruses (Prangishvili
microbial population is reduced. Actual yields of and Garrett 2004).
viruses were significantly below theoretical
yields (Table 1) for both two hot springs. It is
not known if this loss was systematic and, there- Library Construction and Sequencing
fore, biased the metagenomic analysis. Tailed,
rod-shaped, and filamentous morphologies were Advances in sequencing capacity make analyses
observed in the concentrates (Fig. 1). of large numbers of clones feasible; however,
Use of Viral Metagenomes from Yellowstone Hot Springs 685 U
challenges in sampling and library construction Octopus (21,198 reads) hot springs. Paired-end
have prevented the widespread use of reads averaged 981 nucleotides each or nearly
metagenomic shotgun sequencing for studying 30 Mb total. Assuming an average genome size
viral populations. At around 50 ag of DNA per of 50 kb, which is supported by agarose gel elec-
virus, abundances of 105–106 viruses per ml cor- trophoresis of the viral genomic DNA (data not
respond to 5–50 ng of viral DNA per liter. In shown), this sequencing depth represents about
practice, processing of hundreds of liters of 600 viral genomic equivalents. The quality of the
spring water generally yielded no more than libraries is highly dependent on the amount of
100 ng of DNA, much lower than is normally DNA used in their construction. The sequence
required for library construction. This low yield reads of the Octopus library contained very few
of virus precluded cesium chloride purification anomalies that would suggest amplification bias
of the viral particles, as is commonly used for or cloning artifacts. Some of the reads from the
marine viral metagenomic library construction. Bear Paw library were less random than the Octo-
Viral DNA also contains cytotoxic genes and pus library, as demonstrated by several cases of
modified nucleotides that induce host restriction sequence stacking.
systems. A linker-dependent, anonymous Contaminating cellular DNA in viral DNA
method of DNA amplification was used to access preparations was greatly reduced by filtration
this diversity, allowing construction of 3–8 kb and nuclease treatment. Only viral preparations
insert libraries with none of the potential modi- substantially free of microbial cells as judged by
fied nucleotides. This library construction epifluorescence microscopy were used for library
method has been used in the analysis of several construction. Detection of rDNA sequences (5S,
cultivated and uncultivated viral genomes 16S, and 23S) in the libraries was used to identify
(Breitbart et al. 2003, 2004a; Seguritan contaminating cellular DNA. These sequences
et al. 2003; Lindell et al. 2004; Paul et al. 2005; are absent in known viral genomes but highly
Bench et al. 2007) but never fully described. conserved in microbial cells. A typical bacterial
Viral DNA was physically sheared, and short genome contains 15 rRNA genes (Coenye 2003).
(20 bp) linkers were ligated to the DNA frag- Most hyperthermophilic archaeal and bacterial
ments to serve as priming sites for PCR. Ampli- genomes contain three to six rRNA genes,
fied fragments were cloned into a transcription- although the genomes of thermophilic
free pSMART vector to minimize cloning Geobacillus that grow in the temperature range
bias due to cytotoxic sequences (Godiska of Bear Paw contain up to 30 rRNA genes (Feng
et al. 2005). The use of flanking synthetic linkers et al. 2007). BLASTn analysis identified only
provides identical primer annealing sites for four rDNA sequences in the 10.4 microbial
each viral template in the mixture, which signif- genome equivalents sequenced from the Octopus
icantly limits amplification bias. A noteworthy library (two 23S and two 16S) and eight in the 3.8
characteristic of this approach is that it microbial genome equivalents from the Bear Paw
selects exclusively for dsDNA viruses. All culti- library, suggesting viral enrichment was quite
vated thermophilic bacteriophage and archaeal high, particularly for the Octopus library. This
viruses have dsDNA genomes except certain inference is supported by a high similarity to U
Thermus-specific Inoviruses, which have ssDNA sequences of cultivated viruses (shown below)
genomes (Yu et al. 2006). Notably, several viral and a large number of BLASTx similarities to
nucleic acid preparations from these and other genes associated with viral functions. In particu-
springs sampled as part of this study had RNase- lar, the hundreds of presumptive genes for viral
digestible material (data not shown), suggesting functions, such as replication, transcription,
that RNA viruses inhabit these hot spring translation, lysogeny, recombination, lysis, and
environments. structural proteins (Table 2), are consistent only
A total of 28,883 Sanger sequence reads were with a predominately viral origin of the
determined from Bear Paw (7,685 reads) and sequences.
U 686 Use of Viral Metagenomes from Yellowstone Hot Springs
Use of Viral Metagenomes from Yellowstone Hot Springs to Study Phylogenetic Relationships and Evolu-
tion, Table 2 Functional grouping of predicted genes in the viral metagenomes
Bear paw Octopus Bear paw Octopus
Total reads 7,685 21,198
No BLASTx similarity 2,545 8,469
COGs functional category Number of reads matching Percent with a keyword
a keyword match
F. Nucleotide transport and metabolism 1,445 2,130 35.09 % 37.81 %
J. Translation, ribosomal structure, and biogenesis 221 336 5.37 % 5.96 %
K. Transcription 278 325 6.75 % 5.77 %
L. Replication, recombination and repair 688 989 16.71 % 17.55 %
O. Posttranslational modification, protein turnover, chaperones 181 213 4.40 % 3.78 %
None virus specific 350 596 8.50 % 10.58 %
No match to a keyword 955 1,045 23.19 % 18.55 %
respectively, suggesting that the appropriate The degree to which metagenomic reads assem-
machinery for lateral gene transfer exists in hot ble has been used to assess the diversity of the
spring viral genomes (Canchaya et al. 2003). viral populations. Previous studies have used
Other sequence similarities provide evidence >95 % identity over 20 nucleotides as the assem-
of ongoing gene transfer within these bly stringency (Breitbart et al. 2002, 2004a;
populations. Helicase genes shared among Breitbart 2003; Angly et al. 2006). Using this
viruses and cells from all domains have been criteria, the power law rank-abundance model
considered examples of nonorthologous replace- built into the Phages Communities from Contig
ment of cellular genes by viral genes (Filee Spectrum tool (PHACCS, 5) predicted 1,400 and
et al. 2003). Hundreds of reads showed sequence 1,310 viral types in Bear Paw and Octopus hot
similarity to the superfamily II helicases of springs, respectively, with no one viral type
a wide range of cells and viruses. For example, representing more than about 2 % of the popula-
the 2 kb Octopus contig 158 had significant sim- tion (Table 4). For reference, 1,650, 3,350, 7,180,
ilarity to helicases of bacterial, archaeal, and 7,340, and 2,390 viral genotypes were reported in
eukaryotic cells as well as to phage and archaeal estuarine, nearshore marine, open ocean, marine
viruses (Table 3). sediments, and fecal viral assemblages, respec-
Also common in the metagenomic libraries tively (Breitbart et al. 2002, 2003, 2004a; Angly
are presumptive ribonucleotide reductases et al. 2006; Bench et al. 2007), with no single
(14 and 50 in Bear Paw and Octopus springs, viral species representing more than 2–3 % in U
respectively) and thymidylate synthase (seven any case.
and 51, respectively) genes. The conservation of There are several limitations in assessing
these genes between viral and cellular genomes actual numbers of viral species from
of all domains and the biochemical activities of metagenomic libraries. First, these models
the gene products imply that viral genes played assume viral genomes evolve uniformly. How-
a key role in the transition from RNA-based to ever, different regions of viral genomes are
DNA-based genomes (Forterre 2005). DNA clearly more conserved than others (Lindell
polymerase (pol) genes have also been proposed et al. 2004). Genetic diversity outside the
U 688 Use of Viral Metagenomes from Yellowstone Hot Springs
Use of Viral Metagenomes from Yellowstone Hot Springs to Study Phylogenetic Relationships and Evolu-
tion, Table 4 Sequence assembly data and estimation of viral diversity
Bear paw Octopus Totals
Sequence reads 7,685 21,198 28,883
Bear paw 95 % Octopus 95 % Bear paw 50 % Octopus 50 %
Contigs assembled 6,191 13,543 4,850 4,788
Avg. reads per contig 1.239 3.129 1.587 4.427
Largest contig (nt) 3,503 4,554 8,007 35,089
Power law richness 1,440 1,310 548 283
Evenness score 0.946 0.954 0.933 0.936
Most abundant virus 2.14 % 1.88 % 3.93 % 4.88 %
Shannon-Wiener score 6.88 6.85 5.88 5.29
conserved regions is probably higher than these respectively. These lower stringency assemblies
models indicate. Second, the generation of new proved quite useful for associating sequences of
viral species by mosaicism, modular evolution, or related, but not identical, viral types and for
lateral gene transfer (Villarreal and DeFilippis studying diversity among these related viruses.
2000; Canchaya et al. 2003; Weinbauer and At 95 % identity, the largest contigs were 3.5 and
Rassoulzadegan 2004) would not be detected 4.6 kb for Bear Paw and Octopus, respectively
using assembly of <1 kb sequence reads. On the (Table 4). At 50 % identity, Octopus reads assem-
other hand, given the dynamic nature of viral bled into 17 contigs of greater than 10 kb, includ-
genomes, this approach is well suited to a view ing contigs of 35 kb and 19 kb, comprised of
of the diversity and evolution of viruses that >1,000 reads each. In each case, reads were
considers genes or groups of genes rather than evenly distributed across the contigs. The
whole genomes. Finally, assembly at >95 % 17 > 10 kb contigs comprise a total of 7.04
nucleotide identity fails to account for molecular Mbp (33 % of total metagenomic sequence) or
diversity among related viral types, which is about 140 viral equivalents. The four strongest
higher than that of cellular species. In fact such BLASTx hits to the 35 kb contig belonged to
stringency would fail to associate viruses that, thermophilic crenarchaeal viruses Acidianus
based on classical criteria (host range, morphol- Rod-shaped virus (ARV), Sulfolobus islandicus
ogy, replication lineages, and physicochemical rod-shaped viruses 1 (SIRV1) and 2 (SIRV2),
and antigenic properties), are considered to be and Sulfolobus islandicus filamentous viruses
related (LucchiniS and Brussow 1999; Hatfull (SIFV) (Table 5). The only significant similarity
et al. 2006; Kwan et al. 2006) although they for the 19 kb contig was to the thermophilic
may share as little as 50 % nucleotide identity crenarchaeal virus, Pyrobaculum spherical virus
over much of their genomes. (PSV). In the Bear Paw library, with roughly one
third as many reads, the largest contig that assem-
bled at 50 % identity was 8 kb. Five hundred
Lower Stringency Assemblies Reveal thirty four (7 %) of the reads assembled into
Population Heterogeneity 19 contigs >4 kb. These include 0.5 Mbp of
reads or ten viral equivalents.
To accommodate genomic heterogeneity inher- The larger composite contigs allow associa-
ent to viral populations, sequences were also tions that were impossible at standard stringency.
assembled at 50 % identity (Table 4). As More than 200 million bases have been
expected, the numbers of viral types decreased sequenced from marine viral metagenomic librar-
to 548 and 283 in Bear Paw and Octopus, ies, but only one small phage genome has been
Use of Viral Metagenomes from Yellowstone Hot Springs 689 U
Use of Viral Metagenomes from Yellowstone Hot population diversity of an assembled
Springs to Study Phylogenetic Relationships and metagenome with the biochemistry of the gene
Evolution, Table 5 Numbers of 95 % contigs with
tBLASTx similarities (E < 0.001) to the respective cellu- products (Fig. 3). This 16.5 kb contig, assembled
lar genomes at 50 % identity, includes 187 reads (average cov-
Bear Paw Octopus
erage of 11 reads per nucleotide position).
Pyrobaculum 124 684
GeneMark predicted 26 ORFs of greater than
Archaea 100 nucleotides, including an apparent replication
Aeropyrum 62 626 operon. The genes with the strongest similarity to
Sulfolobus 38 326 four of these ORFs encode primase, uracil DNA
Acidianus 25 185 glycosylase, family B DNA polymerase, and
Bacteria nucleotide excision repair nuclease (dnaG, udg,
Aquifex 474 1,138 polB, and ERCC4 genes, respectively). Homologs
of these ORFs belong to crenarchaeal DNA repli-
cation/repair complexes (Roberts and White 2003;
reconstructed (Angly et al. 2006). To validate the Dionne and Bell 2005; Barry and Bell 2006). The
low-stringency assemblies and to further study predicted polB gene showed 28 % identity to
the molecular biology of the viruses, the 4 kb Pyrobaculum islandicus polB2 (Kahler and
cognates of one contig of four reads that assem- Antranikian 2000) and has an archaeal codon pro-
bled at 50 % NAID were PCR amplified, cloned, file (data not shown). Sequences from three of the
and sequenced (Schoenfeld 2014). This confirms discreet clones that comprise the polB gene in this
that at least this assembly accurately reflects the contig have been expressed in E. coli to produce
virome sequence. Furthermore, this contig a functional thermostable DNA polymerase (data
includes an apparent replisome, and amplifica- not shown). This contig also contains apparent
tion based on the low-stringency assembly allows homologs to a zinc fingerlike protein and
study of an operon that, due to its size, could not a transposon-like integrase/resolvase (tnp), func-
otherwise be recovered from the fragmentary tions commonly associated with viruses and
metagenomic data. phages. Another ORF with highest similarity to
Certain contigs provide compelling evidence the CRISPR-associated sequence cas4 (Haft
that the 50 % assemblies associate genuine et al. 2005) is unlikely to be part of a functional
orthologous sequences. An example is Bear Paw CRISPR system. Unlike authentic Cas sequences,
contig 327 (Fig. 2). Eleven open reading frames this one is virus-derived and is not proximal to
(ORFs) were identified by the GeneMark algo- a CRISPR sequence or other typically associated
rithm (Lukashin 1998). BLASTp analysis of each sequences. More likely this gene is a separate
shows strongest similarity to the putative coding member of the Cas4 COG, presumably a RecB-
sequences of PSV (Haring et al. 2004). Nucleo- like exonuclease (Haft et al. 2005).
tide identities were as high as 88 %, gene order is To correlate the level of sequence divergence
perfectly preserved relative to the cultured virus, with predicted gene function, SNP frequency
and gene overlap is identical between the com- was aligned to the 50 % assembly consensus
posite contig and the cultivated virus. Interest- sequence of the contig. Overall distribution of U
ingly, two different ORFs of the PSV genome, gp SNPs in the contig was 0.705 per 10 bp.
4 and 5, are apparently related to each other, since Replication-associated genes showed noticeably
both had significant similarity to the same region lower molecular diversity than the other ORFs.
of the consensus contig. In both the cultured viral SNP distribution in the dnaG, udg, polB, and
genome and the consensus contig, the gp7 PSV ERCC homologs was 0.565, 0.617, 0.569, and
gene overlaps gp6 in the opposite orientation. 0.548 per 10 bp, respectively, while the distribu-
Contig 722 from the Octopus spring library tion in the Zn finger, cas4, and thyA homologs
provided a unique opportunity to associate was 0.979, 1.31, and 0.728, respectively.
U 690 Use of Viral Metagenomes from Yellowstone Hot Springs
Use of Viral Metagenomes from Yellowstone Hot proteins in GenBank. Similarities to Pyrobaculum spher-
Springs to Study Phylogenetic Relationships and ical virus proteins are shown with percent coding identity.
Evolution, Fig. 2 Genes and gene order are highly con- The gene names are based on the annotation in GenBank
served between a cultured crenarchaeal virus and and are named in order of their location on the viral
a consensus contig from the Bear Paw library. Contig chromosome. Direction of transcription is indicated by
372 (5,492 bp, 71 reads) was assembled at 50 % identity the arrows (Reproduced with permission from Schoenfeld
from the Bear Paw library. Open reading frames identified et al. (2008))
by GeneMark algorithm were compared by BLASTp to
16542 bp
187 reads
87% two reads per strand
4
3.5
SNPs per 10 bp
3
2.5
2
1.5
1
0.5
0
bp 0 2000 4000 6000 8000 10000 12000 14000 16000
0RFs
Use of Viral Metagenomes from Yellowstone Hot base pairs were normalized to the number of reads cover-
Springs to Study Phylogenetic Relationships and ing the respective nucleotide (middle) and are aligned with
Evolution, Fig. 3 Alignment of nucleotide polymor- predicted open reading frames from the consensus
phisms with coding sequences in a 16.5 kb consensus sequence in the contig and the gene name of the strongest
contig from Octopus hot spring. Contig 722 was assem- BLASTx similarity (bottom). Direction of transcription is
bled at 50 % identity from the Octopus library. Sequence shown by the arrows. Similarities to known genes were
coverage is shown on the top, with each line representing identified by BLASTp (Reproduced with permission from
a separate read. Single-nucleotide polymorphisms per ten Schoenfeld et al. (2008))
Similarities to Known Viral and and 63 % from Octopus) had no tBLASTx simi-
Microbial Genomes Imply Phylogeny larity (E < 0.001) to any sequence in GenBank
(Fig. 4). Although it is typical for viral
tBLASTx analysis was used to infer phylogenetic metagenomic libraries analyzed in this way to
origin of the 95 % assembled contig sequences. have a high proportion of sequences without
A majority of the contigs (41 % from Bear Paw identifiable homologs, these libraries contained
Use of Viral Metagenomes from Yellowstone Hot Springs 691 U
a Bear Paw b Octopus
archaeal virus archaeal virus
1.3% 3.5%
phage
0.4% archaea
0.6% phage
archaea 0.1%
4.7% bacteria eukaryotic virus
12.3% 4.2%
no similarity
41.2%
bacteria eukarya
44.1% no similarity 16.5%
62.8%
eukarya
8.8%
eukaryotic virus
0.1%
Use of Viral Metagenomes from Yellowstone Hot compared to sequences in GenBank to infer phylogeny.
Springs to Study Phylogenetic Relationships and Shown are frequencies of contigs with no significant
Evolution, Fig. 4 Broad classification of viral sequence similarity in GenBank (E < 0.001) and those
metagenomic contigs based on tBLASTx similarities. with sequence similarity to Bacteria, Archaea, Eukarya,
Contigs assembled at 95 % identity from Bear Paw and and their respective viruses (Reproduced with permission
Octopus reads (Panel a and b, respectively) were from Schoenfeld et al. (2008))
approaches use differences in codon usage, which predicted based on the low diversity of microbes
are generally conserved between hosts and in the sediments and filaments. The BLASTx,
viruses (Lucks et al. 2008). Genome signature- GSPC, and diversity data all suggest that the
based phylogenetic classification (GSPC) ana- viruses are infecting hosts other than the seden-
lyzes differences in di-, tri-, and tetranucleotide tary surface bacteria, implying significant prolif-
utilization patterns to associate phylogenetic eration either in the pool or in the vent. The
relationships, which are influenced by codon viruses used in this study were planktonic isolates
usage bias, as a basis for correlating hosts and collected close to the outflow source immediately
viruses (Pride et al. 2006; Yooseph and Sutton after emergence, making it more unlikely that the
2008). hosts were surface microbes in the filament, sed-
A GSPC study based on tetranucleotide utili- iments, or water column.
zation in the Yellowstone viral metagenomes
from Bear Paw and Octopus hot springs was
reported in Pride and Schoenfeld (2008), which Alignment of the Metagenome to
includes the details of the analysis and the statis- Cultivated Viral Genomes
tical support. To be statistically significant, the
analysis used only contigs >1.9 kb (3.8 kb when Overall, only 3.4 % of the high stringency (95 %
analyzing both strands) assembled at 95 % iden- assembly) contigs from the two libraries showed
tity. Contigs of this size should include 95 % of similarity to known viral sequences. Most of
tetranucleotide combinations at least 7.5 times. these similarities were to cultivated thermophilic
Approximately 19.3 % and 39.0 % of the Bear crenarchaeal viruses (Table 6). Similarity to the
Paw and Octopus metagenomic contigs, respec- only non-thermophilic virus, phage Twort (Kwan
tively, representing the more abundant viruses, et al. 2005), was limited to the helicase gene,
conformed to these criteria. The GSPC analysis which shares similarity with that of SIFV (see
classified 20 of 22 Bear Paw contigs and 69 of above). The two libraries shared comparable fre-
70 Octopus contigs, a much higher proportion of quencies of sequence similarity to archaeal
the reads than either BLASTx or Phylopythia viruses and bacteriophage. Notable exceptions
with significantly stronger statistical support were Acidianus rod-shaped virus and Sulfolobus
(see Pride and Schoenfeld 2008). The method is islandicus rod-shaped virus 1 and 2 where the
useful to group contigs by relatedness, which Octopus library demonstrated a higher frequency
might assist assembly, and to infer phylogenies of homology than the Bear Paw library and
and hosts. The GSPC analysis suggests that Octo- the S. tengchongensis spindle-shaped virus 1
pus viruses belong primarily to archaeal families homology, less common in Octopus than in
Globuloviridae and Fuselloviridae (56 of 69) Bear Paw.
while Bear Paw members belong primarily to Alignment of the metagenomes to whole
the bacteriophage family Caudoviridae genome sequences of six cultivated thermophilic
(includes Myoviridae, Podoviridae, and viruses revealed striking conservation of certain
Siphoviridae) (17 of 20). The analysis also esti- sequences (Fig. 5). Almost the entire genome of
mates that 80 % of the Octopus contigs have Pyrobaculum spherical virus (PSV) has similar-
archaeal signatures, while 77 % of Bear Paw ity to sequences in both metagenomic libraries,
contigs had bacterial signatures, a finding consis- with median identities of 60 % and 51 % to the
tent with BLASTx analysis. Bear Paw and Octopus, respectively. Sequence
The apparent predominance of archaeal similarities to the other crenarchaeal viruses and
viruses seems inconsistent with the reported to bacteriophage YS40 were limited to a few
dominance of Octopus sediments and filaments specific ORFs, but the degree of similarity was
by Bacteria (Blank et al. 2002; Rachel relatively high in those regions. Interestingly,
et al. 2002). Furthermore, the viral populations nearly all of the ORFs showing high levels of
appear much more diverse than would be homology are among the few thermophilic
Use of Viral Metagenomes from Yellowstone Hot Springs 693 U
Use of Viral Metagenomes from Yellowstone Hot Springs to Study Phylogenetic Relationships and Evolu-
tion, Table 6 Numbers of 95 % contigs with tBLASTx similarities to cultured viral sequences
Number of
tBLASTx
similarities
Virus References Accession Bear paw Octopus
ARV, Acidianus rod-shaped virus (Vestergaard et al. 2005) AJ875026 36 228
SIRV 1 and 2, Sulfolobus islandicus rod-shaped (Blum et al. 2001; Peng AJ344259, 30 217
virus 1and 2 et al. 2001) AJ414696
PSV, Pyrobaculum spherical virus (Haring et al. 2004) AJ635161 44 152
SIFV, S. islandicus filamentous virus (Arnold et al. 2000) AF440571 7 46
STSV1, S. tengchongensis spindle-shaped (Xiang et al. 2005) AJ783769 26 22
virus 1
ATV, Acidianus two-tailed virus (Prangishvili et al. 2006b) AJ888457 8 17
YS40, Thermus thermophilus YS40 phage (Naryshkina et al.2006) DQ997624 15 41
TTSV1, Thermoproteus tenax spherical virus 1 (Ahn et al. 2006) AY722806 6 12
Twort, Staphylococcus phage Twort (KwanT et al. 2005) AY954970 4 21
crenarchaeal virus genes for which a function has SSV-RH (Wiedenheft et al. 2004), had no signif-
been assigned or inferred (Fig. 5 and references icant tBLASTx similarity to any of the
therein). These regions of high conservation are metagenomic samples.
genes associated with virion components, DNA
replication, transposition, recombination, or
nucleic acid metabolism. Identification of CRISPR Spacer Cognate
The degree of alignment to cultivated viruses Sequences in the Octopus Viral
was surprising. PSV was isolated from Obsidian Metagenome
hot spring (74 C, pH 5.6), about 30 km away
from both Octopus and Bear Paw. The geochem- Evidence has been accumulating recently associ-
istry of this thermal feature is distinct from the ating CRISPR (clustered regularly interspaced
springs in this study (Shock et al. 2005), and life short palindromic repeats) systems with acquired
within includes a highly diverse population of resistance to lateral gene transfer from viruses
Archaea and Bacteria (Barns et al. 1994; and episomal elements (reviewed by van der
Hugenholtz et al. 1998), most of which have not Oost et al. 2009). CRISPRs were first discovered
been detected in Octopus hot spring (Reysenbach as repetitive sequences found in most bacte-
et al. 1994; Blank et al. 2002) or elsewhere. In rial and virtually all archaeal genomes.
contrast, Thermoproteus tenax spherical virus, These systems are functionally analogous, but
which is quite similar to PSV in terms of nonhomologous, with eukaryotic RNA interfer-
sequence, morphology, and habitat (Ahn ence and appear to limit the lateral transfer of
et al. 2006), had very limited similarity to the genes by targeting them for nucleolytic degrada- U
YNP viral metagenomic sequences (not shown). tion prior to their integration into the genome.
The other viruses showing high similarity to the The emerging view is that sequences in the repeat
metagenomic sequences were isolated on differ- region of the CRISPR system correspond to
ent continents and, with the exception of YS40, sequences in viral or episomal genes and are
occurred in highly acidic springs. This observa- transcribed in the host cell as part of a targeting
tion is more remarkable because the microbial system to neutralize viral infections. However,
populations of acidic and neutral hot springs are little direct evidence of conservation between
quite distinct (Reysenbach et al. 2002). The one the CRISPR spacer sequences and viral genomes
other virus cultivated from Yellowstone, has been found in natural environments. The first
U 694 Use of Viral Metagenomes from Yellowstone Hot Springs
100% 100%
PSV 28,337 bp SIRV1 32,308 bp
90% 90%
80% 80%
70% 70%
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10% vir
rep rep gt vir gt
0% 0%
0 5000 10000 15000 20000 25000 0 5000 10000 15000 20000 25000 30000
100% 100%
ARV 24,655 bp ATV 62,730 bp
90% 90%
80% 80%
70% 70%
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
gt cp gt tnp tnp tnp
0% 0%
0 5000 10000 15000 20000 0 10000 20000 30000 40000 50000 60000
100% 100%
STSV 75,294 bp YS40 152,372 bp Bear Paw
90% 90%
Octopus
80% 80%
70% 70%
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10% hel
dcm ts dut dcm rec mr
0% 0%
0 10000 20000 30000 40000 50000 60000 70000 0 20000 40000 60000 80000 100000 120000 140000
Use of Viral Metagenomes from Yellowstone Hot Octopus alignments. Also shown are the known or
Springs to Study Phylogenetic Relationships and predicted functions of the conserved coding sequences
Evolution, Fig. 5 Alignment of Octopus and Bear Paw (rep replication related, vir virion component, gt glycosyl-
viral metagenomic library contigs with six cultured virus transferase, tnp transposase, cp coat protein, dam adenine
genomes. Contigs assembled at >95 % identity from the DNA methylase, ts thymidylate synthase, dut dUTPase,
viral metagenomic libraries were compared by tBLASTx dcm cytosine DNA methylase, hel helicase, rec
to the genomes of PSV, SIRV1, ARV, ATV, STSV, and recombinase, rnr ribonucleotide reductase (Arnold
YS40. Each bar represents the alignment of a unique et al. 2000; Blum et al. 2001; Peng et al. 2001; Haring
metagenomic sequence to the indicated location on the 2004; Kessler et al. 2004; Vestergaard et al. 2005; Xiang
cultivated viral genome, shown on the horizontal axis. 2005; Ahn et al. 2006; Naryshkina 2006; Prangishvili
Percent coding sequence identities are shown in the verti- 2006b) (Reproduced with permission from (Schoenfeld
cal axis. The threshold for inclusion is E-value <103. et al. 2008)
Red bars indicate Bear Paw alignments; blue bars indicate
Synechococcus genomes isolated from the same of one gene in the viral metagenome. The assem-
hot spring within 2 years of one another, provided bly of 23 reads covering this single gene indicates
a unique opportunity to identify the genes that this was one of the most abundant and con-
targeted by a CRISPR system and observe coevo- served element in the entire metagenome, which,
lution of a CRISPR system and its target in host by itself, would seem to make it an attractive
and viral genomes (Heidelberg et al. 2009). The target for a presumed antiviral system. The data
two Synechococcus strains contained sequences provided by the viral metagenome reveal the
with the hallmarks of a CRISPR system. Like target of the spacers was a likely lysozyme
other such sequences, the CRISPR spacers had gene, the conservation of which may be U
no BLASTn or tBLASTx similarity to any explained by evolutionary constraints due to the
sequences in GenBank. When compared to the interaction with a host cell wall. Inspection of the
microbial metagenome, 180 elements had simi- lys gene assembly revealed the apparent coevo-
larity to CRISPR spacer sequences. Of these, four lution of the CRISPR system and its viral target
shared similarity with 23 sequences in the Octo- (Table 7). Of the 23 viral metagenome reads,
pus viral metagenome. only five had detectable nucleic acid identities
Interestingly, two CRISPR spacer sequences (NAID).
shared by the isolates and the microbial and viral The sequence of one CRISPR spacer is shown
metagenomes had similarity to different regions in Line 1. Shown below are sequences from the
U 696 Use of Viral Metagenomes from Yellowstone Hot Springs
virome with similarity to this CRISPR spacer or Use of Viral Metagenomes from Yellowstone Hot
the same region in reads identified by similarity Springs to Study Phylogenetic Relationships and
Evolution, Table 8 Nucleotide and coding similarities
to a second independent CRISPR spacer or between the viral populations of Octopus and Bear Paw
a translation of one of these. Conserved nucleo- hot springs
tides are shown as dots; those that diverge from tBLASTx BLASTn
the CRISPR 1 spacer are shown as letters. The Frequency (number) of 43 % 21 %
percent nucleic acid identities (%NAID) to Octopus contigs with (5,843) (2,876)
CRISPR 1 and the percent amino acid similarity similarity to Bear Paw contigs
and identity (% AASIM and % AAID, respec- Frequency (number) of Bear 26 % 21 %
tively) to the predicted translation of CRISPR1 Paw contigs with similarity to (1,593) (1,339)
Octopus contigs
are also shown. (adapted from Heidelberg
Average length of similarity 298 175
et al. 2009). The remainder had sequence vari- (nucleotides)
ances that reduced NAID to as low as 70 %; Average identity 74 % 87 %
however, all of these nucleotide variations were Average expect value 1.38E–05 3.00E–05
silent or conservative with respect to the amino
acid sequence, which would likely allow the
sequence to evade targeting by the CRISPR sys-
tem, but not affect the enzymatic function of the average length of sequence alignment (298 and
gene product. This data suggests a high rate of 175 bp) was modest in both cases. This level of
coevolution or “germ warfare” between the similarity did not allow extensive assembly of
viruses and their hosts in this extreme contigs from the two libraries, even at 50 % iden-
environment. tity, presumably due to the short lengths of align-
ment (not shown). Taken together, these data
suggest a mosaiclike pattern of overlap of much
Similarities Between the Two Hot of the coding content in the two hot springs,
Springs’ Viral Populations although entire viral genomes or even entire
genes are not necessarily fully conserved. The
The two libraries were compared to one another fact that the degrees of identity at the nucleotide
to determine any variation between the viral level and at the translational level were relatively
populations in the two very different thermal close suggests that this overlap is not due solely to
environments. Contigs assembled at 95 % from selective pressure on the coding sequence, but
the two libraries were compared to each other by must be explained by other mechanisms. This
tBLASTx and BLASTn (Table 8). The differ- extensive conservation of viral sequences between
ences between the two analyses should be the the two hot springs in this study is surprising,
result of noncoding nucleotides. Since gene den- given that microbial populations are highly tem-
sities are high in viral genomes and there is very perature dependent (Reysenbach et al. 2002) and
little intergenic sequence, these differences are the surface temperatures of these hot springs differ
mainly due to silent codon variations, which by 19 C (74 C vs. 93 C).
should be largely free of selective pressure.
Most remarkable is the similarity between the
two libraries by either analysis. By tBLASTx, Conservation and Distribution of
5,843 of the Octopus contigs (43 %) and 1,593 Viruses in Thermal Environments
of the Bear Paw contigs (26 %) shared amino acid
coding similarity. By BLASTn, 2,876 (21 %) and Taken together, the above analyses suggest that
1,339 (21 %) of the respective contigs shared (1) viral populations in the water columns are
nucleotide similarity. The average percent iden- largely independent of microbial populations
tities were 74 % and 87 % and the expect values reported in the pools and (2) viral genomes, par-
were 1.38E-05 and 3.00E-05, although the ticularly certain genes, are more conserved both
Use of Viral Metagenomes from Yellowstone Hot Springs 697 U
regionally and globally than might have been contributor to viral populations at the surface.
predicted. The regional and global conservation Subsurface proliferation of viruses would also
of viral sequences is an intriguing area for further explain the apparent disconnect between the
study. There are examples of globally distributed planktonic viral populations in the pool and the
genes among marine viral assemblages (Breitbart reported sedentary microbial populations,
and Rohwer 2005; Short and Suttle 2005). Since described above. An implication of subsurface
the oceans are contiguous across the earth, an proliferation of viruses is that the habitable por-
obvious distribution mechanism exists. Groups tion of the subterranean aquifer could be contin-
of highly similar Sulfolobus viruses (Wiedenheft uous across much of the Yellowstone caldera or
2004) and Thermus phages (Yu 2006) have been even much larger areas. A second implication is
isolated from thermal springs on different conti- that, given the higher pressures in the vents, the
nents. In these cases, viruses were isolated from temperature limit of life in the subterrestrial aqui-
environments of similar pH and temperature and fers could significantly exceed the temperatures
were cultivated on the same host under similar measured at the surface.
laboratory conditions. Gene homologs to these
viruses were detected despite the absence of
these selective conditions. Conversely, most Computer Analysis
crenarchaeal virus morphotypes have been
detected in enrichments from YNP (Rice Availability of computer programs is described in
et al. 2001; Rachel et al. 2002; Wiedenheft the original publications (Schoenfeld et al. 2008;
et al. 2004); however, little is known about con- Heidelberg et al. 2009; Pride and Schoenfeld
servation of genes in these enrichments. 2008).
The mechanism and basis of this conservation
of viral sequence is open to speculation. It is
possible that viruses sharing common genes
References
adapt to the different host populations of the
environment. Alternatively, hot springs may be Ahn DG, Kim SI, Rhee JK, Kim KP, Pan JG, et al. TTSV1,
inoculated by airborne viruses from other springs. a new virus-like particle isolated from the hyperther-
It is also possible that the viruses acquire genes mophilic crenarchaeote Thermoproteus tenax. Virol-
ogy. 2006;351:280–90.
from mesophilic viruses, although this explana-
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z,
tion has no support in this study. Lineages of et al. Gapped BLAST and PSI-BLAST: a new gener-
conserved viral genes may be older than the sep- ation of protein database search programs. Nucleic
aration of the continents. Another explanation is Acids Res. 1997;25:3389–402.
Andersson AF, Banfield JF. Virus population dynamics
proliferation of the viruses deeper in the vent.
and acquired virus resistance in natural microbial com-
Thermophilic Bacteria and Archaea, potential munities. Science. 2008;320:1047–50.
hosts for viruses, have been detected in thermal Angly F, Rodriguez-Brito B, Bangor D, McNairnie P,
aquifers several km beneath the earth’s surface at Breitbart M, et al. PHACCS, an online tool for esti-
mating the structure and diversity of uncultured viral
abundances similar to those measured in this
communities using metagenomic information. BMC U
study (Moser et al. 2005) and many are distrib- Bioinformatics. 2005;6:41.
uted worldwide. While it is impossible to sepa- Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA,
rate the contribution of the subsurface viruses et al. The marine viromes of four oceanic regions.
PLoS Biol. 2006;4:e368.
from any proliferation at the surface in the two Arnold HP, Zillig W, Ziese U, Holz I, Crosby M, et al. A
pools in this study, samples from thermal springs novel lipothrixvirus, SIFV, of the extremely thermo-
with no pool at all, collected within seconds of philic crenarchaeon Sulfolobus. Virology. 2000;267:
their emergence, have similar or somewhat 252–66.
Barns SM, Fundyga RE, Jeffries MW, Pace NR. Remark-
higher viral abundances to those measured in
able archaeal diversity detected in a Yellowstone
this report (Breitbart et al. 2004b), suggesting National Park hot spring environment. Proc Natl
subsurface proliferation is at least a significant Acad Sci USA. 1994;91:1609–13.
U 698 Use of Viral Metagenomes from Yellowstone Hot Springs
Barry ER, Bell SD. DNA replication in the archaea. Feng L, Wang W, Cheng J, Ren Y, Zhao G, et al. Genome
Microbiol Mol Biol Rev. 2006;70:876–87. and proteome of long-chain alkane degrading
Bench SR, Hanson TE, Williamson KE, Ghosh D, Geobacillus thermodenitrificans NG80-2 isolated
Radosovich M, et al. Metagenomic characterization from a deep-subsurface oil reservoir. Proc Natl Acad
of Chesapeake Bay virioplankton. Appl Environ Sci USA. 2007;104:5602–7.
Microbiol. 2007;73:7629–41. Filee J, Forterre P, Sen-Lin T, Laurent J. Evolution of
Blank CE, Cady SL, Pace NR. Microbial composition of DNA polymerase families: evidences for multiple
near-boiling silica-depositing thermal springs through- gene exchange between cellular and viral proteins.
out Yellowstone National Park. Appl Environ J Mol Evol. 2002;54:763–73.
Microbiol. 2002;68:5123–35. Filee J, Forterre P, Laurent J. The role played by viruses in
Blum H, Zillig W, Mallok S, Domdey H, Prangishvili the evolution of their hosts: a view based on informa-
D. The genome of the archaeal virus SIRV1 has fea- tional protein phylogenies. Res Microbiol. 2003;154:
tures in common with genomes of eukaryal viruses. 237–43.
Virology. 2001;281:6–9. Forterre P. The two ages of the RNA world, and the
Breitbart M, Rohwer F. Here a virus, there a virus, transition to the DNA world: a story of viruses and
everywhere the same virus? Trends Microbiol. cells. Biochimie. 2005;87:793–803.
2005;13:278–84. Forterre P. The origin of viruses and their possible roles in
Breitbart M, Salamon P, Andresen B, Mahaffy JM, Segall major evolutionary transitions. Virus Res. 2006;117:
AM, et al. Genomic analysis of uncultured marine viral 5–16.
communities. Proc Natl Acad Sci USA. 2002;99: Fournier RO. Geochemistry and dynamics of the Yellow-
14250–5. stone National Park Hydrothermal System. In:
Breitbart M, Hewson I, Felts B, Mahaffy JM, Nulton J, Inskeep W, editor. Geothermal biology and geochemis-
et al. Metagenomic analyses of an uncultured viral try in YNP. Bozeman: Thermal Biology Institute; 2005.
community from human feces. J Bacteriol. Geslin C, Le Romancer M, Erauso G, Gaillard M,
2003;185:6220–3. Perrot G, et al. PAV1, the first virus-like particle iso-
Breitbart M, Felts B, Kelley S, Mahaffy JM, Nulton J, lated from a hyperthermophilic euryarchaeote,
et al. Diversity and population structure of a near-shore “Pyrococcus abyssi”. J Bacteriol. 2003;185:3888–94.
marine-sediment viral community. Proc Biol Sci. Godiska R, Patterson M, Schoenfeld T and Mead
2004a;271:565–74. D. Beyond pUC: vectors for cloning unstable
Breitbart M, Wegley L, Leeds S, Schoenfeld T, Rohwer DNA. In: Kieleczawa, editor. DNA sequencing: opti-
F. Phage community dynamics in hot springs. Appl mizing the process and analysis. 2005;Jones and Bart-
Environ Microbiol. 2004b;70:1633–40. lett Publishers, Sudbury, MA.
Brock TD, Brock ML. Measurement of steady-state Gold T. The deep, hot biosphere. Proc Natl Acad Sci
growth rates of a thermophilic alga directly in nature. USA. 1992;89:6045–9.
J Bacteriol. 1968;95:811–5. Haft DH, Selengut J, Mongodin EF, Nelson KE. A guild of
Brown JR, Doolittle WF. Archaea and the prokaryote-to- 45 CRISPR-associated (Cas) protein families and mul-
eukaryote transition. Microbiol Mol Biol Rev. tiple CRISPR/Cas subtypes exist in prokaryotic
1997;61:456–502. genomes. PLoS Comput Biol. 2005;1:e60.
Canchaya C, Fournous G, Chibani-Chennoufi S, Dillmann Haring M, Peng X, Brugger K, Rachel R, Stetter KO,
ML, Brussow H. Phage as agents of lateral gene trans- et al. Morphology and genome organization of the
fer. Curr Opin Microbiol. 2003;6:417–24. virus PSV of the hyperthermophilic archaeal genera
Cann AJ, Fandrich SE, Heaphy S. Analysis of the virus Pyrobaculum and Thermoproteus: a novel virus fam-
population present in equine faeces indicates the pres- ily, the Globuloviridae. Virology. 2004;323:233–42.
ence of hundreds of uncharacterized virus genomes. Hatfull GF, Pedulla ML, Jacobs-Sera D, Cichon PM,
Virus Genes. 2005;30:151–6. Foley A, et al. Exploring the mycobacteriophage
Coenye T, Vandamme P. Intragenomic heterogeneity metaproteome: phage genomics as an educational plat-
between multiple 16S ribosomal RNA operons in form. PLoS Genet. 2006;2:e92.
sequenced bacterial genomes. FEMS Microbiol Lett. Heidelberg JF, Nelson WC, Schoenfeld T, Bhaya D. Germ
2003;228:45–9. warfare in a microbial mat community: CRISPRs pro-
Culley AI, Lang AS, Suttle CA. The complete genomes of vide insights into the co-evolution of host and viral
three viruses assembled from shotgun libraries of genomes. PLoS ONE. 2009;4:e4169.
marine RNA virus communities. Virol J. 2007;4:69. Hjörleifsdottir SH, Hreggvidsson GO, Fridjonsson OH,
Daubin V, Ochman H. Start-up entities in the origin of Aevarsson A, Kristjansson JK. Bacteriophage RM
new genes. Curr Opin Genet Dev. 2004;14:616–9. 378 of a thermophilic host organism. US Patent; 2002.
Dionne I, Bell SD. Characterization of an archaeal family Horvath P, Romero DA, Coute-Monvoisin AC,
4 uracil DNA glycosylase and its interaction with Richards M, Deveau H, et al. Diversity, activity, and
PCNA and chromatin proteins. Biochem J. 2005;387: evolution of CRISPR loci in Streptococcus
859–63. thermophilus. J Bacteriol. 2008;190:1401–12.
Use of Viral Metagenomes from Yellowstone Hot Springs 699 U
Hugenholtz P, Pitulle C, Hershberger KL, Pace NR. Novel Noble RT, Fuhrman JA. Use of SYBR Green I for rapid
division level bacterial diversity in a Yellowstone hot epifluorescence counts of marine viruses and bacteria.
spring. J Bacteriol. 1998;180:366–76. Aquat Microb Ecol. 1998;14:113–8.
Jahnke LL, Eder W, Huber R, Hope JM, Hinrichs KU, Paul JH, Williamson SJ, Long A, Authement RN, John D,
et al. Signature lipids and stable carbon isotope et al. Complete genome sequence of phiHSIC,
analyses of Octopus Spring hyperthermophilic com- a pseudotemperate marine phage of Listonella pelagia.
munities compared with those of Aquificales represen- Appl Environ Microbiol. 2005;71:3311–20.
tatives. Appl Environ Microbiol. 2001;67:5179–89. Pedulla ML, Ford ME, Houtz JM, Karthikeyan T,
Kahler M, Antrankian G. Cloning and characterization of Wadsworth C, et al. Origins of highly mosaic
a family B DNA polymerase from the hyperthermo- mycobacteriophage genomes. Cell. 2003;113:171–82.
philic crenarchaeon Pyrobaculum islandicum. Peng X, Blum H, She Q, Mallok S, Brugger K,
J Bacteriol. 2000;182:655–63. et al. Sequences and replication of genomes of the
Kessler A, Brinkman AB, van der Oost J, Prangishvili archaeal rudiviruses SIRV1 and SIRV2: relationships
D. Transcription of the rod-shaped viruses SIRV1 to the archaeal lipothrixvirus SIFV and some eukaryal
and SIRV2 of the hyperthermophilic archaeon viruses. Virology. 2001;291:226–34.
sulfolobus. J Bacteriol. 2004;186:7745–53. Prangishvili D, Garrett RA. Exceptionally diverse
Kwan T, Liu J, DuBow M, Gros P, Pelletier J. The com- morphotypes and genomes of crenarchaeal hyperther-
plete genomes and proteomes of 27 Staphylococcus mophilic viruses. Biochem Soc Trans. 2004;32:204–8.
aureus bacteriophages. Proc Natl Acad Sci Prangishvili D, Garrett RA. Viruses of hyperthermophilic
USA. 2005;102:5174–9. Crenarchaea. Trends Microbiol. 2005;13:535–42.
Kwan T, Liu J, Dubow M, Gros P, Pelletier J. Comparative Prangishvili D, Garrett RA, Koonin EV. Evolutionary
genomic analysis of 18 Pseudomonas aeruginosa bac- genomics of archaeal viruses: unique viral genomes
teriophages. J Bacteriol. 2006;188:1184–7. in the third domain of life. Virus Res.
Lindell D, Sullivan MB, Johnson ZI, Tolonen AC, 2006a;117:52–67.
Rohwer F, et al. Transfer of photosynthesis genes to Prangishvili D, Vestergaard G, Haring M, Aramayo R,
and from Prochlorococcus viruses. Proc Natl Acad Sci Basta T, et al. Structural and genomic properties of
USA. 2004;101:11013–8. the hyperthermophilic archaeal virus ATV with an
Lucchini S, Desiere F, Brussow H. Comparative genomics extracellular stage of the reproductive cycle. J Mol
of Streptococcus thermophilus phage species supports Biol. 2006b;359:1203–16.
a modular evolution theory. J Virol. 1999;73:8647–56. Pride DT, Schoenfeld T. Genome signature analysis of
Lucks JB, Nelson DR, Kudla GR, Plotkin JB. Genome thermal virus metagenomes reveals Archaea and ther-
landscapes and bacteriophage codon usage. PLoS mophilic signatures. BMC Genomics. 2008;9:420.
Comput Biol. 2008;4:e1000001. Pride DT, Wassenaar TM, Ghose C, Blaser MJ. Evidence
Lukashin AV, Borodovsky M. GeneMark.hmm: new solu- of host-virus co-evolution in tetranucleotide usage
tions for gene finding. Nucleic Acids Res. patterns of bacteriophages and eukaryotic viruses.
1998;26:1107–15. BMC Genomics. 2006;7:8.
Martin A, Yeats S, Janekovic D, Reiter WD, Aicher W, Rachel R, Bettstetter M, Hedlund BP, Haring M,
et al. SAV 1, a temperate u.v.-inducible DNA Kessler A, et al. Remarkable morphological diversity
virus-like particle from the archaebacterium of viruses and virus-like particles in hot terrestrial
Sulfolobus acidocaldarius isolate B12. EMBO J. environments. Arch Virol. 2002;147:2419–29.
1984;3:2165–8. Reysenbach AL, Wickham GS, Pace NR. Phylogenetic
McCleskey RB, Ball JW, Nordstrom DK, Holloway JM, analysis of the hyperthermophilic pink filament com-
Taylor HE. Water-chemistry data for selected hot munity in Octopus Spring, Yellowstone National Park.
springs, geysers, and streams in Yellowstone National Appl Environ Microbiol. 1994;60:2113–9.
Park, Wyoming, 2001–2002. U.S. Geological Survey Reysenbach AL, Gotz D, Yernool D. Microbial diversity
Open-File Report 2004-1316; 2004. of marine and terrestrial thermal springs. In: Staley JT,
McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Reysenbach AL, editors. Biodiversity of microbial
Rigoutsos I. Accurate phylogenetic classification of life. New York: Wiley Liss; 2002. U
variable-length DNA fragments. Nat Methods. Rice G, Stedman K, Snyder J, Wiedenheft B, Willits D,
2007;4:63–72. et al. Viruses from extreme thermal environments.
Moser DP, Gihring TM, Brockman FJ, Fredrickson JK, Proc Natl Acad Sci USA. 2001;98:13341–5.
Balkwill DL, et al. Desulfotomaculum and RobertsJ A, Bell SD, White MF. An archaeal XPF repair
Methanobacterium spp. dominate a 4– to 5-km-deep endonuclease dependent on a heterotrimeric PCNA.
fault. Appl Environ Microbiol. 2005;71:8773–83. Mol Microbiol. 2003;48:361–71.
Naryshkina T, Liu J, Florens L, Swanson SK, Pavlov AR, Sakaki Y, Oshima T. Isolation and characterization of
et al. Thermus thermophilus bacteriophage phiYS40 a bacteriophage infectious to an extreme thermophile,
genome and proteomic characterization of virions. Thermus thermophilus HB8. J Virol. 1975;15:
J Mol Biol. 2006;364:667–77. 1449–53.
U 700 Use of Viral Metagenomes from Yellowstone Hot Springs
Schoenfeld T, Patterson M, Richardson PM, Wommack Vestergaard G, Haring M, Peng X, Rachel R, Garrett RA,
KE, Young M, et al. Assembly of viral metagenomes et al. A novel rudivirus, ARV1, of the hyperthermo-
from Yellowstone hot springs. Appl Environ philic archaeal genus Acidianus. Virology.
Microbiol. 2008;74:4164–74. 2005;336:83–92.
Seguritan V, Feng IW, Rohwer F, Swift M, Segall Villarreal LP, DeFilippis VR. A hypothesis for DNA
AM. Genome sequences of two closely related viruses as the origin of eukaryotic replication proteins.
Vibrio parahaemolyticus phages, VP16T and VP16C. J Virol. 2000;74:7079–84.
J Bacteriol. 2003;185:6434–47. Ward DM, Ferris MJ, Nold SC, Bateson MM. A natural
Shock EL, Holland M, Meyer-Dombard DR, Amend view of microbial biodiversity within hot spring
JP. Geochemical sources of energy for microbial cyanobacterial mat communities. Microbiol Mol Biol
metabolism in hydrothermal ecosystems: Obsidian Rev. 1998;62:1353–70.
Pool, Yellowstone National Park. In: Inskeep WP, Weinbauer MG, Rassoulzadegan F. Are viruses driving
McDermott TR, editors. Geothermal biology and microbial diversification and diversity? Environ
geochemistry in YNP. Bozeman: Thermal Biology Microbiol. 2004;6:1–11.
Institute; 2005. Wen K, Ortmann AC, Suttle CA. Accurate estimation of
Short CM, Suttle CA. Nearly identical bacteriophage viral abundance by epifluorescence microscopy. Appl
structural gene sequences are widely distributed in Environ Microbiol. 2004;70:3862–7.
both marine and freshwater environments. Appl Envi- Wiedenheft B, Stedman K, Roberto F, Willits D, Gleske
ron Microbiol. 2005;71:480–6. AK, et al. Comparative genomic analysis of hyperther-
Snyder JC, Stedman K, Rice G, Wiedenheft B, Spuhler J, mophilic archaeal Fuselloviridae viruses. J Virol.
et al. Viruses of hyperthermophilic Archaea. Res 2004;78:1954–61.
Microbiol. 2003;154:474–82. WommackKEand Colwell RR. Virioplankton: viruses in
Snyder JC, Spuhler J, Wiedenheft B, Roberto FF, aquatic ecosystems. Microbiol Mol Biol Rev.
Douglas T, et al. Effects of culturing on the population 2000;64:69–114.
structure of a hyperthermophilic virus. Microb Ecol. Xiang X, Chen L, Huang X, Luo Y, She Q, et al.
2004;48:561–6. Sulfolobus tengchongensis spindle-shaped virus
Stoner DL, Geary MC, White LJ, Lee RD, Brizzee JA, STSV1: virus-host interactions and genomic features.
et al. Mapping microbial biodiversity. Appl Environ J Virol. 2005;79:8677–86.
Microbiol. 2001;67:4324–8. Yooseph S, Li W, Sutton G. Gene identification and pro-
Suttle CA. Marine viruses–major players in the global tein classification in microbial metagenomic sequence
ecosystem. Nat Rev Microbiol. 2007;5:801–12. data via incremental clustering. BMC Bioinformatics.
Tatusov RL, Koonin EV, Lipman DJ. A genomic perspec- 2008;9:182.
tive on protein families. Science. 1997;278:631–7. Young R. Bacteriophage lysis: mechanism and regulation.
van der Oost J, Jore MM, Westra ER, Lundgren M, Microbiol Rev. 1992;56:430–81.
Brouns SJ. CRISPR-based adaptive and heritable Yu MX, Slater MR, Ackermann HW. Isolation and char-
immunity in prokaryotes. Trends Biochem Sci. acterization of Thermus bacteriophages. Arch Virol.
2009;34:401–7. 2006;151:663–79.
V
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
V 702 Variable Selection to Improve Classification of Metagenomes
Variable Selection to Improve Classification of Metagenomes, Table 1 Functional databases mostly used for
creating functional profiles
Large collection of RefSeq Around 18 million proteins from 18 k organisms, annotations are available
reference sequences for a subset of the database, well-annotated for human sequences
UniProtKB/ Manually curated annotations for 500,000+ sequences, covering 12,930
Swiss-Prot organisms
Standardized ontologies Gene Well-controlled vocabulary, primarily for eukaryotes
Ontology
Gene orthologous COG Gene groups classified into 23 functional categories, inferred from
groups 66 prokaryote and unicellular eukaryote genomes
KOG Eukaryote version of COG containing 7 eukaryotic genomes
eggNOG Automated annotation of orthologs in 1,133 species
Metabolism KEGG 400+ manually drawn pathways, based on reactions from multiple species
pathway
BioCyc/ 2,000+ single-organism, experimentally derived pathways
Metacyc
SEED Subsystems that describe metabolic machinery with expert curation
Protein domains and Pfam A large collection of protein families that share the same domain
families FIGfam Protein families that share domains and pairwise align for their full length
sequences, resulting in less sequences per family
do such a task. Furthermore, regression could be raw-labeling of sequences can provide much
used instead of classification in the case of information; however, it cannot be used to ana-
continuous-environmental variables; however, lyze hierarchical functional structure in a data set,
for this entry, we assume that phenotypes take such as what high-level functions (e.g., reproduc-
on discrete states, and therefore, classification is tion/cellular transport) are upregulated in my
the primary focus. Previously, feature selection sample. Instead, sequence labeling can answer
has been shown useful to reduce the complexity what genes exist in my sample or which sample
of metagenome classification (Ditzler is functionally more diverse, because they pro-
et al. 2012); however, in this article, its use is vide better annotation coverage in the sample
expanded to determine relevance of biological than higher-level databases. However, if it is
features to associated phenotypes, thus aiding required to annotate with well-defined vocabular-
researchers in drawing conclusions from ies, which is needed to make biological inference
metagenomic data. and associations, then one wishes to use
Feature selection can be applied to a variety of a standardized ontology database. For example,
metagenomic data (e.g., 16S rRNA, whole researchers can use Gene Ontology annotation to
genome shotgun, taxonomic annotations, gene examine what functions are enriched in the sam-
annotations). In addition to selecting species ple compared to others. In some cases,
which differentiate microbiomes, many studies researchers wish to annotate the function of
wish to map DNA/RNA sequences to functional a gene that appears in multiple organisms rather
categories and address enriched/depleted func- than just one. In other words, the focus is to
tions between samples. Depending on the type accurately assign homologous genes associated
of question being asked and the nature of the with multiple species, which is especially impor-
data, there are a variety of functional databases tant in metagenomics due to the complex mixture
to choose from. Table 1 highlights some of the of organisms in a sample. Therefore, orthologous
most widely used databases. Large reference group databases are useful for annotating homol-
sequence databases with a variety of functional ogous function of orthologs. For studying
descriptions are preferred because they provide a microbiome’s metabolism rather than molecu-
detailed annotation of diverse data set. This lar functions, such as asking the questions what
Variable Selection to Improve Classification of Metagenomes 703 V
biological processes are enriched/missing from items to consider before applying a feature selec-
a diseased microbiome or should photosynthesis tion to a (biological) data set. First, how many
activity be enhanced in surface soil compared to features should be selected? Most feature selection
deeper layer soil samples, several metabolic path- algorithms assume that the end-user must select
way databases can be used. Finally, protein fam- this parameter, and the quality of the results will
ily databases search for conserved domains and most likely be highly dependent on the value of
motifs of protein sequences and are important this parameter. In many situations, cross validation
when considering the origin and evolution of pro- can be used to search for an acceptable value.
teins. For example, protein motifs that character- Second, what is the primary objective for features
ize pathogenicity may be used as potential targets selection? Is it the goal of the end-user to perform
for diagnosis and treatment. classification, or are they simply looking for the
Since the diversity of functional databases top k features in the data set? The design of the
serves a variety of research questions, it is impor- objective function, J (.), for feature selection can
tant to note that many studies would adopt several be used to emphasize and address these questions.
databases for annotation. Therefore, the optimal Let J (.) be a function of the features Xj (for j
feature selection technique may depend on the 2 f1; . . . ;Qg), the label variables Y, and the
database choice and the nature of taxonomic or current relevant feature set F. Note that the col-
functional data, such as the dimension of feature lection of variables (e.g., operational taxonomic
space, data sparsity, and the possible range of fold units, Pfams, etc.) is denoted by X. The objective
change between samples. function can be designed in a way, such that it
This entry is organized as follows: section reflects the task at hand. For example, if
“Feature Selection” highlights the components a biologist is interested in the top ranking features
of a general feature selection algorithm and how that carry the most mutual information between
to design such an algorithm. Section “A Descrip- Xj and Y, then the objective function should
tion of the MetaHit Database” presents the bench- reflect this goal. In this situation, using a mutual
mark MetaHit data set, followed by an empirical information maximization (MIM) method is suffi-
analysis of feature selection algorithms tested on cient to achieve this goal (Lewis 1992). MIM can
the MetaHit data set in section “Data Analysis.” be implemented as follows: (a) compute I(Xj;Y)
Finally, section “Conclusion” draws concluding for all j (I(Xj;Y) is the mutual information between
remarks for feature selection applied to Xj and Y), (b) rank the mutual informations
metagenomic data. in descending order, and (c) select the top
k variables with the largest mutual information
and place them in F.
Feature Selection However, many times we seek to classify data
based on Y, and in such situations designing
Feature selection can provide a unique insight a more complex objective function is required.
about the variables that provide discriminating For example, it may be more advantageous to
information about populations, or phenotypes, select F in such a way that the features contained
typically contained in the metadata. This metadata in F are informative about Y; however, they are
could be as simple as two populations, such as not redundant (i.e., one or more features provide
V
healthy or unhealthy, or significantly more com- the same amount of information about Y). An
plex by containing many different populations example of such an objective function is given by
within a data sample. It is natural during the anal-
ysis of a biological data set to ask the question:
J Xj , Y, F ¼ I Xj ; Y I Xj ; Xs
which variables provide the most differentiation Xs
Variable Selection to
Improve Classification of
Metagenomes,
Fig. 1 Generic forward
feature selection algorithm
for a filter-based method
while the second term is penalizing Xj for being The MetaHit data set represents one of the
redundant with the current relevant feature set in F. most comprehensive studies of the human gut
The design of the objective function is quite impor- microbiome. Among the 124 individuals in the
tant to the application to which feature selection database, 25 are from patients who have inflam-
is being applied. There are several works that matory bowel disease (IBD), and 42 patients are
highlight such results on bioinformatics data also obese. It is interesting to note that only
(Saeys et al. 2007), information theory methods three of the individuals who have IBD are also
(Brown et al. 2012), and general feature selection obese. Let us consider two different labeling
techniques (Guyon and Elisseeff 2003). schemes for the data: IBD and obesity, both of
A simple algorithm for feature selection is the which are binary prediction problems. The
forward selection search, which is shown in sequences from each individual are functionally
Fig. 1. The method begins by initializing the annotated using the Pfam database (Finn
relevant feature set F to the empty set. Then for et al. 2010), in a recent study that utilized the
k cycles, equation (1) is maximized, and the fea- MetaHit data set for feature selection on patient
ture that maximizes the expression is added to the age (Lan et al. 2013). There are a total of 6,343
relevant feature set, F, and removed from the unique functional features detected in the data
feature set, X. The forward selection search is set, and Fig. 2 shows the log10 of the total
used with several feature selection objective abundance for each of the 6,343 functional fea-
functions in the section on “Data Analysis.” tures over the 124 observations in the data set.
One way to (loosely) access the separability of
the IBD and no IBD patients (or obese and not
A Description of the MetaHit Database obese) in the data is to examine the principal
coordinate analysis (PCoA) plots of the patients’
As mentioned in Introduction, feature selection Pfam data (Gower 1967). Figure 3 shows the
can allow researchers in metagenomics to inter- PCoA scatter plots of the two sample labeling
pret the differentiating features in a data set. The schemes using PCoA implemented with the
interpretation can be insightful and allow the Euclidean distance. From these plots we observe
researchers to determine the functional differ- that there is a significant amount of overlap
ences between multiple phenotypes. As a case between the classes for both labeling schemes.
study, let us examine a metagenome data set
collected by Qin et al. (2010), which is widely
referred to as the MetaHit data set. The data are Data Analysis
collected from Illumina-based metagenomic
sequencing of 124 fecal samples of 124 Euro- In this section, the classification accuracy and area
pean individuals from Spain and Denmark. under the receiver operating characteristic (auROC)
Variable Selection to Improve Classification of Metagenomes 705 V
curve for the MetaHit data set are examined The joint-mutual information feature selection
when feature selection is applied. The accuracy algorithm (JMI) is implemented with a forward
is measured using the standard 1–0 loss, and the selection search, and the na¨ıve Bayes classifier
auROC is interpreted as the probability of rank- is implemented with a multinomial model. The
ing a target data instance higher than a randomly FEAST feature selection toolbox implements
selected nontarget data instance (Fawcett 2006). the JMI algorithm (Brown et al. 2012). All
The IBD/obese class label is identified as statistics are presented as averages from tenfold
the target for the calculation of the auROC. cross validation using stratified sampling. Strat-
ified sampling assures that instances from
each class will be in each cross-validation
data set. Note that completely random cross-
validation data set partitions do not guarantee
this property.
The auROC and loss for the multinomial n€aıve
Bayes classifier are measured using the two label-
ing schemes described in section “A Description
of the MetaHit Database” (i.e., IBD and obese).
Table 2 contains the classification assessments
from the different labeling schemes as well as
a variation in the number of features that are
selected via JMI. From Table 2, it is clear that
feature selection can have a significant outcome
in the classification results. This is best shown in
Fig. 4 which shows the number of features
Variable Selection to Improve Classification of selected by the MIM algorithm versus the loss
Metagenomes, Fig. 2 Logarithm of the total abundance
of each feature detected by the Pfam database for Qin
(Fig. 4b) and the auROC (Fig. 4a). Note that these
et al. (2010)’s human gut microbiome data set. The results are generated using the mutual informa-
x-axis represents rank of each feature corresponding with tion maximization approach; however, similar
the number of detections sorted in descending order. From results/trends are observed for other feature
the plot, it is obvious that there are few Pfams with a large
abundance and many Pfams with a very low abundance
selection methods.
count. For example, there are 2,572 Pfams with 10 or Figure 5a presents a visualization of the
fewer occurrences across the 124 observations MetaHit data set before and after MIM feature
Variable Selection to Improve Classification of and obese labeling of the samples. There appears to be
Metagenomes, Fig. 3 (a) IBD (b) Obese. Multi- a significant amount of overlap between the controls and
dimensional scaling of the MetaHit data set with the IBD targets for both prediction problems
V 706 Variable Selection to Improve Classification of Metagenomes
selection is applied. The features are sorted from Table 3. It is known in IBD patients that the
high to low in terms of overall abundance, and the expression of ABC transporter protein
patients are represented such that samples 1–99 (PF00005, the first feature MIM selected for clas-
do not have IBD and samples 99–124 have IBD. sifying IBD versus no IBD samples) is decreased
Clearly, this shows a large amount of sparsity that which limits the protection against various lumi-
is inherent in the data, which would also be evi- nal threats (Deuring et al. 2011). The feature
dent if taxonomic abundances were used over selection for IBD also identified glycosyl-
Pfams. Figure 5b shows that most of the features transferase (PF00535), whose alternation is
being selected by MIM are relatively abundant hypothesized to result in recruitment of bacteria
features; however, simply because a feature is to the gut mucosa and increased inflammation
abundant does not imply that the feature is rele- (Campbell et al. 2001). And the genotype of
vant. This can be observed near the 44th feature acetyltransferase (PF00583) plays an important
in Fig. 5b. Note that the features in Fig. 5b are role in the pathogenesis of IBD, which is useful in
order by the time they were selected by the for- the diagnostics and treatment of IBD (Baranska
ward search. et al. 2011). It is not surprising that ABC trans-
The top Pfams that maximize the mutual infor- porter (PF00005) is also selected for obesity,
mation for the MetaHit data set are shown in which is known to mediate fatty acid transport
that is associated with obesity and insulin-
Variable Selection to Improve Classification of
resistant states (Ashrafi 2007) and ATPases
Metagenomes, Table 2 Area under the ROC
(auROC) curves and classification error for a n€aıve (PF02518) that catalyze dephosphorylation reac-
Bayes classifier tested using tenfold cross validation tions to release energy.
auROC Error auROC Error
(IBD) (IBD) (obese) (obese)
10 0.706 0.233 0.640 0.395 Conclusion
15 0.624 0.290 0.672 0.352
25 0.616 0.292 0.660 0.403 This entry has presented a broad overview about
50 0.750 0.223 0.649 0.422 how feature selection algorithms can be used to
100 0.660 0.249 0.659 0.397 facilitate and interpret data in the field of
200 0.654 0.257 0.643 0.389 metagenomics. Recall that metagenomic abun-
500 0.635 0.277 0.641 0.378 dance data can be of very large dimension (e.g.,
All 0.665 0.238 0.622 0.240 MetaHit), and feature selection reduces the
Variable Selection to Improve Classification of selected has a larger effect on the auROC (i.e., detection of
Metagenomes, Fig. 4 (a) Loss of n€aıve Bayes. (b) target population examples) than the accuracy of the sys-
auROC of n€aıve Bayes. The effect of the number of tem. Similar results are observed with JMI and other
features selected by the MIM algorithm versus the loss feature selection methods
(left) and the auROC (right). The number of features being
Variable Selection to Improve Classification of Metagenomes 707 V
Variable Selection to Improve Classification of samples. Samples 1 through 99 do not have IBD, and
Metagenomes, Fig. 5 (a) No feature selection. (b) Fea- samples 99 through 124 have IBD. (b) contains the top
ture selection. Visualization of the abundance matrix (on a 50 features relevant to the 124 data sets. Differences
log10 scale) (a) Before and (b) after MIM feature selec- between the two classes cannot be visualized; however,
tion. The x-axis represents a feature and y-axis represents classification auROCs are 10–15 % above chance
Read-Based Annotation
Viral Metagenome Annotation Pipeline,
Fig. 1 Schematic representation of a viral metagenomic
workflow Direct annotation of sequencing reads is fre-
quently used when the goal of the VM study is
to investigate and compare the type of species or
platforms currently available, Illumina and gene functions that are present in one or more
454 FLX/titanium pyrosequencing have been viral communities. In general, read-based anno-
the most frequently used for the characterization tation assumes that each read encodes for a single
of VM samples. One Illumina sequencing lane gene. Before proceeding with any annotation, it is
generates approximately 125–150 million reads important to preprocess sequencing reads to elim-
of up to 150 bp in length while one full-plate run inate regions with low-quality base calls and
of 454 titanium produces ~1 million reads of duplicated reads. This is particularly important
about 350–450 bp. A VM project usually when working with next generation sequencing
involves two or more Illumina/454 runs, and (NGS) data, since pyrosequencing and Illumina
therefore, the volume of sequence produced by technologies have a higher error rate compared
these studies is in the order of several gigabases. with Sanger. In particular, 454 pyrosequencing is
This huge volume of sequencing data makes prone to the generation of artifactual indels in
V
downstream annotation and analyses methods regions containing homopolymers (Kunin
very challenging and computationally expensive. et al. 2010; Gilles et al. 2011) while Illumina
Therefore, it is critical to preprocess sequencing reads have a higher substitution error rate than
data whenever possible in order to reduce the 454 dealing better with homopolymeric regions
amount of sequence to be annotated. (Minoche et al. 2011). Also, NGS platforms have
Pre-annotation processing methods include a tendency to produce a significant number of
V 710 Viral Metagenome Annotation Pipeline
Viral Metagenome
Annotation Pipeline,
Fig. 2 Number of articles
in PubMed about viral
metagenomics during the
period 2004–2011
duplicated reads, in particular when sequencing annotation stage. Because viruses have a fast evo-
libraries are constructed from very limited quan- lutionary rate, any comparison at the nucleotide
tities of starting RNA/DNA material (<10 ng). level is not sensitive enough to detect similarities
There are several programs that can be used to between reads from a studied metagenome and
remove exact or near exact duplicated reads or nucleotide databases of characterized viral genes
trimming low-quality bases and vector sequences or genomes. In consequence, all searches should
without requiring a large computer infrastructure. be done using translated sequences. The simplest
Some examples are BIGpre (Zhang et al. 2011), annotation approach is to compare the six-frame
Bolger et al. 2014 (http://www.usadellab.org/ translations of each read against a collection of
cms/index.php?page¼trimmomatic), PyroCleaner well-annotated protein databases using
(Jerome et al. 2011), CD-HIT (Huang et al. 2010), TBLASTN or equivalent algorithms to identify
NGS QC Toolkit (Patel and Jain 2012), LUCY the types of viral species or functions encoded by
(Chou and Holmes 2001), and SeqClean (Chen the viral metagenome. The main advantage of
et al. 2007). For example, Trimmomatic is RBA is that it does not involve previous gene
a java-based program that can run in Linux, identification or assembly of reads, processes
Windows, and Mac OSX operating systems and that require some level of user expertise. Another
has several different options for trimming benefit is that translation-based similarity
low-quality bases and adaptor sequences from searches are independent of gene structure and
Illumina reads. BIGpre is compatible with both therefore may prove to be more sensitive than
454 and Illumina platforms and detects and GBA at the time of studying viral communities
removes redundant reads after taking sequenc- whose genomes are enriched in intron-containing
ing errors into account and trimming low-quality genes. However, RBA has several disadvantages.
reads from raw data as well. BIGpre and NGS First, sequence similarity searches using
QC Toolkit also output a number of quality stats TBLASTN or equivalent programs are computa-
about NGS reads that are useful to assess the tionally demanding and time consuming. Second,
presence of sequence bias and the correlation many databases of conserved protein domains or
between forward and reverse reads among motifs cannot be queried using nucleotide
other tools. sequences or on the fly translations. Third, when
Once raw sequencing reads have been reads code for more than one gene, the molecular
processed, it is possible to proceed with the functions associated with the most divergent
Viral Metagenome Annotation Pipeline 711 V
genes on the read are usually masked by the gene perform well on metagenomic datasets because
with the best (lowest) e-values and hence are they are not designed to handle a mixture of reads
difficult to detect. Fourth, further characterization derived from different strains and species with
and phylogenetic analysis of protein families are distinct relative abundances. In this context,
complicated by the fact that it is difficult to gen- sequences of highly abundant species are likely
erate multiple sequence alignments from misidentified as repeats in a single genome,
evolutionary-related genes that start at different resulting in a number of small fragmented scaf-
positions on their respective reads. Lastly, the folds. There are a number of programs and
higher indels rate of NGS reads, in particular in websites specifically designed for generating de
454-derived sequences, creates artifactual trans- novo contigs and scaffolds of overlapping
lation frameshifts that can lead to an metagenomic NGS reads. The CAMERA website
overestimation of gene family diversity and com- (Sun et al. 2011) offers a meta-assembly proce-
plicates the interpretation of results from protein dure for 454 reads which consist of running
database searches. a number of single-genome assemblers with care-
fully optimized parameters on the metagenomic
dataset, then it collects all the resulting contigs
Gene-Based Annotation and assigns quality scores by consensus analysis,
and finally, it uses an adaptation of phrap (http://
A more thorough and efficient way to annotate www.phrap.org) to reassemble the contigs based
sequencing datasets from viral communities is to on computed quality scores. There are also
identify protein-coding genes before carrying out a number of metagenome-specific de novo
any comparison against protein databases. This assemblers, such as MetaVelvet (Namiki
approach reduces considerably the amount of et al. 2012), Meta-IDBA (Peng et al. 2011),
sequencing data to be queried and hence comput- IDBA-UD (Peng et al. 2012), and Genovo
ing time, expands the spectrum of databases that (Laserson et al. 2011). These programs deal bet-
can be searched, and simplifies the interpretation ter with a mixed population of species with dif-
of results and further evolutionary studies. ferent abundances compared to single-genome
Although GBA may involve different bioin- de novo assemblers (Namiki et al. 2012) and
formatics tools, databases, and cutoffs, it is usu- seem to reduce the number of chimeric contigs.
ally composed of the following consecutive Also, depending on the species diversity of the
steps: (i) sequence assembly; (ii) protein-coding metagenome, some of these programs may per-
gene identification; (iii) similarity searches of form differently (Namiki et al. 2012), and there-
predicted proteins against generic or specialized fore, it is better to try a variety of assembly
databases of characterized proteins, conserved programs before starting to work on the annota-
protein domains or motifs; and (iv) functional tion of a particular dataset.
assignments of predicted proteins following
a series of predefined rules. Below, we will dis- Ab Initio Gene Identification
cuss each of these steps in more detail. Gene features in viral genomes are strongly dic-
tated by the genetic characteristics of their host.
Assembling Viral Metagenomes Thus, bacterial viruses, or bacteriophages, are
V
Metagenomic sequence assembly is mostly composed of single-exon genes while
a fundamental way to improve metagenomic eukaryotic-infecting viruses may contain genes
annotation. For example, the sensitivity of both with more than one exon. In spite of this property,
phylogenetic assignment methods based on the majority of genes encoded by viral genomes
nucleotide composition and metagenomic ab do not have introns. Therefore, there are
initio gene finders increases with sequence length a number of gene finders that are suitable for the
(McHardy et al. 2007; Li 2009; Yok and Rosen ab initio identification of viral genes on either
2010). Single-genome assemblers usually do not NGS reads or assembled sequences, although
V 712 Viral Metagenome Annotation Pipeline
Viral Metagenome
Annotation Pipeline,
Fig. 3 Relative number of
viral, bacterial, archaeal,
and eukaryotic proteins in
GenBank, UniProt/Swiss-
Prot, and UniProt/
TrEMBL. Numbers are
relative to the total number
of protein in each database
none of them have been specifically developed Functional Annotation of Predicted Genes
for viral metagenomic samples. Two of the most Functional predictions of protein sequences are
widely used gene finders are MetaGeneAnnotator usually done in two consecutive steps:
(Noguchi et al. 2008) and FragGeneScan (Rho (A) similarity searches against very well-curated
et al. 2010). MetaGeneAnnotator integrates statis- protein databases and (B) functional assignments
tical models from prophage, bacterial and archeal based on database hits. A fundamental problem
genes, and ribosomal-binding sites, and it also uses in functional annotation of viral genes is how to
a self-training model from input sequences for assign functional roles to their encoding proteins
making predictions. FragGeneScan incorporates when viral sequences are highly divergent from
sequencing error models and codon usage infor- those already present in well-annotated protein
mation in a hidden Markov model (HMM) that databases. To make the situation even more
improves the prediction of protein-coding genes in complicated, proteins of viral origin represent
NGS reads and assemblies. FragGeneScan is able a tiny fraction of the proteins deposited in public
to compensate for artifactual frameshifts in repositories (Fig. 3). In consequence, in a typical
pyrosequencing reads caused by the higher fre- VM project, only a very small proportion of
quency of indels at homopolymeric regions. An viral peptides give significant hits (e-value
alternative strategy to the identification of genes 1 105) against protein databases. There-
with gene finders is using naı̈ve six-frame trans- fore, protein database searches have to be
lations (NSFT) that identify each possible ORF of complemented with other bioinformatics tools
at least 80 nt of length. In this case, 50 and 30 ends to increase the number of functionally predicted
of reads can be considered as start and stop codons, viral proteins. In this section we describe
respectively, to also incorporate partial genes trun- a strategy for functional annotation of viral
cated by read ends. In those cases where there are metagenomic datasets as implemented in the
two or more overlapping ORFs, it is possible to Viral MetaGenomic Annotation Pipeline
analyze all of them or select candidate genes based (VMGAP) at the J. Craig Venter Institute
on their properties: length, dn/ds ratio, similarity at (Lorenzi et al. 2011). This pipeline makes use
the protein level, etc. An alternative to this of databases of conserved protein domains,
approach is to combine the results of NSFT with mobile genetic elements, and environmental
gene predictions from FragGeneScan or peptides to improve the sensitivity and quality
MetaGeneAnnotator and pick the longest of the annotation. The first step in the VMGAP
predicted gene per region. is to perform several similarity searches
Viral Metagenome Annotation Pipeline 713 V
between the VM peptides to be annotated and (iii) BLASTP searches against GenBank envi-
the following databases: ronmental nonredundant database
(i) BLASTP searches against public An intriguing aspect of VM is the fact
nonredundant protein databases that the majority of viral predicted proteins
Several generic nonredundant protein do not share similarity with any known
databases can be used for functional assign- sequence. This collection of unknown pro-
ment of viral proteins: GenBank NR, teins, which are usually discarded as “junk”
UniProtKB (UniProt Consortium 2012), and sequences, may represent a formidable
UniRef (Suzek et al. 2007). UniProtKB con- source for the discovery of new viral spe-
sists of two databases, UniProtKB/Swiss-Prot cies. One way to exploit these protein
and UniProtKB/TrEMBL. Protein records in sequences is to compare them against the
UniProtKB/Swiss-Prot are annotated and proteins from other metagenomic datasets
reviewed by a curator, while entries in to gain some insight about the viral entities
UniProtKB/TrEMBL are automatically anno- that are shared between them. GenBank
tated and classified. UniRef is a group of environmental nonredundant database
nonredundant protein databases derived from (env_nr) is a collection of all the protein
clustering UniProtKB entries at different per- sequences derived from metagenomic pro-
centages of identity. Thus, UniRef100 com- jects deposited in GenBank.
bines identical complete and fragmented (iv) HMM searches against PFAM database
protein sequences from any organism into PFAM is a database of hidden Markov
a single UniRef entry. UniRef90 and models of conserved protein domains
UniRef50 are built by clustering UniRef100 (Punta et al. 2012). Because these domains
sequences at the 90 % or 50 % sequence are usually associated with a particular
identity levels. One of the main advantages molecular function or protein family and
of using a clustered protein database such as evolve at a lower pace compared to other
UniRef90 and UniRef50 is that they signifi- protein regions, they are excellent tools for
cantly reduce the time required for similarity identifying functional domains in highly
searches and improve detection of distant divergent protein sequences as the ones
relationships, since all closely related proteins from viruses. PFAM HMM searches can be
are collapsed in a single representative run with the HMMER2 suite of programs
sequence (Suzek et al. 2007). (Eddy 2011) in two different modes, global
(ii) BLASTP searches against the ACLAME or local, allowing for total or partial align-
database ments of the HMMs to the queried protein
The ACLAME database is a repository of sequences, respectively. If gene predictions
mobile genetic elements such as bacterio- are done on reads, it is expected a high pro-
phages, plasmids, and transposons (Leplae portion of partial (truncated) proteins. In that
et al. 2010). Entries in ACLAME are orga- case, local HMM searches are a more sensi-
nized into families based on their sequence tive approach. HMM searches using global
similarity and function. Those families with alignments are more specific than locals and
more than three members are manually anno- perform better on complete proteins. How-
V
tated with functional assignments using gene ever, even in assembled VM datasets the
ontology terms from GO (Shoop et al. 2004) proportion of truncated genes is very high,
and MeGO (Toussaint et al. 2007). MeGO is since assemblies tend to be very partitioned.
an ontology developed by ACLAME to Recently, PFAM released a new generation
describe biological functions, processes, and of HMM models compatible with a new
components specific to mobile genetic ele- development of the HMMER package, the
ments that are not present in the GO database HMMER3. These HMMs can only be run in
(for an example see Fig. 4). local mode but have similar specificity and
V 714 Viral Metagenome Annotation Pipeline
Viral Metagenome Annotation Pipeline, Fig. 4 Example of MeGO terms as they appear in ACLAME using
AmiGO
sensitivity to those from the two PFAM signatures of protein families and therefore
HMMER2 models (local and global) com- provide useful functional information. Since
bined. HMMER3 uses a faster algorithm and PSSMs are BLAST scoring matrices specific
hence is a better choice for performing of conserved protein domains or motifs, their
HMM searches on VM protein datasets. use gives better sensitivity than regular
(v) RPS-BLAST against NCBI-CDD database BLASTP at the time of detecting these
The NCBI Conserved Domain Database domains on more divergent proteins.
(CDD) (Marchler-Bauer et al. 2011) is (vi) Additional bioinformatic tools for func-
a compendium of position-specific scoring tional annotation
matrices (PSSMs) representing conserved Because a significant proportion of the
protein domains, protein families, and super- proteins encoded by viral metagenomes are
families gathered from SMART (Letunic unknown, it is useful to take advantage of in
et al. 2012), COG database (Tatusov silico protein-signal prediction tools that
et al. 2003), NCBI-curated protein domains could provide hints about their putative
(Sayers et al. 2012), and PFAM. In spite of function. An important first step toward
having some overlap with PFAM HMMs, understanding the biological role of
PSSMs derived from PFAM domains do not unknown viral proteins is determining their
behave exactly the same as their HMMs subcellular localization while infecting their
counterparts, and therefore, they complement host. A set of popular protein localization
each other. CDD-PSSMs are usually associ- prediction programs has been developed for
ated with a molecular function or represent the identification of protein signals that
Viral Metagenome Annotation Pipeline 715 V
dictate the subcellular destination of pep- chloroplast transit peptide, mitochondrial
tides: SignalP (Petersen et al. 2011), targeting peptide, or signal peptide.
ChloroP (Emanuelsson et al. 1999), TargetP TMHMM is a program that predicts trans-
(Emanuelsson et al. 2000), and Krogh et al. membrane domains based on HMM
2001 (http://www.cbs.dtu.dk/services/ searches. Each of these programs outputs
TMHMM/). None of these programs are spe- a p-value that can be used to select highly
cifically designed for viral genes. However, significant predictions.
once in the host, viruses can use prokaryotic/
eukaryotic signals to target their own pro- Functional Assignments to VM Proteins
teins to defined subcellular locations. Based on Annotation Rules
SignalP 4.0 uses a neural network-based The second stage of functional annotation is the
method to predict signal peptides from processing of the functional information pro-
gram positive, gram negative, and eukary- duced from database searches to generate a file
otic peptides, and it has been recently containing a summary of the functional charac-
improved to distinguish between signal pep- teristics (product names, GO/MeGO terms, EC
tides and transmembrane domains located numbers, etc.) for each viral peptide. Each of the
near the N-terminus of proteins. ChloroP evidences generated by the analyses described
also uses a neural network approach to pre- above is more or less informative or accurate
dict chloroplast transit peptides, and there- depending on the origin of the VM, the queried
fore, it might be useful for the functional databases, and the programs used. Therefore, the
annotation of viruses that infect plants. best approach is to apply a series of hierarchical
TargetP is a program that predicts the sub- rules to prioritize the use of a certain piece of
cellular location of eukaryotic proteins. The evidence over another based on how trustful and
location assignment is based on the presence useful that evidence is. Figure 5 shows a potential
of any of the following N-terminal signals: hierarchical scheme similar to the one used as
Viral Metagenome Annotation Pipeline, Fig. 5 Hierarchical scheme for functional annotation of viral proteins
V 716 Viral Metagenome Annotation Pipeline
part of the VMGAP at the JCVI. Under this location, body site, type of disease, etc., may
scheme, hits against ACLAME database are the still provide some clues about the biology of the
highest ranked supporting evidence for func- viruses present in the VM sample. Finally, if the
tional assignments. Hence, any viral protein that viral protein does not contain a database hit that
hits an ACLAME entry with an e-value falls within any of the first 12 tiers, then it is
1 1010, with at least 50 % identity spanning considered an unknown protein.
80 % of the length of the shortest sequence, will Note that the rules described above can be
automatically inherit the functional annotation further improved by, for example, the incorpora-
associated with that particular ACLAME pep- tion of results from subcellular localization
tide. The second, third, and fourth tiers of evi- predictions (TargetP, SignalP, ChloroP, and
dence correspond respectively to highly TMHMM) between tears 12 and 13 or any other
significant BLASTP hits against UniProt/Swiss- functional analysis.
Prot (US), UniProt/TrEMBL (UT), and GenBank Applying the rules described above, it is pos-
NR (GB). US has a higher hierarchy than UT sible to assign product names, EC numbers, and
and GB because entries in US are manually GO/MeGO terms to predicted proteins from the
curated. BLASTP hits against UT have a higher VM sample. For example, if a viral predicted
priority than GB hits because UT annotation is protein has a hit against a peptide from ACLAME
usually more comprehensive compared with database above the cutoffs from tier 1, then it can
GB. Ranked fifth and sixth are hits against almost inherit the product name as well as the GO or
complete PFAM HMMs and CDD-PSSMs, MeGO terms associated with that particular
respectively. PFAM hits are more reliable than ACLAME entry. UniProt entries, in particular
CDD hits because they can be selected based on from US, are also a very good source of product
their e-value but also using pre-calibrated names, EC numbers, and GO terms. However,
domain-specific bit score cutoffs named trusted these assignments should be done from high con-
cutoff. Any protein that hits a PFAM HMM with fidence hits only.
a bit score above its trusted cutoff is considered to
contain that particular domain with a very high
level of confidence. CDD domains, on the other Web Resources for Functional
hand, are being selected just based on the e-value Annotation of VM Datasets
of the RPS-BLAST hit and coverage of the CDD
domain, and hence, hits are less reliable com- Currently, there are a number of publicly avail-
pared with tier five. Local hits against PFAM able bioinformatics tools that can be used for the
HMM domains with e-values 1 105 are structural (gene identification) and functional
ranked seventh below CDD hits. Because these annotation of viral metagenomes. MG-RAST
hits span just a portion of the HMM model, they (Glass et al. 2010; Meyer et al. 2008) is
are solely selected by their e-value and not by a popular web resource able to perform structural
their bit score. Tiers eight to 11 look for less and functional annotations on both NGS reads
reliable hits against ACLAME, US, UT, and GB and assembled metagenomic data. One main
databases in that order using more permissive advantage is that all computes are run by the
cutoffs (e-value 1 105; coverage 70 %, MG-RAST server, and therefore, the user is not
identity 30 %) compared to tiers 1–4 (e-value required to have a big computer infrastructure. It
1 1010; coverage 80 %, identity 50 %). also handles Illumina and 454 reads and provides
Ranked 12th are BLASTP hits against environ- several read preprocessing tools such as elimina-
mental protein databases, such as GenBank tion of duplicated or contaminated reads and
env_nr, with e-values of at least 1 105. deletion of low-quality sequences and short
Entries in environmental protein databases are reads. Structural annotation is carried out either
likely to lack any functional annotation. How- on reads or assemblies using FragGeneScan
ever, associated metadata such as geographic while functional annotation is being done by
Viral Metagenome Annotation Pipeline 717 V
similarity searches against a protein scale metabolic networks in the SEED. BMC
nonredundant database that compiles the follow- Bioinforma. 2007;8:139. 1868769.
Dwivedi B, Schmieder R, Goldsmith DB, Edwards RA,
ing public protein databases: GenBank NR, Breitbart M. PhiSiGns: an online tool to identify sig-
KEGG (Tanabe and Kanehisa 2012), IMG nature genes in phages and design PCR primers for
(Markowitz et al. 2012), InterPro (Hunter examining phage diversity. BMC Bioinformatics.
et al. 2012), PATRIC (Gillespie et al. 2011), 2012;13:37.
Eddy SR. Accelerated profile HMM searches. PLoS
Dwivedi et al. 2012 (http://www.phantome.org/), Comput Biol. 2011;7(10):e1002195. 3197634.
RefSeq (Pruitt et al. 2012), SEED (DeJongh Emanuelsson O, Nielsen H, von Heijne G. ChloroP,
et al. 2007), UniProt/Swiss-Prot, UniProt/ a neural network-based method for predicting chloro-
TrEMBL, COG (Tatusov et al. 2003), GO, KO plast transit peptides and their cleavage sites. Protein
Sci. 1999;8(5):978–84. 2144330.
(Mao et al. 2005), and eggNOG (Powell Emanuelsson O, Nielsen H, Brunak S, von Heijne
et al. 2012). Among these databases is Phantome, G. Predicting subcellular localization of proteins
a protein database of complete phage genomes based on their N-terminal amino acid sequence.
manually curated by experts using a subsystem J Mol Biol. 2000;300(4):1005–16.
Gilles A, Meglecz E, Pech N, Ferreira S, Malausa T,
approach (Overbeek et al. 2005). Another nice Martin JF. Accuracy and quality assessment of
feature of MG-RAST is that it allows the com- 454 GS-FLX titanium pyrosequencing. BMC Geno-
parison among the annotated VM samples pro- mics. 2011;12:245. 3116506.
vided by the user and the more than 10,000 Gillespie JJ, Wattam AR, Cammer SA, Gabbard JL,
Shukla MP, Dalay O, Driscoll T, et al. PATRIC: the
metagenomic datasets that are publicly available comprehensive bacterial bioinformatics resource with
at the MG-RAST server. a focus on human pathogenic species. Infect Immun.
Another useful web resource is CAMERA 2011;79(11):4286–98. 3257917.
(Sun et al. 2011), which allows the construction Glass EM, Wilkening J, Wilke A, Antonopoulos D, Meyer
F. Using the metagenomics RAST server (MG-RAST)
of customized workflows for the analysis of for analyzing shotgun metagenomes. Cold Spring
external metagenomic data. Among the many Harb Protoc. 2010; 2010(1):pdb prot5368.
bioinformatic tools available are an assembly Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT suite: a web
pipeline for 454 reads, protein clustering with server for clustering and comparing biological
sequences. Bioinformatics. 2010;26(5):680–2.
CD-HIT, clustering of duplicated 454 reads, 2828112.
gene predictions based on different gene finders, Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK,
and a general pipeline for annotation of Bateman A, Bernard T, Binns D, Bork P, Burge S,
metagenomic datasets. et al. InterPro in 2011: new developments in the family
and domain prediction database. Nucleic Acids Res.
2012;40(Database issue):D306–12. 3245097.
Jerome M, Noirot C, Klopp C. Assessment of replicate
References bias in 454 pyrosequencing and a multi-purpose read-
filtering tool. BMC Res Notes. 2011;4:149. 3117718.
Bench SR, Hanson TE, Williamson KE, Ghosh D, Krogh A1, Larsson B, von Heijne G, Sonnhammer EL.
Radosovich M, Wang K, Wommack Predicting transmembrane protein topology with a
KE. Metagenomic characterization of Chesapeake hidden Markov model: application to complete
Bay virioplankton. Appl Environ Microbiol. genomes. J Mol Biol. 2001;305(3):567–80.
2007;73(23):7629–41. 2168038. Kunin V, Engelbrektson A, Ochman H, Hugenholtz
Bolger AM1, Lohse M, Usadel B. Trimmomatic: a flexible P. Wrinkles in the rare biosphere: pyrosequencing
trimmer for Illumina sequence data. Bioinformatics. errors can lead to artificial inflation of diversity esti-
2014; doi:10.1093/bioinformatics/btu170 mates. Environ Microbiol. 2010;12(1):118–23. V
Chen YA, Lin CC, Wang CD, Wu HB, Hwang PI. An Laserson J, Jojic V, Koller D. Genovo: de novo assembly
optimized procedure greatly improves EST vector for metagenomes. J Comput Biol. 2011;18(3):429–43.
contamination removal. BMC Genomics. 2007;8:416. Leplae R, Lima-Mendez G, Toussaint A. ACLAME:
2194723. a classification of mobile genetic elements, update
Chou HH, Holmes MH. DNA sequence quality trimming 2010. Nucleic Acids Res. 2010;38(Database issue):
and vector removal. Bioinformatics. 2001;17(12): D57–61. 2808911.
1093–104. Letunic I, Doerks T, Bork P. SMART 7: recent updates to
DeJongh M, Formsma K, Boillot P, Gould J, Rycenga M, the protein domain annotation resource. Nucleic Acids
Best A. Toward the automated generation of genome- Res. 2012;40(Database issue):D302–5. 3245027.
V 718 Viral Metagenome Annotation Pipeline
Li W. Analysis and comparison of very large Peng Y, Leung HC, Yiu SM, Chin FY. IDBA-UD: a de
metagenomes with fast clustering and functional anno- novo assembler for single-cell and metagenomic
tation. BMC Bioinforma. 2009;10:359. 2774329. sequencing data with highly uneven depth. Bioinfor-
Lorenzi HA, Hoover J, Inman J, Safford T, Murphy S, matics. 2012;28(11):1420–8.
Kagan L, Williamson SJ. TheViral MetaGenome Petersen TN, Brunak S, von Heijne G, Nielsen H. SignalP
Annotation Pipeline (VMGAP): an automated tool 4.0: discriminating signal peptides from transmem-
for the functional annotation of viral Metagenomic brane regions. Nat Methods. 2011;8(10):785–6.
shotgun sequencing data. Stand Genomic Sci. Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M,
2011;4(3):418–29. 3156399. Muller J, Arnold R, Rattei T, Letunic I, Doerks T,
Mao X, Cai T, Olyarchuk JG, Wei L. Automated genome et al. eggNOG v3.0: orthologous groups covering
annotation and pathway identification using the KEGG 1133 organisms at 41 different taxonomic ranges.
Orthology (KO) as a controlled vocabulary. Bioinfor- Nucleic Acids Res. 2012;40(Database issue):D284–9.
matics. 2005;21(19):3787–93. 3245133.
Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derby- Pruitt KD, Tatusova T, Brown GR, Maglott DR. NCBI
shire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer Reference Sequences (RefSeq): current status, new
RC, Gonzales NR, et al. CDD: a conserved domain features and genome annotation policy. Nucleic
database for the functional annotation of proteins. Acids Res. 2012;40(Database issue):D130–5.
Nucleic Acids Res. 2011;39(Database issue):D225–9. 3245008.
3013737. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J,
Markowitz VM, Chen IM, Chu K, Szeto E, Palaniappan K, Boursnell C, Pang N, Forslund K, Ceric G,
Grechkin Y, Ratner A, Jacob B, Pati A, Huntemann M, Clements J, et al. The Pfam protein families database.
et al. IMG/M: the integrated metagenome data manage- Nucleic Acids Res. 2012;40(Database issue):
ment and comparative analysis system. Nucleic Acids D290–301. 3245129.
Res. 2012;40(Database issue):D123–9. 3245048. Rho M, Tang H, Ye Y. FragGeneScan: predicting genes in
McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, short and error-prone reads. Nucleic Acids Res.
Rigoutsos I. Accurate phylogenetic classification of 2010;38(20):e191. 2978382.
variable-length DNA fragments. Nat Methods. Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH,
2007;4(1):63–72. Canese K, Chetvernin V, Church DM, Dicuccio M,
Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Federhen S, et al. Database resources of the National
Kubal M, Paczian T, Rodriguez A, Stevens R, Center for Biotechnology Information. Nucleic Acids
Wilke A, et al. The metagenomics RAST server – Res. 2012;40(Database issue):D13–25. 3245031.
a public resource for the automatic phylogenetic and Shoop E, Casaes P, Onsongo G, Lesnett L, Petursdottir
functional analysis of metagenomes. BMC EO, Donkor EK, Tkach D, Cosimini M. Data explora-
Bioinforma. 2008;9:386. 2563014. tion tools for the gene ontology database. Bioinformat-
Minoche AE, Dohm JC, Himmelbauer H. Evaluation of ics. 2004;20(18):3442–54.
genomic high-throughput sequencing data generated Steward GF, Preston CM. Analysis of a viral metagenomic
on Illumina HiSeq and genome analyzer systems. library from 200 m depth in Monterey Bay, California
Genome Biol. 2011;12(11):R112. 3334598. constructed by direct shotgun cloning. Virol
Namiki T, Hachiya T, Tanaka H, Sakakibara J. 2011;8:287. 3128862.
Y. MetaVelvet: an extension of Velvet assembler to Sun S, Chen J, Li W, Altintas I, Lin A, Peltier S, Stocks K,
de novo metagenome assembly from short sequence Allen EE, Ellisman M, Grethe J, et al. Community
reads. Nucleic Acids Res. 2012;40(20):e155. cyberinfrastructure for advanced microbial ecology
Noguchi H, Taniguchi T, Itoh T. MetaGeneAnnotator: research and analysis: the CAMERA resource. Nucleic
detecting species-specific patterns of ribosomal bind- Acids Res. 2011;39(Database issue):D546–51. 3013694.
ing site for precise gene prediction in anonymous Suzek BE, Huang H, McGarvey P, Mazumder R, Wu
prokaryotic and phage genomes. DNA Res. CH. UniRef: comprehensive and non-redundant
2008;15(6):387–96. 2608843. UniProt reference clusters. Bioinformatics. 2007;
Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang 23(10):1282–8.
HY, Cohoon M, De Crecy-Lagard V, Diaz N, Disz T, Tanabe M, Kanehisa M. Using the KEGG database
Edwards R, et al. The subsystems approach to genome resource. Curr Protoc Bioinform. 2012. Chapter 1:
annotation and its use in the project to annotate 1000 Unit1 12.
genomes. Nucleic Acids Res. 2005;33(17):5691–702. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR,
1251668. Kiryutin B, Koonin EV, Krylov DM, Mazumder R,
Patel RK, Jain M. NGS QC toolkit: a toolkit for quality Mekhedov SL, Nikolskaya AN, et al. The COG data-
control of next generation sequencing data. PLoS One. base: an updated version includes eukaryotes. BMC
2012;7(2):e30619. 3270013. Bioinformatics. 2003;4:41. 222959.
Peng Y, Leung HC, Yiu SM, Chin FY. Meta-IDBA: a de The UniProt Consortium. Reorganizing the protein space
Novo assembler for metagenomic data. Bioinformat- at the Universal Protein Resource (UniProt). Nucleic
ics. 2011;27(13):i94–101. 3117360. Acids Res. 2012;40(Database issue):D71–5.3245120.
Viral Pathogens in Clinical Samples by Use of a Metagenomic Approach 719 V
Toussaint A, Lima-Mendez G, Leplae R. PhiGO, a phage the associate diseases. Traditional techniques for
ontology associated with the ACLAME database. Res virus discovery such as cultivation-, morphology-,
Microbiol. 2007;158(7):567–71.
Yok N, Rosen G. Benchmarking of gene prediction pro- serology-, and immunology-based methods have
grams for metagenomic data. Conf Proc IEEE Eng contributed significantly to the identification
Med Biol Soc. 2010;2010:6190–3. of most important viral pathogens during the
Zhang T, Luo Y, Liu K, Pan L, Zhang B, Yu J, Hu S. last century. In addition, modern molecular
BIGpre: a quality assessment package for next-
generation sequencing data. Genomics Proteomics methods such as PCR and microarray also
Bioinforma. 2011;9(6):238–44. play more and more important roles in clinical
virology practices in the past decade. The
newly emerged metagenomic-based method is
a particularly powerful approach for virus identi-
fication since genetic materials can be analyzed
Viral Pathogens in Clinical Samples directly from clinical samples, bypassing the
by Use of a Metagenomic Approach need for culturing, cloning, or a priori knowledge
of what viruses may be present. The recent advent
Jian Yang of next-generation sequencing technologies
MOH Key Laboratory of Systems Biology of (NGS), which have dramatically improved the
Pathogens, Institute of Pathogen Biology, speed and cost-effectiveness of sequencing,
Chinese Academy of Medical Sciences & Peking fueled the clinical application of metagenomic
Union Medical College (CAMS&PUMC), method for viral diagnosis. Herein, we summa-
Beijing, People’s Republic of China rized the most recent studies that have success-
fully identified viral pathogens from clinical
samples by using the NGS-based metagenomic
Synonyms approach.
(Towner et al. 2008). In another study on the recent studies as the higher throughput do offers
suspected hemorrhagic fever endemic in northern greater sensitivity as compared with the former.
Uganda, using the same strategy, they not only Second, differ from traditional methods the
recognized yellow fever virus but also generated metagenomic approach rely heavily on subse-
98 % of the virus genome sequence, which facil- quent bioinformatics data analyses, which can
itated the follow-up phylogenetic analyses be very tricky particularly in case of detection
(McMullan et al. 2012). The Illumina platforms potential novel viruses. Lacking of standard pro-
are also employed for the detection of viral path- tocols for metagenomic data analysis has ham-
ogens in blood samples by using a metagenomic pered the further extensive applications of
approach (Table 1). During a tick-transmitted- metagenomic approach in the future. Third,
like outbreak of fever, thrombocytopenia, and results from metagenomic approach only indicate
leukopenia syndrome in China, most patients the presence of given viruses in the clinic samples
are tested negative for the former-suspected screened, and it cannot directly deduce that the
human granulocytic anaplasmosis. Hence, viruses identified are responsible for the human
a metagenomic approach based on paired-end diseases investigated. Hence, the biological and
Illumina sequencing was applied to screen medical interpretations of metagenomic results
potential viral agents from the sera of patients, may require further evidences from epidemiol-
and a novel bunyavirus was successfully identi- ogy, morphology, immunology, etc.
fied (Xu et al. 2011). In addition, the novel
virus was confirmed to present in 78 % of the
acute-phase serum samples by further RT-PCR Cross-References
testing.
▶ Functional Viral Metagenomics and the
Development of New Enzymes for DNA and
Summary RNA Amplification and Sequencing
▶ Viral MetaGenome Annotation Pipeline
Since the first introduce in 2008, we have
witnessed the emergence and extensive applica-
tions of the NGS-based metagenomic approach
References
as a powerful tool in diagnostic virology. The
intrinsic properties of metagenomics provide the Briese T, Paweska JT, McMullan LK, et al. Genetic detec-
method prominent advantages in speed and sen- tion and characterization of Lujo virus, a new hemor-
sitivity for parallel screening of known viral path- rhagic fever-associated arenavirus from southern
Africa. PLoS Pathog. 2009;5:e1000455.
ogens as well as detection of new unexpected
Cheval J, Sauvage V, Frangeul L, et al. Evaluation of high-
viral agents in clinical samples. With the contin- throughput sequencing for identifying known and
uous development and improvement of high- unknown viruses in biological samples. J Clin
throughput sequencing technologies, the Microbiol. 2011;49:3268–75.
Feng H, Shuda M, Chang Y, et al. Clonal integration of
metagenomic approach will probably become an
a polyomavirus in human Merkel cell carcinoma. Sci-
essential diagnostic method in clinical routines. ence. 2008;319:1096–100.
However, in current stage, several issues Finkbeiner SR, Li Y, Ruone S, et al. Identification of
should be kept in mind for the application of the a novel astrovirus (astrovirus VA1) associated with
an outbreak of acute gastroenteritis. J Virol.
metagenomic approach in viral diagnostic prac- 2009;83:10836–9.
tices. First, the selection of different NGS plat- Greninger AL, Chen EC, Sittler T, et al. A metagenomic
forms will be critical to both preceding sample analysis of pandemic influenza A (2009 H1N1) infec-
nucleotides preparation and further sequence data tion in patients from North America. PLoS One.
2010;5:e13381.
analyses. Though the majority of published appli-
Greninger AL, Runckel C, Chiu CY, et al. The complete
cations used Roche/454 platform, the Illumina genome of klassevirus – a novel picornavirus in pedi-
technology is increasingly employed in most atric stool. Virol J. 2009;6:82.
Viral Pathogens in Clinical Samples by Use of a Metagenomic Approach 723 V
Lysholm F, Wetterbom A, Lindau C, et al. Quan PL, Wagner TA, Briese T, et al. Astrovirus enceph-
Characterization of the viral microbiome in patients alitis in boy with X-linked agammaglobulinemia.
with severe lower respiratory tract infections, Emerg Infect Dis. 2010;16:918–25.
using metagenomic sequencing. PLoS One. 2012;7: Towner JS, Sealy TK, Khristova ML, et al. Newly discov-
e30875. ered ebola virus associated with hemorrhagic fever
MacConaill L, Meyerson M. Adding pathogens by geno- outbreak in Uganda. PLoS Pathog. 2008;4:e1000212.
mic subtraction. Nat Genet. 2008;40:380–2. Victoria JG, Kapoor A, Li L, et al. Metagenomic analyses
McMullan LK, Frace M, Sammons SA, et al. Using next of viruses in stool samples from children with acute
generation sequencing to identify yellow fever virus in flaccid paralysis. J Virol. 2009;83:4642–51.
Uganda. Virology. 2012;422:1–5. Xu B, Liu L, Huang X, et al. Metagenomic analysis of
Nakamura S, Yang CS, Sakon N, et al. Direct fever, thrombocytopenia and leukopenia syndrome
metagenomic detection of viral pathogens in nasal (FTLS) in Henan Province, China: discovery of
and fecal specimens using an unbiased high- a new bunyavirus. PLoS Pathog. 2011;7:e1002369.
throughput sequencing approach. PLoS One. 2009;4: Yang J, Yang F, Ren L, et al. Unbiased parallel detection
e4219. of viral pathogens in clinical samples by use of a
Palacios G, Druce J, Du L, et al. A new arenavirus in metagenomic approach. J Clin Microbiol. 2011;49:
a cluster of fatal transplant-associated diseases. 3463–9.
N Engl J Med. 2008;358:991–8. Yozwiak NL, Skewes-Cox P, Gordon A, et al. Human
Phan TG, Li L, O’Ryan MG, et al. A third gyrovirus enterovirus 109: a novel interspecies recombinant
species in human faeces. J Gen Virol. 2012a; enterovirus isolated from a case of acute pediatric respi-
93:1356–61. ratory illness in Nicaragua. J Virol. 2010;84:9047–58.
Phan TG, Vo NP, Bonkoungou IJ, et al. Acute diarrhea in Yozwiak NL, Skewes-Cox P, Stenglein MD, et al. Virus
West African children: diverse enteric viruses and identification in unknown tropical febrile illness cases
a novel parvovirus genus. J Virol. 2012b;86: using deep sequencing. PLoS Negl Trop Dis. 2012;6:
11024–30. e1485.
V
List of Entries
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools, 725
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
726 List of Entries
Taxa Counting Using Specific Peptides of Use of Viral Metagenomes from Yellowstone
Aminoacyl tRNA Synthetases Hot Springs to Study Phylogenetic
Taxonomic Classification of Metagenomic Relationships and Evolution
Shotgun Sequences with CARMA3 Variable Selection to Improve Classification of
The Vaginal Microbiome in Health and Disease Metagenomes
tRNA Gene Database Curated Manually by Experts Viral Metagenome Annotation Pipeline
Use of Bacterial Artificial Chromosomes in Viral Pathogens in Clinical Samples by Use of a
Metagenomics Studies, Overview Metagenomic Approach