Professional Documents
Culture Documents
NGSplatforms_metagenomic_blood_pathogens
NGSplatforms_metagenomic_blood_pathogens
NGSplatforms_metagenomic_blood_pathogens
Abstract
Background: The introduction of benchtop sequencers has made adoption of whole genome sequencing possible
for a broader community of researchers than ever before. Concurrently, metagenomic sequencing (MGS) is rapidly
emerging as a tool for interrogating complex samples that defy conventional analyses. In addition, next-generation
sequencers are increasingly being used in clinical or related settings, for instance to track outbreaks. However,
information regarding the analytical sensitivity or limit of detection (LoD) of benchtop sequencers is currently
lacking. Furthermore, the specificity of sequence information at or near the LoD is unknown.
Results: In the present study, we assess the ability of three next-generation sequencing platforms to identify a
pathogen (viral or bacterial) present in low titers in a clinically relevant sample (blood). Our results indicate that the
Roche-454 Titanium platform is capable of detecting Dengue virus at titers as low as 1X102.5 pfu/mL, corresponding
to an estimated 5.4X104 genome copies/ml maximum. The increased throughput of the benchtop sequencers, the
Ion Torrent PGM and Illumina MiSeq platforms, enabled detection of viral genomes at concentrations as low as
1X104 genome copies/mL. Platform-specific biases were evident in sequence read distributions as well as viral
genome coverage. For bacterial samples, only the MiSeq platform was able to provide sequencing reads that could
be unambiguously classified as originating from Bacillus anthracis.
Conclusion: The analytical sensitivity of all three platforms approaches that of standard qPCR assays. Although all
platforms were able to detect pathogens at the levels tested, there were several noteworthy differences. The
Roche-454 Titanium platform produced consistently longer reads, even when compared with the latest chemistry
updates for the PGM platform. The MiSeq platform produced consistently greater depth and breadth of coverage,
while the Ion Torrent was unequaled for speed of sequencing. None of the platforms were able to verify a single
nucleotide polymorphism responsible for antiviral resistance in an Influenza A strain isolated from the 2009 H1N1
pandemic. Overall, the benchtop platforms perform well for identification of pathogens from a representative
clinical sample. However, unlike identification, characterization of pathogens is likely to require higher titers,
multiple libraries and/or multiple sequencing runs.
* Correspondence: kim.bishop-lilly@med.navy.mil
1
Naval Medical Research Center, NMRC-Frederick, 8400 Research Plaza, Fort
Detrick, Frederick, MD 21702, USA
2
Henry M. Jackson Foundation, 6720-A Rockledge Drive, Suite 100, Bethesda,
MD 20817, USA
Full list of author information is available at the end of the article
© 2014 Frey et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication
waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise
stated.
Frey et al. BMC Genomics 2014, 15:96 Page 2 of 14
http://www.biomedcentral.com/1471-2164/15/96
thousands of reads, while low titer infections would reads map to the reference, covering 31% of the viral
require tens of millions. In the current study we use genome, de novo assembly of the reads produces a single
the latest sequencing technologies with their increased contig comprised of 2 reads that covers only 4% of the
throughput and lower cost to confirm these projections viral genome (Table 1). This apparent difference between
as well as establish metrics for pathogen identification fraction of the viral genome covered by the de novo as-
from clinical samples. sembly and the reference mapping is due to insufficient
breadth of coverage for the reads to overlap and be used
Results in de novo assembly.
Limits of detection of Roche-454 pyrosequencing
Our previous study suggested that as few as 0.1% of the Comparison of Roche-454, Illumina MiSeq and Ion Torrent
total reads were sufficient to unambiguously identify a PGM for detecting viral pathogens
viral pathogen present in a complex background [6]. Next, we examined how the performance of benchtop
However, our experimental model was reflective of a dis- sequencers would compare with that of the Roche-454.
seminated infection and high viral load [6,10-12]. There- To that end, mock samples containing Influenza virus
fore, we wished to determine if the Roche-454 platform H1N1 were prepared in a similar fashion to the DENV-1
was capable of detecting viruses at reduced titers. samples. For this experiment, the DENV-1 results were
To determine the limits of detection (LoD), mock used as a guide as to what range of titers to investigate.
samples were created by spiking known amounts of The effective LoD for the 454 platform was found to be
Dengue virus Type 1 (DENV-1) into 1 mL aliquots of 1X102.5, which we estimate to correlate with an upper
whole human blood. Total RNA was extracted from limit of 31,600-53,720 genome copies/ml, based on work
each sample and used for construction of cDNA librar- by Houng et al. and Wang et al., who found one DENV
ies. Sequencing of samples representing each titer was pfu to correspond to 100 and 170 genome copies by
conducted in replicate fashion on the Roche-454 plat- qPCR, respectively [13,14]. Therefore, an initial titer of
form and DENV-1 was detected in a range of titers from 1X104 - genome copies/mL was chosen for these experi-
1X103 to 1X105 pfu/ml. Since no DENV-1 specific reads ments. At 1X104 genome copies/mL the Roche-454 plat-
were detected at the titer of 1X102, it was determined form was capable of detecting the Influenza virus in only
that the effective LoD for 454 sequencing using this one out of two replicates, with only a single hit covering
protocol must lie between 1X102 and 1X103, and so sev- 3% of the flu genome (Table 2). At the same concentra-
eral samples were prepared within this range (1X102.3, tion, two biological replicates sequenced on the PGM
1X102.5, and 1X102.7). Sequencing was initiated with the platform scored 11 and 21 hits, covering 7% and 12% of
1X102.5 sample and progressed down to 1X102.3, from the genome, respectively. By contrast, two biological rep-
which no reads were detected. Overall, the decrease in licates sequenced on the MiSeq platform scored 308 and
the number of viral reads approximated the reduction in 811 hits, covering 79% and 95% of the genome, respect-
titer, especially at lower titers. At the lowest titer detect- ively. Interestingly, even when the number of Influenza-
able, 1X102.5, the breadth of coverage of the reference specific reads is expressed as reads per 100,000, the MiSeq
genome was 31% with an average depth of coverage of and PGM platforms both outperform the Roche-454, indi-
0.35X upon read mapping (Figure 1). In a parallel ap- cating that there may be some platform-specific factors
proach, de novo assembly of reads yielded a single contig that affect LoD in addition to total output per run.
of 427 bp in length. By comparison, assemblies of sam- Examination of the read mappings revealed that the
ples at 1X105 pfu/mL and 1X104 pfu/mL yielded one mapped reads were not evenly distributed along the viral
contig of 9445 bp and two contigs of 4640 and 3566 bp, genome for any of the three sequencing platforms.
respectively. In both cases the largest contig in the Figure 2A depicts representative read mappings by plat-
de novo assembly belonged to DENV-1. It is striking form for segment 5 of Influenza A H1N1. For the PGM,
that, in the case of the low titer samples, de novo assem- the regions of highest coverage were found most often
bly of the reads provides less information than would a toward the 5′ end of segment 5. Specifically, 14/19
taxonomic classification of all reads. In the case of the mapped reads aligned within the first quarter of the
1X102.5 pfu/mL sample, although 11/604,130 resulting genome segment. The single mapped read from the
Figure 1 Read mapping against DENV-1 genome at 1X102.5 pfu/ml. Reads resulting from 1 454 Titanium sequencing run of a cDNA library
made from blood spiked with DENV-1 at titer of 1X102.5 pfu/ml were aligned to the reference genome NC_001477.1 using CLC Genomics
Workbench version 6.0.4 at default parameters.
Frey et al. BMC Genomics 2014, 15:96 Page 4 of 14
http://www.biomedcentral.com/1471-2164/15/96
Roche-454 was 448 bases in length. By contrast, the lon- highest level of stringency tested, as the requirement for
gest mapped read for the PGM was 179 bases. As ex- 100% identity would preclude any matches.
pected, the mapped reads from the MiSeq platform were For Influenza virus segment 8, the MiSeq yielded 14
all 150 bases in length. Since the PGM offers scalability mapped reads versus the single mapped read from the
in throughput, we sought to compare the sensitivity of PGM. It is noteworthy that although the MiSeq reads
the PGM using a 314 chip as well. The throughput covered more of the segment overall than the single
of the 314 chip is similar to that of one-half of a 454 PGM read did, both platforms produced a single read in
picotitre plate. Although not shown in Table 2, two in- an area of high G/C content. Given that the MiSeq
dependent experiments using a 314 chip yielded a single, consistently produced the highest depth of coverage, we
84 base read mapped to Segment 5 (Additional file 1: next examined whether the increased coverage of the
Figure S1). This single read mapped within the first MiSeq was sufficient to determine the presence of a
quarter of Segment 5. These data strongly suggest that single nucleotide polymorphism (SNP) that conveys re-
increased throughput is required for detection of patho- sistance to a widely used antiviral drug, oseltamavir
gens present at low levels. (Ose). Previous studies have established that Influenza A
A closer examination of the read mappings for both strains from the 2009 epidemic had mutated to an Ose-
the PGM and MiSeq platforms indicated that some resistant phenotype [15]. Thus, we examined the reads
mapped reads were spurious. In order to eliminate these resulting from MiSeq sequencing of a 2009 epidemic
matches, stringency of read mapping parameters was in- strain for the presence of the mutation H275Y, and in
creased. At the medium stringency setting, whereby the neither of two biological replicates were any of the
length of the read required to match was increased from known SNPs detected. In fact, there was zero coverage
50% to 70%, the number of mapped reads dropped by of that specific region of the genome. Given that our ex-
roughly half for the PGM and by roughly 10% for the perimental model, human blood, represents a relatively
MiSeq. Furthermore, the number of MiSeq reads that low-complexity sample (as compared to clinical samples
mapped in pairs was roughly 25% (66 out of 269; from other body sites such as oropharyngeal swabs or
Table 2). This is most likely due to polymorphisms in fecal samples), spiked with Influenza A genomes at titers
the sequenced strain that are not present in the NCBI higher than would be expected to be encountered in
reference genome. As expected, no reads mapped at the a clinical sample, these data suggest that pathogen
Figure 2 Read mapping against representative segments of Influenza genome. In A), reads resulting from a representative run of 454
Titanium sequencing, Ion Torrent PGM sequencing, and Illumina MiSeq sequencing were mapped to the reference Influenza A H1NI segment 5
[NC_002019.1], using CLC Genomics Workbench version 6.0.4 at default parameters, whereas in B), they were mapped against segment 8
[NC_002020.1]. In B), no Roche-454 reads mapped to the reference. In both A) and B), coordinates of reference genome segment are displayed
along the top and G/C content is graphed below reference in pink.
enrichment may be a necessary step for determination substrings of each other and therefore absorbed dur-
of drug resistance by metagenome sequencing. ing the merge step of assembly, or they had a stretch
of mismatches long enough (≥ 10 bases) to cause
de novo assembly and BLASTn analysis of Influenza A the assembly algorithm to mark these reads as not
reads overlapping.
In many cases the identity of a pathogen may not be By comparison, assembly of the MiSeq reads for two
known or the pathogen’s genome may be very divergent biological replicates using Velvet generated well over
at the nucleic acid level from that of known family 750,000 contigs for both replicates, with 18 and 11 flu-
members, necessitating analysis by means other than specific contigs, respectively. Taxonomic classification of
mapping reads to a reference genome. Therefore, de the contigs by MEGAN properly assigned the correct
novo assemblies of the Influenza A metagenomes from strain (Influenza A virus (A/swine/Iowa/15/1930(H1N1))
the PGM and MiSeq data were performed, followed by in 5 out of 18 and 0 out of 11 contigs. Importantly, only
BLASTn analysis. Assembly of the PGM reads using the strain level classifications were mis-assigned. In each
MIRA produced an average of 28,000 contigs, yet not case, the correct genus and species calls were made.
a single contig was classified as Influenza A (Table 3). Therefore, each of the contigs identified by MEGAN as
In fact, no contig was classified as a viral taxon. This Influenza but not correctly identified at strain level was
result is surprising given the observed sequence depth individually subjected to BLASTn analysis and manual
in several regions of the Influenza A genome (see curation of results. This inspection revealed that the
Figure 2A). We also attempted assembly of the PGM proper strain was indeed present in the hit tables (6 and
reads using Velvet with similar assembly parameters. 7 contigs, respectively), but that for each of these con-
Again, none of the resultant contigs were able to be tigs, there were several strains hit with the identical
assigned to Influenza A. Upon closer inspection, it E-value, including A/swine/Iowa/15/1930(H1N1), and
was observed that the identified reads were either therefore, a definitive strain call could not be made
in those cases. Thus, for one of two replicates, a majority
Table 3 De novo assembly and BLASTn statistics of of the influenza-specific contigs resulting from de novo
Influenza A sequence reads assembly of the MiSeq reads were either properly classi-
Platform Assembler # total contigs # specific contigs fied to the strain level, or conserved amongst a subset of
PGM 1 MIRA 26,171 0 strains including A/swine/Iowa/15/1930(H1N1). It was
PGM 2 MIRA 30,719 0 interesting to note that, in each case, contigs corre-
sponding to Segments 4 (HA) and 6 (NA) were not
MiSeq 1 Velvet 841,384 18
assigned to the proper strain. These results demonstrate
MiSeq 2 Velvet 768,912 11
that the increased throughput of the MiSeq enables
Frey et al. BMC Genomics 2014, 15:96 Page 6 of 14
http://www.biomedcentral.com/1471-2164/15/96
species level identification of reads present in propor- proportion of mapped reads for a given genome segment
tions as low as .0008% of the total. as a function of that particular segment’s length, as
follows:
Comparison of Influenza A sequencing replicates on the
PGM and MiSeq y ¼ ð# mapped reads per given segment= # total reads mappedÞ
We were interested in further examining the observed = segment length 1000
differences in the number of mapped reads between
biological replicates and between genome segments. In This value for segment 5 is more than twice that for
addition, we wished to determine the reproducibility of any other segment by PGM sequencing (Additional file
our results. The biological replicates (derived from inde- 3: Table S2). This is an interesting observation given
pendent spikings and library preparations) sequenced on that segment 5, at 1565 bases in length, is the fifth lar-
the MiSeq platform exhibited considerable variability gest segment in the genome. In every case for the
(Table 2), and we were interested in determining at what PGM, segment 5 yielded the most mapped reads. This
stage this variability would arise- at library construction or observation is consistent with our previous result with
sequencing. Thus, we performed replicate sequencing runs the 314 chip in which the only segment mapped was
for each of the PGM and MiSeq libraries. The total num- segment 5 (Additional file 1: Figure S1). Secondly, seg-
ber of mapped reads for each technical replicate was quite ment 3 tended to be underrepresented, in both the
consistent (Table 2). This observation held true for both PGM and MiSeq data. This trend was especially
platforms. Moreover, each individual library yielded a marked in the PGM libraries as compared to the MiSeq
number of mapped reads that was less than one standard libraries (.03 and .05 respectively). These findings sug-
deviation from the average. Thus, the inter-run variability gest that some sequences are inherently favored for se-
seems to be limited and results of sequencing a given li- quencing by the PGM as well as the MiSeq. Thirdly,
brary preparation are highly reproducible. regardless of the number of reads mapped, the coordi-
However, we did note several tendencies regarding nates of the mapped reads in relation to the genome
the number of mapped reads per segment (Additional were widely variable (Figure 3). This effect was more
file 2: Table S1 and Additional file 3: Table S2). Firstly, pronounced in the PGM libraries. This suggests that
segment 5 tended to be overrepresented in each PGM the library diversity is greater than that sampled by a
replicate. Specifically, for comparison, we expressed the single sequencing run.
Figure 3 Read mapping against segment 5 of Influenza A of replicate sequencing runs. Reads resulting from a replicate run of Ion Torrent
PGM sequencing (top) and Illumina MiSeq sequencing (bottom) were mapped to the reference Influenza a H1NI segment 5, [NCBI accession:
NC_002019 ] using CLC Genomics Workbench version 6.0 at default parameters. Coordinates of reference genome segment are displayed along
the top and G/C content is graphed below reference in pink.
Frey et al. BMC Genomics 2014, 15:96 Page 7 of 14
http://www.biomedcentral.com/1471-2164/15/96
Comparison of Roche-454, Illumina MiSeq and Ion Torrent reads and PGM reads by expressing mean number of
PGM for detecting bacterial pathogens mismatches per read as a proportion of mean read
Although there have been performance comparisons of length of non-perfectly matched reads. When expressed
the benchtop platforms for sequencing bacterial isolates in this manner, the proportion of mismatches per length
[16,17] to our knowledge, there has been no published of non-perfectly matched reads is 0.5 for the MiSeq
report of comparisons of sensitivity and specificity in versus 0.4 for the PGM. Additionally, the number of
terms of speciation. Therefore, we aimed to establish a indel related mismatches was found to be proportionally
LoD for each sequencer using samples spiked with a higher in the MiSeq reads than in the PGM reads.
known biothreat agent, Bacillus anthracis. Using the Conversely, the PGM reads had a higher number of
Sterne strain as a surrogate of anthrax infection, the A → T and G → C transversions than the MiSeq data.
quantity of bacterial cells present in a dilution series This is inconsistent with the known error models for
was established. Using three complementary quantitative both of these platforms.
techniques, mock samples with B. anthracis Sterne Although de novo assemblies and taxonomic classifica-
present at roughly 30,000 cells/mL of human blood were tion were attempted for both PGM and 454 reads, no
prepared. dataset produced a single contig that matched Bacillus
All three platforms delivered sequence reads matching spp. This is most likely due to the lack of coverage of
the chromosome of B. anthracis Sterne by reference the bacterial genome, resulting in a lack of sufficient
mapping (Table 4). As expected, given the increased overlap between reads necessary to form a contig. By
throughput, both PGM and MiSeq yielded significantly comparison, de novo assembly of one MiSeq run yielded
higher numbers of reads as well as genome coverage. 6,140,965 total contigs. Taxonomic classification of the
However, at low stringency, large numbers of reads were assembled sequence reads rendered 1200 contigs that
falsely counted as mapped. This trend was especially were assigned to the Bacillus cereus superfamily. More
prominent in the PGM data. Most of the spurious importantly, 70 contigs were properly classified as
reads were short (average: 40 bases) and low-complexity Bacillus anthracis Sterne. These results echo those of
(i.e. rich in nucleotide repeats). This tendency was con- the viral samples in that assembled sequence reads from
firmed in the replicate samples. The same pattern was the MiSeq were able to identify the spiked-in micro-
observed in the Roche-454 sequence reads as well. In organism, in this case to the strain level.
general, the 454 reads were clustered near the 5′-end of
the linear reference sequence, while the PGM and MiSeq Detection of genetic engineering
reads spanned a much larger breadth. However, the 454 We were interested in whether sequencing reads near
reads were uniformly longer and more information-rich. the LoD could detect an instance of genetic engineering.
The MiSeq replicates demonstrated a similar tendency Therefore, the strain of B. anthracis spiked into human
as that of the PGM; in general, the lower stringency blood for these experiments was a 34F2 strain of
mapping contained matched reads of short length B. anthracis containing a stably integrated genomic copy
(avg. 66 bases). For comparison, in replicate two for both of red fluorescent protein (RFP). Metagenomic sequence
platforms, 80.77% of mapped MiSeq reads mapped data was compared with a computer simulation of the
as non-perfect matches versus 98.83% of mapped likelihood of detecting this genetic manipulation. For
PGM reads. The average read length of a non-perfectly this purpose, we developed a Perl-based algorithm to
matched MiSeq read was 53.22 bases, as compared to generate a distribution of the number of sequence reads
39.99 bases for the PGM. Interestingly, the mean num- necessary to detect a gene of 1000 bps inserted at five
ber of mismatches per read tended to be higher in the locations into a prokaryotic genome of 5 Mb in size. At
MiSeq data than that of the PGM (mean of 26 for MiSeq 1000 organism-specific reads, the probability of detect-
vs. 15 mismatches for PGM). Due to the slightly longer ing the inserted gene is roughly one-third (Figure 4). In
length of mapped MiSeq reads, there could conceivably fact, the minimum number of organism-specific hits
be more opportunity for mismatched bases per read. necessary to ensure detection of the inserted gene is
Therefore, we compared the mismatches within MiSeq 10,000. Given that 5 of the 6 metagenomes sequenced
Table 4 Cross-platform Comparison of Read Mapping to B. anthracis Sterne
Platform #Replicates Reads mapped Reads mapped (med) Reads mapped (high) Fraction of reference Detected by qPCR?
(low) covereda
PGM 2 29,534/15,676 7,689/4,286 247/178 .12/.09 Y
Roche-454 2 384/376 249/240 65/56 .01/.01 Y
MiSeq 2 10,415/41,242 3,024/9,985 1,633/7,930 .07/.19 Y
a: Fraction of reference covered using high stringency mapping parameters as defined in Methods.
Frey et al. BMC Genomics 2014, 15:96 Page 8 of 14
http://www.biomedcentral.com/1471-2164/15/96
Figure 4 Mathematical modeling of the likelihood of detecting a genetic modification in B. anthracis. The expected number of hits to an
inserted gene of size 1 kb, at 5 copies, was simulated as a function of the number of organism-specific reads collected from the metagenomic
sample. The relative size of each rectangle indicates the proportion of samplings for which a specific number of hits is expected.
here contained far fewer than 1000 bacillus-specific strains. This necessitates design of multiple probe sets,
reads (Table 4), one would expect few, if any, sequence which is often a laborious task [20]. Additionally, the
reads to match the RFP reference sequence. Indeed, read sensitivity of these assays is often low due to high num-
mapping to the available NCBI reference (EF606900.1) bers of false negatives [21]. Thus, there is a need for as-
yielded zero matches to the RFP gene present in the says that are more robust and less pathogen specific.
34F2 chromosome. These data suggest that increased Prior to the widespread adoption of high-throughput
titres, targeted enrichment, or both will be necessary for sequencing (HTS), high-density oligonucleotide microar-
detection of gene-insertion events. rays were used to determine the presence of microor-
ganisms. Syndrome-specific panels showed success in
Discussion diagnosis of infectious disease [22]. However, sequence
Historically, identification of causal agents of disease has features sufficiently different from the array probe will
relied heavily on one’s ability to culture the organism in not hybridize, resulting in false negatives. By compari-
the laboratory and/or the use of organism-specific son, HTS represents a relatively unbiased approach to
antibodies or sequence-based probes. However, these detection of causal agents of infectious disease. However,
methods can be very limiting. For instance, some micro- for metagenomic sequencing to be utilized in a routine
organisms are refractory to laboratory culture. In some clinical context would require some basic questions an-
cases, even microbiological assays for diseases that are swered in terms of sensitivity and reproducibility. In this
manifested by cultivable organisms, such as endocarditis study, we compared three HTS platforms for their ability
caused by staphylococci and streptococci, have a high to detect pathogens in human blood. As compared to
false negative rate [18]. Serological assays are typically the traditional Roche-454 sequencer, the benchtop se-
limited to identifying known or closely related organisms quencers IonTorrent PGM and MiSeq were better able
and antigenic drift and shift can result in false negatives. to detect a pathogen in human blood, in part by virtue
Even highly sensitive PCR-based assays must be continu- of their increased throughput.
ally updated due to signature degradation [19]. Addition- Our reported LoD for viral samples on the Roche-454
ally, for divergent viruses such as HIV-1, many PCR of 1X102.5 pfu/mL is similar to but slightly lower than a
assays are unable to discriminate between closely related previously reported value of 1X103 pfu/mL [5]. This may
Frey et al. BMC Genomics 2014, 15:96 Page 9 of 14
http://www.biomedcentral.com/1471-2164/15/96
be a reflection of an increased number of reads in our The results of this study suggest that the benchtop se-
study, differences in sequence library preparation or im- quencers perform well at the task of identifying a puta-
provements in sequencing chemistry. Notably, this value tive pathogen present at low titers. Each of the three
is near the LoD for a validated qPCR assay [23]. Our re- platforms tested provided a number of reads that was
ported LoD of 1X102.5 pfu/mL corresponds to an upper sufficient to unambiguously identify the pathogen to
limit of 31,600 – 53,720 genome copies/mL. Indeed, species level in the case of virus and to the genus level
at that titer, the sequence reads were of sufficient num- in the case of bacteria. The data from each platform was
ber and length to unequivocally discriminate between very reproducible for technical replicates within library
Dengue virus subtypes. However, a strain-level designa- preparations. Indeed, the data from each platform was
tion was not possible. The inability to make a strain-level remarkably consistent in terms of quantity and quality
call could conceivably have potential clinical conse- (Table 2; Additional file 2: Table S1). Additionally, there
quences. For instance, a recent phylogenetic study of cir- was little variation in the number of reads mapped or in
culating strains of Dengue Virus 2 indicated that a single taxonomic classification of contigs. These results suggest
substitution on the prM39 was responsible for fatal cases that library protocols and sequencing chemistries are ro-
of Dengue Hemorrhagic Fever [24]. bust and uniform enough to make a dependable identifi-
Using a different virus at a comparable titer, the cation of a given pathogen. Our results are in agreement
Roche-454 platform was able to definitively identify the with a recently published study in which gDNA from
pathogen in only one of two replicates. Both the PGM Bacillus anthracis was serially titrated into background
and MiSeq surpassed the Roche-454 in terms of number DNA collected from air filters and soil. The results of
of mapped reads as well as reproducibility. However, al- this study demonstrated that, even with whole genome
though the Roche-454 sequence data produced only a amplification prior to sequencing, it is difficult to assign
single hit, the length and quality of the hit were ad- a proper species classification to sequence reads from
equate to make a correct strain call (data not shown). B. anthracis [25].
The apparent difference in the ability of each platform We did note several platform-specific variations in our
to detect a given pathogen is a function of the total out- data. For instance, in the case of the Influenza data from
put per platform. When the values are expressed in a the PGM platform, segment 5 routinely exhibited cover-
normalized manner (pathogen reads per 100,000), it be- age bias as compared to larger segments. There could be
comes apparent that the analytical sensitivity of the two several possible explanations for this observation. The
benchtop platforms is roughly the same and surpasses G/C-content of the Influenza A genome displays wide
that of Roche-454 (Table 2). Neither technology seems intra-segment and inter-segment variations (Additional
to have a significant advantage regarding detection or file 4: Figure S2). A closer examination of the mapped
characterization of a pathogen if the number of reads is reads seems to show a slight correlation with areas of
held constant. When cost per sequencing run or per low G/C-content, but follow on experiments would be
megabase is factored in, the benchtop sequencers are a required to conclusively elucidate the impact of G/C-
more economical option. content in this context. A number of studies have noted
In this study, we also examined metagenomic sequen- that NGS data have distinct biases in areas of high G/C-
cing for its ability to detect a genetically modified patho- content [17,26,27]. Additionally, template amplification
gen in a clinical sample. A BSL-2 strain of B. anthracis via emulsion PCR is a potential source of reduced library
containing an inserted RFP gene was used as a surrogate diversity [28]. Moreover, inefficiencies during reverse
for a genetically modified organism (GMO), and spiked transcription due to RNA secondary structure may be
into human blood at relatively high concentration. In responsible for the observed coverage bias. Given that
this case, although reads were identified as likely ori- the initial step in PGM library construction involves ran-
ginating from B. anthracis, in none of the samples dom fragmentation with RNase III at 37°C, it is possible
was evidence of the inserted gene detected. This that some RNA strands did not completely unfold. This
likely means that, for the time being, even with their may be especially true for those segments with higher
substantial output, the benchtop sequencers are not ΔG, such as segments 1, 2 and 5 (Additional file 4:
suitable for detection of GMO from complex samples Figure S2). Although the MiSeq replicates also showed a
or characterization of threat agents from complex slight skew towards segment 5, the preference was not
samples. Our results indicate, however, that although as extreme as that for the PGM. This may be due in part
characterization of a given pathogen from a clinical to the method for RNA library construction on the
sample by metagenome sequencing on a benchtop se- MiSeq platform. The initial step involves a chemical
quencer may not be possible without some pathogen- fragmentation step at high temperature. Thus, the diver-
specific enrichment, identification of species and even sity of the libraries may be different from the start. On
strains is possible. the other hand, the PGM replicates demonstrated some
Frey et al. BMC Genomics 2014, 15:96 Page 10 of 14
http://www.biomedcentral.com/1471-2164/15/96
variation in mapped reads (Figure 3) suggesting that in- quasispecies, in this case HIV. The authors of that study
creased throughput might produce greater diversity. noted that the increased throughout and lower error rate
Another platform-specific characteristic observed in of the Illumina platform enabled improved reconstruc-
this study was the proportion of mismatches within a tion of viral haplotypes. However, due to the longer read
non-perfectly mapped read. This statistic was slightly lengths, the Roche-454 was superior when long range re-
higher for the MiSeq platform than for the PGM plat- construction was necessary [31].
form. Ideally, given the correct reference sequence, for a Our results indicate that library diversity and overall
given platform this statistic would approach zero as se- throughput are the two key metrics in determining how
quencing error decreased. However, as in this case, this a researcher or clinician may use metagenomic data
statistic may also be affected by nucleotide differences from benchtop sequencers. For instance, a recent survey
between the sequenced strain and the closest reference of Human papillomavirus DNA present in human skin
sequence available from NCBI, precluding us from mak- tumors demonstrated that the increased throughput of
ing any conclusions as to sequencing error rates from the PGM enabled identification of seven additional viral
these data. It is possible that, to some extent, the longer subtypes as compared to data from Roche-454 sequen-
length of MiSeq reads allows for more opportunity for cing of the same samples [32]. This result is similar to
mapping of non-perfect matches, and this may contrib- our observation that PGM data provided steady, repro-
ute to a decreased LoD for MiSeq when mapping to a ducible identification of Influenza A virus in comparison
closely related but not identical reference sequence. to sequence data produced by the Roche-454. Overall,
However, the extent to which error rate versus strain the MiSeq proved superior to both the Ion Torrent
level differences and read length affect this statistic as PGM and Roche-454 for both detection as well as classi-
well as the LoD cannot be ascertained in the absence of fication of the pathogen present in our mock samples.
a true reference sequence for the strain in question. Although there is no published report of this, it re-
A number of recent reports have attempted to define mains formally possible to identify a previously unknown
the limitations of metagenomic sequencing data. One agent from a single novel microbial read present in a
study made use of simulated data sets to compare as- complex metagenomic sample. Indeed, identification of
semblies from three sequencing technologies (Sanger, novel agents has been reported with as few as 14 reads
pyrosequencing and Illumina). Unsurprisingly, the study out of over 100,000 [33]. Whereas identification of an
concluded that assembly quality decreased rapidly with agent may require detection of only one or more reads,
increasing sample complexity. For low complexity sam- characterization, the crucial next step, is absolutely
ples (10 genomes) the assemblies were comparable in dependent on complete (100%) or nearly complete rep-
quality and inclusiveness, while Illumina data produced resentation of the agent’s entire genome at adequate
superior assemblies in a higher complexity sample (100 depth of coverage, especially in the case of RNA viruses
genomes) [29]. These results mirror those presented or other microorganisms likely to exhibit functionally
here in that no one sequencing chemistry clearly sur- relevant minority populations or quasispecies, or genet-
passed another in terms of identification of a micro- ically modified organisms. In this case, it is necessary for
organism present in a low-complexity sample. It should follow-on experiments to more fully characterize the
be noted that the Illumina data in the study by Mende genome of the microorganism, such as Sanger sequen-
et al. were from the HiSeq platform and were extensively cing using primers based on the novel fragment(s). It
trimmed to provide high quality reads as input [28]. A would be optimal if some of the original sample were
separate report estimated that genome coverage of 20X available for such experiments. However, in many cases,
was required for proper taxonomic classification of spe- the original sample may be precious or limited in terms
cies present in a given metagenomic community [30]. of volume. This challenge can be more pronounced
This study is in agreement with our inability to make a cor- when identifying viral agents as opposed to bacterial
rect species-level determination from our Bacillus anthra- agents. Viral genomes are orders of magnitude smaller
cis samples. Additionally, our results complement an (~1X104-1X105 bps) than those of an average bacterial
important conclusion from the previous report - that the agent (~3-5X106 bp). Thus, the overall amount of viral
efficiency of gene detection is most likely overestimated. nucleic acid may be in the picogram range, increasing
There has been much effort to understand and im- the likelihood of two technical obstacles: 1) viral nucleic
prove metagenomic data from complex samples com- acid is outcompeted during amplification by other nu-
prised mostly of bacterial species. There are fewer cleic acids in the matrix, such as host ribosomal RNA if
published studies examining the effects of different the matrix is tissue or, 2) if the overall amount of nucleic
sequence technologies on viral metagenomics. A recent acids in the metagenome sample itself is low, then
research paper attempted a comparison of Roche-454 library preparation of the sample may fail as successive
and Illumina data for estimation of diversity in viral losses of genetic material occur in each step. Targeted
Frey et al. BMC Genomics 2014, 15:96 Page 11 of 14
http://www.biomedcentral.com/1471-2164/15/96
amplification of organism-specific regions prior to se- Island, NY) was added to the sample at a ratio of 3:1.
quencing has shown some promise. For instance, an Samples were processed for total RNA according to
assay in which multiplex PCR preceded sequencing was manufacturers’ protocol.
able to fully differentiate Bacillus anthracis, Yersinia
pestis and Francisella tularensis [34]. (ii) Bacterial samples
Log-phase cultures of B. anthracis strain 34F2^RFP were
Conclusions ten-fold serially diluted in 1X sterile PBS. Two mL of
In this study, we sought to determine empirical limits of sodium citrate-treated whole blood were spiked with
detection for metagenomic sequencing of clinical sam- 200 μL of diluted culture at various titers. After
ples using three different sequencing platforms. We thorough mixing with a micropipette, genomic DNA
found that the analytical sensitivity of all three platforms (gDNA) was extracted with a BioStic® Bacteremia DNA
approaches that of standard qPCR assays. Although all Isolation kit (Mobio; Carlsbad, CA) according to manu-
platforms were able to detect pathogens at the levels facturers’ protocol.
tested, there were several noteworthy differences. The
Roche-454 Titanium platform produced consistently Nucleic acid quality checks
longer reads, even when compared with the latest chem- RNA integrity and purity were assayed using an RNA
istry updates for the PGM platform. The MiSeq platform 6000 Pico chip on the Agilent Bioanalyzer™. RNA mass
produced consistently greater depth and breadth of was determined by fluorescent detection using Qubit®
coverage, while the Ion Torrent was unequaled for speed Broad Range RNA kit (Life Technologies). All RNA sam-
of sequencing. None of the platforms were able to verify ples were stored at −80°C until use. DNA integrity and
a single nucleotide polymorphism responsible for anti- purity were assessed by agarose gel electrophoresis on
viral resistance in an Influenza A strain isolated from 0.8% E-Gels® (Life Technologies). DNA mass was deter-
the 2009 H1N1 pandemic. Additionally none of the plat- mined using Qubit® Broad Range DNA kit (Life Technolo-
forms tested was able to detect evidence of genetic en- gies). All DNA samples were stored at −20°C until use.
gineering in a bacterial biowarfare agent that was spiked
into a clinical-type sample. Overall, the benchtop plat- High-throughput sequencing
forms perform well for identification of pathogens from (i) Roche-454 pyrosequencing
a representative clinical sample. However, our results For RNA samples, libraries were constructed by following
indicate that, unlike identification, characterization of the cDNA Rapid protocol (Roche Diagnostics, Mannheim
pathogens is likely to require higher titers, multiples Germany) using an input of 200 ng of total RNA. For
libraries and/or multiple sequencing runs. DNA samples, libraries were constructed following the
Rapid DNA protocol (Roche) using an input of 500 ng
Methods of gDNA. All emPCR reactions were performed using
Cells and viruses GS FLX Titanium Lib-L-LV kits. Template-to-bead ratios
Bacillus anthracis [34F2 (NCBI taxon ID #526966)] used were optimized via titration. All sequencing runs were
in this study was routinely cultured in Brain Heart performed using a two-region gasket for each pico-titre
Infusion (BHI) broth (KD Medical; Columbia, MD). Cell plate. Each run was 200 cycles.
counts were performed using the track dilution method
[35] on BHI agar plates. One mL aliquots of plaque- (ii) Ion Torrent PGM® sequencing
purified Dengue virus Type 1 and Type 2 were main- For RNA samples, libraries were constructed by follow-
tained at -80° until spiking blood. Influenza A virus ing the Whole Transcriptome Library protocol using an
stocks were purchased from ATCC (Manassas, VA): input of 500 ng total RNA. For DNA samples, libraries
VR-1683(NCBI taxon ID # 380342) and VR-1736 (NCBI were constructed using the IonXpress™ Plus gDNA Frag-
taxon ID # 710659). ment Library protocol using an input of 500 ng gDNA.
Quality control of all libraries was performed on the
Sample preparation Agilent Bioanalyzer using a High Sensitivity chip. Library
(i) Viral samples templates were clonally amplified using the Ion One
Frozen aliquots of plaque-purified Dengue virus Type 1, Touch 2™, following the manufacturers’ protocol. Tem-
Dengue virus Type 2, or Swine Influenza A were ten- plate dilutions were calculated by extrapolation from a
fold serially diluted in sterile, 1X PBS. One mL aliquots qPCR standard curve using the Ion Library Quantitation
of sodium citrate-treated whole human blood (BioRecla- Kit (Life Technologies). Recovered template-positive Ion
mation, Liverpool, NY) were spiked with 100 μL of sphere particles (ISPs) were subjected to enrichment ac-
diluted virus at various titers. After thorough mixing cording to template corresponding protocol. Ion Sphere
with a micropipette, Trizol LS (Life Technologies, Grand quality control was performed on enriched and un-
Frey et al. BMC Genomics 2014, 15:96 Page 12 of 14
http://www.biomedcentral.com/1471-2164/15/96
enriched template ISPs using the Ion Sphere™ Quality Length fraction = 0.5 and similarity fraction = 0.8). CLC
Control Kit (Life Technologies). Samples containing an default settings were arbitrarily assigned as low strin-
optimum number of template ISPs and satisfactory en- gency settings. Medium stringency settings were arbi-
richment were subjected to the standard Ion PGM™ 200 trarily defined as Insertion cost = 3, Deletion cost = 3,
Sequencing v2 protocol. Mismatch cost = 2, Length fraction = 0.7 and similarity
fraction = 0.8. High stringency settings were defined as
(iii) Illumina MiSeq sequencing Insertion cost = 3, Deletion cost = 3, Mismatch cost = 3,
Illumina TruSeq cDNA libraries were prepared from Length fraction = 1.0 and similarity fraction = 1.0. GS
400 ng total RNA, omitting the polyA selection step. Reference Mapper software (v2.0.01.14; Roche/454) was
Each library was subjected to a full MiSeq run using a used to produce reference-guided assemblies of each of
300 cycle kit, paired end sequencing. A quality control the Roche datasets with respect to the DENV-1 genome
tool for high throughput sequence, FASTQC, a java (GenBank; DVU88536).
stand-alone program, was downloaded from Babraham
Bioinformatics Institute: http://www.bioinformatics.bab- (ii) de novo assembly and BLASTn analysis of contigs
raham.ac.uk/projects/fastqc/ and each fastq file was MIRA V3.4.0.1 production version was used for the as-
checked for quality. sembly of PGM data. The SFF file was converted to fastq
using the sff_extract.py utility. The job parameter was
Quantitative PCR modified with the keywords “denovo,genome,accurate,
(i) RNA samples iontor” to indicate a de novo assembly for the corre-
Detection and quantification of Dengue virus and Influ- sponding platform. In addition, the modifier “-GE:not”
enza A virus in samples was performed with SYBR® was set to 12 processors for parallel computing. The
Green-based real time RT-PCR. Equal masses of total assembly parameters for MIRA for the PGM data
RNA from each sample were analyzed in duplicate using were, minimum overlap = 17 bps, minimum contig
Express One SYBR® GreenER™ (Life Technologies) fol- size =150 bps, minimum neighbor quality needed for
lowing manufacturer’s instructions. Specificity of primer tagging = 20, minimum read length =80. For the Velvet
pairs (Additional file 5: Table S3) for each virus was assembly the k-mer size was 33.
checked using NCBI Primer BLAST against the nr data- The assembly of the MiSeq data was conducted using
base. Standard curves for Dengue virus were constructed velvet version 1.2.1. The velvet assembly was conducted
using ten-fold serial dilutions of viral RNA extracted from using a k-mer of size 31 and “short” modifier. For
a stock aliquot and are expressed in pfu/mL. Standard the downstream analysis, BLASTN and annotation,
curves for Influenza A virus were constructed using a 250 only those contigs with size greater than 150 bps were
base RNA oligo (Bio-Synthesis Inc., Lewisville, TX) repre- considered.
senting the region of the gene to be amplified. Influenza A BLASTN was conducted with the NCBI-BLAST++
values are expressed in genome copies/mL. version 2.2.28 against the NT BLAST database from 04/
22/2013. The taxonomical annotation was obtained with
(ii) DNA samples an in-house script and the results were visualized using
Detection and quantification of Bacillus anthracis in MEGAN version 4.70.4.
samples was performed using SYBR® Green-based real
time PCR. Equal masses of gDNA were analyzed in (iii) Mathematical modeling
duplicate using Power SYBR® Green (Applied Biosys- Given a particular genome (G) of size (N), we were in-
tems, Grand Island, NY). Specificity of primer pairs terested in a particular gene (H) of size (M) within G. If
(Additional file 5: Table S3) for each gene was checked a sequencing experiment is conducted and a particular
using NCBI Primer BLAST against the nr database. read from the experiment matches the sequence of H,
Standard curves for each gene were constructed using this is labeled a hit. In order to examine the range of
ten-fold serial dilutions of a plasmid construct contain- read counts that will result in a range from one hit with
ing the complete coding sequence of each gene. Values low probability to one hit with near certainty, we de-
are expressed in copies/mL. cided to recreate the parameters of a sequencing experi-
ment in silico. The simulation was written in PERL,
Bioinformatics where N was set to 5.23 Mbps, M was set to 1000 bps,
(i) Reference mapping and five copies of H were used in the simulation. The
Sequence data mapping was performed using CLC Gen- simulation assumes that all parts of G are equally likely
omics Workbench v6.0.4 (CLC Inc, Aarhus, Denmark). to be sequenced. A thousand iterations of the simulation
CLC Reference Mapper was run with default settings produced the expected number of hits for read counts of
(Insertion cost = 3, Deletion cost = 3, Mismatch cost = 2, 1 k, 2 k, 3 k, 4 k, 5 k, 8 k, 10 k and 15 k.
Frey et al. BMC Genomics 2014, 15:96 Page 13 of 14
http://www.biomedcentral.com/1471-2164/15/96
20. Gardner SN, Kuczmarski TA, Vitalis EA, Slezak TR: Limitations of TaqMan
PCR for detecting divergent viral pathogens illustrated by hepatitis A, B,
C, and E viruses and human immunodeficiency virus. J Clin Microbiol
2003, 41(6):2417–2427.
21. Lemmon GH, Gardner SN: Predicting the sensitivity and specificity of
published real-time PCR assays. Ann Clin Microbiol Antimicrob 2008, 7:18.
22. Palacios G, Quan PL, Jabado OJ, Conlan S, Hirschberg DL, Liu Y, Zhai J,
Renwick N, Hui J, Hegyi H, et al: Panmicrobial oligonucleotide array for
diagnosis of infectious diseases. Emerg Infect Dis 2007, 13(1):73–81.
23. Callahan JD, Wu SJ, Dion-Schultz A, Mangold BE, Peruski LF, Watts DM,
Porter KR, Murphy GR, Suharyono W, King CC, et al: Development and
evaluation of serotype- and group-specific fluorogenic reverse
transcriptase PCR (TaqMan) assays for dengue virus. J Clin Microbiol
2001, 39(11):4119–4124.
24. Faria NR, Nogueira RM, de Filippis AM, Simoes JB, Nogueira Fde B, da Rocha
Queiroz Lima M, dos Santos FB: Twenty years of DENV-2 activity in Brazil:
molecular characterization and phylogeny of strains isolated from
1990 to 2010. PLoS Negl Trop Dis 2013, 7(3):e2095.
25. Be NA, Thissen JB, Gardner SN, McLoughlin KS, Fofanov VY, Koshinsky H,
Ellingson SR, Brettin TS, Jackson PJ, Jaing CJ: Detection of bacillus
anthracis DNA in complex soil and air samples using next-generation
sequencing. PLoS One 2013, 8(9):e73455.
26. Benjamini Y, Speed TP: Summarizing and correcting the GC content bias
in high-throughput sequencing. Nucleic Acids Res 2012, 40(10):e72.
27. Ferrarini M, Moretto M, Ward JA, Urbanovski N, Stevanovi V, Giongo L, Viola
R, Cavalieri D, Velasco R, Cestaro A, et al: An evaluation of the PacBio RS
platform for sequencing and de novo assembly of a chloroplast
genome. BMC Genomics 2013, 14(1):670.
28. Merriman B, Ion Torrent R, Team D, Rothberg JM: Progress in ion torrent
semiconductor chip based sequencing. Electrophoresis 2012, 33(23):3397–3417.
29. Mende DR, Waller AS, Sunagawa S, Jarvelin AI, Chan MM, Arumugam M,
Raes J, Bork P: Assessment of metagenomic assembly using simulated
next generation sequencing data. PLoS One 2012, 7(2):e31386.
30. Ni J, Yan Q, Yu Y: How much metagenomic sequencing is enough to
achieve a given goal? Sci Rep 1968, 2013:3.
31. Zagordi O, Daumer M, Beisel C, Beerenwinkel N: Read length versus depth of
coverage for viral quasispecies reconstruction. PLoS One 2012, 7(10):e47046.
32. Arroyo LS, Smelov V, Bzhalava D, Eklund C, Hultin E, Dillner J: Next
Generation sequencing for human papillomavirus genotyping. J Clin Virol
2013, 58(2):437–42.
33. Palacios G, Druce J, Du L, Tran T, Birch C, Briese T, Conlan S, Quan PL,
Hui J, Marshall J, et al: A new arenavirus in a cluster of fatal
transplant-associated diseases. N Engl J Med 2008, 358(10):991–998.
34. Turingan RS, Thomann HU, Zolotova A, Tan E, Selden RF: Rapid focused
sequencing: a multiplexed assay for simultaneous detection and strain
typing of Bacillus anthracis, Francisella tularensis, and Yersinia pestis.
PLoS One 2013, 8(2):e56093.
35. Jett BD, Hatter KL, Huycke MM, Gilmore MS: Simplified agar plate method
for quantifying viable bacteria. Biotechniques 1997, 23(4):648–650.
doi:10.1186/1471-2164-15-96
Cite this article as: Frey et al.: Comparison of three next-generation
sequencing platforms for metagenomic sequencing and identification
of pathogens in blood. BMC Genomics 2014 15:96.