Professional Documents
Culture Documents
Significance and Impact of The Study
Significance and Impact of The Study
Significance and Impact of The Study
Accepted Article
Comparative Study of Sequence Aligners for Detecting Antibiotic Resistance in Bacterial
Metagenomes
Department of Civil and Environmental Engineering, Michigan State University, East Lansing,
Michigan, USA1
treatment plants among other environments, thus methods for detecting these genes have
become increasingly relevant. Next generation sequencing has brought about a host of sequence
challenging since results produced from alignment tools can vary significantly. Our study
provides sequence alignment results of synthetic, and authentic bacterial metagenomes mapped
This article has been accepted for publication and undergone full peer review but has not
been through the copyediting, typesetting, pagination and proofreading process, which may
lead to differences between this version and the Version of Record. Please cite this article as
doi: 10.1111/lam.12842
We aim to compare the performance of Bowtie2, BWA-MEM, BLASTN, and BLASTX when
(CARD). Simulated reads were used to evaluate the performance of each aligner under the
following four performance criteria: correctly mapped, false positives, multi-reads, and partials.
The optimal alignment approach was applied to samples from two wastewater treatment plants
to detect ARGs using next generation sequencing. BLASTN mapped with greater accuracy
among the four sequence alignment approaches considered followed by Bowtie2. BLASTX
generated the greatest number of false positives and multi-reads when aligned against the
CARD. The performance of each alignment tool was also investigated using error-free reads.
Although each aligner mapped a greater number of error-free reads as compared to Illumina-
error reads, in general, the introduction of sequencing errors had little effect on alignment
results when aligning against the CARD. Given each performance criteria, BLASTN was found to
be the most favorable alignment tool and was therefore used to assess resistance genes in
sewage samples. Beta-lactam and aminoglycoside were found to be the most abundant classes
It has been reported that within the first few years of introducing a new antibiotic,
Accepted Article
pathogens commonly known to persist in hospital settings, develop resistance (Zhang et al.
2015a). Resistant pathogens of this nature escape controlled settings through, most commonly,
hospital waste streams and are released into the environment. Additionally, the intake of
growth promoters, and prevent disease. This may further facilitate the dissemination of
resistance genes in the environment through the use of manure and wastewater effluent in land
applications (Manaia et al. 2016; Schmieder and Edwards 2012). Consequently, antibiotic
resistance genes (ARGs) have become an evolving environmental pollutant known to occur in
ecosystems such as, soils, surface waters, and wastewater treatment plants (WWTPs) (Manaia
et al. 2016). Several techniques, namely, culturing, quantitative polymerase chain reaction
(qPCR), and microarrays have been used for detecting ARGs in the environment (Luby et al.
2016). Though these techniques have made it possible to detect ARGs, they are insufficient in
identifying novel ARGs, or a broad spectrum of these genes within microbial communities due
metagenomes, which consist of all genetic material contained in a sample. NGS platforms such
as Illumina, PacBio, and SOLiD sequence libraries of genomic fragments that are later processed
for interpretation. A major stage in the post-sequencing process is the alignment of sequence
fragments (or “reads”) against reference sequences (Langmead and Salzberg 2012; Hatem et al.
2013). The Comprehensive Antibiotic Resistance Database (CARD) is a widely used reference
database in recent metagenomic ARG studies (Elbehery et al. 2016; Garner et al. 2016; Subirats
et al. 2016).
databases. The Basic Local Alignment Search Tool (BLAST) is a widely used sequence alignment
Accepted Article
tool for detecting antibiotic resistance in metagenomic studies (Yang et al. 2014; Zhang et al.
2015b; Elbehery et al. 2016; Subirats et al. 2016). BLAST takes a heuristic approach, which
allows for rapid alignment of query sequences against massive reference databases like
GenBank (Altschul et al. 1997). In addition to BLAST, two popular sequence aligners, Bowtie2
(Langmead and Salzberg 2012) and BWA-MEM (Li 2013) attempt to find all possible alignments
for each query sequence by way of indexing the reference, finding global, or maximum exact
Several studies have evaluated aligners given either synthetic data, or authentic data
extracted from, for example, the human genome (Ruffalo et al. 2011; Langmead and Salzberg
2012; Hatem et al. 2013; Li 2013). Fewer studies have analyzed sequence aligners using
bacterial genomes (Li 2013), and to our knowledge there are no known studies that have
assessed the alignment of bacterial metagenomes against ARG reference databases. There are a
number of factors that influence the behavior of sequence aligners including, parameters set,
genetic data, and sequencing platform (Hatem et al. 2013; Elbehery et al. 2016). Sequence
aligners are challenged with the task of assigning multi-reads (reads that align equally as good
to more than one position on a reference genome) (Treangen and Salzberg 2012) to the correct
position and can result in errors when reporting best hits only. When dealing with metagenomic
datasets, the situation is aggravated because the volume of data increases dramatically. Errors
generated from the sequencing platform introduce further ambiguity when analyzing
sequencing data.
two simulated ARG bacterial metagenomes (one without error and another with simulated
based on their frequent appearance in metagenomic studies, compatibility with Illumina data,
Accepted Article
and capabilities to deal with reads of various lengths. Furthermore, real bacterial DNA extracted
from two WWTPs and sequenced on an Illumina platform, was aligned against the CARD using
the optimal alignment approach determined by the specified performance criteria. We aim to
(1) assess the performance of BWA-MEM, Bowtie2, BLASTN, and BLASTX when aligning a
simulated bacterial metagenome against an ARG database, and (2) apply the optimal alignment
approach on empirical sewage samples and report the relative abundance of ARG classes in
each sample. Results presented here will assist in determining what tool to use in ARG
metagenomic studies.
To validate each alignment approach, simulated reads were initially mapped back to their
whole genomes (Table 1). During alignment of simulated reads against whole genomes, Bowtie2
and BWA-MEM aligned 100% of reads with and without sequencing errors while BLASTN
aligned 99.71% and 99.82% of reads with error and without error, respectively. Without error
BWA-MEM mapped at a greater accuracy followed by Bowtie2 and BLASTN. Although BWA-
MEM mapped error-free reads with greater accuracy than Bowtie2 and BLAST, there was a
significant decrease in alignment accuracy when aligning reads with sequencing errors. Hence,
our findings suggest that BWA-MEM is more sensitive to discrepancies between the reference
Bowtie2 yielded the highest alignment error rate at 3.51% followed by BWA-MEM at
2.60% when aligning error containing reads against whole genomes. Conversely, BWA-MEM
generated an error of 5.17% while Bowtie2 obtained an error rate of 3.98% when aligning
Considering the absence of a protein reference for whole genomes, BLASTX was not considered
Accepted Article
in the preliminary analysis. The high accuracy of aligned reads with and without error among all
methods indicate that the alignment approaches taken generally performed well.
metagenomes to an ARG reference, simulated reads were aligned against the CARD. The highest
number of mapped reads was obtained during BLASTX alignment. Each alignment tool obtained
a greater number of mapped reads when aligning reads without sequencing errors. BLASTN
obtained the greatest number of correctly mapped reads followed by Bowtie2 (Table 2). While
BLASTX obtained the lowest percentage of false positives, it generated a substantial amount of
multi-reads. BWA-MEM produced the least amount of multi-reads and there was no significant
difference in the percent of multi-reads between BLASTN and Bowtie2 (P = 0.08). Bowtie2
alignment generated 6.08% and 8.17% more false positives compared to BLASTN when aligning
reads with and without error, respectively. Bowtie2 yielded the highest error rate when aligning
simulated data against the CARD with and without error at 5.42% and 2.47%, respectively,
have a slightly greater accuracy compared to Bowtie2, BWA-MEM, and BLASTX. Although false
positives are anticipated, each alignment tool generated a relatively large number of false
positives when mapping against the CARD as compared to whole genomes. This could be
attributed to sequencing errors, which can complicate the alignment process resulting in
incorrect alignments, specifically in metagenomic data (Elbehery et al. 2016). BWA-MEM and
BLASTN mapped reads without sequencing errors with marginally greater accuracy than reads
consisting of sequencing errors. In most cases, the percent of multi-reads decreased when
positives was slightly less for error-free reads only during alignment with BLAST. Repetitive
Accepted Article
regions can also have a substantial impact on the number of false alignments and multi-reads
(Treangen and Salzberg 2012, Yu et al. 2012). While this may be the case during BLASTX
alignment, further analysis is needed to draw a definite conclusion. Despite the slight variations
in results between error-free and Illumina-error reads, the performance (i.e. no. of mapped
reads, correct, partials, multi-reads, and false positives) of aligners when mapping against the
CARD remained largely unchanged when introducing sequencing errors (P = 0.05 - 0.57).
Quality analysis results on trimmed sequences revealed a mean quality of 36 for both
samples (Table 4). Wastewater samples were aligned against the CARD using BLASTN with an
E-value of 1e-5, the remaining parameters were maintained at default settings. A gene similarity
threshold of ≥ 90% over 150 bps was considered for mapped reads.
A total of 256 and 300 different ARGs met threshold conditions, each obtaining an E-value
of less than 1e-50, in the CAS and MBR sewage samples, respectively using BLASTN. β-lactam
resistance genes were the most abundant in each sample followed by aminoglycosides. Minor
counts of ARGs belonging to elfamycin, glycopeptide, and polymyxin classes of antibiotics were
detected in the MBR sample, but went undetected in the CAS sample (Figure 1). Streptomyces
cinnamoneus tuf gene (NCBI accession no. X98831), resistant to elfamycins, was detected in the
MBR sample. Elfamycins are a class of naturally occurring antibiotics that inhibit bacterial
bacterial protein synthesis (Sottani et al. 1993). To our knowledge, no recent studies on the
prevalence of elfamycin resistance genes in sewage treatment plants have been documented.
difference in the abundance of antibiotic classes found between sewage samples (P = 0.88).
Accepted Article
Limitations on Sequence Alignment Results
Results generated in this analysis only depict results from the specific aligner parameters,
references, and samples used in this study. Some may choose to adjust sequence algorithm
Here, we offer results from the alignment of simulated and real bacterial metagenomes mapped
against an ARG reference database with varying aligners under default conditions.
Since the analysis of ARGs using sequencing is most accurate when identifying known
genes (Schmieder and Edwards 2012), ARGs detected in each WWTP only suggest the presence
of these genes in the samples investigated. When possible, traditional biological detection
methods are recommended for verifying the identification of genes detected using sequences
aligners.
This study evaluates the performance of Bowtie2, BWA-MEM, and BLAST sequence
alignment tools in metagenomic ARG analyses. It also highlights sequencing errors as a potential
factor that can interfere with accurately detecting ARGs in bacterial metagenomes using
compared to Bowtie2 and maintained a lower number of false positives verses Bowtie2 and
BWA-MEM. Therefore, BLASTN was selected as the aligner of choice in the study. It is clear that
each tool has its tradeoff when confronted with specificity verses sensitivity. To gain more
with varying sequencing tools and aligner performance parameters are warranted in the future.
Accepted Article
Materials and methods
Sewage samples were collected from the East Lansing, conventional activated sludge
(CAS), and Traverse City, membrane bioreactor (MBR), WWTPs in Michigan (U.S.A.) in 2013.
The characteristics of these WWTPs are shown in Table 3. Samples presented in this study were
taken directly after the disinfection process of each treatment utility. In short, 2 liters of grab
effluent sample was collected in sterile nalgene bottles from each WWTP. Samples were mixed
and stored on ice, then transported to the laboratory for further processing.
Bacteria were recovered using a standard filtration technique with 0.45 μm HA filters
(Millipore, Billerica, MA). The volume of sample filtered was 1 liter for each sample. The filters
were collected in sterile 50 ml polypropylene tubes and 50 ml Phosphate Buffer (1X PBS) was
added in each tube containing a filter. The tubes were vortexed for 5 min to allow the biomass
layer on the filters to mix with water. Both tubes were centrifuged for 20 min at 2309 g to
concentrate the sample down to 2 ml. Supernatant was discarded and the concentrates were
DNA extraction was performed using a MagNA Pure Compact DNA extractor (Roche
Accepted Article
Applied Science, Indianapolis, IN, USA) following the protocol in the manufacturer’s manual. The
MagNA Pure Compact utilizes a magnetic-bead technology for the isolation process. Sample
amount of 400 μl was loaded in the system and the elution volume was 100 μl. The purified
DNAs were stored in a freezer at -20°C. DNA concentration was determined using the NanoDrop
DNA samples were isolated and approximately 1 μg of DNA (per sample) was sent to the
Research Technology Support Facility (RTSF) at Michigan State University. The NuGEN Ovation
Ultralow Library System, with an input requirement of 1-100 ng of DNA, was used for both
samples to accommodate for any sample containing low genetic material. After preparation,
Sequences were returned as R1.FASTQ and R2.FASTQ files for each sample, where R1 and
R2 constitute a read pair. Each FASTQ file was processed using a Unix/Linux system offered
through the MSU High Performance Computing Center (HPCC). Raw sequences were analyzed
for quality using FastQC, a quality control tool for sequencing data (Andrews 2010). Based on
the quality control check, Illumina adapters and reads with an average quality score below 15
were removed using Trimmomatic (Bolger et al. 2014). Finally, FastQC was performed once
more on the quality trimmed reads to ensure the integrity of the sequence reads and accuracy of
nucleotide database. Sequences associated with the following NCBI accession numbers were
used and consist of ARGs contained in the CARD: NC_003197.2, NC_010410.1, X58272.1,
genera and comprise ARGs with lengths ranging between 439 - 3205 nucleotides (Table S1).
Read length and insert size for pair-end synthetic reads was assigned based on the
characteristics of sequenced wastewater samples. Read length distribution for both samples
was validated using a custom awk command on trimmed reads. BBmerge (extended to 100
bases with a kmer size of 62) in the BBTools package (BBMap – Bushnell B. –
deviation of both overlapping and non-overlapping reads in the trimmed CAS and MBR pair-end
FASTQ files. Synthetic reads were then simulated using Grinder, version 0.5.4. (Angly et al.
2012). Grinder generated 150 bp, pair-end reads with 1x coverage and a mean insert size and
standard deviation of 218 and 54, respectively. The remaining parameters were run at default
conditions.
To better assess the effects of sequencing error on performance, synthetic reads were
generated with and without error. Illumina-error profiles were generated as recommended in
the instruction manual. Quality scores between 15 and 36 were assigned to each read in the
resulting FASTQ file according to error profile. The simulated metagenome reads without error
were generated with default quality scores. Synthetic reads were aligned with Bowtie2, BWA-
Reference genes (both nucleotide and protein sequences) from non-mutated CARD
Accepted Article
version 1.18 (Jia et al. 2017) were downloaded and used for alignment. The nucleotide CARD is
antibiotic classes. It is 2,027,840 nucleotide bases in length and contains 2165 ARG sequences
imported from NCBI GenBank and peer-reviewed publications. The protein CARD is 671,057
amino acids in length and consists of 2165 protein sequences in FASTA format. Reference genes
are classified based on the CARD’s Antibiotic Resistance Ontology (ARO) (Jia et al. 2017).
Simulated metagenomes were analyzed using Bowtie2, BWA-MEM, and BLAST, tools for
aligning reads to reference sequences. Bowtie2 was operated using default settings (i.e. end-to-
end alignment, and a minimum threshold alignment score of -90) for each metagenome
(Langmead and Salzberg 2012). BWA-MEM was operated using default settings (i.e. local
alignment) (Li 2013). BLASTN and BLASTX in the BLAST+ package version 2.6 were used for
aligning reads. BLAST tools were ran using default settings (i.e. local alignment) with an E-value
of 1e-5.
CARD, the alignment methods used in this study were verified by first aligning simulated reads
back to their whole genomes. The performance of each alignment tool was then evaluated when
mapping simulated reads against the CARD. Since the position of each read is known, a custom
python code was used to evaluate the following four performance criteria: correctly mapped
reads (reads aligned to the correct position on the genome, or gene), partially mapped reads
(reads offset from its true position while obtaining at least 20 true positive bases), false
parameters were evaluated for Bowtie2, BWA-MEM, and BLASTN alignments. Reads mapped to
Accepted Article
their respective nucleotide accession number when translated with a percent identity of ≥ 90%
over at least 25 amino acids (aa) (Elbehery et al. 2016) were considered correctly mapped
during alignment with BLASTX. Reads mapped with the abovementioned criteria to the
incorrect target accession number were considered false positives. Multi-reads obtained during
BLASTX follow the same criteria as previously mentioned. Partially mapped reads were not
considered during BLASTX alignment. Since multi-reads introduce uncertainty when analyzing
The optimal alignment approach given the performance criteria was used for aligning the
wastewater samples against the CARD. Genes meeting a threshold value of ≥ 90.00% nucleotide
gene similarity (Kristiansson et al. 2011; Zhang et al. 2015b) or ≥ 90.00% over at least 25 aa
Statistical Analysis
classes between sewage samples was determined by a one-way analysis of variance (ANOVA)
test in SPSS 24.0, where P < 0.05 was considered statistically significant. Error rates were
retrieved directly from the sample’s Binary Alignment/Mapping (BAM) output file using
SAMtools. All quality scores reported follow a Phred scale (Ruffalo et al. 2011).
We would like to thank the managers at East Lansing and Traverse City WWTPs for
Accepted Article
providing samples. Thank you to Bioinformatics Research Specialists, Dr. Tracy Teal and
Dharanya Sampath for technical support. Special thanks to Dr. Shin-Han Shiu for his guidance,
Conflict of Interest
References
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J.H., Zhang, Z., Miller, W. and Lipman, D.J.
(1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Andrews S. (2010) FastQC: A quality control tool for high throughput sequence data.
Angly, F.E., Willner, D., Rohwer, F., Hugenholtz, P. and Tyson, G.W. (2012) Grinder: a versatile
amplicon and shotgun sequence simulator. Nucleic Acids Res 40, e94-e94.
Bolger, A.M., Lohse, M. and Usadel, B. (2014) Trimmomatic: a flexible trimmer for Illumina
Elbehery, A.H.A., Aziz, R.K. and Siam, R. (2016) Antibiotic Resistome: Improving Detection and
M., Aga, D.S. and Pruden, A. (2016) Metagenomic profiling of historic Colorado Front Range
Accepted Article
flood impact on distribution of riverine antibiotic resistance genes. Sci Rep 6.
Gupta, S.K., Padmanabhan, B.R., Diene, S.M., Lopez-Rojas, R., Kempf, M., Landraud, L. and Rolain,
J.M. (2014) ARG-ANNOT, a New Bioinformatic Tool To Discover Antibiotic Resistance Genes in
Hatem, A., Bozdag, D., Toland, A.E. and Catalyurek, U.V. (2013) Benchmarking short sequence
Jia, B., Raphenya, A.R., Alcock, B., Waglechner, N., Guo, P., Tsang, K.K., Lago, B.A., Dave, B.M.,
Pereira, S., Sharma, A.N., Doshi, S., Courtot, M., Lo, R., Williams, L.E., Frye, J.G., Elsayegh, T.,
Sardar, D., Westman, E.L., Pawlowski, A.C., Johnson, T.A., Brinkman, F.S.L., Wright, G.D. and
McArthur, A.G. (2017) CARD 2017: expansion and model-centric curation of the comprehensive
Kristiansson, E., Fick, J., Janzon, A., Grabic, R., Rutgersson, C., Weijdegard, B., Soderstrom, H. and
Langmead, B. and Salzberg, S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods
9, 357-U354.
Li H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.
arXiv:1303.3997v1 [q-bio.GN].
Luby, E., Ibekwe, A.M., Zilles, J. and Pruden, A. (2016) Molecular Methods for Assessment of
45, 441-453.
urban aquatic environments: can it be controlled? Appl Microbiol Biotechnol 100, 1543-1557.
Accepted Article
Ruffalo, M., LaFramboise, T. and Koyuturk, M. (2011) Comparative analysis of algorithms for
Schmieder, R. and Edwards, R. (2012) Insights into antibiotic resistance through metagenomic
Sottani, C., Islam, K., Soffientini, A., Zerilli, L.F. and Seraglia, R. (1993) Studies on the Interaction
Subirats, J., Sanchez-Melsio, A., Borrego, C.M., Balcazar, J.L. and Simonet, P. (2016) Metagenomic
analysis reveals that bacteriophages are reservoirs of antibiotic resistance genes. Int J of
Treangen, T.J. and Salzberg, S.L. (2012) Repetitive DNA and next-generation sequencing:
Yang, Y., Li, B., Zou, S., Fang, H.H.P. and Zhang, T. (2014) Fate of antibiotic resistance genes in
sewage treatment plant revealed by metagenomic approach. Water Res 62, 97-106.
Ye, H., Meehan, J., Tong, W. and Hong, H. (2015) Alignment of Short Reads: A Crucial Step for
541.
Yu, X., Guda, K., Willis, J., Veigl, M., Wang, Z., Markowitz, S., Adams, M.D. and Sun, S. (2012) How
do alignment programs perform on sequencing data with varying qualities and from repetitive
Zhang, T., Yang, Y. and Pruden, A. (2015b) Effect of temperature on removal of antibiotic
Supporting Information
Table S1 Summary of genomes and ARGs contained in the simulated bacterial metagenome.
Percent idenitiy represents BLASTn hit against the NCBI nucleotide database (nr/nt) with an E-
value of 0 over 100% of the query seqeunce. Each ARG seqeunce was extracted directly from the
CARD. All genome and gene annotations were extracted from NCBI and the CARD’s Antibiotic
Resistance Ontology.
Figure 1 Relative abundance of antibiotic resistance classes in the (a) CAS and (b) MBR
Illumina-errors when aligned with Bowtie2, BWA-MEM, and BLASTN against their whole
Accepted Article
genomes.
No. of Mapped reads (%) 128864 (100) 128864 (100) 128484 (99.71)
No Error 2 (~ 0) 4 (~ 0) 0
Illumina-errors when aligned with Bowtie2, BWA-MEM, BLASTN, and BLASTX against the
Accepted Article
CARD. BLASTX was evaluated based on the following criteria: target genes obtaining a percent
identity of ≥ 90% over at least 25 amino acids with the read mapped to the correct target NCBI
accession number.
No. of Mapped reads (%) 456 (0.35) 802 (0.62) 509 (0.39) 1090 (0.85)
No. of False Positives (%) 292 (64.04) 679 (84.66) 295 (57.96) 327 (30)
plants.
Accepted Article
East Lansing WWTP Traverse City WWTP
Table 4 Quality control analysis results on raw and post quality trimmed sequence reads using
FastQC.
Conventional
CAS 13115140 52 10747794 52
Activated Sludge
Membrane
MBR 12478608 58 9779513 58
Bioreactor