Nature Reviews Genetics | AOP, published online 10 December 2013; doi:10.1038/nrg3655

1038/nrg3655 PERSPECTIVES

fixation causes degradation and nucleotide

A P P L I C AT I O N S O F N E X T- G E N E R AT I O N S E Q U E N C I N G — O P I N I O N
changes7,8. Moreover, inadequate amounts of
high-quality genomic material can increase
The role of replicates for error amplification errors and decrease sequenc-
ing read depth9. Finally, contamination
mitigation in next-generation poses a challenge when non-tumour cells
mask oncogenic somatic variants10 or when
sequencing exogenous DNA interferes with calls of
homozygosity or heterozygosity 11.

Kimberly Robasky, Nathan E. Lewis and George M. Church Library preparation. Errors also arise dur-
ing sequencing library preparation, which
Abstract | Advances in next-generation sequencing (NGS) technologies have leads to uneven coverage, sequence changes
rapidly improved sequencing fidelity and substantially decreased sequencing and interruption of sequence tags. DNA
error rates. However, given that there are billions of nucleotides in a human fragmentation can produce length biases,
genome, even low experimental error rates yield many errors in variant calls. which subsequently causes preferential
amplification12. Library amplification is
Erroneous variants can mimic true somatic and rare variants, thus requiring
subject to unmeasured primer biases, such
costly confirmatory experiments to minimize the number of false positives. as primer bias in multiple displacement
Here, we discuss sources of experimental errors in NGS and how replicates can amplification (MDA)13, mispriming in PCR
be used to abate such errors. target enrichment 14 and incorporation of
sequence errors during both clonal ampli-
fication and PCR cycling 15. When barcodes,
The emergence of next-generation Experimental errors in NGS adaptors and other pre-defined sequence
sequencing (NGS) has revolutionized the Technological advances and the digital tags are added to the fragments being
study of genetics and provided valuable nature of DNA are helping to achieve sequenced, disruption and inadequate tag
resources for other scientific disciplines. As highly accurate genome sequences. design can result in cross-contamination of
NGS becomes more widely accessible, its However, sequencing methods are imper- data sets, read loss and decreased read qual-
use has extended beyond basic research fect. NGS applications — such as whole- ity 2,16. Chimeric reads can also arise in long-
and into broader clinical contexts. It is genome sequencing, targeted capture, insert paired-end libraries17 and potentially
therefore increasingly important to account high-throughput RNA sequencing (RNA- confound variant calls and assembly efforts.
for the errors that arise in the sequenc- seq) and chromatin immunoprecipitation
ing process. These errors can stem from followed by sequencing (ChIP–seq) — are Sequencing and imaging. Current NGS
the bioinformatic analysis1 and from prone to errors that result in miscalled platforms3 have sequencing and imaging
experimental steps2,3 (which can often bases, thus causing misalignment of short error types that are specific to the plat-
be mitigated through the use of replicate reads and mistakes in genome assembly. forms18. For example, substitution errors can
experiments). Reported claims of sequencing base call arise in platforms such as Illumina and
The use of replicates permeates almost accuracy for leading NGS technologies SOLiD when incorrect bases are intro-
all scientific disciplines. However, in NGS, greatly vary, which range from one error duced during clonal amplification of tem-
many researchers use sequencing read depth in one thousand nucleotides (99.9%)5 plates. Furthermore, Illumina has shown a
and bioinformatic filters to address errors in to one error in ten million nucleotides sequence-specific error profile19 that pos-
lieu of biological replication. This practice (99.9999%)6. Even for methods that have sibly arises from either single-strand DNA
is understandable, given that replicates can the lowest reported error rates, the absolute folding or sequence-specific alterations in
substantially increase study costs. However, numbers of miscalled genomic variants enzyme preference. The single-molecule,
sequencing costs have decreased markedly 4, remain unwieldy — there might be thou- real-time (SMRT) platform of Pacific
and now is the time to re-evaluate the value sands of false-positive variants in a fully Bioscience yields long single-molecule
of replication in sequencing studies. sequenced human genome. Furthermore, reads that are subject to false insertions and
In this Perspective article, we discuss false-positive errors are mistaken as rare deletions (indels) from non-fluorescing
sources of errors in sequencing and the and somatic variants, thereby obfuscating nucleotides20,21. Pyrosequencing (for exam-
nascent use of replication in published true variants of clinical interest. Known ple, Roche 454 platforms) and semiconduc-
high-throughput sequencing efforts. In sources for experimental errors can tor sequencing (for example, Ion Torrent)
addition, we show how biological replicates be grouped by their occurrence in the have difficulty in counting homopolymer
can be used to reduce sequencing errors. In sequencing workflow; that is, during sam- stretches, which results in carry-forward
particular, we demonstrate that replicates ple preparation, library preparation, or insertion and deletion errors22.
can be used to assess the specificity and sequencing and imaging (FIG. 1a; BOX 1). Experimental errors pose challenges in
the sensitivity of sequence variant-calling applications for which accuracy is crucial,
methods in a manner that is independent Sample preparation. Sequencing errors such as in detection of somatic mosaicism23,24
of the algorithms and the chemistry that and biases can arise from sample degrada- and in other clinical applications. Errors
are used to call variants, thereby guiding tion and contamination during sample are often addressed by increasing sequenc-
the appropriate selection of quality score isolation and preservation. For example, ing read depth but can also be mitigated by
thresholds. during sample preservation, formalin careful barcoding strategies25, replicates,


orthogonal sequencing technologies26 and Replicates and experimental errors necessitates confirmatory experiments, such
prior knowledge of variants27. Together, Many applications of NGS — for example, as Sanger sequencing. The standard valida-
these approaches can help to overcome vari- the detection of rare causal variants and tion methods that are used for confirmation
ations in experimental conditions, stochastic somatic variants, and clinical applications tend to be costly and labour intensive, and
fluctuations and systematic biases. — require high fidelity in sequencing, which lower-cost alternatives are therefore needed.

a Experimental sources of sequence variation

Somatic Sample Chimeric Incorrect base Incorrect base Overlapping
mosaicism degradation reads incorporation incorporation signal

R1 R2 R3
Amplification Sequencing
and adapter
Cycle x-1 G G G

..C.. Cycle x G C C
Cycle x+1 A A A

Sample preparation Library preparation Sequencing and imaging

b Post-processing mechanisms to identify unexpected variation

Quality Optimized Sequence Variant clustering
scores sequence filters coverage
component (PC)

Read clipping Reference T T A G G C C G G A G T A C C G C C C A A A T

Pedigree analysis
GC alignment T T A G G C filtering

Primary analysis Secondary analysis

e.g. base calling and quality control e.g. alignment and variant calling Tertiary analysis

Figure 1 | Sources of and tools to cope with unexpected or errone- include indicators of data quality (for example, base call and mapping
ous variants.  Sequencing experiments involve many steps from quality scores) and the choice of filters that is informed by these indica-
sample acquisition to final data analysis, and a major challenge in the Nature Reviews
tors. Additional tertiary analyses can also highlight | Genetics
systematic biases
process stems from the emergence of unexpected or erroneous vari- through clustering methods and possible false-positive variants by
ants. Sequencing pipeline and sources of errors are shown; R represents accounting for Mendelian inheritance patterns 57. Throughout
a replicate. a | These variants can include legitimate somatic mosaicism the sequencing and post-processing pipeline, the use of replicated
and rare oncogenic variants. Additionally, many erroneous sequence sequencing experiments can help to mitigate the effect of erroneous
variants arise during experimental steps, for example, through sample variants from the experimental steps and to inform the choice of post-
degradation, PCR amplification errors and base-calling errors. processing filters. Thus, greater accuracy of germline variant detection
b | Several analytical tools and post-processing mechanisms are often can be attained, and improved sensitivity can be achieved for true
used for separating true variation from false sequence variants. These somatic variation.


An approach that holds promise uses the merely increasing sequencing read depth haplotypes with incongruent base calls that
tried-and-true scientific method of replica- cannot ameliorate issues that arise from the were suspected as amplification errors were
tion to mitigate user errors, stochastic dif- widespread batch effect phenomenon30 and discarded, and the sequence quality was
ferences and other sources of experimental many other error types that are introduced significantly improved.
errors. Different types of replication are in the experimental process. Thus, increased
described below, including sequencing sequencing read depth is not necessarily an Biological replicates. We define biological
read depth, and technical, biological and adequate proxy for biological replication and replication as the preparation and the
cross-platform replication. is limited in its ability to mitigate errors. analysis of multiple biological samples under
the same conditions from the same host.
Sequencing read depth. The most straight- Technical replicates. The frequency of Biological replicates in genome sequencing
forward approach to improve sensitivity certain error types can be reduced through can be used to assess the efficacy of various
and accuracy in sequence variant calls is technical replication. We define technical bioinformatic filters32. Additional benefits
to increase sequencing read depth28,29. By replication as the repeat analysis of the exact over technical replicates include the iden-
increasing the number of short reads, one can same sample. For example, technical rep- tification of rare somatic mosaicism and of
improve variant calling on easily sequenced licates were used with monozygotic twins, differences in transcript abundance. Somatic
regions. Consequently, one can reduce the and the data showed higher intra-individual mosaicism can arise from mutations that
number of missed true variants (that is, false correlations than inter-individual correla- occur from mutagens and other causes24.
negatives) and sometimes the number of true tions31. In another example6, many technical Biological replicates can indirectly help to
non-variants that are incorrectly detected as replicate pools were sequenced and each uncover somatic mutations in complex and
variants (that is, false positives). However, contained dilute DNA. Pools containing heterogeneous tumours when they are used
to achieve the ‘normal’ baseline sequence in
tumour–normal pairs.
Box 1 | Experimental sources of errors in sequencing
The importance and the relative effect of each error source on downstream applications depend Cross-platform replicates. Each sequencing
on many factors, such as sample acquisition, reagents, tissue type, protocol, instrumentation, platform introduces unique biases and error
experimental conditions, analytical application and the ultimate goal of the study. Sequencing types. Thus, integrating sequencing data
errors can stem from any time point throughout the experimental workflow, including initial from different technologies can further miti-
sequence preparation, library preparation and sequencing. Some examples are listed below. gate errors. For example, sequencing DNA
Sample preparation samples that were taken from both the blood
• User errors; for example, mislabelling and saliva on two different platforms —
• Degradation of DNA and/or RNA from preservation methods; for example, tissue autolysis, Illumina and Complete Genomics — resulted
nucleic acid degradation and crosslinking during the preparation of formalin-fixed, in 88.1% concordance of single-nucleotide
paraffin-embedded (FFPE) tissues8,87,88 variants (SNVs) across replicates33. Validation
• Alien sequence contamination; for example, those of mycoplasma and xenograft hosts89 rates for variants that were called on both
• Low DNA input9 platforms were higher than variants that
were not. In another study, sequencing on
Library preparation
three platforms — Illumina, Roche 454 and
• User errors; for example, carry-over of DNA from one sample to the next and contamination
SOLiD — showed 64.7% concordance5. This
from previous reactions90
disparity could result from multiple experi-
• PCR amplification errors9 mental error sources and from differences
• Primer biases; for example, binding bias, methylation bias, biases that result from mispriming, in downstream bioinformatic processing.
nonspecific binding and the formation of primer dimers, hairpins and interfering pairs, and Cross-platform replicates greatly reduce the
biases that are introduced by having a melting temperature that is too high or too low91,92
number of false-positive variants, but
• 3ʹ‑end capture bias that is introduced during poly(A) enrichment in high-throughput RNA the different biases from each sequencing
platform may cause many true variants to be
• Private mutations; for example, those introduced by repeat regions and mispriming over overlooked when cross-platform replicates
private variation94
are compared.
• Machine failure; for example, incorrect PCR cycling temperatures15
• Chimeric reads2,17 Reducing errors and replicates
• Barcode and/or adaptor errors; for example, adaptor contamination, lack of barcode diversity As sequencing further permeates science
and incompatible barcodes16,95 and medicine, replicates will be invaluable
Sequencing and imaging to researchers and clinicians alike. Current
• User errors; for example, cluster crosstalk caused by overloading the flow cell96 efforts in sequencing error mitigation mainly
• Dephasing; for example, incomplete extension and addition of multiple nucleotides instead of a rely on filtering strategies, including filtering
single nucleotide3 for sequencing read depth, base call quality,
• ‘Dead’ fluorophores, damaged nucleotides and overlapping signals20 short-read alignment quality, variant call
• Sequence context; for example, GC richness, homologous and low-complexity regions, and
quality, known variants, strand bias, allelic
homopolymers19,97,98 imbalance and sequence context10,21,25,27,34–36.
All of these post-processing techniques help
• Machine failure; for example, failure of laser, hard drive, software and fluidics
to reduce uncertainty in the final genotyping
• Strand biases97
variant call (FIG. 1b).


Bioinformatic filtering techniques can Sequence and call variants

be optimized using technical, biological and
cross-platform replicates to improve their
specificity and sensitivity 32. For example,
optimal quality score thresholds for each
filter may be selected using replicate genome
sequences. An individual human genotype
has ~3 million variants36; however, variant
callers can predict >20 million variants of Target genome (T) Replicate genomes (R1 and R2)
differing quality per genome, which mainly
result from mismapped short reads37, mosai-
cism and sequencing errors. Consequently,
thresholds are chosen to limit the variants
called in the individual’s genotype. Ideally,
these thresholds are chosen with experimen- TTCGCG... ...ACGTT CGTTGTTA
tal confirmation38, but this can be costly. We ...ACGTT GTTAGCG... TGTTAG
assert that replicates can abet bioinformatic GTTGTTCGA GTTGTTAGC ...ACGTTG
filtering and reduce the number of variants Ref. ...ACGTAGTTTGCG... ...ACGTAGTTTGCG... ...ACGTAGTTTGCG...
that require validation, thereby improving
the quality of the sequence that is being Classify variants Decreasing score stringency
mapped or assembled.

Fraction of concordant SNVs

To illustrate this, we use biological
replicates to carry out a simple analysis for T ...ACGTTGTTCGCG... Sensitivity or
R1 ...ACGTTGTTAGCG... specificity Random
assessing the reliability of single-nucleotide 0.5 threshold
substitution calls (FIG. 2). For genotyping, the
number of replicates should be chosen to Concordant Discordant
attain adequate statistical power at the loci
in question. However, in this case, we seek a Rank variants to assess sensitivity
0 0.5 1.0
set of probable false positives that stem from or specificity and to select error
Fraction of discordant SNVs
experimental errors, which requires only metric threshold
three replicates for a voting majority. For the
replicates, we obtained sequence data from Figure 2 | Platform-independent method for choosing quality score Naturethresholds.  Single-
Reviews | Genetics
nucleotide variants (SNVs) are called for all replicates and then classified either as concordant if
three distinct tissue samples of participant
the variant calls agree among the replicates or as discordant if they differ. Variants are then ranked
PGP1 in the Personal Genome Project39 (see in order by the desired metric (for example, quality scores) and plotted in a graph that is similar
Supplementary information S1 (box)). to a receiver–operator characteristic curve; that is, the cumulative distributions of concordant
Loci in which one or more replicates con- and discordant variants are plotted from left to right as the stringency of the confidence score of
tained a SNV were identified. Briefly, SNV interest decreases. Ref, reference sequence.
loci are known as concordant when all repli-
cate variant calls agree40 and discordant when
other replicates differ from the target repli- maximize the proportion of all concordant be used to evaluate the effect of varying qual-
cate. Thus, concordant loci represent true- variants that are seen either at or below a ity score thresholds for a specific data set of
positive variants, and discordant loci signal particular threshold relative to the proportion interest. For example, sensitivity of a particu-
false-positive variants. See Supplementary of all discordant variants. This analysis (FIG. 3) lar threshold can be evaluated by considering
information S1 (box) for precise definitions suggests that, although adequate sequencing the false-negative rate, as estimated by the
of concordance and discordance, for details read depth across the genome is essential28,29, number of concordant variants that are lost as
on choosing a target replicate and for it is not the best measure of reliability of a spe- a result of applying the threshold.
implementation details. cific variant call at a particular locus. Indeed,
Once discordant variants (potential false sequencing read depth at a particular locus Post-processing errors in NGS
positives) and concordant variants (potential is an inferior filter when it is compared with Even with the use of replicates, some types
true positives) have been separated from each error-model-based quality scores. We found of errors cannot be addressed without
other, metrics of variant call confidence (for that this holds true for quality scores that are further technological advances and
example, quality scores and read depth) are computed by software packages which pro- improvements in bioinformatic processing.
used to rank-order the target variants. Using cess genomic35 and expression27 data. Even For example, indels41, paralogues and other
the ranked sets, one can plot the accumula- after removing regions that have abnormally repetitive sequences42 often confound NGS
tion rate for both concordant and discordant high read depths (that is, regions that are short-read alignment 43,44, which results in
variants with decreasing score stringency in enriched for misalignment errors in low- mismapped reads and, ultimately, variant
a representation that is similar to a receiver– complexity sequences37), the quality scores call errors. Other sources of errors can arise
operator characteristic (ROC) curve (see that are considered here still outperform read from limitations in software and configu-
Supplementary information S2 (box) for depth as a filter for sequencing errors. ration during secondary analysis, includ-
methods and source code). Thus, thresholds In addition to comparing disparate error- ing read clipping and filtering 45, allelic bias
for variant call quality scores can be chosen to model-based quality scores, this approach can measurement46 and variant call confidence


1.0 Threshold to find biologically and clinically relevant

variants are steadily improving, as algorithmic
0.9 advances more intelligently filter the large
Random amount of sequence data. For example, prior-
0.8 ity can be assigned to variants by considering
either heritability or variant association in
Fraction of all concordant SNVs

populations60,77, correcting for gene-specific
mutation rates10, accounting for evolutionary
conservation78–80 and providing network con-
0.5 text through systems biology approaches81–83.
Beyond strictly biological applications,
0.4 23,800 sequencing is also becoming an analytical tool
46 for more esoteric questions, such as record-
ing fluctuations in ion concentrations84 and
Genomic quality scores
even potentially detecting dark matter in
0.2 Expression quality scores
astrophysics85. However, all these sequencing
65 Read depth
0.1 studies rely on the accuracy of the underlying
71 Read depth (depth >120× removed)
sequencing experiments.
0 Here, we have identified sources of
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 sequencing errors and presented a method
Fraction of all discordant SNVs for addressing the stochastic effects.
Additional approaches to address other
Figure 3 | An example application of plotting replicate scores to assess filter efficiency.  The
efficiency of different variant call filter metrics can be evaluated by plotting Reviews | Genetics
replicate-based single- sources of errors, such as experimental bias
nucleotide variant (SNV) concordance and discordance in a manner that is similar to a receiver– and software limitations, are also essential.
operator characteristic curve. As one goes from left to right on the plot, the quality score that has These approaches include identifying erro-
been ranked in order is reduced in stringency, and the fractions of retained concordant and discord- neous single-nucleotide polymorphisms that
ant variants increase. Thus, this curve quantifies the proportion of reliable data (that is, concordant show Hardy–Weinberg disequilibrium11,
SNVs) that are retained and the proportion of low-confidence data (that is, discordant SNVs) that are masking poor-quality bases86, phasing and
discarded as a consequence of variable quality score cutoffs. For the genomes used in our analysis, imputing variants in regions that are dif-
this graph indicates that filtering variants solely on the basis of locus read depth is inferior to filtering ficult to sequence or in uncalled regions54,
by genomic35 and expression27 quality scores35. Furthermore, filtering by expression data quality as well as improved methods for calling of
scores is also inferior to filtering by genomic quality scores (which are obtained from Complete
structural variants, copy number variations
Genomics); nevertheless, both of these filters are better than filtering loci by read depth. The read
depth curve that excludes outliers (that is, read depth that is higher than the 99.5th-percentile) out- and indels. Together with these computa-
performs the all-inclusive read depth curve. As an example of how to understand the value of a tional approaches, the wise use of replicate
threshold, note that choosing a threshold score of 120 as a measure for the highest quality for the genome sequencing will have an increasingly
genomic data will include the same fraction of total predicted errors as choosing a threshold quality important role in reducing the noise in data
score of 23,800 for the expression data. Meanwhile, when a similar threshold is chosen for read depth, processing and in downstream analyses.
the efficiency at retaining true variants is worse than that at random. See Supplementary information
S2 (box) for a full description of the method. Kimberly Robasky was previously at the Program in
Bioinformatics, Boston University, Massachusetts
02115, USA; the Department of Genetics, Harvard
Medical School, and the Wyss Institute for Biologically
calculation47. These cannot be addressed difference can have important implications Inspired Engineering at Harvard University, Boston,
Massachusetts 02115, USA. Present address:
with replicates alone. with regard to phenotype and to clinical
Expression Analysis, a Quintiles Company, Durham,
Erroneous variant calls also arise from applications of sequencing. Unfortunately, North Carolina 27713, USA.
incomplete reference data. This error type current mainstream NGS methods do not
Nathan E. Lewis was previously at the Department of
arises when reads are mapped to unfinished consistently discern between these two cases. Genetics, Harvard Medical School, and the Wyss
reference genomes and transcriptomes, and Thus, ad hoc experimental6,52,53 and compu- Institute for Biologically Inspired Engineering at
to drafts that contain misassembled regions48. tational procedures54,55 are required to Harvard University, Boston, Massachusetts 02115,
These errors will steadily decrease in fre- distinguish the haplotypes of diploid cells. USA; and the Department of Biology, Brigham Young
University, Provo, Utah 84602, USA. Present address:
quency as reference genome assemblies and Division of Pediatric Pharmacology and Drug
annotations such as GRChr37 (REF. 49) and Concluding remarks Discovery, University of California, San Diego School of
RefSeq50 are completed and corrected with In the past decades, scientific and technologi- Medicine, La Jolla, California 92093, USA.
each new build release. cal advances have provided molecular-level George M. Church is at the Department of Genetics,
Finally, advances in haplotype phasing resolution for the inner workings of life. Harvard Medical School, and the Wyss Institute for
hold promise not only for reducing ampli- NGS technologies are providing insights into Biologically Inspired Engineering at Harvard
fication errors6 but also for reducing the genetic disease associations56–62, differences in University, Boston, Massachusetts 02115, USA.

causal variation search space. For example, human gut microbiota63, amino acid essenti- K.R. and N.E.L. contributed equally to this work.
only through accurate haplotype phasing can ality in proteins64, experimental evolution65–67, Correspondence to N.E.L. 
we begin to discern the difference between biotherapeutic development 68–72, protein– e‑mail:
two dysfunctional gene copies (that is, a dou- DNA interactions73, epigenetics74, cancer doi:10.1038/nrg3655
ble mutant) and a single normal copy 51. This genomics38,75 and clinical diagnosis76. Efforts Published online 10 December 2013


