Professional Documents
Culture Documents
Amyris - Low Cost High Thruput Sequencing of DNA Assemblies Using A Highly Multiplexed Nextera Process
Amyris - Low Cost High Thruput Sequencing of DNA Assemblies Using A Highly Multiplexed Nextera Process
pubs.acs.org/synthbio
demonstrate the simultaneous sequencing of over 4000 mononucleotide runs longer than two bases or GC content
plasmids with an average size of 8 kb (largest ∼20 kb) at a outside the range of 35−65% were removed. Sequences
consumables cost of less than $3 per plasmid. Here, we focus differing by at least three bases from all other barcodes in the
on a description of the method and briefly discuss how such set or from sequences complementary to all 8-base sequences
data is being used to optimize the DNA assembly process. A present within the constant regions of the i5 and i7 adapter
comprehensive analysis of the data vis-à-vis different assembly primers (see Illumina Customer Sequence Letter18) were then
methods will be the topic of a separate publication. selected. These ∼800 sequences were placed into the context of
B DOI: 10.1021/sb500362n
ACS Synth. Biol. XXXX, XXX, XXX−XXX
ACS Synthetic Biology Technical Note
Figure 4. Example sequence data plots for samples from the run of 4078 samples described here. The top two show samples with differences
between the reads and the reference, while the bottom two show samples that match the reference perfectly (not counting the vector). The green
region shows the depth of coverage. Red and blue bars indicate a SNP in the forward and reverse reads, respectively. Purple and yellow bars indicate
an indel in the forward and reverse reads, respectively. Note that even with less than 15× average coverage (bottom right) it is sometimes possible to
obtain reliable QC data. At the bottom of each plot are the DNA parts in green and the vector in yellow.
replicates of each are sequenced), achieving only 2× coverage Given the small size of the plasmids that we were sequencing
by Sanger sequencing of this same set of DNA assemblies (compared to genomes), we decided to implement our own
would be about 20-fold more expensive and would include the simple feature detection method based on the pileup file.
need to order and track many primers to distribute the reads Software was written in F# (http://fsharp.org) to call
across the assemblies appropriately. mutations and assign severity scores to features based on
Analyzing the NGS QC Data. Aligning reads to a digital their sequence context (e.g., part type and the probability that
reference and choosing the best replicate of an assembly is they could impair function). The software ranks the replicates
conceptually simple but requires rapid, parallel analysis of many of each assembly based on the number of mutations and their
data sets. W e initially tested SAMTOOLS and severity and reports which replicate best matches the digital
BCFTOOLS29,30 to identify features (SNPs and indels), but template. In addition, the software stores all sequence variants
we were unable to find appropriate settings to reliably call all found, along with other relevant information, in a postgreSQL
mutations found in the plasmids. A possible cause for this could database. Finally, the software generates a graphic for each
be the high read coverage seen in some samples (approaching sample (Figure 4) showing coverage and variant calls, which
1000×), which may hinder some part of the mutation calling facilitates the investigation of specific cases when the
algorithm. We were reluctant to subsample the sequencing data algorithmic decision is in question. The uneven coverage in
in these cases, as this reduces resolution of SNP frequency and these examples is mostly due to Poisson sampling during the
complicates base calling in regions of low coverage. Another sequencing process. Some of the uneven coverage might also be
possible cause is that our DNA samples may be mixed due to bias for or against certain sequence motifs by either the
populations that do not resemble the diploid genomic samples transposome23 or the polymerase used for the PCR.24 On the
against which these algorithms and tool sets were developed. other hand, it might also be an indication of sequence
For example, a SNP at 10% frequency does not match a discrepancies that should be more closely investigated.
heterozygous or homozygous situation. Interestingly, we found In the run with 4078 samples described here, 4056 were four
that the features were identified correctly at the level of read replicates of 1014 constructs assembled by yeast homologous
alignment but sometimes missed by the calling algorithms. recombination. The remaining 22 samples were internal
D DOI: 10.1021/sb500362n
ACS Synth. Biol. XXXX, XXX, XXX−XXX
ACS Synthetic Biology Technical Note
process controls, which were not used for data analysis. Table 1 CA). Fluorescence was read on an M5 plate reader (Molecular
shows the statistics for the sequence differences between the Devices, LLC, Sunnyvale, CA). DNA fragment size profiles
were determined using either a Bioanalyzer 2100 (Agilent
Table 1. Sequence Difference Statistics for the Four Technologies, Inc., Santa Clara, CA) or a Fragment Analyzer
Replicates of 1014 Assemblies Assembled by Yeast (Advanced Analytical Technologies, Inc., Ames, IA).
Homologous Recombination DNA Assembly and Quantitation. DNA parts with
specific linker sequences at each end were assembled in a
percent of 4056 samples or
statistic 1014 constructs shuttle vector using yeast homologous recombination, followed
by shuttling into Escherichia coli for isolation of DNA, as
samples exactly matching the reference 54%
previously described.8 DNA assemblies built using the ligase
samples with only one SNP or one indel 23%
cycling reaction (LCR)27 were also used in some experiments.
samples with more than one SNP or indel 16%
Plasmid DNA was prepared by alkaline lysis and silica gel
samples misassembled (zero coverage for 5.8%
>200 bp) binding8 or was amplified using an Illustra Templiphi kit (GE
constructs having at least one replicate 73% Healthcare Life Sciences, Piscataway, NJ). DNA concentration
matching reference was measured using Quant-iT PicoGreen reagent (Life
constructs having at least one replicate 99% Technologies, Foster City, CA) in Costar 3658 or 3677 black
correctly assembled 384-well plates (Corning, Inc., Corning, NY). The PicoGreen
reagent was diluted with TE (10 mM Tris-HCl, pH 8, 0.5 mM
samples and the digital reference sequences. The importance of EDTA) containing 0.05% Tween 20.
replicates is highlighted by the fact that, although 5.8% of the Preparing Nextera Libraries for Sequencing. Figure 5
samples were misassembled, only 1% of the constructs had no depicts the chronological workflow for the highly multiplexed
correctly assembled replicate.
When a SNP or indel is present in only one replicate of a
construct, this is likely due to errors in the primers or errors by
the polymerase during PCR amplification of parts. Alter-
natively, errors may arise during RCA for MiSeq sample prep.
The frequency of this type of mutation appears to be consistent
with the known fidelity of the polymerases25 or with the
reported frequency of errors in oligonucleotide primers.26
Many indels were located at homopolymers, which are known
to be susceptible to contraction during replication and are also
prone to sequencing artifacts even on the Illumina platform.
When the same SNPs or indels are present in all four replicates,
or in the same part in different constructs, they are most likely Figure 5. Diagram of the workflow for the plasmid QC process. The
due to errors in either the digital reference sequence (i.e., data type of robot used at each step is indicated in parentheses.
entry) or the template used for PCR amplification of the part.
Several errors were due to the use of a physical part for the plasmid sequencing protocol described here. Using the reagents
PCR template that was not the same as the part specified in the in an Illumina Nextera kit (FC-121-1031), the tagmentation
digital request. The frequency of this type of mutation was reaction volume was reduced from 50 μL, as specified in the kit
higher than anticipated, and we can reduce it. Since the run protocol, to 5 μL for the Biomek robots (2 μL of DNA solution
with 4078 samples described here, we have used this NGS QC and 3 μL of tagmentation master mix containing 0.5 μL
process in more than 10 assembly cycles, thus accumulating a tagmentation enzyme and 25 μL tagmentation buffer) or 0.5 μL
large amount of NGS QC data. We intend to publish a (200 nL DNA and 300 nL of tagmentation master mix) for the
comprehensive analysis of this data in the future and to identify Echo. Rolling circle amplified (RCA) DNA or plasmid DNA
how the assembly process generates the different types of prepared by alkaline lysis was diluted with TE to achieve the
mutations. desired concentration (2.5−10 ng/μL; see Results and
Applications of This Process. Plasmid DNA has been the Discussion). The transposase was dissociated from the
workhorse of molecular biology for 4 decades. NGS protocols, tagmented DNA by adding SDS (sodium dodecyl sulfate) to
however, have not yet enabled the cost-effective sequencing of a final concentration of 0.1% (e.g., 125 nL of 0.5% SDS added
these small bits of DNA. The method described herein bridges to 0.5 μL tagmented DNA).
the power of NGS to the plasmid libraries used in gene Adapters for the Illumina sequencing process, including 8-
synthesis, DNA assembly, enzyme engineering, amplicon base barcodes, were attached to each tagmented DNA sample
sequencing, and library deconvolution, to name a few. These using 12 cycles of PCR. All primers were obtained from IDT
are common research areas in synthetic biology, and we expect (Integrated DNA Technologies, Inc., Coralville, IA) with
the community to benefit from adoption of the NGS QC standard desalting. The barcodes inserted into the Illumina i5
process described here.
■
and i7 adapter primer sequences (see Illumina Customer
Sequence Letter18) are listed in Table S1. Using the Echo, each
METHODS sample well received 125 nL of a forward barcode primer and
Instrumentation. Liquid transfers were carried out on 125 nL of a reverse barcode primer (each at 100 μM). A PCR
Biomek FX or NX robots (Beckman Coulter, Brea, CA) for master mix (24.5 μL) was then added using a Biomek robot.
volumes greater than 2 μL or on an Echo 550 plus Access robot The master mix contained 0.2 units/μL of Vent DNA
(Labcyte, Sunnyvale, CA) for volumes less than 2 μL. polymerase (New England Biolabs, Ipswich, MA), 1×
Sequencing was done on a MiSeq (Illumina, Inc., San Diego, Thermopol buffer (NEB), 2 mM MgSO4, 200 μM of each
E DOI: 10.1021/sb500362n
ACS Synth. Biol. XXXX, XXX, XXX−XXX
ACS Synthetic Biology Technical Note
■
Millipore, Billerica, MA). The DNA concentration and average
fragment size of the pool were determined by Picogreen
fluorescence and a high-sensitivity DNA chip on a Bioanalyzer ACKNOWLEDGMENTS
2100, respectively. After diluting the filtered pool to 1.11 nM This work was funded by Defense Advanced Research Projects
with water, 18 μL was denatured by adding 2 μL of 1 N NaOH. Agency (DARPA) Living Foundries grant HR001-12-3-0006.
After 5 min at room temperature, 980 μL of ice-cold Illumina We wish to thank Don Misumi for maintaining the Biomek
hybridization buffer was added, followed by 2 μL of 1 N HCl. robots and modifying the scripts for our needs. We also thank
The denatured pool was loaded on the MiSeq at 12 pM, which Daniel Hollis, Nalien Vongtharangsy, and other members of the
was empirically determined to give the optimum cluster density Amyris Automated Strain Engineering (ASE) team for helping
when following this protocol. with automation and getting the NGS QC process into the ASE
workflow.
■
Sequence Data Processing. A web-based sequencing
tracking system was created to manage the many samples and
the large amounts of data generated. It facilitates the creation of REFERENCES
runs, generation of sample sheets required by the MiSeq, and (1) Weenink, T., and Ellis, T. (2013) Creation and characterization
analysis of multiple data types, including the NGS QC data of component libraries for synthetic biology. Methods Mol. Biol. 1073,
51−60.
described here. Reads were demultiplexed using the embedded (2) Polizzi, K. M. (2013) What is synthetic biology? Methods Mol.
MiSeq Reporter software. For large numbers of multiplexed Biol. 1073, 3−6.
samples (>1000), it was necessary to increase the File Copy (3) Munnelly, K. (2013) Engineering for the 21st century: synthetic
Timeout setting to avoid premature interruption of the biology. ACS Synth. Biol. 2, 213−215.
demultiplexing process, which takes several extra hours after a (4) Stephanopoulos, G. (2012) Synthetic biology and metabolic
highly multiplexed run appears to have completed. When a engineering. ACS Synth. Biol. 1, 514−525.
(5) Gardner, T. S., Hawkins, K. M., Meadows, A. L., Tsong, A. E., and
sequencing run completes, the system automatically retrieves Tsegaye, Y. (2014) Production of Acetyl-Coenzyme a Derived Isoprenoids,
the FASTQ files from the MiSeqOutput folder. Read mapping Patent US 8415136 B1.
to the intended assembly sequences uses BWA v0.6.232 and the (6) Du, Y. L., Williams, D. E., Patrick, B. O., Andersen, R. J., and
sampe method with default settings.28 Alignments are stored in Ryan, K. S. (2014) Reconstruction of cladoniamide biosynthesis
BAM file format using SAMTOOLS v0.1.19.29,30 Mapping reveals nonenzymatic routes to bisindole diversity. ACS Chem. Biol. 9,
statistics are obtained using the SAMTOOLS flagstat utility. A 2748−2754.
(7) Ajikumar, P. K., Xiao, W. H., Tyo, K. E., Wang, Y., Simeon, F.,
pileup file is generated using SAMTOOLS mpileup with default Leonard, E., Mucha, O., Phon, T. H., Pfeifer, B., and Stephanopoulos,
options to obtain read coverage along the reference sequence. G. (2010) Isoprenoid pathway optimization for Taxol precursor
■
overproduction in Escherichia coli. Science 330, 70−74.
ASSOCIATED CONTENT (8) Dharmadi, Y., Patel, K., Shapland, E., Hollis, D., Slaby, T.,
Klinkner, N., Dean, J., and Chandran, S. S. (2014) High-throughput,
*
S Supporting Information cost-effective verification of structural DNA assembly. Nucleic Acids
Figure S1: Identifying the optimum SDS and Triton X-100 Res. 42, e22.
concentrations for removal of the transposase after tagmenta- (9) Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G.
P., Milton, J., Brown, C. G., Hall, K. P., Evers, D. J., Barnes, C. L.,
tion. Figure S2: PCR efficiency using Vent polymerase and
Bignell, H. R., et al. (2008) Accurate whole human genome
primers ordered from IDT or the Nextera kit reagents NPM sequencing using reversible terminator chemistry. Nature 456, 53−59.
and PPC. Figure S3: Transfer of RCA DNA by the Echo. (10) Perkins, T. T., Tay, C. Y., Thirriot, F., and Marshall, B. (2013)
Figure S4: Correlation of read coverage comparing two Choosing a benchtop sequencing machine to characterise Helicobacter
separate MiSeq runs of the same plasmids prepared for pylori genomes. PLoS One 8, e67539.
sequencing by the protocol described here. Table S1: Barcodes (11) Loman, N. J., Misra, R. V., Dallman, T. J., Constantinidou, C.,
Gharbia, S. E., Wain, J., and Pallen, M. J. (2012) Performance
used in this work. Table S2: Consumables costs per sample
comparison of benchtop high-throughput sequencing platforms. Nat.
when 4000 samples are sequenced in parallel. The Supporting Biotechnol. 30, 434−439.
Information is available free of charge on the ACS Publications (12) Caruccio, N. (2011) Preparation of next-generation sequencing
website at DOI: 10.1021/sb500362n. libraries using Nextera technology: simultaneous DNA fragmentation
F DOI: 10.1021/sb500362n
ACS Synth. Biol. XXXX, XXX, XXX−XXX
ACS Synthetic Biology Technical Note
G DOI: 10.1021/sb500362n
ACS Synth. Biol. XXXX, XXX, XXX−XXX