Carmean Douglas Dna Data Storage and Hybrid Molecular PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

DNA Data Storage and Hybrid


Molecular–Electronic
Computing

By D OUGLAS C ARMEAN , L UIS C EZE , Senior Member IEEE, G EORG S EELIG , K ENDALL S TEWART,
K ARIN S TRAUSS , Senior Member IEEE, AND M AX W ILLSEY

ABSTRACT | Moore’s law may be slowing, but our abil- I. I N T R O D U C T I O N


ity to manipulate molecules is improving faster than ever.
DNA could provide alternative substrates for computing and Exponentially growing data pose a significant challenge to
storage as existing ones approach physical limits. In this the landscape of current storage technologies. If we are to
paper, we explore the implications of this trend in computer store and make use of the world’s information, we need
architecture. We present a computer systems perspective on fundamentally denser and cheaper storage technologies.
molecular processing and storage, positing a hybrid molecular– We believe going to the molecular level is inevitable, as
electronic architecture that plays to the strengths of both also observed by Zhirnov et al. [1].
domains. We cover the design and implementation of all stages Synthetic DNA is an attractive storage medium for
of the pipeline: encoding, DNA synthesis, system integration many reasons: its theoretical information density of about
with digital microfluidics, DNA sequencing (including emerging 1018 B/mm3 is 107 times denser than magnetic tape
technologies such as nanopores), and decoding. We first draw (Fig. 1), it can potentially last for thousands of years, and
on our experience designing a DNA-based archival storage it will never go obsolete since we will always be interested
system, which includes the largest demonstration to date in reading DNA for health purposes. The biotechnology
of DNA digital data storage of over three billion nucleotides industry has developed the basic tools to manipulate DNA,
encoding over 400 MB of data. We then propose a more including writing and reading DNA, which can now be
ambitious hybrid–electronic design that uses a molecular form leveraged and improved for digital data storage. Impor-
of near-data processing for massive parallelism. We present a tantly, there is rapid exponential progress in DNA reading
model that demonstrates the feasibility of these systems in the and writing, arguably surpassing Moore’s law [2] (though
near future. We think the time is ripe to consider molecular in the analysis provided in this paper, we chose to model
storage seriously and explore system designs and architectural sequencing and synthesis rates that are achievable today).
implications. Given the current trends in data production and the rapid
progress of DNA manipulation technologies, we believe the
KEYWORDS | Future computer architectures; molecular data
time is ripe to make DNA-based storage and computing
storage and computing.
systems a reality.
In this paper, we articulate a vision toward an
Manuscript received March 19, 2018; revised July 18, 2018; accepted end-to-end system for archival and retrieval, discuss the
September 18, 2018. This work was supported in part by Microsoft, the National
Science Foundation (NSF), and the Defense Advanced Research Projects Agency
challenges in building it, and consider additional applica-
(DARPA) under the Molecular Informatics Program. (Corresponding author: tions once it is built. The key challenge is scaling through-
Luis Ceze.)
put and cost of DNA synthesis and sequencing orders of
D. Carmean and K. Strauss are with Microsoft, Seattle, WA, USA.
L. Ceze, G. Seelig, K. Stewart, and M. Willsey are with the University of magnitude beyond the needs of the life sciences indus-
Washington, Seattle, WA 98195 USA (e-mail: luisceze@cs.washington.edu). try. Others challenges include system integration, fluidics
Digital Object Identifier 10.1109/JPROC.2018.2875386 automation, reliable interfaces between electronics and

0018-9219 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

P ROCEEDINGS OF THE IEEE 1


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Carmean et al.: DNA Data Storage and Hybrid Molecular–Electronic Computing

A. DNA Structure
DNA molecules are biopolymers consisting of a sequence
of nucleotides. Each nucleotide can have one of four bases:
adenine (A), cytosine (C), guanine (G), or thymine (T).
A single DNA molecule (also called an oligonucleotide or
oligo for short) consists of a sequence of bases, written
with initials, e.g., AGTATC. The direction is significant: as
normally written, the left end is called the 5 (“five prime”)
end, and the right end is called the 3 (“three prime”)
Fig. 1. Comparing DNA with mainstream storage media. end.
Two oligos can come together to form a double-stranded
duplex, where bases are paired with their complement:
A with T, and C with G. The two oligos in a duplex
wet system components, stable preservation, and random run in opposite directions, therefore a sequence will be
access of data stored in molecular form. fully complementary with its reverse complement. This is
Molecular data storage creates opportunities for illustrated in Fig. 2(a): each base on the 5 end of the upper
near-data processing; for example, pattern matching and strand is paired with a complementary base on the 3 end
search could be performed directly on the molecular of the lower strand.
representation. Adleman [3] noted that DNA’s stable The process of duplex formation is called hybridization.
double-stranded structure comes with a simple computa- Two complementary strands suspended in solution will
tional primitive: matching single-stranded molecules will eventually form a stable structure that requires energy
stochastically “bump into each other” in solution. Adle- to break. Fully complementary sequences will form the
man used this property to compute a solution to the famous double-helix structure [Fig. 2(a)].
Hamiltonian path problem, pioneering the field of DNA Two sequences do not have to be fully complementary
computing. While this area of work has advanced rapidly to hybridize [Fig. 2(b)]. Partially hybridized structures are
over the last two decades, the path to large-scale systems less thermodynamically stable, and occur less frequently
remains unclear. at higher solution temperatures. Given two strands, the
Motivated by progress in DNA data storage, we envision solution temperature at which 50% of the strands have
a hybrid molecular–electronic architecture that combines formed a duplex at equilibrium is called the duplex’s melt-
the strengths of molecular and conventional electronics. ing temperature. A higher melting temperature indicates
This approach takes advantage of DNA as both a storage a more stable duplex. The number of unpaired bases is
medium and computing substrate. It promises to achieve not necessarily related to melting temperature [5]. For
nearly unlimited bandwidth: data and processing units instance, changing the mismatched A on the lower strand
float free in solution, so computation can diffuse through of Fig. 2(b) to a mismatched G raises the melting tem-
data and effectively occur everywhere simultaneously. We perature to 44.6 ◦ C, despite the fact that the number of
call this phenomenon “near-molecule processing.” This unpaired bases remains the same. Melting temperature for
property effectively breaks the fixed capacity/bandwidth a pair of sequences can be calculated precisely by ther-
ratio on typical storage devices in traditional systems, mak- modynamic simulation software such as NUPACK [6]. In
ing it especially promising for data-intensive applications addition to temperature, one can also control hybridization
such as content-based media search. via pH or ionic strength of the solution.
In the remainder of this paper, we provide background
in Section II, discuss hybrid molecular–electronic systems
in general in Section III, and then present our work
on DNA data storage in detail in Section IV. In Section
V, we propose a new hybrid molecular–electronic sys-
tem for image similarity search and model its feasibility.
Finally, we conclude with a discussion of future technology
trends that impact the design space of these systems in
Section VI.

II. B A C K G R O U N D Fig. 2. DNA molecules can form double-stranded duplexes even


when their sequences are not fully complementary, but these
DNA’s potential as a substrate for molecular computation
structures are less stable and thus have lower melting
and storage has been the subject of research for over two
temperatures. (a) A fully hybridized duplex with complementary
decades, dating back to Adleman’s exploration of combi- sequences. Melting temperature 66.5 ◦ C. (b) A partially hybridized
natorial problems [3] and Baum’s proposal for a massive duplex with mismatched sequences (indicated with red arrows).
DNA-based database with associative search capability [4]. Melting temperature 42.5 ◦ C.

2 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Carmean et al.: DNA Data Storage and Hybrid Molecular–Electronic Computing

B. DNA Writing (Synthesis) and Reading


(Sequencing)

DNA synthesis is the process of making arbitrary DNA


molecules from a specification. One of the most established
methods is based on phosphoramidite chemistry due to
Caruthers [7]. The method uses “protected” monomers
(individual nucleotides) to prevent the formation of a long
homopolymer chain. Removing the protecting group is Fig. 3. Hybrid eletronic–molecular architecture. Benefits of
done with an acid solution. The synthesis cycles works by: electronic and molecular components. Different applications may
1) incorporating a chosen nucleotide into an existing poly- better fit the strengths of either domain. The arrows show ways of
mer; 2) strengthening the bond via oxidation; 3) washing getting data from electronic to molecular components and vice
versa.
out excess monomers; 4) deprotecting the last added base;
and 5) repeat. DNA synthesis can be made very parallel via
require heavy signal processing and more precise sensors,
an array-based control of either deposition of the next base
and increasing the density of nanopores on a physical
or localized removal of the protecting group.
substrate, as well as solving problems with clogging and
There are several technologies for DNA synthesis [26].
endurance of pores.
Enzymatic synthesis is a potential alternative to phospho-
ramidite chemistry. In this process, engineered enzymes
C. Brief History of DNA Data Storage
incorporate bases in a controllable fashion without a tem-
plate. This method promises to be cheaper, faster, and The general idea of using DNA as storage of synthetic
cleaner (water based, as opposed to needing to use sol- information has been around since at least the mid-1960s,
vents). Making synthesis scale requires a control mecha- when Norbert Wiener suggested the idea of “genetic”
nism to select which bases to add to which sequences. This memory for computers. In the past six years, work from
is often called array-based synthesis: sequences are seeded Harvard [8] and the European Bioinformatics Institute [9]
on a surface and reagents flow in succession to add bases in showed that progress in modern DNA manipulation meth-
a cyclic fashion. There are several basic technologies, from ods could make it both possible and practical soon. Many
which three are most commonly used: electrochemical and research groups, including groups at ETH Zurich, Uni-
light-based arrays, which selectively deblock sequences versity of Illinois at Urbana-Champaign, and Columbia
and adds the same base to all deblocked sequences simul- University are working on this problem. Our own group at
taneously; and deposition-based arrays that use inkjet to the University of Washington and Microsoft, in collabora-
selectively deposit bases where they are to be added. tion with Twist Bioscience, holds the world record for the
The most commercially adopted DNA sequencing plat- amount of data successfully stored in and retrieved from
form today is based on image processing and the con- DNA: over 400 MB as of June 2018.
cept called sequencing by synthesis. Single-stranded DNA
sequences are attached to a substrate and complementary D. DNA-Based Computation
bases with fluorescent markers are attached one by one The kinetics of DNA hybridization enable more than just
to individual sequences (yet, in parallel for all sequences). a lookup operation. For instance, partial hybridization can
The spatial fluorescence pattern created by the fluorescent implement “fuzzy matching,” where the query and target
markers is captured in an image, which is then processed do not have to be entirely complementary, and the “fuzzi-
and fluorescent spots correlated to individual bases in ness” can be controlled by varying the temperature [5].
the sequences. The fluorescent markers are then chemi- This property can be leveraged to perform distance com-
cally removed, leaving complementary bases behind and putations, which we discuss further in Section III.
setting up the next base in the sequence to be recog- More recently, researchers have shown that
nized. Scaling such technology to higher throughputs will hybridization reactions can form complex cascades
depend on more precise optical setups and improvements called strand displacement reactions, which can be used
in image processing, and once optical resolution limits are to implement general purpose computations, including
reached, this style of sequencing will probably no longer boolean circuits [10] and neural networks [11].
be appropriate. Beyond hybridization, evolution has led to a variety of
Another DNA sequencing solution that has been gaining enzymes for processing DNA, including cutting, joining,
momentum is nanopore technology. The cornerstone of replication, and editing. These enzymes can be used to
nanopore technology is to capture DNA molecules and create even more complex circuits.
force them through a nanoscale pore which causes small
fluctuations in electrical current depending on the passing III. H Y B R I D M O L E C U L A R–E L E C T R O N I C
DNA. The main challenges in using nanopore devices for SYSTEMS
DNA storage are controlling the high error rates resulting A hybrid molecular–electronic system aims to leverage
from sensing these minute current fluctuations, which may the best properties of each domain (Fig. 3). As with any

P ROCEEDINGS OF THE IEEE 3


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Carmean et al.: DNA Data Storage and Hybrid Molecular–Electronic Computing

heterogeneous system, the strengths and weaknesses of The cost of getting data into and out of molecular
each domain raise a series of critical design questions. In components is a crucial consideration. The extreme den-
this section, we discuss challenges and tradeoffs pertaining sity and parallelism afforded by the molecular domain
to physical constraints, communication, storage, and com- is of limited use if the interface is a bottleneck. An
putation. We discuss these dimensions in general, and we efficient hybrid system would send a relatively small
provide examples of how they guided design decisions in amount of information to the molecular domain, where
the systems presented in Sections IV and V. lots of work would be done in parallel, and return a rela-
tively small amount of information again to the electronic
A. Physical Constraints domain. In this respect, hybrid molecular–electronic sys-
tems are similar to heterogeneous systems with hardware
Molecular systems are unique because they require the
accelerators.
storage and manipulation of various solutions, including
mixing, splitting, diluting, and incubating them. Architects
must take care to ensure that a system is physically realiz- C. Storage Considerations
able. Adleman’s famous DNA-based algorithm for solving Molecular computation is based on strand interaction,
the Hamiltonian path problem [3] provides a cautionary so having some information already in the molecular
tale: the amount of DNA required grows exponentially with domain would reduce the amount of data that crosses
the graph size. A system is not feasible if modestly sized the interface. Bandwidth into and out of the molecular
problems require swimming pools or oceans of DNA. The domain is limited, so an existing database in molecular
systems presented in this paper, however, demonstrate that form could greatly improve performance at execution time.
some applications require only a small reaction volume Information stored in DNA is dense and long-lived, so this
and are thus feasible. molecular preprocessing could be done out of band with
Since we are trying to build computer systems, physical actual execution.
manipulation also necessitates automation. The various The nature of molecular interactions may lead to
steps of preparing, operating on, and analyzing samples destructive reads of edits of molecular information. We
are typically done by humans in a wetlab. Microfluidic envision getting around this potential issue via periodic
technology could provide the needed automation, but it is molecular amplification like polymerase chain reaction
not yet advanced enough to support a practical computer (PCR) or resynthesis. Reamplification of the entire mole-
system. Some instances of the technology are not flexible cular database could lead to errors accumulating over
enough, and those that are remain error prone and dif- time (e.g., polymerase errors are estimated to be 10−6 to
ficult to program [12]. Furthermore, programming these 10−5 per base). If resorting to resynthesis, it is important
hybrid molecular–electronic systems will require inter- to include only data that were read out and not the
twined control code, sample manipulation, data analysis, entire database, which would possibly oversubscribe the
and conventional computation; these challenges remain to electronic domain with excessive data volumes.
be explored.
D. Computation Considerations
B. Communication Considerations When it comes to computation, our goal is to harness
How to move information between domains is a primary the best of both the electronic and molecular systems.
concern for any heterogeneous system, and it is especially Electronic platforms can be highly general and precise;
important for hybrid molecular–electronic systems, where they can perform a wide variety of operations exactly as
communication can be expensive. specified. No molecular platforms exist today that match
There are many ways to communicate from the elec- the generality and precision of electronic systems, but they
tronic domain to the molecular domain. DNA synthesis may offer orders of magnitude improvements in perfor-
adds new molecules representing data to the system. mance and/or energy efficiency.
Physical manipulation also adds data: the choice of which Computationally, the main benefit of molecular systems
samples to combine determines the behavior of the system. is that certain computations can be performed in a mas-
Changes to the environment (e.g., temperature, humidity) sively parallel fashion. For example, the systems presented
can also control the system by influencing chemical prop- below use hybridization to search for exact and approxi-
erties. mate matches. Since both query and data are in solution
Getting data from the molecular domain back into the and there are multiple copies of both (DNA synthesis
electronic domain varies as well. Some operations may naturally creates multiple copies of the molecules), the
obtain enough data from a simple sensor reading: for search operation is entirely parallel. We refer to this mole-
example, fluorescent markers can indicate the presence cular version of near-data processing as near-molecule
of a particular substance or the occurrence of a reac- processing.
tion. DNA sequencing provides even more information Interestingly, the latencies of these parallel operations
by reconstructing the exact sequence of bases from a do not change with the size of the data set: performing
sample. an operation on a few items takes as long as doing it

4 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Carmean et al.: DNA Data Storage and Hybrid Molecular–Electronic Computing

Fig. 4. Overview of DNA-based data storage.

Fig. 5. Layout of a DNA strand for data storage. The primer


regions in both extremities are used to both enable molecular
amplification and to map molecules to the object [13], [14]. The
over trillions. This “constant-time” performance is offset index region is necessary when reassembling the right order of
by a large overhead; operations could take on the order payloads, since molecular storage does not have fixed 3-D physical

to hours to complete. As such, it may only be profitable to structure across data items.

perform such operations in molecular form when the data


set is above a certain offload break-even size. As with elec-
tronic systems, this break-even point also determines the synthesis technology effectively manufactures molecules
granularity of communication between the two domains. one nucleotide at a time, so it cannot synthesize mole-
cules of arbitrary length without error. Based on current
IV. D N A D ATA S T O R A G E efficiency of synthesis methods and technologies, a rea-
sonably efficient strand length for DNA synthesis is about
A DNA storage system (Fig. 4) takes digital data as input,
150 nucleotides (a couple hundred bits of information).
synthesizes DNA molecules to represent these data, and
The write process therefore splits input data into small
stores them in a physical container or pool. To read data
blocks which correspond to separate DNA sequences.
back, the system selects molecules from the pool, amplifies
Because DNA molecules do not offer spatial organization
them using polymerase chain reaction (a standard molec-
like traditional storage media, we must explicitly include
ular biology protocol), and sequences them back to digital
addressing information in the DNA molecule. Fig. 5 shows
data. One can think of a DNA data storage system as a key-
the layout of an individual DNA strand in our system.
value store, in which input data are associated with a key,
Each strand contains a payload, which is a substring of
and read operations identify the key they wish to recover.
the input data to encode. An address includes both a key
identifier and an index into the input data (to allow data
A. Requirements for End-to-End DNA-Based longer than one strand). At each end of the strand, special
Archival Storage primer sequences [13]–[15]—which correspond to the key
The requirements of a storage system are low read/write identifier according to a hash function—allow for efficient
latency, high throughput (bits per second), random access sequencing during read operations.
and reliability. DNA manipulation latency is significantly Splitting data into smaller strands requires a coding
higher than electronics. However, write and read through- method that provisions information for later reassem-
put (bits per second) can be competitive. This makes bly. Previous work [9] overlapped multiple small blocks,
DNA-based storage a good fit for archival purposes, where but our experimental and simulation results show this
latency is not critical if throughput is high enough. For approach to sacrifice too much density for little gain. Our
example, current archival storage services quote access coding scheme embeds indexing information within each
times in minutes to hours and sometimes service-level block and uses a Reed–Solomon (RS)-based outer coding
agreements (SLAs) specify times in the order of a day. But scheme [16]. Such coding methods provision what we
to be competitive with other commercial systems, a DNA refer to as logical redundancy. Note that DNA synthesis
archival storage system will need to offer throughputs of
about 1 GB/s in a few years.

B. Encoding and Synthesis


Writing to DNA storage involves encoding binary data
as DNA nucleotides and synthesizing the corresponding
molecules. Synthesizing and sequencing DNA is far from
perfect (errors on the order 1% per position), hence
we need a robust error correction scheme. This process
involves two nontrivial steps. First, the trivial encoding
from binary into the four DNA nucleotides (A, C, T, G)
produces problematic sequences such as long stretches of
repeated letters. We avoid that with a rotating code [9] and
randomization using one-time pads [13]. Second, DNA Fig. 6. Overheads as function of strand length.

P ROCEEDINGS OF THE IEEE 5


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Carmean et al.: DNA Data Storage and Hybrid Molecular–Electronic Computing

makes many copies of each sequence, and hence also nat- will then have to use this physical redundancy to cope with
urally offers physical redundancy, in the form of multiple errors introduced by synthesis and sequencing.
copies of each sequence (on the order of hundreds of The decoder operates in three basic stages: The first
millions). Overheads in addressing and error correction step is to cluster noisy reads by similarity to collect all
can be amortized with longer strands, but because of available reads that likely correspond to a unique orig-
diminishing returns and higher errors in longer synthesis inally stored DNA sequence. To do so, we employ an
processes, it is not advantageous to go beyond 500–1000 algorithm that leverages the input randomization done
nucleotides (Fig. 6). during encoding. The next step is to process each cluster
to recover the original sequence using a variant of the
C. Random Access bitwise majority alignment (BMA) algorithm [17] adapted
to support insertions, deletions, and substitutions. Finally,
Random access is fundamental because it is not practical
the bits are recovered by using an RS code to correct errors
to have to sift through a vast data archive to retrieve a
and erasures.
desired data item. Our design allows for random access by
We have used an Illumina NextSeq instrument to read
using polymerase chain reaction (PCR). The read process
over 400 MB of encoded data so far. We have re-sequenced
first determines the primers for the given key (analogous
the data several times, which brings the total of digital
to a hash function) and synthesizes them into new DNA
data read from DNA to the equivalent of well over 1 GB.
molecules. Then, rather than applying sequencing to the
Sequencing error rates have been reasonably low, typically
entire pool of stored molecules, we first apply PCR to
below 1%, and has not prevented us from decoding any
the pool using these primers. PCR amplifies the strands
files. The largest commercial nanopore DNA sequencing
in the pool whose primer target sites match the given
device to which we have access contains about 2000
primers, creating many copies of those strands. To recover
nanopores and delivers error rates of about 10%, after
the file, we now take a sample of the resulting pool, which
recent improvements in its chemistry. Despite this high
contains a large number of copies of all the relevant strands
error rate, we have been able to decode a file read with
but only a few other irrelevant strands. Sequencing this
this platform.
sample therefore returns the data for the relevant key
rather than all data in the system. As a side note, while E. Our Results So Far
PCR amplification is not even (i.e., there may be bias) and
Our work so far demonstrates an end-to-end approach
may amplify undesired strands, it is not a problem for DNA
toward the viability of DNA data storage with large-scale
data storage because of underlying error tolerance of the
random access. Although we have only reported on the
encoding/decoding schemes.
initial 35 files and 200 MB of data [13], we have so
While PCR-based random access [13], [14], [16] is a
far encoded, stored, retrieved, and successfully recovered
viable implementation, we do not believe it is practical to
about 40 distinct files totaling over 400 MB of data in
put all data in a single pool. We instead envision a “library”
more than unique 25 million DNA oligonucleotides synthe-
of pools offering spatial isolation. We estimate each pool to
sized by Twist Bioscience (over three billion nucleotides in
contain about 1 TB of data. An address then maps to both
total). Our results represent an advance of more than an
a pool location and a PCR primer. This design is analogous
order of magnitude over prior work. Our data set focused
to a magnetic tape storage library, where robotic arms are
on technologically advanced data formats and historical
used to retrieve tapes. A production DNA-based storage
relevance, including the Universal Declaration of Human
system would require the use of microfluidic automation
Rights in over 100 languages, a high-definition music video
to perform the necessary reactions. Tape libraries offer
of the band OK Go, and a CropTrust database listing seeds
random access by robotic movement of cartridges and
stored in the Svalbard Global Seed Vault.
fast-forwarding to specific tape segments. The equivalent
We demonstrated our random access methodology
in DNA would be physically isolated “containers” with
based on selective PCR amplification, for which we
DNA, along with some form of molecular selection prior
designed and validated a large library of primers, and
to sequencing and decoding. While PCR is the mechanism
randomly accessed arbitrarily chosen items from our whole
we have focused on so far, one can also use magnetic-bead
pool with zero-byte error. Moreover, we developed a novel
based and other DNA random access methods.
coding scheme that dramatically reduces the sequencing
reads per DNA sequence required for error-free decoding
D. Reading and Decoding to about 6x, while maintaining levels of logical redundancy
Reading back the data involves selecting the appropriate comparable to the best prior codes. Finally, we further
pool where the data of interest are stored, retrieving a stress-tested our coding approach by successfully decoding
sample, and sequencing the DNA. No matter the DNA a file using the more error-prone nanopore sequencing.
sequencing method, the result is a large number of reads.
Recall that each unique strand is replicated many times V. N E A R-M O L E C U L E P R O C E S S I N G
in the sequenced sample, so the result will contain many Most computer systems consist of a few processors sur-
reads for each unique DNA sequence. The decode process rounded by memory. To perform computation, the proces-

6 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Carmean et al.: DNA Data Storage and Hybrid Molecular–Electronic Computing

sor must load data from memory, operate on it, and write it
back. Even parallel processors and GPUs still have to load
all of the relevant data before doing computation. As appli-
cations become bandwidth-bound, instead of compute-
bound, researchers have sought to bring compute closer
to the data [18].
In the molecular setting, we can take advantage of
nature to perform massively parallel near-molecule com-
putation. If we can formulate the operation and data
such that the result we want is thermodynamically favor-
able, the operation will diffuse through solution and hap-
pen everywhere simultaneously. Random access through
hybridization and PCR as discussed in the previous section
is an example of this. The query strand “searches” for Fig. 7. Stack diagram for a hybrid and a purely electronic

targets in the entire data set, all at once. However, random content-based image retrieval system. Electronic components are
green; molecular components are pink.
access in electronic systems does not scan the entire data
set, so molecular retrieval does not offer any performance
gain.
phenomenon is popularly known as “the curse of dimen-
Here we explore a more compelling case for near-
sionality.” Real systems overcome this limitation by using
molecular processing. We describe a DNA-based hybrid
approximation schemes that reduce the amount of data to
system for content-based image retrieval which we call
be sifted through, at the cost of potentially missing similar
molecular accelerated similarity search (MASS). MASS
images [21]–[23].
relies on a biomolecular mechanism for “fuzzy matching”
Ultimately, a high-recall CBIR system must examine a
and we have demonstrated it at a small scale [27]. We are
large part of the data set. This provides an opportunity
currently working on scaling it to larger databases. The
for MASS to outperform its purely electronic counterparts
purpose of this section is to, given a new molecular mech-
by using the near-molecular compute afforded by the
anism, discuss how to design a feasible hybrid molecular–
molecular domain.
electronic system. We also explore the practicality of such
a system with a model of its latency, necessary reaction
volume, and scalability. B. Architecture
Much like our DNA data storage system described in
A. Molecular Accelerated Similarity Search Section IV, the database of the MASS system consists
Similarity search is a mechanism for searching a large of DNA strands that associate a primer with some data.
data set for objects similar to some given query. We focus Instead of mapping an address (the primer) to some data,
on a particular instance of this problem, content-based the strands in the MASS database map encoded feature
image retrieval (CBIR), the task of finding images that vectors to the address of the image in some other database.
appear visually similar to a given query image. CBIR So the MASS system deals with the image feature vectors
systems power real-world applications such as Google’s and the addresses of the images instead of the actual image
reverse image search. Fig. 7 shows a stack diagram for data.
our proposed CBIR implementation and a purely electronic Fig. 7 shows the lifetime of a single query in the MASS
one. system. The input to the system is a query feature vector
The first step in building a CBIR system is to extract which is encoded into a string of bases and then synthe-
visual features from each image in the database. Visual sized into (many copies of) a query strand. The query
features are usually real-valued numbers that represent strands are then combined with a small sample of the
the activity of some filter applied to the image. These can database in the reaction vessel. In the reaction, the query
be hand-engineered features like scale-invariant feature strands will partially hybridize with matching targets, per-
transform (SIFT) [19], or learned features such as interme- forming the similarity search with massive parallelism.
diate layer activations from a deep neural network [20]. The matching targets can then be sequenced, yielding the
Pairs of feature vectors can be compared using familiar addresses of the similar images. The images themselves
functions such as Euclidean or cosine distance. To find can then be retrieved from a different database down-
images that are visually similar to a query, the system stream, potentially a DNA data store like ours described
searches for image feature vectors within some distance in Section IV.
of the query’s feature vector. The encoder is a critical part of the system. We discuss
Ordinarily, such searches could be accelerated by parti- some of the challenges and tradeoffs in designing it, and
tioning the data into a tree-like data structure. However, show that it is possible to encode feature vectors into DNA
when feature vectors are high dimensional, partitioning such that their similarity correlates with partial hybridiza-
schemes become no better than a linear search. This tion efficiency elsewhere [26].

P ROCEEDINGS OF THE IEEE 7


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Carmean et al.: DNA Data Storage and Hybrid Molecular–Electronic Computing

Because the targets come from a sample of the database,


the reaction has the same concentration (ρrxn ) and number
of unique targets (nrxn ) as the database. The number of
unique targets determines the capacity of the system; the
reaction is effectively searching over nrxn unique image
feature vectors in parallel.
Once the query strands are added to the database sam-
ple, partial hybridization binds the query to similar targets.
These targets can be retrieved with a procedure similar
to the one used in DNA data storage (Section IV). For
example, PCR can amplify strands that hybridized, leaving
Fig. 8. Parameters for the content-based retrieval. Equations the reaction vessel dominated by target image features that
(1)–(3) describe their relationships. were similar (and bound to) the query.
3) Sequencing: Once the reaction vessel is dominated
The following section will describe the protocol for the by similar target strands, we take a sample of the vessel
molecular search in detail. We will also introduce a sim- to avoid unnecessary sequencing. We sample such that we
ple analytical model that relates the important quantities only sequence cseq of each unique strand.
describing the protocol. The model will let us predict the The number of unique strands (nseq ) to be sequenced,
systems latency and physical feasibility. their copy number (cseq ), and their length (lseq ) together
determine the number of bases to be sequenced. This and
C. Modeling a Hybrid Molecular–Electronic the sequencing rate rseq determine the latency, as shown in
System
Fig. 8 shows the parameters that comprise the model. cseq lseq nseq
tseq = . (3)
Parameters marked with syn refer to the synthesized query, rseq
rxn refers to the reaction vessel, and seq refers to the solu-
tion that actually gets sequenced. They will be introduced Note that the amount to be sequenced is not dependent
as they become relevant, but Fig. 8 shows a summary of all on the size of the data set nrxn . Unlike electronic systems,
model parameters and their values in a potential design. whose time-to-solution is proportional to the size of the
data set, the molecular system instead depends on the size
1) Query Synthesis: The protocol starts by synthesizing
of the result. This is the fundamental benefit provided by
many copies of the query strand, which represents the
the near-molecule computation.
encoded feature vector of the query image. DNA synthesis
For massive data sets on the order of trillions of images
makes many copies of a strand at once, so the latency is
or more, the number of images similar to a given query
proportional to the length of the strand, not the number of
could be quite large, so controlling the number of desired
copies. Synthesizing many copies helps ensure that partial
results (those that end up getting sequenced, nseq ) inde-
hybridization happens quickly and allows us to perform
pendently of the data set size nrxn is crucial to maintain
PCR.
good performance. To that end, the temperature of the
To get the latency of synthesis, we model the length of
reaction vessel can be raised or lowered to get more or
the synthesized strand lsyn and the rate of synthesis rsyn .
fewer similar results.
The following equation shows how to calculate the latency
of query synthesis:
D. Model Instantiation
tsyn = lsyn /rsyn . (1) Using the model in Fig. 8, we can derive the latencies
(tsyn and tseq ) and capacity of the system (nrxn ). The
2) Reaction: The reaction vessel initially contains a sam- remaining model parameters are constrained by either the
ple from the database (see Fig. 7). The rxn parameters biomolecular protocol or technology limits.
describe this sample of target strands in the reaction vessel,
1) Protocol Constraints: We choose the reaction con-
not the synthesized query strands.
centration ρrxn = 100 μM, a common concentration for
The number of unique targets in the reaction vessel is
synthetic DNA [24]. We choose the reaction copy number
nrxn , and each unique strand is replicated crxn times. This
crxn = 10.
replication factor crxn is also called the copy number. There
We chose length of the synthesized query lsyn to be
are nrxn crxn total target strands in the reaction. These are
100 bases. We believe that this is sufficient to encode fea-
stored at some concentration ρrxn , which determines the
ture vectors given a dimensionality reduction. The length
volume vrxn . The following equation shows this relation:
of the target strands that get sequenced lseq is 160 bases.
This allocates 100 bases for the encoded feature vector
vrxn ρrxn = nrxn crxn . (2) and 60 bases for the address. At a density of one bit per

8 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Carmean et al.: DNA Data Storage and Hybrid Molecular–Electronic Computing

base [13], 60 bases is sufficient to uniquely address nrxn = 1016 strands represents 1 exabyte of data, so one instance
1e16 images. of MASS has an effective search bandwidth on the order of
100 TB/s. Modeling other systems is outside the scope of
2) Technological Constraints: Sequencing and synthesis
this paper, but we believe that MASS would be competitive
are expected to get exponentially faster, improving at a rate
with or outperform purely electronic systems at this scale.
exceeding Moore’s law [2]. However, we chose to model
sequencing and synthesis rates that are achievable today.
We draw the synthesis rate for our model rsyn from VI. D I S C U S S I O N
recent literature proposing a method to synthesize a Both synthesis and sequencing need to be lower cost and
base every 50 s [25]. Recall that synthesis time in (1) higher throughput than they are today for DNA data stor-
is proportional only to the length of the strand, not the age and computing to succeed. The gap in both dimensions
copy number. Synthesis of a single unique strand is already is orders of magnitude, but it is important to note that,
commercially available on the scale of millimoles, which is when used for data storage, DNA synthesis and sequencing
well above the amount we require. have different requirements than for life sciences. First,
Note that we are assuming the existence of a large when storing data, control over the sequences to be syn-
database of potentially up to ten quadrillion of unique thesized allows for the use of smart error correction to
targets. This is beyond the capability of DNA synthesis tolerate error rates orders of magnitude higher than those
today. Today’s technology can synthesize many unique required for life sciences applications. Second, storage
strands of DNA at once, on the order of millions [13], but applications can tolerate completely missing sequences as
making a database that references ten quadrillion images well as contamination. Third, data storage needs very few
would only become feasible with further advancements. copies of each sequence, compared to the much higher life
sciences requirements. Higher synthesis and sequencing
3) System Capability: Plugging the above constraints
scale implies simultaneously higher throughput and lower
into the model yields a synthesis latency of tsyn of 83 min
costs, so it will be key to a practical, large-scale end-to-end
and a sequencing latency tseq of 2 min. These are, of course,
DNA storage system.
rough estimations due to the coarse granularity of our
model. The partial hybridization and PCR reactions would
take on the order of hours. The bottlenecks are clearly DNA Acknowledgments
synthesis and the reactions, not sequencing. The authors would like to thank the anonymous review-
If we plug in a data set size (equal to the number of ers for their helpful feedback on the manuscript, and the
unique strands in the reaction nrxn ) of 1016 , the model Molecular Information Systems Lab (MISL) members for
shows we only require a reaction volume of 1.7 mL. input on the research and feedback on how to present this
Assuming a feature vector size of 100 bytes, a data set of work.

REFERENCES
[1] V. Zhirnov, R. M. Zadegan, G. S. Sandhu, nanotechnology using strand-displacement scale-invariant keypoints,” Int. J. Comput. Vis.,
G. M. Church, and W. L. Hughes, “Nucleic acid reactions,” Nature Chem., vol. 3, no. 2, vol. 60, no. 2, pp. 91–110, Nov. 2004.
memory,” Nature Mater., vol. 15, no. 4, pp. 103–113, Feb. 2011. [20] J. Wan et al., “Deep learning for content-based
pp. 366–370, 2016. [11] L. Qian, E. Winfree, and J. Bruck, “Neural network image retrieval: A comprehensive study,” in Proc.
[2] R. Carlson. (2014). “Time for new DNA synthesis computation with DNA strand displacement 22nd ACM Int. Conf. Multimedia (MM), 2014,
and sequencing cost curves.” [Online]. Available: cascades,” Nature, vol. 475, no. 7356, pp. 368–372, pp. 157–166, doi: 10.1145/2647868.2654948.
https://synbiobeta.com/time-new-dna-synthesis- Jul. 2011. [21] P. Indyk and R. Motwani, “Approximate nearest
sequencing-cost-curves-rob-carlson [12] N. Blow, “Microfluidics: The great divide,” Nature neighbors: Towards removing the curse of
[3] L. M. Adleman, “Molecular computation of Methods, vol. 6, no. 9, pp. 683–686, 2009. dimensionality,” in Proc. 13th Annu. ACM Symp.
solutions to combinatorial problems,” Science, [13] L. Organick et al., “Random access in large-scale Theory Comput. (STOC), 1998, pp. 604–613, doi:
vol. 266, no. 5187, pp. 1021–1024, Nov. 1994. DNA data storage,” Nature Biotechnol., vol. 36, doic 10.1145/276698.276876.
[4] E. B. Baum, “Building an associative memory vastly pp. 242–248, Jul. 2018. [22] A. Andoni and P. Indyk, “Near-optimal hashing
larger than the brain,” Science, vol. 268, no. 5210, [14] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, algorithms for approximate nearest neighbor in
pp. 583–585, Apr. 1995. G. Seelig, and K. Strauss, “A DNA-based archival high dimensions,” Commun. ACM, vol. 51, no. 1,
[5] S. A. Tsaftaris, A. K. Katsaggelos, T. N. Pappas, and storage system,” in Proc. 21st Int. Conf. Archit. pp. 117–122, Jan. 2008.
T. E. Papoutsakis, “DNA-based matching of digital Support Program. Lang. Oper. Syst. (ASPLOS), [23] M. Muja and D. G. Lowe, “Fast approximate nearest
signals,” in Proc. IEEE Int. Conf. Acoust. Speech Mar. 2016, pp. 637–649. neighbors with automatic algorithm configuration,”
Signal Process., May 2004, pp. V-581–V-584. [15] S. M. H. T. Yazdi, Y. Yuan, J. Ma, H. Zhao, and in Proc. VISAPP Int. Conf. Comput. Vis. Theory Appl.,
[6] J. N. Zadeh et al., “NUPACK: Analysis and design of O. Milenkovic, “A rewritable, random-access 2009, p. 2.
nucleic acid systems,” J. Comput. Chem., vol. 32, DNA-based storage system,” Sci. Rep., vol. 5, no. 1, [24] Integrated DNA Technologies. Accessed: Nov. 8,
no. 1, pp. 170–173, Jan. 2011. p. 1763, Sep. 2015. 2017. [Online]. Available: https://www.idtdna.com
[7] M. H. Caruthers, “The chemical synthesis of [16] R. N. Grass, R. Heckel, M. Puddu, D. Paunescu, and [25] M. Sack et al., “Express photolithographic DNA
DNA/RNA: Our gift to science,” J. Biol. Chem., W. J. Stark, “Robust chemical preservation of digital microarray synthesis with optimized chemistry and
vol. 288, no. 2, pp. 1420–1427, 2013. [Online]. information on DNA in silica with error-correcting high-efficiency photolabile groups,”
Available: http://www.jbc.org/content/ codes,” Angew. Chem. Int. Ed., vol. 54, no. 8, J. Nanobiotechnol., vol. 14, no. 1, p. 14, Mar. 2016,
288/2/1420.full.pdf pp. 2552–2555, Feb. 2015. doi: 10.1186/s12951-016-0166-0.
[8] G. M. Church, Y. Gao, and S. Kosuri, [17] A. Batu, T. Kannan, S. Khanna, and S. McGregor, [26] S. Kosuri and G. M. Church, “Large-scale de novo
“Next-generation digital information storage in “Reconstructing strings from random traces,” in DNA synthesis: Technologies and applications,”
DNA,” Science, vol. 337, no. 6102, p. 1628, Proc. 15th Annu. ACM-SIAM Symp. Discrete Nat. Methods, vol. 11, pp. 499–507,
Sep. 2012. Algorithms (SODA), 2004, pp. 910–918. 2014.
[9] N. Goldman et al., “Towards practical, [18] R. Balasubramonian and B. Grot, “Near-data [27] K. Stewart, Y.-J. Chen, D. Ward, X. Liu, G. Seelig,
high-capacity, low-maintenance information processing [guest editors’ introduction],” K. Strauss, and L. Ceze, “A content-addressable
storage in synthesized DNA,” Nature, vol. 494, IEEE Micro, vol. 36, no. 1, pp. 4–5, DNA database with learned sequence encodings,”
no. 7435, pp. 77–80, Jan. 2013. Jan. 2016. in Proc. 24th Int. Conf. DNA Comput. Mol. Program.,
[10] D. Y. Zhang and G. Seelig, “Dynamic DNA [19] D. G. Lowe, “Distinctive image features from Oct. 2018.

P ROCEEDINGS OF THE IEEE 9


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Carmean et al.: DNA Data Storage and Hybrid Molecular–Electronic Computing

ABOUT THE AUTHORS

Douglas Carmean received the B.S. Kendall Stewart is currently working


degree in electrical and electronics engi- toward the Ph.D. degree at the University of
neering from Oregon State University, Cor- Washington, Seattle, WA, USA.
vallis, OR, USA. Her focus is on designing practical
Currently, he is a Distinguished Engineer systems that leverage unique properties
at Microsoft, Seattle, WA, USA. His current of unconventional devices. She is cur-
work explores new architectures on futures rently working on architectures for high-
device technology. throughput parallel computing using syn-
thetic DNA.

Luis Ceze (Senior Member, IEEE) received


Karin Strauss (Senior Member, IEEE)
the B.Eng. and M.Eng. degrees from the
received the Ph.D. degree from the Depart-
Universidade de São Paulo (USP), São Paulo,
ment of Computer Science, University of Illi-
Brazil and the Ph.D. degree in computer
nois at Urbana-Champaign, Urbana, IL, USA,
science from the University of Illinois at
in 2007.
Urbana-Champaign (UIUC), Urbana, IL, USA.
Currently, she is a Principal Researcher at
Currently, he is a Professor at the Paul
Microsoft, Seattle, WA, USA and an Affili-
G. Allen School of Computer Science and
ate Professor in the Allen School for Com-
Engineering, University of Washington, Seat-
puter Science and Engineering, University of
tle, WA, USA. His research focuses on the intersection between
Washington, Seattle, WA, USA. Her research lies at the intersection
computer architecture, programming languages, machine learning,
of computer architecture, systems, and biology. Lately, her focus
and biology. His current focus is on approximate computing for effi-
has been on DNA data storage. In the past, she has studied
cient machine learning and DNA-based data storage. He codirects
other emerging memory technologies and hardware accelerators
the Molecular Information Systems Lab (MISL) and the Systems
for machine learning, among others. Previously, she worked for
Architectures and Programming Languages for Machine Learning
AMD Research.
lab (SAMPL).
Dr. Strauss is a Senior Member of the Association for Computing
Prof. Ceze is a Senior Member of the Association for Computing
Machinery (ACM).
Machinery (ACM).

Georg Seelig received the Ph.D. degree Max Willsey received the B.S. degree
in physics from the University of Geneva, in computer science from Carnegie Mellon
Geneva, Switzerland. University, Pittsburgh, PA, USA, in 2016.
He did his postdoctoral work in synthetic Currently, he is working toward the Ph.D.
biology and DNA nanotechnology at Cal- degree at the Paul G. Allen School for Com-
ifornia Institute of Technology (Caltech), puter Science & Engineering, University of
Pasadena, CA, USA. Currently, he is an Asso- Washington, Seattle, WA, USA.
ciate Professor in the Paul G. Allen School He is a recipient of a Qualcomm Innovation
of Computer Science & Engineering and the Fellowship and a National Science Founda-
Department of Electrical and Computer Engineering, University of tion (NSF) Graduate Research Fellowship honorable mention. His
Washington, Seattle, WA, USA. He is also an Adjunct Associate research interests are in programming languages and computer
Professor in Bioengineering. architecture with applications in biology.

10 P ROCEEDINGS OF THE IEEE

You might also like