Professional Documents
Culture Documents
Carmean Douglas Dna Data Storage and Hybrid Molecular PDF
Carmean Douglas Dna Data Storage and Hybrid Molecular PDF
Carmean Douglas Dna Data Storage and Hybrid Molecular PDF
By D OUGLAS C ARMEAN , L UIS C EZE , Senior Member IEEE, G EORG S EELIG , K ENDALL S TEWART,
K ARIN S TRAUSS , Senior Member IEEE, AND M AX W ILLSEY
0018-9219 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
A. DNA Structure
DNA molecules are biopolymers consisting of a sequence
of nucleotides. Each nucleotide can have one of four bases:
adenine (A), cytosine (C), guanine (G), or thymine (T).
A single DNA molecule (also called an oligonucleotide or
oligo for short) consists of a sequence of bases, written
with initials, e.g., AGTATC. The direction is significant: as
normally written, the left end is called the 5 (“five prime”)
end, and the right end is called the 3 (“three prime”)
Fig. 1. Comparing DNA with mainstream storage media. end.
Two oligos can come together to form a double-stranded
duplex, where bases are paired with their complement:
A with T, and C with G. The two oligos in a duplex
wet system components, stable preservation, and random run in opposite directions, therefore a sequence will be
access of data stored in molecular form. fully complementary with its reverse complement. This is
Molecular data storage creates opportunities for illustrated in Fig. 2(a): each base on the 5 end of the upper
near-data processing; for example, pattern matching and strand is paired with a complementary base on the 3 end
search could be performed directly on the molecular of the lower strand.
representation. Adleman [3] noted that DNA’s stable The process of duplex formation is called hybridization.
double-stranded structure comes with a simple computa- Two complementary strands suspended in solution will
tional primitive: matching single-stranded molecules will eventually form a stable structure that requires energy
stochastically “bump into each other” in solution. Adle- to break. Fully complementary sequences will form the
man used this property to compute a solution to the famous double-helix structure [Fig. 2(a)].
Hamiltonian path problem, pioneering the field of DNA Two sequences do not have to be fully complementary
computing. While this area of work has advanced rapidly to hybridize [Fig. 2(b)]. Partially hybridized structures are
over the last two decades, the path to large-scale systems less thermodynamically stable, and occur less frequently
remains unclear. at higher solution temperatures. Given two strands, the
Motivated by progress in DNA data storage, we envision solution temperature at which 50% of the strands have
a hybrid molecular–electronic architecture that combines formed a duplex at equilibrium is called the duplex’s melt-
the strengths of molecular and conventional electronics. ing temperature. A higher melting temperature indicates
This approach takes advantage of DNA as both a storage a more stable duplex. The number of unpaired bases is
medium and computing substrate. It promises to achieve not necessarily related to melting temperature [5]. For
nearly unlimited bandwidth: data and processing units instance, changing the mismatched A on the lower strand
float free in solution, so computation can diffuse through of Fig. 2(b) to a mismatched G raises the melting tem-
data and effectively occur everywhere simultaneously. We perature to 44.6 ◦ C, despite the fact that the number of
call this phenomenon “near-molecule processing.” This unpaired bases remains the same. Melting temperature for
property effectively breaks the fixed capacity/bandwidth a pair of sequences can be calculated precisely by ther-
ratio on typical storage devices in traditional systems, mak- modynamic simulation software such as NUPACK [6]. In
ing it especially promising for data-intensive applications addition to temperature, one can also control hybridization
such as content-based media search. via pH or ionic strength of the solution.
In the remainder of this paper, we provide background
in Section II, discuss hybrid molecular–electronic systems
in general in Section III, and then present our work
on DNA data storage in detail in Section IV. In Section
V, we propose a new hybrid molecular–electronic sys-
tem for image similarity search and model its feasibility.
Finally, we conclude with a discussion of future technology
trends that impact the design space of these systems in
Section VI.
heterogeneous system, the strengths and weaknesses of The cost of getting data into and out of molecular
each domain raise a series of critical design questions. In components is a crucial consideration. The extreme den-
this section, we discuss challenges and tradeoffs pertaining sity and parallelism afforded by the molecular domain
to physical constraints, communication, storage, and com- is of limited use if the interface is a bottleneck. An
putation. We discuss these dimensions in general, and we efficient hybrid system would send a relatively small
provide examples of how they guided design decisions in amount of information to the molecular domain, where
the systems presented in Sections IV and V. lots of work would be done in parallel, and return a rela-
tively small amount of information again to the electronic
A. Physical Constraints domain. In this respect, hybrid molecular–electronic sys-
tems are similar to heterogeneous systems with hardware
Molecular systems are unique because they require the
accelerators.
storage and manipulation of various solutions, including
mixing, splitting, diluting, and incubating them. Architects
must take care to ensure that a system is physically realiz- C. Storage Considerations
able. Adleman’s famous DNA-based algorithm for solving Molecular computation is based on strand interaction,
the Hamiltonian path problem [3] provides a cautionary so having some information already in the molecular
tale: the amount of DNA required grows exponentially with domain would reduce the amount of data that crosses
the graph size. A system is not feasible if modestly sized the interface. Bandwidth into and out of the molecular
problems require swimming pools or oceans of DNA. The domain is limited, so an existing database in molecular
systems presented in this paper, however, demonstrate that form could greatly improve performance at execution time.
some applications require only a small reaction volume Information stored in DNA is dense and long-lived, so this
and are thus feasible. molecular preprocessing could be done out of band with
Since we are trying to build computer systems, physical actual execution.
manipulation also necessitates automation. The various The nature of molecular interactions may lead to
steps of preparing, operating on, and analyzing samples destructive reads of edits of molecular information. We
are typically done by humans in a wetlab. Microfluidic envision getting around this potential issue via periodic
technology could provide the needed automation, but it is molecular amplification like polymerase chain reaction
not yet advanced enough to support a practical computer (PCR) or resynthesis. Reamplification of the entire mole-
system. Some instances of the technology are not flexible cular database could lead to errors accumulating over
enough, and those that are remain error prone and dif- time (e.g., polymerase errors are estimated to be 10−6 to
ficult to program [12]. Furthermore, programming these 10−5 per base). If resorting to resynthesis, it is important
hybrid molecular–electronic systems will require inter- to include only data that were read out and not the
twined control code, sample manipulation, data analysis, entire database, which would possibly oversubscribe the
and conventional computation; these challenges remain to electronic domain with excessive data volumes.
be explored.
D. Computation Considerations
B. Communication Considerations When it comes to computation, our goal is to harness
How to move information between domains is a primary the best of both the electronic and molecular systems.
concern for any heterogeneous system, and it is especially Electronic platforms can be highly general and precise;
important for hybrid molecular–electronic systems, where they can perform a wide variety of operations exactly as
communication can be expensive. specified. No molecular platforms exist today that match
There are many ways to communicate from the elec- the generality and precision of electronic systems, but they
tronic domain to the molecular domain. DNA synthesis may offer orders of magnitude improvements in perfor-
adds new molecules representing data to the system. mance and/or energy efficiency.
Physical manipulation also adds data: the choice of which Computationally, the main benefit of molecular systems
samples to combine determines the behavior of the system. is that certain computations can be performed in a mas-
Changes to the environment (e.g., temperature, humidity) sively parallel fashion. For example, the systems presented
can also control the system by influencing chemical prop- below use hybridization to search for exact and approxi-
erties. mate matches. Since both query and data are in solution
Getting data from the molecular domain back into the and there are multiple copies of both (DNA synthesis
electronic domain varies as well. Some operations may naturally creates multiple copies of the molecules), the
obtain enough data from a simple sensor reading: for search operation is entirely parallel. We refer to this mole-
example, fluorescent markers can indicate the presence cular version of near-data processing as near-molecule
of a particular substance or the occurrence of a reac- processing.
tion. DNA sequencing provides even more information Interestingly, the latencies of these parallel operations
by reconstructing the exact sequence of bases from a do not change with the size of the data set: performing
sample. an operation on a few items takes as long as doing it
to hours to complete. As such, it may only be profitable to structure across data items.
makes many copies of each sequence, and hence also nat- will then have to use this physical redundancy to cope with
urally offers physical redundancy, in the form of multiple errors introduced by synthesis and sequencing.
copies of each sequence (on the order of hundreds of The decoder operates in three basic stages: The first
millions). Overheads in addressing and error correction step is to cluster noisy reads by similarity to collect all
can be amortized with longer strands, but because of available reads that likely correspond to a unique orig-
diminishing returns and higher errors in longer synthesis inally stored DNA sequence. To do so, we employ an
processes, it is not advantageous to go beyond 500–1000 algorithm that leverages the input randomization done
nucleotides (Fig. 6). during encoding. The next step is to process each cluster
to recover the original sequence using a variant of the
C. Random Access bitwise majority alignment (BMA) algorithm [17] adapted
to support insertions, deletions, and substitutions. Finally,
Random access is fundamental because it is not practical
the bits are recovered by using an RS code to correct errors
to have to sift through a vast data archive to retrieve a
and erasures.
desired data item. Our design allows for random access by
We have used an Illumina NextSeq instrument to read
using polymerase chain reaction (PCR). The read process
over 400 MB of encoded data so far. We have re-sequenced
first determines the primers for the given key (analogous
the data several times, which brings the total of digital
to a hash function) and synthesizes them into new DNA
data read from DNA to the equivalent of well over 1 GB.
molecules. Then, rather than applying sequencing to the
Sequencing error rates have been reasonably low, typically
entire pool of stored molecules, we first apply PCR to
below 1%, and has not prevented us from decoding any
the pool using these primers. PCR amplifies the strands
files. The largest commercial nanopore DNA sequencing
in the pool whose primer target sites match the given
device to which we have access contains about 2000
primers, creating many copies of those strands. To recover
nanopores and delivers error rates of about 10%, after
the file, we now take a sample of the resulting pool, which
recent improvements in its chemistry. Despite this high
contains a large number of copies of all the relevant strands
error rate, we have been able to decode a file read with
but only a few other irrelevant strands. Sequencing this
this platform.
sample therefore returns the data for the relevant key
rather than all data in the system. As a side note, while E. Our Results So Far
PCR amplification is not even (i.e., there may be bias) and
Our work so far demonstrates an end-to-end approach
may amplify undesired strands, it is not a problem for DNA
toward the viability of DNA data storage with large-scale
data storage because of underlying error tolerance of the
random access. Although we have only reported on the
encoding/decoding schemes.
initial 35 files and 200 MB of data [13], we have so
While PCR-based random access [13], [14], [16] is a
far encoded, stored, retrieved, and successfully recovered
viable implementation, we do not believe it is practical to
about 40 distinct files totaling over 400 MB of data in
put all data in a single pool. We instead envision a “library”
more than unique 25 million DNA oligonucleotides synthe-
of pools offering spatial isolation. We estimate each pool to
sized by Twist Bioscience (over three billion nucleotides in
contain about 1 TB of data. An address then maps to both
total). Our results represent an advance of more than an
a pool location and a PCR primer. This design is analogous
order of magnitude over prior work. Our data set focused
to a magnetic tape storage library, where robotic arms are
on technologically advanced data formats and historical
used to retrieve tapes. A production DNA-based storage
relevance, including the Universal Declaration of Human
system would require the use of microfluidic automation
Rights in over 100 languages, a high-definition music video
to perform the necessary reactions. Tape libraries offer
of the band OK Go, and a CropTrust database listing seeds
random access by robotic movement of cartridges and
stored in the Svalbard Global Seed Vault.
fast-forwarding to specific tape segments. The equivalent
We demonstrated our random access methodology
in DNA would be physically isolated “containers” with
based on selective PCR amplification, for which we
DNA, along with some form of molecular selection prior
designed and validated a large library of primers, and
to sequencing and decoding. While PCR is the mechanism
randomly accessed arbitrarily chosen items from our whole
we have focused on so far, one can also use magnetic-bead
pool with zero-byte error. Moreover, we developed a novel
based and other DNA random access methods.
coding scheme that dramatically reduces the sequencing
reads per DNA sequence required for error-free decoding
D. Reading and Decoding to about 6x, while maintaining levels of logical redundancy
Reading back the data involves selecting the appropriate comparable to the best prior codes. Finally, we further
pool where the data of interest are stored, retrieving a stress-tested our coding approach by successfully decoding
sample, and sequencing the DNA. No matter the DNA a file using the more error-prone nanopore sequencing.
sequencing method, the result is a large number of reads.
Recall that each unique strand is replicated many times V. N E A R-M O L E C U L E P R O C E S S I N G
in the sequenced sample, so the result will contain many Most computer systems consist of a few processors sur-
reads for each unique DNA sequence. The decode process rounded by memory. To perform computation, the proces-
sor must load data from memory, operate on it, and write it
back. Even parallel processors and GPUs still have to load
all of the relevant data before doing computation. As appli-
cations become bandwidth-bound, instead of compute-
bound, researchers have sought to bring compute closer
to the data [18].
In the molecular setting, we can take advantage of
nature to perform massively parallel near-molecule com-
putation. If we can formulate the operation and data
such that the result we want is thermodynamically favor-
able, the operation will diffuse through solution and hap-
pen everywhere simultaneously. Random access through
hybridization and PCR as discussed in the previous section
is an example of this. The query strand “searches” for Fig. 7. Stack diagram for a hybrid and a purely electronic
targets in the entire data set, all at once. However, random content-based image retrieval system. Electronic components are
green; molecular components are pink.
access in electronic systems does not scan the entire data
set, so molecular retrieval does not offer any performance
gain.
phenomenon is popularly known as “the curse of dimen-
Here we explore a more compelling case for near-
sionality.” Real systems overcome this limitation by using
molecular processing. We describe a DNA-based hybrid
approximation schemes that reduce the amount of data to
system for content-based image retrieval which we call
be sifted through, at the cost of potentially missing similar
molecular accelerated similarity search (MASS). MASS
images [21]–[23].
relies on a biomolecular mechanism for “fuzzy matching”
Ultimately, a high-recall CBIR system must examine a
and we have demonstrated it at a small scale [27]. We are
large part of the data set. This provides an opportunity
currently working on scaling it to larger databases. The
for MASS to outperform its purely electronic counterparts
purpose of this section is to, given a new molecular mech-
by using the near-molecular compute afforded by the
anism, discuss how to design a feasible hybrid molecular–
molecular domain.
electronic system. We also explore the practicality of such
a system with a model of its latency, necessary reaction
volume, and scalability. B. Architecture
Much like our DNA data storage system described in
A. Molecular Accelerated Similarity Search Section IV, the database of the MASS system consists
Similarity search is a mechanism for searching a large of DNA strands that associate a primer with some data.
data set for objects similar to some given query. We focus Instead of mapping an address (the primer) to some data,
on a particular instance of this problem, content-based the strands in the MASS database map encoded feature
image retrieval (CBIR), the task of finding images that vectors to the address of the image in some other database.
appear visually similar to a given query image. CBIR So the MASS system deals with the image feature vectors
systems power real-world applications such as Google’s and the addresses of the images instead of the actual image
reverse image search. Fig. 7 shows a stack diagram for data.
our proposed CBIR implementation and a purely electronic Fig. 7 shows the lifetime of a single query in the MASS
one. system. The input to the system is a query feature vector
The first step in building a CBIR system is to extract which is encoded into a string of bases and then synthe-
visual features from each image in the database. Visual sized into (many copies of) a query strand. The query
features are usually real-valued numbers that represent strands are then combined with a small sample of the
the activity of some filter applied to the image. These can database in the reaction vessel. In the reaction, the query
be hand-engineered features like scale-invariant feature strands will partially hybridize with matching targets, per-
transform (SIFT) [19], or learned features such as interme- forming the similarity search with massive parallelism.
diate layer activations from a deep neural network [20]. The matching targets can then be sequenced, yielding the
Pairs of feature vectors can be compared using familiar addresses of the similar images. The images themselves
functions such as Euclidean or cosine distance. To find can then be retrieved from a different database down-
images that are visually similar to a query, the system stream, potentially a DNA data store like ours described
searches for image feature vectors within some distance in Section IV.
of the query’s feature vector. The encoder is a critical part of the system. We discuss
Ordinarily, such searches could be accelerated by parti- some of the challenges and tradeoffs in designing it, and
tioning the data into a tree-like data structure. However, show that it is possible to encode feature vectors into DNA
when feature vectors are high dimensional, partitioning such that their similarity correlates with partial hybridiza-
schemes become no better than a linear search. This tion efficiency elsewhere [26].
base [13], 60 bases is sufficient to uniquely address nrxn = 1016 strands represents 1 exabyte of data, so one instance
1e16 images. of MASS has an effective search bandwidth on the order of
100 TB/s. Modeling other systems is outside the scope of
2) Technological Constraints: Sequencing and synthesis
this paper, but we believe that MASS would be competitive
are expected to get exponentially faster, improving at a rate
with or outperform purely electronic systems at this scale.
exceeding Moore’s law [2]. However, we chose to model
sequencing and synthesis rates that are achievable today.
We draw the synthesis rate for our model rsyn from VI. D I S C U S S I O N
recent literature proposing a method to synthesize a Both synthesis and sequencing need to be lower cost and
base every 50 s [25]. Recall that synthesis time in (1) higher throughput than they are today for DNA data stor-
is proportional only to the length of the strand, not the age and computing to succeed. The gap in both dimensions
copy number. Synthesis of a single unique strand is already is orders of magnitude, but it is important to note that,
commercially available on the scale of millimoles, which is when used for data storage, DNA synthesis and sequencing
well above the amount we require. have different requirements than for life sciences. First,
Note that we are assuming the existence of a large when storing data, control over the sequences to be syn-
database of potentially up to ten quadrillion of unique thesized allows for the use of smart error correction to
targets. This is beyond the capability of DNA synthesis tolerate error rates orders of magnitude higher than those
today. Today’s technology can synthesize many unique required for life sciences applications. Second, storage
strands of DNA at once, on the order of millions [13], but applications can tolerate completely missing sequences as
making a database that references ten quadrillion images well as contamination. Third, data storage needs very few
would only become feasible with further advancements. copies of each sequence, compared to the much higher life
sciences requirements. Higher synthesis and sequencing
3) System Capability: Plugging the above constraints
scale implies simultaneously higher throughput and lower
into the model yields a synthesis latency of tsyn of 83 min
costs, so it will be key to a practical, large-scale end-to-end
and a sequencing latency tseq of 2 min. These are, of course,
DNA storage system.
rough estimations due to the coarse granularity of our
model. The partial hybridization and PCR reactions would
take on the order of hours. The bottlenecks are clearly DNA Acknowledgments
synthesis and the reactions, not sequencing. The authors would like to thank the anonymous review-
If we plug in a data set size (equal to the number of ers for their helpful feedback on the manuscript, and the
unique strands in the reaction nrxn ) of 1016 , the model Molecular Information Systems Lab (MISL) members for
shows we only require a reaction volume of 1.7 mL. input on the research and feedback on how to present this
Assuming a feature vector size of 100 bytes, a data set of work.
REFERENCES
[1] V. Zhirnov, R. M. Zadegan, G. S. Sandhu, nanotechnology using strand-displacement scale-invariant keypoints,” Int. J. Comput. Vis.,
G. M. Church, and W. L. Hughes, “Nucleic acid reactions,” Nature Chem., vol. 3, no. 2, vol. 60, no. 2, pp. 91–110, Nov. 2004.
memory,” Nature Mater., vol. 15, no. 4, pp. 103–113, Feb. 2011. [20] J. Wan et al., “Deep learning for content-based
pp. 366–370, 2016. [11] L. Qian, E. Winfree, and J. Bruck, “Neural network image retrieval: A comprehensive study,” in Proc.
[2] R. Carlson. (2014). “Time for new DNA synthesis computation with DNA strand displacement 22nd ACM Int. Conf. Multimedia (MM), 2014,
and sequencing cost curves.” [Online]. Available: cascades,” Nature, vol. 475, no. 7356, pp. 368–372, pp. 157–166, doi: 10.1145/2647868.2654948.
https://synbiobeta.com/time-new-dna-synthesis- Jul. 2011. [21] P. Indyk and R. Motwani, “Approximate nearest
sequencing-cost-curves-rob-carlson [12] N. Blow, “Microfluidics: The great divide,” Nature neighbors: Towards removing the curse of
[3] L. M. Adleman, “Molecular computation of Methods, vol. 6, no. 9, pp. 683–686, 2009. dimensionality,” in Proc. 13th Annu. ACM Symp.
solutions to combinatorial problems,” Science, [13] L. Organick et al., “Random access in large-scale Theory Comput. (STOC), 1998, pp. 604–613, doi:
vol. 266, no. 5187, pp. 1021–1024, Nov. 1994. DNA data storage,” Nature Biotechnol., vol. 36, doic 10.1145/276698.276876.
[4] E. B. Baum, “Building an associative memory vastly pp. 242–248, Jul. 2018. [22] A. Andoni and P. Indyk, “Near-optimal hashing
larger than the brain,” Science, vol. 268, no. 5210, [14] J. Bornholt, R. Lopez, D. M. Carmean, L. Ceze, algorithms for approximate nearest neighbor in
pp. 583–585, Apr. 1995. G. Seelig, and K. Strauss, “A DNA-based archival high dimensions,” Commun. ACM, vol. 51, no. 1,
[5] S. A. Tsaftaris, A. K. Katsaggelos, T. N. Pappas, and storage system,” in Proc. 21st Int. Conf. Archit. pp. 117–122, Jan. 2008.
T. E. Papoutsakis, “DNA-based matching of digital Support Program. Lang. Oper. Syst. (ASPLOS), [23] M. Muja and D. G. Lowe, “Fast approximate nearest
signals,” in Proc. IEEE Int. Conf. Acoust. Speech Mar. 2016, pp. 637–649. neighbors with automatic algorithm configuration,”
Signal Process., May 2004, pp. V-581–V-584. [15] S. M. H. T. Yazdi, Y. Yuan, J. Ma, H. Zhao, and in Proc. VISAPP Int. Conf. Comput. Vis. Theory Appl.,
[6] J. N. Zadeh et al., “NUPACK: Analysis and design of O. Milenkovic, “A rewritable, random-access 2009, p. 2.
nucleic acid systems,” J. Comput. Chem., vol. 32, DNA-based storage system,” Sci. Rep., vol. 5, no. 1, [24] Integrated DNA Technologies. Accessed: Nov. 8,
no. 1, pp. 170–173, Jan. 2011. p. 1763, Sep. 2015. 2017. [Online]. Available: https://www.idtdna.com
[7] M. H. Caruthers, “The chemical synthesis of [16] R. N. Grass, R. Heckel, M. Puddu, D. Paunescu, and [25] M. Sack et al., “Express photolithographic DNA
DNA/RNA: Our gift to science,” J. Biol. Chem., W. J. Stark, “Robust chemical preservation of digital microarray synthesis with optimized chemistry and
vol. 288, no. 2, pp. 1420–1427, 2013. [Online]. information on DNA in silica with error-correcting high-efficiency photolabile groups,”
Available: http://www.jbc.org/content/ codes,” Angew. Chem. Int. Ed., vol. 54, no. 8, J. Nanobiotechnol., vol. 14, no. 1, p. 14, Mar. 2016,
288/2/1420.full.pdf pp. 2552–2555, Feb. 2015. doi: 10.1186/s12951-016-0166-0.
[8] G. M. Church, Y. Gao, and S. Kosuri, [17] A. Batu, T. Kannan, S. Khanna, and S. McGregor, [26] S. Kosuri and G. M. Church, “Large-scale de novo
“Next-generation digital information storage in “Reconstructing strings from random traces,” in DNA synthesis: Technologies and applications,”
DNA,” Science, vol. 337, no. 6102, p. 1628, Proc. 15th Annu. ACM-SIAM Symp. Discrete Nat. Methods, vol. 11, pp. 499–507,
Sep. 2012. Algorithms (SODA), 2004, pp. 910–918. 2014.
[9] N. Goldman et al., “Towards practical, [18] R. Balasubramonian and B. Grot, “Near-data [27] K. Stewart, Y.-J. Chen, D. Ward, X. Liu, G. Seelig,
high-capacity, low-maintenance information processing [guest editors’ introduction],” K. Strauss, and L. Ceze, “A content-addressable
storage in synthesized DNA,” Nature, vol. 494, IEEE Micro, vol. 36, no. 1, pp. 4–5, DNA database with learned sequence encodings,”
no. 7435, pp. 77–80, Jan. 2013. Jan. 2016. in Proc. 24th Int. Conf. DNA Comput. Mol. Program.,
[10] D. Y. Zhang and G. Seelig, “Dynamic DNA [19] D. G. Lowe, “Distinctive image features from Oct. 2018.
Georg Seelig received the Ph.D. degree Max Willsey received the B.S. degree
in physics from the University of Geneva, in computer science from Carnegie Mellon
Geneva, Switzerland. University, Pittsburgh, PA, USA, in 2016.
He did his postdoctoral work in synthetic Currently, he is working toward the Ph.D.
biology and DNA nanotechnology at Cal- degree at the Paul G. Allen School for Com-
ifornia Institute of Technology (Caltech), puter Science & Engineering, University of
Pasadena, CA, USA. Currently, he is an Asso- Washington, Seattle, WA, USA.
ciate Professor in the Paul G. Allen School He is a recipient of a Qualcomm Innovation
of Computer Science & Engineering and the Fellowship and a National Science Founda-
Department of Electrical and Computer Engineering, University of tion (NSF) Graduate Research Fellowship honorable mention. His
Washington, Seattle, WA, USA. He is also an Adjunct Associate research interests are in programming languages and computer
Professor in Bioengineering. architecture with applications in biology.