Professional Documents
Culture Documents
Genome Assembly White Paper
Genome Assembly White Paper
Advantages of long
reads for genome
assembly
nanoporetech.com/publications nanoporetech.com
Contents
De novo assembly 4
Case studies 9
1. Assembling the human genome with
PromethION long reads 9
2. Genomic variation detection in disease 10
3. 100 tomato genomes in 100 days 11
4. Assembling the genome of an extremely
drug-resistant pathogen 12
5. Assembling metagenomes 13
Summary 14
References 15
2
Introduction
Over the last decade, improvements in next generation DNA sequencing
technology have transformed the field of genomics, making it an
essential tool in modern genetic and clinical research laboratories. The
facility to sequence whole genomes or specific genomic regions of
interest is delivering new insights into a variety of applications such as
human health and disease, metagenomics, antimicrobial resistance,
evolutionary biology, and crop breeding.
One of the key steps of WGS is the accurate assembly of the vast
amount of data generated into a contiguous stretch of DNA sequence.
This review provides a background to the DNA assembly process and
the associated advantages of long or ultra-long DNA reads, as provided
by nanopore sequencing technology.
Traditional technologies have required users to sequence short lengths of DNA, which must
then be reassembled back into their original order as accurately as possible. Such short-read
sequencing technologies, however, present a number of challenges, particularly the difficulty
of accurately analysing repetitive regions and large structural variants1.
This means that many reference genomes that were created using short-read sequencing
are highly fragmented, which in turn introduces bias into any alignments made against that
reference2. This review shows how these challenges are now being met, by the emergence
of nanopore sequencing, which supports the generation of any length of sequencing read
— from short to ultra-long.
Figure 1
A schematic showing how Long reads
long sequencing reads
can deliver simplified,
less ambiguous genome
assembly. Long reads (solid Short reads
arrows) have greater overlap
with other reads than is
provided by short reads
(dashed arrows), allowing
more accurate assemblies,
especially in repeat regions
(R). Image adapted from
Schatz (2014)3.
4
Advantages of long sequencing
reads for genome assembly
Until recently, cost-effective generation of large amounts of sequence data could only be
performed using short-read (<300 bp) sequencing technologies; however, nanopore-based
sequencing can process very long DNA fragments (current record >4 Mb)4 to create long
reads, which deliver a number of significant benefits.
Ease of assembly
Most genomes contain significant amounts of repetitive be highly fragmented2,5. Long-read capable sequencing
DNA (e.g. transposons, satellites, gene duplications). As technologies have a significant advantage here as the
the short reads produced by traditional next generation reads generated are more likely to span the full repetitive
sequencing (NGS) technology may not span each given region, allowing the creation of accurate genome
repetitive region, the resulting genome assemblies can assemblies with minimal gaps (Figures 2 and 3).
Figure 2
Like a jigsaw puzzle with
large pieces, long-read DNA
is much easier to assemble
than short-read DNA. The
Escherichia coli genome
comprises 4.6 million bases,
which would equate to
92,000 fragments of 50 bp
or just 9 fragments of 500 kb
in length.
Short reads
Short-read consensus
Long reads
Long-read consensus
Figure 3
A schematic highlighting the advantages of long reads in de novo assembly of repetitive regions.
Long read lengths are more likely to incorporate the whole repetitive region (blue boxes),
providing much simpler analysis and more accurate genome assemblies with fewer gaps.
6
Figure 4
Structural variation a) classes and b) variant size and frequency in the human genome.
Image adapted from Huddleston (2017)11.
Assembly quality
The quality of genome assemblies is often assessed by adding the length of all contigs together), and identifying
the number of contigs required to represent the genome, the length of the contig which is situated in the middle of
the length of these contigs relative to the size of the the cumulative length (Figure 5). The higher the contig
genome, and the proportion of reads that can be N50 value, the more contiguous the assembly.
assigned to the contigs. One of the most commonly
used assembly metrics is N50. The much higher N50 values generated using long
nanopore sequencing reads underline the superior
The contig N50 value is calculated by sorting all contigs quality and completeness of genome assembly when
by length, calculating the cumulative assembly length (i.e. compared with short-read sequencing technology.
Figure 5
The N50 metric is one measure of the quality of a genome assembly. Other metrics such as
N75 and N25 are also sometimes used to further assess assembly quality. Image adapted from
Jansen (2016)12.
A study by Cherukuri and Janga (2016) comparing a combined de Bruijn and OLC approaches also report
number of assembly algorithms found the OLC approach accurate genome assemblies16,17. Researchers have also
to be most favourable for nanopore long sequencing compared long-read assembly tools for different
reads — delivering higher genome coverage combined applications, such as in the context of human genome
with significantly larger average contig lengths, and lower assembly 8 and metagenomic analysis2.
overall contig numbers15. However, other studies applying
Reference-guided/
Assembler Github link Reference
de novo
https://github.com/chanzuckerberg/
Shasta De novo Shafin et al. (2019)8
shasta
SMART https://github.com/ruanjue/
De novo Ruan22
denovo smartdenovo
Table 1
Selection of assembly tools designed or adapted for Oxford Nanopore long reads.
For the latest genome assemblers, visit the Tools section of the Resource Centre at
www.nanoporetech.com/resource-centre.
8
CASE STUDY 1
Assembling the human genome
with PromethION long reads
Despite continual advancements in sequencing technology, full characterisation of the
human genome has yet to be achieved. Combining high output with the facility to generate
long and ultra-long reads, the PromethION platform from Oxford Nanopore Technologies
now makes de novo assembly of highly contiguous human genomes possible, both at scale
and with unprecedented efficiency.
Shafin and colleagues used an optimised PromethION draft human genome assembly in under 6 hours for only
sequencing method and assembly workflow to $70 — a significant time and cost reduction compared to
sequence 11 human cell lines in 9 days on a single the frequently-used Canu assembler. Using
PromethION8. They obtained 2.3 Tb of sequence with an MarginPolish/HELEN for polishing cost $108 and took 29
average depth of coverage per sample of 63x and read hours per sample, on average, and achieved 99.9%
N50 of 42 kb, including 6.5x of “ultra-long” 100 kb+ sequence identity for haploid samples. The authors
reads (Figure 6). stated that this could be even further improved by taking
advantage of the real-time capabilities of nanopore
sequencing: ’With real-time base calling, a DNA-to-de
‘…in terms of contemporary long-read
novo assembly could be achieved in less than 96 hours
sequencing platforms, this throughput
with little difficulty.’
is unmatched’8
The impact of comprehensive, rapid nanopore
The team introduced three new computational tools for sequencing workflows for human genetics is clear: ‘Such
genome assembly and polishing: Shasta, a de novo speed could make these techniques practical for
long-read assembler, and MarginPolish and HELEN, for screening human genomes for abnormalities in difficult-
assembly polishing. Shasta was capable of producing a to-sequence regions’ 8.
Figure 6
a) b) Nanopore sequencing
results: a) Read N50s
for each flow cell, and b)
genome depth of coverage
Coverage of a human genome (x)
Approximately 20–30% of lung cancer patients remain Neuronal intranuclear inclusion disease (NIID) is a
undiagnosed with respect to their cancerous mutations; progressive neurodegenerative disease which is difficult
variant detection is particularly pressing in cases where to identify ante-mortem due to its wide range of clinical
an effective treatment remains elusive25. Using manifestations; its genetic basis remains unsolved.
PromethION whole-genome sequencing of cancer cell Through PromethION whole-genome sequencing of NIID
lines and lung cancer specimens, Sakamoto et al. patients and their families, Jun Sone and colleagues
identified a unique class of complex structural variant, identified a repeat expansion within the gene
consisting of copy number changes, inversions, and NOTCH2NLC which was consistently present in familial
deletions, which could not be characterised using and sporadic cases of the disease but absent from all
short-read sequencing25. These variants were detected in unaffected family members (Figure 7)26. Using the
six of the nine clinical research specimens analysed, and Oxford Nanopore protocol for Cas9-mediated
generally resulted in reduced gene expression levels and enrichment, the team subsequently enriched for the
disruption of genes in the region. The authors concluded NOTCH2NLC repeat sequence to obtain higher read
that these mutations may point to the causal mechanism coverage at this locus and form a consensus; the results
of disease in individuals for whom this is unknown. from this consensus analysis agreed well with their initial
analysis of the whole-genome sequencing data.
Figure 7
NOTCH2NLC repeat
consensus sequence of
multiple alignments. Note
that there are polymorphisms
between normal alleles. F1-1
and F2-2 are familial patients.
Expansion: allele with repeat
expansion; normal: normal
allele. Adapted from Sone
(2019)26.
10
CASE STUDY 3
100 tomato genomes
in 100 days
The tomato is one of the most valuable agricultural crops in the world, with an annual
production of over 175 million tonnes and a value of $85 billion. Tomatoes are also important
from a model system perspective due to their extensive phenotypic variation with over
15,000 known varieties27.
Figure 8
Nanopore sequencing results
from 12 tomato genomes a) b)
revealed (a) approximately
25,000–45,000 SVs per
genome, the majority of
which were deletions and
insertions, and (b) most
SVs were specific to each
sample. Figures courtesy of
Prof. Michael Schatz, Johns
Hopkins University, USA.
Short-read sequencing has been used for are often excluded from short-read sequencing analysis
Mycobacterium tuberculosis (MTB) assembly previously due to their repetitive and GC-rich nature.
yet it has limited ability to resolve long, repetitive regions
and structural variants, and is hindered by the relatively With a sequencing depth of 238x obtained from a single
high GC content (65%) of the MTB genome. MinION Flow Cell, a final assembly of 4,404,064 bp was
obtained in a single contig with >99.9% accuracy.
Using the Oxford Nanopore MinION, Bainomugisa et al. Furthermore, 166/168 PE/PPE genes had 100% breadth
performed whole-genome sequencing of the modern of coverage at 299.87x depth, compared to a short-read
Beijing lineage strain of TB responsible for drug dataset of this strain where only 92/168 of these genes
resistance outbreaks in the Western Province of Papua obtained 100% coverage at an average depth of 46.3x.
New Guinea30. The team sequenced and assembled a Finally, genome-wide variant calling identified three
high-quality, complete genome, including all PE/PPE deleted genic regions and a 4,490 bp insertion that were
(proline-glutamate/proline-proline-glutamate) gene both absent from the H37RV reference strain genome
families, which are thought to contribute to virulence but (Figure 9).
Figure 9
Integrative Genomic Viewer
(IGV) alignments of short-
read data from different
M. tuberculosis lineages
(A-G) mapped against the
nanopore draft genome. Left
image: a 4,490 bp insertion
spanning 7 annotated genes.
Right image: a 390 bp
insertion. Image courtesy
of Dr. Lachlan Coin, The
University of Queensland,
Australia.
12
CASE STUDY 5
Assembling metagenomes
Analysis of complex metagenomic samples has wide-reaching applications, with the
potential to recover genomes from previously unexplored microbial species, such as those
comprising gut microbiomes, as well as in the context of outbreak surveillance. Short-read
sequencing and assembly of metagenomes is challenging due to the difficulty in assigning
short reads to the correct genome of origin.
Figure 10
Increase in assembly
contiguity as a function
of read coverage for a
representative short-read
assembler (a), long-read
assembler (b), and the hybrid
OPERA-MS assembler (c).
Unassembled genomes
are shown as circles with
black borders. Adapted from
Bertrand (2019)31.
It is clear that the growing usage of Oxford Nanopore sequencing technology and
associated improvements in genome assemblies will bring additional and rapid insight
into genomics, further refining the relationship of phenotype to genotype.
About Oxford
Nanopore Technologies
Oxford Nanopore Technologies introduced the world’s regions and structural variants, offering significant
first nanopore DNA sequencer, the MinION — a portable, advantages over traditional short-read sequencing
real-time, low-cost device — followed by the larger technology.
GridION, PromethION P24 and P48 devices, and smaller
Flongle (Figure 11). The latest addition to the range, A number of protocols are available for nanopore
MinION Mk1C, combines the portability and power of the sequencing, enabling optimised whole-genome analysis
MinION with high-performance compute and an for a range of sample types and DNA input amounts. Our
integrated touchscreen, providing a complete, go- dedicated whole-genome sequencing Getting Started
anywhere solution for nanopore sequencing. guides, available on the Resource centre (nanoporetech.
com/resource-centre), have been developed to provide
Enabling the generation of long sequencing reads, further details and sample-to-answer guidance on
Oxford Nanopore platforms deliver high-quality whole- genome assembly with Oxford Nanopore long reads.
genome sequencing with the facility to span repetitive
Find out more at: www.nanoporetech.com
Figure 11
Oxford Nanopore sequencing platforms (from left to right): Flongle, a flow cell adapter for MinION and GridION; the portable
MinION and the latest addition, MinION Mk1C; GridION, with capacity for five Flongle or MinION Flow Cells; and the high-
throughput PromethION (P24 or P48) platform.
14
References
1. Mak, A. C. et al. Genome-Wide 12. Jansen, H. De novo genome assembly 24. Ruan, J. and Li, H. Fast and accurate
structural variation detection by with MinION long reads. Presentation. long-read assembly with wtdbg2.
genome mapping on nanochannel Available at: https://community. BioRxiv 530972 (2019).
arrays. Genetics 202(1):351–62 (2016). nanoporetech.com/posts/136
[Accessed: 30 September 2019] 25. Sakamoto, Y. et al Long read
2. Latorre-Perez, A. et al. Assembly sequencing reveals a novel class of
methods for nanopore-based 13. Sovic, I. et al. Github: Racon [Online]. structural aberrations in cancers:
metagenomic sequencing: a Available at: https://github.com/lbcb- identification and characterization of
comparative study. BioRxiv 722405 sci/racon [Accessed: 9 September cancerous local amplifications. BioRxiv
(2019). 2019] 620047 (2019).
3. Schatz, M. De novo assembly of 14. Simpson, J. Github: Nanopolish 26. Sone, J. et al. Long-read sequencing
complex genomes using single [Online]. Available at: https://github. identifies GGC repeat expansions in
molecule sequencing. Presentation. com/jts/nanopolish [Accessed: 9 NOTCH2NLC associated with neuronal
Available at: http://schatzlab.cshl. September 2019] intranuclear inclusion disease. Nat.
edu/presentations/2014-01-14.PAG. Genet. 51(8):1215-1221 (2019).
Single%20Molecule%20Assembly.pdf 15. Cherukuri, Y. and Janga, S. C.
[Accessed: 3 October 2019] Benchmarking of de novo assembly 27. Schatz. M. 100 genomes in 100 days:
algorithms for Nanopore data The structural variant landscape in
4. Oxford Nanopore Technologies. Ultra- reveals optimal performance of OLC tomato genomes. Available at: https://
Long DNA Sequencing Kit. Available approaches. BMC Genomics 22:17 nanoporetech.com/resourcecentre/
at: https://store.nanoporetech.com/ Suppl 7:507 (2016). michael-schatz-100-genomes-100-
ultra-long-dna-sequencing-kit.html days-structural-variantlandscape-
[Accessed: 23 August 2021] 16. Lin, Y. et al. Assembly of long tomato-genomes [Accessed: 7 May
error-prone reads using de Bruijn 2019]
5. Chaisson, M. J. P. et al. Genetic graphs. Proc. Natl. Acad. Sci. USA
variation and the de novo assembly 113(52):E8396-E8405 (2016). 28. Soyk, S et al. Duplication of a
of human genomes. Nat. Rev. Genet. domestication locus neutralized a
16(11):627–40 (2015). 17. Judge, K. et al. Comparison cryptic variant that caused a breeding
of bacterial genome assembly barrier in tomato. Nat. Plants. 5(5):471-
6. Eichler, E. Genetic Variation, software for MinION data and their 479 (2019).
Comparative Genomics, and the applicability to medical microbiology.
Diagnosis of Disease. N. Engl. J. Med. Microbial Genomics doi: 10.1099/ 29. WHO (2018) Tuberculosis [Online].
381:64-74 (2019). mgen.0.000085 (2016). Available at: https://www.who.int/
en/news-room/fact-sheets/detail/
7. Miga, K. Telomere-to-telomere 18. Koren, S. et al. Canu: scalable and tuberculosis [Accessed: 17 September
assembly of a complete human X accurate long-read assembly via 2019]
chromosome. Presentation. Available adaptive k-mer weighting and repeat
at: https://nanoporetech.com/ separation. Genome Res. 27:722-736 30. Bainomugisa, A. et al. A complete
resource-centre/telomere-telomere- (2017). high-quality MinION nanopore
assembly-complete-human-x- assembly of an extensively drug-
chromosome [Accessed: 9 September 19. Kolmogorov, M. et al. Assembly of resistant Mycobacterium tuberculosis
2019] long, error-prone reads using repeat Beijing lineage strain identifies novel
graphs. Nat. Biotech. 37:540-546 variation in repetitive PE/PPE gene
8. Shafin, K. et al. Efficient de novo (2019). regions. Microb. Genom. doi: 10.1099/
assembly of eleven human genomes mgen.0.000188 (2018).
using PromethION sequencing and a 20. Li, H. Minimap and miniasm: fast
novel nanopore toolkit. BioRxiv 715722 mapping and de novo assembly for 31. Bertrand, D. et al. Hybrid metagenomic
(2019) noisy long sequences. Bioinformatics assembly enables high-resolution
32:2103–10 (2016). analysis of resistance determinants
9. Zarrei, M. et al. A copy number and mobile elements in human
variation map of the human genome. 21. Alonge, M. et al. Fast and accurate microbiomes. Nat. Biotechnol.
Nat. Rev. Genet. 16(3):172–83 (2015). reference-guided scaffolding of draft 37(8):937-944 (2019).
genomes. BioRxiv 519637 (2019).
10. Ebbert, M. T. W. et al. Systematic 32. Kafetzopoulou, L. E. et al. Metagenomic
analysis of dark and camouflaged 22. Ruan, J. Available at: https://github. sequencing at the epicenter of the
genes reveals disease-relevant genes com/ruanjue/smartdenovo [Accessed: Nigeria 2018 Lassa fever outbreak.
hiding in plain sight. Genome Biol. 5 January 2017]. Science 363(6422):74-77(2019).
20(1):97 (2019).
23. Wick, R. R. et al. Unicycler: Resolving
11. Huddleston, J. et al. Discovery and bacterial genome assemblies from
genotyping of structural variation from short and long sequencing reads.
long-read haploid genome sequence PLoS Comput. Biol. 13(6): e1005595
data. Genome Res. 27(5): 677-685 (2017).
(2017).
Oxford Nanopore Technologies, the Wheel icon, Flongle, GridION, MinION, PromethION, and VolTRAX are registered
trademarks of Oxford Nanopore Technologies in various countries. All other brands and names contained are the property of
their respective owners. © 2021 Oxford Nanopore Technologies. All rights reserved. Flongle, GridION, MinION, PromethION,
and VolTRAX are for research use only.
WP_1043(EN)_V3_24Aug2021