Genome Assembly White Paper

GENOME ASSEMBLY
Advantages of long
reads for genome
assembly
WHITE PAPER October 2019
nanoporetech.com/publications nanoporetech.com
Contents
De novo assembly 4
Advantages of long sequencing reads

for genome assembly 5
Ease of assembly 5
Facility to span repetitive genomic regions 5
Identification of large structural variation 6
Assembly quality 7
Genome assembly tools 8
Case studies 9
1. Assembling the human genome with
PromethION long reads 9
2. Genomic variation detection in disease 10
3. 100 tomato genomes in 100 days 11
4. Assembling the genome of an extremely
drug-resistant pathogen 12
5. Assembling metagenomes 13
Summary 14
About Oxford Nanopore Technologies 14
References 15
2
Introduction
Over the last decade, improvements in next generation DNA sequencing
technology have transformed the field of genomics, making it an
essential tool in modern genetic and clinical research laboratories. The
facility to sequence whole genomes or specific genomic regions of
interest is delivering new insights into a variety of applications such as
human health and disease, metagenomics, antimicrobial resistance,
evolutionary biology, and crop breeding.
One of the key steps of WGS is the accurate assembly of the vast
amount of data generated into a contiguous stretch of DNA sequence.
This review provides a background to the DNA assembly process and
the associated advantages of long or ultra-long DNA reads, as provided
by nanopore sequencing technology.
For applications such as the analysis of larger

structural variation, or de novo assembly, whole-
genome sequencing (WGS) is typically the
technique of choice.
Whole-genome assembly — solving the puzzle
Traditional technologies have required users to sequence short lengths of DNA, which must
then be reassembled back into their original order as accurately as possible. Such short-read
sequencing technologies, however, present a number of challenges, particularly the difficulty
of accurately analysing repetitive regions and large structural variants1.
This means that many reference genomes that were created using short-read sequencing
are highly fragmented, which in turn introduces bias into any alignments made against that
reference2. This review shows how these challenges are now being met, by the emergence
of nanopore sequencing, which supports the generation of any length of sequencing read
— from short to ultra-long.
OXFORD NANOPORE TECHNOLOGIES | GENOME ASSEMBLY 3

De novo assembly
De novo assembly aims to reconstruct the original genome sequence from the set of reads,
for example when there is no reference genome available or a researcher prefers to avoid
potential bias that could arise from using an imperfect reference. Generally, the first step is to
find overlaps between the reads and build a graph to describe their relationship.
Several approaches have been used for this purpose,

including Greedy extension, de Bruijn graphs, and Long sequencing reads support
Overlap Layout Consensus (OLC) (Figure 1). The graph simplified and less ambiguous
is then simplified and a consensus is extracted. A key genome assembly.
step in this process is the assembly of contiguous,
uninterrupted stretches of overlapping DNA, so-called
contigs. As detailed in the next section, long-read
capable sequencing technology offers many
advantages for both de novo and alignment-based
genome assembly.
Figure 1
A schematic showing how Long reads
long sequencing reads
can deliver simplified,
less ambiguous genome
assembly. Long reads (solid Short reads
arrows) have greater overlap
with other reads than is
provided by short reads
(dashed arrows), allowing
more accurate assemblies,
especially in repeat regions
(R). Image adapted from
Schatz (2014)3.
4
Advantages of long sequencing
reads for genome assembly
Until recently, cost-effective generation of large amounts of sequence data could only be
performed using short-read (<300 bp) sequencing technologies; however, nanopore-based
sequencing can process very long DNA fragments (current record >4 Mb)4 to create long
reads, which deliver a number of significant benefits.
Ease of assembly
The longer a sequencing read, the more overlap it will

have with other reads. As such, it is much easier to Long nanopore sequencing reads
assemble the DNA fragments back into the correct order. offer easier assembly and the ability
This can be visualised much like a jigsaw; the larger the to span repetitive genomic regions.
pieces, the easier the puzzle (Figure 2).
Facility to span repetitive genomic regions
Most genomes contain significant amounts of repetitive be highly fragmented2,5. Long-read capable sequencing
DNA (e.g. transposons, satellites, gene duplications). As technologies have a significant advantage here as the
the short reads produced by traditional next generation reads generated are more likely to span the full repetitive
sequencing (NGS) technology may not span each given region, allowing the creation of accurate genome
repetitive region, the resulting genome assemblies can assemblies with minimal gaps (Figures 2 and 3).
Figure 2
Like a jigsaw puzzle with
large pieces, long-read DNA
is much easier to assemble
than short-read DNA. The
Escherichia coli genome
comprises 4.6 million bases,
which would equate to
92,000 fragments of 50 bp
or just 9 fragments of 500 kb
in length.
~ 50 base-pair read ~ 500,000 base-pair read

~ 92,000 “pieces” ~ 9 “pieces”

Genome sequence
Short reads
Short-read consensus
Long reads
Long-read consensus
Figure 3
A schematic highlighting the advantages of long reads in de novo assembly of repetitive regions.
Long read lengths are more likely to incorporate the whole repetitive region (blue boxes),
providing much simpler analysis and more accurate genome assemblies with fewer gaps.
Identification of large structural variation
The term structural variation (SV) covers a range of genetic

alterations, including copy number variation (CNV), ‘Structural variants…have been the
duplications, translocations, and inversions (Figure 4). most difficult to characterize with the
use of short-read DNA sequencing
Traditionally, structural variants were defined as spanning methods, and so pathogenic alleles
over 1,000 base pairs; however, with the advent of have gone undetected’6
higher-resolution techniques for genome analysis, this
definition has now been revised to include variants from assembly 7,8. As CNV alone is estimated to comprise
50 bp in length6. Structural variants have been 4.8–9.5% of the human genome9, it is clear that accurate
associated with a number of diseases, such as autism, analysis of such regions is of significant importance.
schizophrenia, and cancer; moreover, compared to
single nucleotide variants, they are 3 times as likely to be Unlike short-read technology, longer nanopore
associated with a genome-wide association signal6. It sequencing reads can cover the whole structural variant
has also been suggested that unknown SV may in one read. This results in more accurate genome
contribute to the “missing heritability” of human disease6. assemblies, facilitating better understanding of genome
For these reasons, SV has become an increasingly architecture in genetic diseases10.
important area for research.
For further details and sample-to-answer
Most existing genome assemblies have been created guidance on genome assembly using Oxford
using short-read sequencing technology, which is limited Nanopore long reads, view our Getting Started
in its ability to capture large structural variation. Even the guides at nanoporetech.com/resource-centre.
well-studied human genome has large gaps in its
6
Figure 4
Structural variation a) classes and b) variant size and frequency in the human genome.
Image adapted from Huddleston (2017)11.
Assembly quality
The quality of genome assemblies is often assessed by adding the length of all contigs together), and identifying
the number of contigs required to represent the genome, the length of the contig which is situated in the middle of
the length of these contigs relative to the size of the the cumulative length (Figure 5). The higher the contig
genome, and the proportion of reads that can be N50 value, the more contiguous the assembly.
assigned to the contigs. One of the most commonly
used assembly metrics is N50. The much higher N50 values generated using long
nanopore sequencing reads underline the superior
The contig N50 value is calculated by sorting all contigs quality and completeness of genome assembly when
by length, calculating the cumulative assembly length (i.e. compared with short-read sequencing technology.
N75 N50 N25
Figure 5
The N50 metric is one measure of the quality of a genome assembly. Other metrics such as
N75 and N25 are also sometimes used to further assess assembly quality. Image adapted from
Jansen (2016)12.

Genome assembly tools
A range of tools are available for genome assembly utilising long-read sequencing data
(Table 1). Tools differ in what they offer, and the extent of their use, and should be assessed
according to the specific project requirements. Assembly tools often include, or are coupled
with, error-correction methods (e.g. Racon13 and nanopolish14), which utilise information in the
assembled reads, or take information from other reads, to produce high consensus
accuracy.
A study by Cherukuri and Janga (2016) comparing a combined de Bruijn and OLC approaches also report
number of assembly algorithms found the OLC approach accurate genome assemblies16,17. Researchers have also
to be most favourable for nanopore long sequencing compared long-read assembly tools for different
reads — delivering higher genome coverage combined applications, such as in the context of human genome
with significantly larger average contig lengths, and lower assembly 8 and metagenomic analysis2.
overall contig numbers15. However, other studies applying
Reference-guided/
Assembler Github link Reference
de novo
Canu De novo https://github.com/marbl/canu Koren et al. (2017)18
Flye De novo http://github.com/fenderglass/Flye Kolmogorov et al. (2019)19
Miniasm De novo https://github.com/lh3/miniasm Li (2016)20
RaGOO Reference-guided https://github.com/malonge/RaGOO Alonge et al. (2019)21
https://github.com/chanzuckerberg/
Shasta De novo Shafin et al. (2019)8
shasta
SMART https://github.com/ruanjue/
De novo Ruan22
denovo smartdenovo
Unicycler De novo https://github.com/rrwick/Unicycler Wick et al. (2017)23
Wtdbg2 De novo https://github.com/ruanjue/wtdbg2 Ruan and Li (2019)24
Table 1
Selection of assembly tools designed or adapted for Oxford Nanopore long reads.
For the latest genome assemblers, visit the Tools section of the Resource Centre at
www.nanoporetech.com/resource-centre.
8
CASE STUDY 1
Assembling the human genome
with PromethION long reads
Despite continual advancements in sequencing technology, full characterisation of the
human genome has yet to be achieved. Combining high output with the facility to generate
long and ultra-long reads, the PromethION platform from Oxford Nanopore Technologies
now makes de novo assembly of highly contiguous human genomes possible, both at scale
and with unprecedented efficiency.
Shafin and colleagues used an optimised PromethION draft human genome assembly in under 6 hours for only
sequencing method and assembly workflow to $70 — a significant time and cost reduction compared to
sequence 11 human cell lines in 9 days on a single the frequently-used Canu assembler. Using
PromethION8. They obtained 2.3 Tb of sequence with an MarginPolish/HELEN for polishing cost $108 and took 29
average depth of coverage per sample of 63x and read hours per sample, on average, and achieved 99.9%
N50 of 42 kb, including 6.5x of “ultra-long” 100 kb+ sequence identity for haploid samples. The authors
reads (Figure 6). stated that this could be even further improved by taking
advantage of the real-time capabilities of nanopore
sequencing: ’With real-time base calling, a DNA-to-de
‘…in terms of contemporary long-read
novo assembly could be achieved in less than 96 hours
sequencing platforms, this throughput
with little difficulty.’
is unmatched’8
The impact of comprehensive, rapid nanopore
The team introduced three new computational tools for sequencing workflows for human genetics is clear: ‘Such
genome assembly and polishing: Shasta, a de novo speed could make these techniques practical for
long-read assembler, and MarginPolish and HELEN, for screening human genomes for abnormalities in difficult-
assembly polishing. Shasta was capable of producing a to-sequence regions’ 8.
Figure 6
a) b) Nanopore sequencing
results: a) Read N50s
for each flow cell, and b)
genome depth of coverage
Coverage of a human genome (x)
(x) as a function of read

Read N50 (kb)
length. Dashed lines indicate

coverage at 10 kb and 100 kb.
HG00733 is presented in
bold as an example. Adapted
from Shafin (2019)8.
Read N50 (kb)

Individual genomes

CASE STUDY 2
Genomic variation
detection in disease
Obtaining highly contiguous genome assemblies with long nanopore sequencing reads
enables the comprehensive investigation of disease-associated variants.
Approximately 20–30% of lung cancer patients remain Neuronal intranuclear inclusion disease (NIID) is a
undiagnosed with respect to their cancerous mutations; progressive neurodegenerative disease which is difficult
variant detection is particularly pressing in cases where to identify ante-mortem due to its wide range of clinical
an effective treatment remains elusive25. Using manifestations; its genetic basis remains unsolved.
PromethION whole-genome sequencing of cancer cell Through PromethION whole-genome sequencing of NIID
lines and lung cancer specimens, Sakamoto et al. patients and their families, Jun Sone and colleagues
identified a unique class of complex structural variant, identified a repeat expansion within the gene
consisting of copy number changes, inversions, and NOTCH2NLC which was consistently present in familial
deletions, which could not be characterised using and sporadic cases of the disease but absent from all
short-read sequencing25. These variants were detected in unaffected family members (Figure 7)26. Using the
six of the nine clinical research specimens analysed, and Oxford Nanopore protocol for Cas9-mediated
generally resulted in reduced gene expression levels and enrichment, the team subsequently enriched for the
disruption of genes in the region. The authors concluded NOTCH2NLC repeat sequence to obtain higher read
that these mutations may point to the causal mechanism coverage at this locus and form a consensus; the results
of disease in individuals for whom this is unknown. from this consensus analysis agreed well with their initial
analysis of the whole-genome sequencing data.
Figure 7
NOTCH2NLC repeat
consensus sequence of
multiple alignments. Note
that there are polymorphisms
between normal alleles. F1-1
and F2-2 are familial patients.
Expansion: allele with repeat
expansion; normal: normal
allele. Adapted from Sone
(2019)26.
10
CASE STUDY 3
100 tomato genomes
in 100 days
The tomato is one of the most valuable agricultural crops in the world, with an annual
production of over 175 million tonnes and a value of $85 billion. Tomatoes are also important
from a model system perspective due to their extensive phenotypic variation with over
15,000 known varieties27.
Michael Schatz and colleagues are sequencing 12–16

samples per week at 40x depth of coverage, using the ‘Short-read sequencing has proven
PromethION platform, to achieve a target of 100 genome valuable for single nucleotide
sequences in 100 days27. Initial results from the first 12 polymorphism discovery, but lacks
sequenced genomes revealed substantial variation power for more complex structural
between samples, with between 25,000–45,000 variants’ 27
structural variants per sample (Figure 8). For example,
an 83 kb tandem duplication was identified in the gene Such large-scale sequencing efforts will enable the rapid
EJ2; this was found to be associated with the higher fruit and high-throughput genome-wide characterisation of a
yields observed in some plants28. The majority of wide variety of tomato species, and potentially provide
structural variants identified were sample-specific and the genetic information needed to enrich for desired
many had been missed using short-read sequencing, genome traits and increase the efficiency of crop
which struggles to resolve large-scale genomic breeding in the future.
differences.
Figure 8
Nanopore sequencing results
from 12 tomato genomes a) b)
revealed (a) approximately
25,000–45,000 SVs per
genome, the majority of
which were deletions and
insertions, and (b) most
SVs were specific to each
sample. Figures courtesy of
Prof. Michael Schatz, Johns
Hopkins University, USA.

CASE STUDY 4
Assembling the genome
of an extremely
drug-resistant pathogen
Globally, with over 10 million new cases and 1.6 million deaths each year, and a growing
number of multi- and extremely-drug resistant strains, tuberculosis (TB) is a major threat to
human health29. Gaining a comprehensive understanding of genetic mechanisms
contributing to its transmission and resistance is therefore paramount for combatting drug-
resistant TB in endemic settings.
Short-read sequencing has been used for are often excluded from short-read sequencing analysis
Mycobacterium tuberculosis (MTB) assembly previously due to their repetitive and GC-rich nature.
yet it has limited ability to resolve long, repetitive regions
and structural variants, and is hindered by the relatively With a sequencing depth of 238x obtained from a single
high GC content (65%) of the MTB genome. MinION Flow Cell, a final assembly of 4,404,064 bp was
obtained in a single contig with >99.9% accuracy.
Using the Oxford Nanopore MinION, Bainomugisa et al. Furthermore, 166/168 PE/PPE genes had 100% breadth
performed whole-genome sequencing of the modern of coverage at 299.87x depth, compared to a short-read
Beijing lineage strain of TB responsible for drug dataset of this strain where only 92/168 of these genes
resistance outbreaks in the Western Province of Papua obtained 100% coverage at an average depth of 46.3x.
New Guinea30. The team sequenced and assembled a Finally, genome-wide variant calling identified three
high-quality, complete genome, including all PE/PPE deleted genic regions and a 4,490 bp insertion that were
(proline-glutamate/proline-proline-glutamate) gene both absent from the H37RV reference strain genome
families, which are thought to contribute to virulence but (Figure 9).
Figure 9
Integrative Genomic Viewer
(IGV) alignments of short-
read data from different
M. tuberculosis lineages
(A-G) mapped against the
nanopore draft genome. Left
image: a 4,490 bp insertion
spanning 7 annotated genes.
Right image: a 390 bp
insertion. Image courtesy
of Dr. Lachlan Coin, The
University of Queensland,
Australia.
12
CASE STUDY 5
Assembling metagenomes
Analysis of complex metagenomic samples has wide-reaching applications, with the
potential to recover genomes from previously unexplored microbial species, such as those
comprising gut microbiomes, as well as in the context of outbreak surveillance. Short-read
sequencing and assembly of metagenomes is challenging due to the difficulty in assigning
short reads to the correct genome of origin.
Bertrand and colleagues introduced the first hybrid

metagenome assembler, OPERA-MS31. Applying this tool ‘Portable metagenomic sequencing
to the gut metagenomes from 28 antibiotic-treated of genetically diverse RNA viruses on
individuals, the team demonstrated that the integration of the MinION…and with no pathogen-
long nanopore sequencing reads with short-read data specific enrichment, is shown to be
provided a 200-fold improvement in assembly contiguity
a feasible methodology enabling a
real-time characterization of potential
(Figure 10), and enabled completion of over 80 plasmid
outbreaks in the field’32
or phage sequences. Such high-quality assemblies are
likely to provide a comprehensive insight into the gut
resistome. the outbreak. An average of 4.26% of the sequencing
reads were LASV, which was sufficient for phylogenetic
In the context of outbreak surveillance, Liana comparison of at least one genomic segment in 91/120
Kafetzopoulou and her team performed in-field samples tested. Hepatitis A virus co-infection was also
metagenomic sequencing and analysis during a major detected in one sample, comprising 0.1% of reads which
Lassa fever virus (LASV) outbreak32. In 2018, the Nigerian provided 74% genome coverage and 20x depth, using
Lassa fever season experienced the largest ever upsurge the Centrifuge metagenomic classification software. This
in cases, raising fears of an emergent strain with increased demonstrated the potential of their simple approach for
transmission rate. The LASV genome is highly variable and identifying multiple RNA viruses, even within the same
therefore, for genome analysis, an unbiased metagenomic sample. They identified that rodent hosts were the main
approach is preferable to targeted amplicon or capture- source of the upsurge in cases, as opposed to person-
based whole-genome sequencing methods. to-person transmission, with no strong evidence of a
new emerging strain. The results were immediately
Benefiting from the speed and portability of real-time
reported to the WHO and Nigerian authorities,
sequencing with the MinION, the team sequenced 120
supporting a rapid public health response to the
infected human samples over 7 weeks at the epicentre of
outbreak.
Figure 10
Increase in assembly
contiguity as a function
of read coverage for a
representative short-read
assembler (a), long-read
assembler (b), and the hybrid
OPERA-MS assembler (c).
Unassembled genomes
are shown as circles with
black borders. Adapted from
Bertrand (2019)31.

Summary
Oxford Nanopore sequencing technology, which is capable of generating ultra-long reads
in excess of 4 Mb4, offers a real solution to the challenges associated with genome
assembly, such as accurately sequencing and mapping repetitive sequences and
structural variants, improving the completeness of genome assemblies where short-read
sequencing has failed6,7,10,30. Long nanopore sequencing reads offer unique potential for
genome assembly across a range of applications, from human and plant disease research
to bacterial metagenomics.
It is clear that the growing usage of Oxford Nanopore sequencing technology and
associated improvements in genome assemblies will bring additional and rapid insight
into genomics, further refining the relationship of phenotype to genotype.
About Oxford
Nanopore Technologies
Oxford Nanopore Technologies introduced the world’s regions and structural variants, offering significant
first nanopore DNA sequencer, the MinION — a portable, advantages over traditional short-read sequencing
real-time, low-cost device — followed by the larger technology.
GridION, PromethION P24 and P48 devices, and smaller
Flongle (Figure 11). The latest addition to the range, A number of protocols are available for nanopore
MinION Mk1C, combines the portability and power of the sequencing, enabling optimised whole-genome analysis
MinION with high-performance compute and an for a range of sample types and DNA input amounts. Our
integrated touchscreen, providing a complete, go- dedicated whole-genome sequencing Getting Started
anywhere solution for nanopore sequencing. guides, available on the Resource centre (nanoporetech.
com/resource-centre), have been developed to provide
Enabling the generation of long sequencing reads, further details and sample-to-answer guidance on
Oxford Nanopore platforms deliver high-quality whole- genome assembly with Oxford Nanopore long reads.
genome sequencing with the facility to span repetitive
Find out more at: www.nanoporetech.com
Figure 11
Oxford Nanopore sequencing platforms (from left to right): Flongle, a flow cell adapter for MinION and GridION; the portable
MinION and the latest addition, MinION Mk1C; GridION, with capacity for five Flongle or MinION Flow Cells; and the high-
throughput PromethION (P24 or P48) platform.
14
References
1. Mak, A. C. et al. Genome-Wide 12. Jansen, H. De novo genome assembly 24. Ruan, J. and Li, H. Fast and accurate
structural variation detection by with MinION long reads. Presentation. long-read assembly with wtdbg2.
genome mapping on nanochannel Available at: https://community. BioRxiv 530972 (2019).
arrays. Genetics 202(1):351–62 (2016). nanoporetech.com/posts/136
[Accessed: 30 September 2019] 25. Sakamoto, Y. et al Long read
2. Latorre-Perez, A. et al. Assembly sequencing reveals a novel class of
methods for nanopore-based 13. Sovic, I. et al. Github: Racon [Online]. structural aberrations in cancers:
metagenomic sequencing: a Available at: https://github.com/lbcb- identification and characterization of
comparative study. BioRxiv 722405 sci/racon [Accessed: 9 September cancerous local amplifications. BioRxiv
(2019). 2019] 620047 (2019).
3. Schatz, M. De novo assembly of 14. Simpson, J. Github: Nanopolish 26. Sone, J. et al. Long-read sequencing
complex genomes using single [Online]. Available at: https://github. identifies GGC repeat expansions in
molecule sequencing. Presentation. com/jts/nanopolish [Accessed: 9 NOTCH2NLC associated with neuronal
Available at: http://schatzlab.cshl. September 2019] intranuclear inclusion disease. Nat.
edu/presentations/2014-01-14.PAG. Genet. 51(8):1215-1221 (2019).
Single%20Molecule%20Assembly.pdf 15. Cherukuri, Y. and Janga, S. C.
[Accessed: 3 October 2019] Benchmarking of de novo assembly 27. Schatz. M. 100 genomes in 100 days:
algorithms for Nanopore data The structural variant landscape in
4. Oxford Nanopore Technologies. Ultra- reveals optimal performance of OLC tomato genomes. Available at: https://
Long DNA Sequencing Kit. Available approaches. BMC Genomics 22:17 nanoporetech.com/resourcecentre/
at: https://store.nanoporetech.com/ Suppl 7:507 (2016). michael-schatz-100-genomes-100-
ultra-long-dna-sequencing-kit.html days-structural-variantlandscape-
[Accessed: 23 August 2021] 16. Lin, Y. et al. Assembly of long tomato-genomes [Accessed: 7 May
error-prone reads using de Bruijn 2019]
5. Chaisson, M. J. P. et al. Genetic graphs. Proc. Natl. Acad. Sci. USA
variation and the de novo assembly 113(52):E8396-E8405 (2016). 28. Soyk, S et al. Duplication of a
of human genomes. Nat. Rev. Genet. domestication locus neutralized a
16(11):627–40 (2015). 17. Judge, K. et al. Comparison cryptic variant that caused a breeding
of bacterial genome assembly barrier in tomato. Nat. Plants. 5(5):471-
6. Eichler, E. Genetic Variation, software for MinION data and their 479 (2019).
Comparative Genomics, and the applicability to medical microbiology.
Diagnosis of Disease. N. Engl. J. Med. Microbial Genomics doi: 10.1099/ 29. WHO (2018) Tuberculosis [Online].
381:64-74 (2019). mgen.0.000085 (2016). Available at: https://www.who.int/
en/news-room/fact-sheets/detail/
7. Miga, K. Telomere-to-telomere 18. Koren, S. et al. Canu: scalable and tuberculosis [Accessed: 17 September
assembly of a complete human X accurate long-read assembly via 2019]
chromosome. Presentation. Available adaptive k-mer weighting and repeat
at: https://nanoporetech.com/ separation. Genome Res. 27:722-736 30. Bainomugisa, A. et al. A complete
resource-centre/telomere-telomere- (2017). high-quality MinION nanopore
assembly-complete-human-x- assembly of an extensively drug-
chromosome [Accessed: 9 September 19. Kolmogorov, M. et al. Assembly of resistant Mycobacterium tuberculosis
2019] long, error-prone reads using repeat Beijing lineage strain identifies novel
graphs. Nat. Biotech. 37:540-546 variation in repetitive PE/PPE gene
8. Shafin, K. et al. Efficient de novo (2019). regions. Microb. Genom. doi: 10.1099/
assembly of eleven human genomes mgen.0.000188 (2018).
using PromethION sequencing and a 20. Li, H. Minimap and miniasm: fast
novel nanopore toolkit. BioRxiv 715722 mapping and de novo assembly for 31. Bertrand, D. et al. Hybrid metagenomic
(2019) noisy long sequences. Bioinformatics assembly enables high-resolution
32:2103–10 (2016). analysis of resistance determinants
9. Zarrei, M. et al. A copy number and mobile elements in human
variation map of the human genome. 21. Alonge, M. et al. Fast and accurate microbiomes. Nat. Biotechnol.
Nat. Rev. Genet. 16(3):172–83 (2015). reference-guided scaffolding of draft 37(8):937-944 (2019).
genomes. BioRxiv 519637 (2019).
10. Ebbert, M. T. W. et al. Systematic 32. Kafetzopoulou, L. E. et al. Metagenomic
analysis of dark and camouflaged 22. Ruan, J. Available at: https://github. sequencing at the epicenter of the
genes reveals disease-relevant genes com/ruanjue/smartdenovo [Accessed: Nigeria 2018 Lassa fever outbreak.
hiding in plain sight. Genome Biol. 5 January 2017]. Science 363(6422):74-77(2019).
20(1):97 (2019).
23. Wick, R. R. et al. Unicycler: Resolving
11. Huddleston, J. et al. Discovery and bacterial genome assemblies from
genotyping of structural variation from short and long sequencing reads.
long-read haploid genome sequence PLoS Comput. Biol. 13(6): e1005595
data. Genome Res. 27(5): 677-685 (2017).
(2017).

Oxford Nanopore Technologies
phone +44 (0)845 034 7900
email sales@nanoporetech.com
twitter @nanopore
www.nanoporetech.com
Oxford Nanopore Technologies, the Wheel icon, Flongle, GridION, MinION, PromethION, and VolTRAX are registered
trademarks of Oxford Nanopore Technologies in various countries. All other brands and names contained are the property of
their respective owners. © 2021 Oxford Nanopore Technologies. All rights reserved. Flongle, GridION, MinION, PromethION,
and VolTRAX are for research use only.
WP_1043(EN)_V3_24Aug2021

Genome Assembly White Paper

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Genome Assembly White Paper

Uploaded by

Copyright:

Available Formats

GENOME ASSEMBLY

WHITE PAPER October 2019

Advantages of long sequencing reads

Genome assembly tools 8

About Oxford Nanopore Technologies 14

For applications such as the analysis of larger

Whole-genome assembly — solving the puzzle

OXFORD NANOPORE TECHNOLOGIES | GENOME ASSEMBLY 3

Several approaches have been used for this purpose,

The longer a sequencing read, the more overlap it will

Facility to span repetitive genomic regions

~ 50 base-pair read ~ 500,000 base-pair read

OXFORD NANOPORE TECHNOLOGIES | GENOME ASSEMBLY 5

Identification of large structural variation

The term structural variation (SV) covers a range of genetic

N75 N50 N25

OXFORD NANOPORE TECHNOLOGIES | GENOME ASSEMBLY 7

Canu De novo https://github.com/marbl/canu Koren et al. (2017)18

Flye De novo http://github.com/fenderglass/Flye Kolmogorov et al. (2019)19

Miniasm De novo https://github.com/lh3/miniasm Li (2016)20

RaGOO Reference-guided https://github.com/malonge/RaGOO Alonge et al. (2019)21

Unicycler De novo https://github.com/rrwick/Unicycler Wick et al. (2017)23

Wtdbg2 De novo https://github.com/ruanjue/wtdbg2 Ruan and Li (2019)24

(x) as a function of read

length. Dashed lines indicate

Read N50 (kb)

OXFORD NANOPORE TECHNOLOGIES | GENOME ASSEMBLY 9

Michael Schatz and colleagues are sequencing 12–16

OXFORD NANOPORE TECHNOLOGIES | GENOME ASSEMBLY 11

Bertrand and colleagues introduced the first hybrid

OXFORD NANOPORE TECHNOLOGIES | GENOME ASSEMBLY 13

OXFORD NANOPORE TECHNOLOGIES | GENOME ASSEMBLY 15

You might also like