Professional Documents
Culture Documents
Li - 2018 - Minimap2 Pairwise Alignment For Nucleotide Sequen
Li - 2018 - Minimap2 Pairwise Alignment For Nucleotide Sequen
doi: 10.1093/bioinformatics/bty191
Advance Access Publication Date: 10 May 2018
Original Paper
Sequence analysis
Abstract
Motivation: Recent advances in sequencing technologies promise ultra-long reads of 100 kb in
average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb
in length. Existing alignment programs are unable or inefficient to process such data at scale,
which presses for the development of new alignment algorithms.
Results: Minimap2 is a general-purpose alignment program to map DNA or long mRNA
sequences against a large reference database. It works with accurate short reads of 100 bp in
length, 1 kb genomic reads at error rate 15%, full-length noisy Direct RNA or cDNA reads
and assembly contigs or closely related full chromosomes of hundreds of megabases in length.
Minimap2 does split-read alignment, employs concave gap cost for long insertions and dele-
tions and introduces new heuristics to reduce spurious alignments. It is 3–4 times as fast as
mainstream short-read mappers at comparable accuracy, and is 30 times faster than long-
read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in
one type of alignment.
Availability and implementation: https://github.com/lh3/minimap2
Contact: hengli@broadinstitute.org
Supplementary information: Supplementary data are available at Bioinformatics online.
C The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
V 3094
Aligning nucleotide sequences with minimap2 3095
2 Materials and methods found at the previous step may have significant or complete overlaps
due to repeats in the reference (Li and Durbin, 2010). Minimap2
Minimap2 follows a typical seed-chain-align procedure as is used
used the following procedure to identify primary chains that do not
by most full-genome aligners. It collects minimizers (Roberts et al.,
greatly overlap on the query.
2004) of the reference sequences and indexes them in a hash table,
Let Q be an empty set initially. For each chain from the best to the
with the key being the hash of a minimizer and the value being a
worst according to their chaining scores: if on the query, the chain
list of locations of the minimizer copies. Then for each query
overlaps with a chain in Q by 50% or higher percentage of the shorter
sequence, minimap2 takes query minimizers as seeds, finds exact
chain, mark the chain as secondary to the chain in Q; otherwise, add
matches (i.e. anchors) to the reference, and identifies sets of coli-
the chain to Q. In the end, Q contains all the primary chains. We did
near anchors as chains. If base-level alignment is requested, mini-
not choose a more sophisticated data structure (e.g. range tree or k-d
map2 applies dynamic programming (DP) to extend from the ends
tree) because this step is not the performance bottleneck.
of chains and to close regions between adjacent anchors in chains.
For each primary chain, minimap2 estimates its mapping quality
Minimap2 uses indexing and seeding algorithms similar to mini-
with an empirical formula:
map (Li, 2016), and furthers the predecessor with more accurate
chaining, the ability to produce base-level alignment and the support mapQ ¼ 40 ð1 f2 =f1 Þ minf1; m=10g log f1
It achieves this higher sensitivity by indexing 1/3 fewer minimizers, With Eij Hij , we have
which further helps performance. HPC-based indexing reduces the
q e xij maxfq; 0g e ¼ e
sensitivity for current ONT reads, though.
and similar inequations for yij, x~ij and y~ij . In addition,
2.2 Aligning genomic DNA uij ¼ zij vi1;j maxfxi1;j ; x~i1;j g q e
2.2.1 Alignment with 2-piece affine gap cost
Minimap2 performs DP-based global alignment between adjacent As the maximum value of zij ¼ Hij Hi1;j1 is M, the maximal
anchors in a chain. It uses a 2-piece affine gap cost (Gotoh, 1990): matching score, we can derive
Without losing generality, we always assume q þ e < q~ þ e~. On the In conclusion, in Equation (5), x and y are bounded by ½q e; e,
condition that e > e~, it applies cost q þ jlj e to gaps shorter than d x~ and y~ by ½~q e~; ~
e, and u and v by ½q e; M þ q þ e. When
ðq~ qÞ=ðe e~Þe and applies q~ þ jlj e~ to longer gaps. This scheme 128 q e < M þ q þ e 127, each of them can be stored
helps to recover longer insertions and deletions (INDELs). as a 8-bit integer. This enables 16-way SSE vectorization regardless
(Daily, 2016). Without banding, our implementation is slower than Let T be the reference sequence. d(i) is computed as
Edlib (Sosic and
Sikic, 2017), but with a 1000 bp band, it is consid- 8
>
> 0 if T ½i þ 1; i þ 3 is GTA or GTG
erably faster. When performing global alignment between anchors, <
we expect the alignment to stay close to the diagonal of the DP dðiÞ ¼ p=2 if T ½i þ 1; i þ 3 is GTC or GTT
>
>
:
matrix. Banding is applicable most of the time. p otherwise
et al., 2011)]. Minimap2 is 3–4 times as fast as Bowtie2 and BWA- helps to reduce downstream base-level alignment of candidate
MEM, but is 1.3 times slower than SNAP. Minimap2 is more chains, which is still several times slower than chaining even with
accurate on this simulated dataset than Bowtie2 and SNAP but less the Suzuki–Kasahara improvement. In addition, taking a general
accurate than BWA-MEM (Fig. 1b). Closer investigation reveals form, minimap2 chaining can be adapted to non-typical data types
that BWA-MEM achieves a higher accuracy partly because it tries to such as spliced reads and multiple reads per fragment. This gives us
locally align a read in a small region close to its mate. If we disable the opportunity to extend the same base algorithm to a variety of
this feature, BWA-MEM becomes slightly less accurate than mini- use cases.
map2. We might implement a similar heuristic in minimap2 in Modern mainstream aligners often use a full-text index, such as
future. suffix array or FM-index, to index reference sequences. An advant-
To evaluate the accuracy of minimap2 on real data, we aligned age of this approach is that we can use exact seeds of arbitrary
human reads (AC: ERR1341796) with BWA-MEM and minimap2, lengths, which helps to increase seed uniqueness and reduce unsuc-
and called SNPs and small INDELs with GATK HaplotypeCaller cessful extensions. Minimap2 indexes reference k-mers with a hash
v3.5 (Depristo et al., 2011). This run was sequenced from experi- table instead. Such fixed-length seeds are inferior to variable-length
mentally mixed CHM1 and CHM13 cell lines. Both of them are seeds in theory, but can be computed much more efficiently in prac-
tice. When a query sequence has multiple seed hits, we can afford to
Irimia,M. and Roy,S.W. (2008) Evolutionary convergence on highly-conserved 3’ Marçais,G. et al. (2018) MUMmer4: a fast and versatile genome alignment
intron structures in intron-poor eukaryotes and insights into the ancestral system. PLoS Comput. Biol., 14, e1005944.
eukaryotic genome. PLoS Genet., 4, e1000148. Ono,Y. et al. (2013) PBSIM: pacBio reads simulator—toward accurate
Iwata,H. and Gotoh,O. (2012) Benchmarking spliced alignment programs genome assembly. Bioinformatics, 29, 119–121.
including Spaln2, an extended version of Spaln that incorporates additional Roberts,M. et al. (2004) Reducing storage requirements for biological
species-specific features. Nucleic Acids Res., 40, e161. sequence comparison. Bioinformatics, 20, 3363–3369.
Jain,M. et al. (2018) Nanopore sequencing and assembly of a human genome Robinson,J.T. et al. (2011) Integrative genomics viewer. Nat. Biotechnol., 29,
with ultra-long reads. Nat. Biotechnol., 36, 338–345. 24–26.
Langmead,B., and Salzberg,S.L. (2012) Fast gapped-read alignment with Sedlazeck,F.J. et al. (2018) Accurate detection of complex structural variations
Bowtie 2. Nat. Methods, 9, 357–359. using single-molecule sequencing. Nat. Methods, doi:10.1038/s41592-
Li,H. (2013) Aligning sequence reads, clone sequences and assembly contigs 018-0001-7.
with BWA-MEM. arXiv, 1303.3997.
Sosic,M., and Sikic,M. (2017) Edlib: a C/Cþþ library for fast, exact sequence
Li,H. (2016) Minimap and miniasm: fast mapping and de novo assembly for alignment using edit distance. Bioinformatics, 33, 1394–1395.
noisy long sequences. Bioinformatics, 32, 2103–2110. Sovic,I. et al. (2016) Fast and sensitive mapping of nanopore sequencing reads
Li,H., and Durbin,R. (2010) Fast and accurate long-read alignment with with GraphMap. Nat. Commun., 7, 11307.
Burrows-Wheeler transform. Bioinformatics, 26, 589–595. Suzuki,H., and Kasahara,M. (2018) Introducing difference recurrence rela-