Sequence Alignment Algorithm Overview

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

RUPRECHT KARL

UNIVERSITY
OF
HEIDELBERG
Implementation of Smith-Waterman algorithm in OpenCLfor GPUs
Dzmitry Razmyslovich
1
, Guillermo Marcus
1
, Markus Gipp
1
, Marc Zapatka
2
and Andreas Szillus
2
1
Institute of Computer Engineering (ZITI),
University of Heidelberg,
Mannheim, Germany
Main contact: dzmitry.razmyslovich@ziti.uni-heidelberg.de
Other contacts: see http://li5.ziti.uni-heidelberg.de
2
German Cancer Research Center,
Heidelberg, Germany
Email: m.zapatka, a.szillus@dkfz-heidelberg.de
Abstract
We present the implementation of Smith-
Waterman algorithm done in OpenCL. This
implementation is capable of computing
similarity indexes between query sequences
and a reference sequence with or without
sequence alignment paths. In accordance with
the requirement for the target application in
cancer research the implementation provides
processing of very long reference sequences
(in the order of millions of nucleotides).
Biological Problem Description
All cancers are results of changes (aberrations) occurred
in the DNA sequence of the genomes of cancer cells. The
identification of even most complex aberrations can be
done with Smith-Waterman algorithm by processing a
long reference sequence and short query sequences. The
former is a genome sequence which can be rather long
while the latter are the product of the second-generation
technology. The basic problem solved with the
implementation presented lies in an alignment of short
query sequences along a long reference sequence.
Smith-Waterman Algorithm
The idea of alignment lies in filling the nxm matrix H, the similarity matrix, where n is the number of elements in a query sequence and m is the
number of elements in a reference sequence. The values of the matrix are computed using dynamic programming according to formula 1.
Each value H[i; j] is the measure of similarity of two subsequences: a query sequence up to the i-th element and a reference sequence up to the
j-th element.
The data size requirements put limits to the possibility of storing the matrix. It is necessary to use
online computation of the paths. It means calculating the paths for the already calculated part of
the matrix and truncating the matrix concurrently with computation of a new piece of the matrix.
Choosing a nucleotide from a query sequence as a parallelization grain makes online computation
possible.
Step 1. The long reference sequence processing and online computation Step 2. The diamond calculation shape
Step 3. The multiquery processing
Step 4. The concurrent data transfer and kernel execution
The no path calculating model
Conclusion
Implementation strengths:
Principal advantage: alignment paths
calculation;
Efficient processing of long reference
sequences (up to 28 million in the tests);
High performance characteristics:
on GTX 480 (Fermi): competitive to
CUDASW++v2.0.1 implementation and
4.5x as fast as Farrars implementation;
on GTX 260: competitive to Farrars
implementation and 3x as fast as
CUDASW++v2.0 implementation;
the acceleration in comparison with our
CPU implementation is 14.5x for the path
calculating version and 610x for the no
path calculating version;
Heterogeneous platform independence.
Results
m j n i
j i S j i H
RF j i H
I F j i H
j i H
m j j H
n i i H
s s s s

+
+
+
=
s s =
s s =
1 , 1
,
) , ( ] 1 , 1 [
] 1 , [
] , 1 [
0
max ] , [
, 0 , 0 ] , 0 [
, 0 , 0 ] 0 , [
In most cases vast amount of sequencer data must be filtered
at the beginning. This task only requires similarity values
Comparison of OpenCL implementation with other ones
The possibility of faster computation of solely similarity values
NO PATH
CALCULA
TING
VERSION
Initialization performs some calculations
to make the wavefront technique usable
Calculating calculates the whole matrix
excluding heading and ending
Finishing calculates the ending with
saving the results
Calculating
Initialization
Finishing
calculation of a new piece of the matrix
calculation of a new optimal path
together with truncating the current matrix
GPU
CPU
P
R
O
C
E
S
S
Computation time
0
0,5
1
1,5
2
2,5
7840 14980 49980 62930
A
v
e
r a
g
e
c
o
m
p
u
t a
t i o
n
t i m
e
p
e
r q
u
e
r y
( m
s
)
Reference sequence length (the number of nucleotides)
Farrar (GTX 260) OpenCL without paths (GTX260)
CudaSW++v2.0 (GTX260)
Reference
Q
u
e
r
y
7 5 6 4 2 3
6 4 5
8 6 7
10 8 9
9 7 8
11 9 10
13 11 12
10 8 9
12 10 11
14 12 13
14 15
15
15 13 14
13 11 12 14 15
1
3
5
7
1 2
2 3
3 4
4 5
3 4 5 6
4 5 6 7
5 6
6 7
7 8
8 9
7 8 9 10
8 9 10 11
9 10
10 11
11
11
Reference
Q
u
e
r
y
The values calculated with the preprocessing
kernel function
The values calculated with the main kernel
function
0 0 0 0 0
0
0
0
0
h11 h12 h13 h14
h21 h22 h23
h31 h32
h41 W
A
V
E
F
R
O
N
T
0
500
1000
1500
2000
2500
3000
3500
0,00E+00 5,00E+06 1,00E+07 1,50E+07 2,00E+07 2,50E+07 3,00E+07 A
v
e
r a
g
e
c
o
m
p
u
t a
t i o
n
t i m
e
p
e
r q
u
e
r y
( m
s
)
Reference sequence length (the number of nucleotides)
GeForce GTX 260 (with paths) GeForce GTX 260 (without paths)
GeForce GTX 480 Fermi (with paths) GeForce GTX 480 Fermi (without paths)
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0,5
7840 14980 49980 62930
A
v
e
r a
g
e
c
o
m
p
u
t a
t i o
n
t i m
e
p
e
r q
u
e
r y
( m
s
)
Reference sequence length (the number of nucleotides)
Farrar (GTX480)
OpenCL without paths (GTX480)
CudaSW++v2.0.1 (GTX480)
0 0 0 + | | | | | | ( ) i q sign j i H j i H = , , =
Query 1
Query 2
Query k
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
Similarity matrix for
query 1 and the
reference sequence
Similarity matrix for
query 2 and the
reference sequence
Similarity matrix for
query k and the
reference sequence
Query 1
Query 2
Query k
Query 1 Query 2 Query k
Test space includes:
a reference sequence: the sequence of chromosome 21 - circa 28 million nucleotides in
length (from the NCBI Build 36 of the human reference assembly);
query sequences: a set of 36 nucleotide long reads of equal length from an Illumina genome
analyzer.
We use a computation time to measure the performance. The computation time includes:
the kernel execution scheduling time,
the kernel execution time,
device-to-host data transferring time (for the version with path calculating),
the paths calculation time (for the version with path calculating),
device-to-host results transferring time (for the version without path calculating).
Test platform:
the NVIDIA GeForce GTX 260 GPU with 1.75GB of RAM, 30 multiprocessors and 216
cores;
the NVIDIA GeForce GTX 480 GPU with 1.5GB of RAM, 15 multiprocessors and 480
cores;
the Intel i7-920 CPU;
6GB of RAM;
the Linux OS with the installed NVIDIA GPU Computing SDK 3.1.
The bandwidth of 5.9 GB/s
was reached in the tests. The
index doesnt depend on the
tests data.
R1
R2
P1
P2 J
P3
the optimal path P1 for the matrix
the optimal path P2 for the similarity matrix for the reference sequence R2 and the query
sequence Q
another optimal path P3 for the renewed matrix; P3 is assembled by merging the part of P2
from the end of the renewed matrix to the junction J and the part of P1 from J to the top row
of the old matrix
the useless part of the matrix (can be truncated just before the calculation of the new optimal
path)
The already calculated similarity matrix for a reference sequence R1 and the query
sequence Q
a newly calculated piece of the matrix constructing together with the matrix the
similarity matrix for a reference sequence R2 and the query sequence Q
Calculating a window of the matrix together with
transferring the previous window and its processing
is possible, due to the independent functionality of
the GPU DMA controller and GPU multiprocessors.
And since path calculating and matrix block
calculating tasks are processed with different
devices, these tasks can also be handled
concurrently. To enable the possibility to overlap
data transferring and kernel execution, a ring buffer
is allocated in device memory.
P
R
O
C
E
S
S
Host initialization
Input data transfer
Scheduling running
Precalculation kernel
Calculation kernel
Matrix transfer to host memory
Path calculations
Printing results
G
P
U
C
P
U
C
P
U

P
a
th
s
M
e
m
o
ry

tra
n
s
fe
r
t
t
t
t
Transfer
Transfer
Transfer
Calculation
Calculation
Calculation
G
P
U
C
P
U
C
P
U

P
a
th
s
M
e
m
o
ry

tra
n
s
fe
r
t
t
t
t

You might also like