Professional Documents
Culture Documents
Ji 等。 - 2015 - One-dimensional pairwise CNN for the global alignm
Ji 等。 - 2015 - One-dimensional pairwise CNN for the global alignm
Ji 等。 - 2015 - One-dimensional pairwise CNN for the global alignm
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
art ic l e i nf o a b s t r a c t
Article history: The cellular neural network (CNN) is one of the classic artificial neural networks. During the past
Received 17 September 2013 decades, the one-dimensional CNN models and their applications have not yet been paid enough
Received in revised form enthusiasm too. For this reason, this paper proposes a simplified one-dimensional CNN model and then
30 July 2014
designs a pairwise network using this model to demonstrate its applicability. This pairwise CNN consists
Accepted 7 August 2014
of two parallel one-dimensional CNNs, a fixed master and a movable slave. Using this pairwise CNN, an
Communicated by R.W. Newcomb
Available online 23 August 2014 algorithm is developed to perform the global alignment of two DNA sequences. In this algorithm, the
slave moves forward step by step, and the cell states of the master are computed in the meanwhile.
Keywords: Based on all the states obtained in all time steps, a state selection array is generated then a global
Cellular neural network
alignment path is determined from this array. Under the guidance of the alignment path, two DNA
DNA sequences
sequences are globally aligned by inserting blank spaces in the appropriate positions of these two
State selection array
Global alignment sequences. Experiments on aligning the DNA sequences from the publicly available databases of the NCBI
Similarity computation with this method are carried out in this paper and compared with the other two methods. Through
evaluating computation time and similarity, these experiments prove that the proposed one-
dimensional CNN model is effective, and the alignment algorithm based on a pairwise CNN of the
model is efficient, obtaining higher similarity with less computation time than the other two.
& 2014 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.neucom.2014.08.023
0925-2312/& 2014 Elsevier B.V. All rights reserved.
506 L. Ji et al. / Neurocomputing 149 (2015) 505–514
function f(x) for the cell output is defined as the following form: cells in the master at all time steps. Then, an alignment path is
obtained by back-tracing the states in the array. Finally, under the
1
f ðxij ðtÞÞ ¼ ðxij ðtÞ þ1 xij ðtÞ 1Þ guidance of the alignment path, two sequences are globally aligned
2
8 by inserting some blank spaces at the appropriate positions. More-
> 1; xij ðtÞ Z 1; over, in order to evaluate this algorithm, some numerical experi-
<
¼ xij ðtÞ; ∣xij ðtÞ∣o 1; ð2Þ ments on the DNA sequences from the publicly available NCBI
>
: 1; xij ðtÞ r 1: databases have been conducted in this paper. Two typical criteria
including similarity and computation time are adopted to numeri-
Besides those classic two-dimensional CNNs, some one- cally evaluate the algorithm performances in these experiments.
dimensional CNN models have also been proposed, such as the The remainder of this paper is organized as follows. In Section 2,
discrete-time model of Manganaro et al. [14], and the two-layer a simplified one-dimensional CNN structure model is proposed.
model [27] by Takahashi et al. Moreover, the sufficient conditions Section 3 designs a one-dimensional pairwise CNN and develops an
of the one-dimensional CNN for detecting the connected compo- algorithm for the global alignment of two DNA sequences and
nent are also analyzed by Takahashi et al. [28]. These one- similarity computation using the designed pairwise CNN. In order
dimensional models exhibit many novel characteristics, and may to evaluate the model and the algorithms, Section 4 shows a few
be potentially applied to some engineering fields such as sequence numerical experiments on the NCBI databases, and analyzes the
alignment. However, almost no further research work was carried experiment results. Finally, Section 5 concludes this paper.
out since then. By now, the research on the one-dimensional CNNs
has almost fallen into silence completely.
During the past decades, as one of the most important research
problems in bioinformatics [29], the similarity computation of
DNA sequences has attracted many eye-sights of researchers in the 2. The proposed one-dimensional CNN model
world. As we know, DNA carrying the genetic secrets of livings
consists of only the four different nucleotide bases: A, C, G and T. The conventional two-dimensional CNN models usually consist of
It is highly believed that, by similarity comparison on the DNA a typical cell array with n rows and m columns, and each cell in the
sequences obtained from different biological samples, their biolo- network is mutually connected to its neighbor cells [1,2]. Differing
gical structure characteristics can be predicted in advance, their from the two-dimensional models, the proposed one-dimensional
evolutionary paths can be traced, and their biological functions model is only made up of a linear cell chain with m cells which are
can also been accurately identified. arranged in a single row. Thus, any cell C(i), iA f0; 1; …; m 1g, in the
In general, to complete the similarity computation of two DNA one-dimensional network has only one direct link to each of its two
sequences, a global alignment process is extremely critical and neighbor cells, Cði 1Þ and Cðiþ 1Þ. Fig. 1(a) and (b) exhibits the brief
even inevitably necessary. So far, many classic alignment algo- network structure of the model with a linear cell arrangement and
rithms have been developed to do the global alignment of two the single cell diagram, respectively.
DNA sequences, such as the dot pot [30], the dynamic program- As shown in Fig. 1, xi ; ui and yi represent the cell state signal,
ming [29,31], and the heuristic algorithms such as FASTA [32] and the link input signal, and the cell output signal of the given cell C
BLAST [33]. Moreover, as another classic problem, the alignment of (i), respectively. f ðxi Þ is a modulation function formulated for the
multiple DNA sequences is still a very difficult NP problem, and it cell output, which is usually defined as a piecewise-linear func-
has not been well-solved yet though there have been some tion. Moreover, A is a feedback template, which is usually used to
efficient algorithms proposed to do it, for instance, the iteration modulate all the feedback inputs from the output signals of cell C
algorithm in [34], the progressive alignment in [35], and the graph (i) and its neighbors. Similarly, B represents a control template,
theory approach in [36]. Not considering the global alignment of which is usually used to modulate all the link-input signals from
multiple DNA sequences, this paper only addresses the global cell C(i) and its neighbors. Furthermore, both C and Rx are constant
alignment problem of two DNA sequences. circuit parameters, which are usually determined empirically. Ii is
During the past decades, although CNNs have been widely used the cell bias constant (or threshold), and xi ð0Þ indicates the initial
in some application fields such as the image processing [19], they cell state of C(i) at the time step t ¼0.
have not been applied to the similarity computation of DNA Similar to the two-dimensional models, Eq. (1), the dynamic
sequences yet. This paper proposes a new one-dimensional CNN state of cell C(i) in the one-dimensional CNN can be briefly
model differing from those traditional two-dimensional models,
then designs a one-dimensional pairwise CNN using the proposed
model. Based on the one-dimensional pairwise CNN, this paper
also develops a global alignment and similarity computation
algorithm for two DNA sequences.
Significantly differing from the conventional two-dimensional
CNN models with row-column cell structure, the proposed one-
dimensional CNN model consists of only one cell chain, in which
each cell only has two neighbor cells at the most at any time: a
right one and a left one. Furthermore, the pairwise CNN consists of
two parallel one-dimensional CNNs, a master sub-network and a
slave sub-network. The master is fixed, while the slave is movable.
As running the pairwise CNN, the master sub-network always
keeps immobile and the slave regularly moves forward a fixed
distance (the critical distance) at each step along the direction
parallel to the master.
Using the pairwise one-dimensional CNN, an alignment method
is developed for two DNA sequences. First, it generates a state Fig. 1. Illustration of the proposed one-dimensional CNN model. (a) Linear chain
selection array according to the dynamic states and outputs of the network structure; (b) Single cell diagram.
L. Ji et al. / Neurocomputing 149 (2015) 505–514 507
described in the following forms: If let Δt ¼ 1, Eq. (5) is transformed into Eq. (6), which is
a discrete model of the one-dimensional CNN:
8
> dx ðtÞ 1 !
<C i ¼ xi ðtÞ þ ∑ Ak yk ðtÞ þ ∑ Bk uk þ I i 1 1
dt Rx k A Ni ðrÞ k A N i ðrÞ ð3Þ xi ðt þ1Þ ¼ xi ðtÞ þ xi ðtÞ þ ∑ Ak yk ðtÞ þ ∑ Bk uk þI i :
>
: y ðtÞ ¼ f ðx ðtÞ; I Þ C Rx k A Ni ðrÞ k A N i ðrÞ
i i i
ð6Þ
where t indicates the time step, and k represents the position In Eq. (6), t is used to represent the discrete time step,
coordinate of a cell in the one-dimensional CNN. Ni(r) represents t ¼ 0; 1; 2; 3; … . Moreover, all the other symbols or expressions
the coordinate set of the neighbor cells of C(i), and r is defined as in Eq. (6) have the same meanings as those defined in Eq. (3).
the critical distance, beyond which any cell cannot have any link to
its neighbors. Ak and Bk are two matrix coefficients from the
templates A and B, respectively. yk and uk are the outputs from the 3. The global alignment of two DNA sequences using the CNN
neighbor cell C(k) and the link input to C(k), respectively. More-
over, at time t, as shown in Eq. (4), the output modulation function Usually, a DNA sequence consists of only the four nucleotide
f ðÞ is redefined as bases: adenine, guanine, cytosine and thymine. In bioinformatics,
these four nucleotide bases are usually abridged as the corre-
sponding characters ‘A’, ‘G’, ‘C’ and ‘T’, respectively. In nature,
1 if xi ðtÞ ¼ I i ;
yi ðtÞ ¼ f ðxi ðtÞ; I i Þ ¼ ð4Þ a species has its own distinctive DNA sequences with an exclusive
0 else:
arrangement structure of nucleotide bases. It is the differences of
Differing from Eq. (2), this function is not a piecewise-linear the category, quantity and arrangement order of the nucleotide
function. In fact, from the output characteristic's point of view, bases in DNA sequences that makes the DNA sequences different
Eq. (4) is a typical two-valued function with a pulse output of the from each other. Moreover, the DNA sequence is also believed as
cell in the CNN. It means that if xi(t) reaches the threshold Ii at time the most direct and essential reason of the biological diversity in
t, the cell C(i) outputs 1, yi ðtÞ ¼ 1, otherwise no pulse outputs, nature.
yi ðtÞ ¼ 0. Moreover, of course, Eq. (3) still should follow both the Since a DNA sequence consists of only the four fundamental
two constraints jxi ð0Þj r1 and jui j r1. nucleotide bases, any given sequence can be equivalently trans-
As shown in Fig. 1, the proposed one-dimensional CNN consists formed into a pure character sequence of fA; G; C; Tg by converting
of a simple cell chain, so it has a much simpler cell neighbor all the nucleotide bases using the corresponding characters to
domain than a two-dimensional CNN. Fig. 2 gives out a typical them. So, the global alignment of two DNA sequences can be
structure of the cell neighbor domain in the proposed one- equivalently transformed into the global alignment of two char-
dimensional model. acter sequences. The following content will show how to globally
In the one-dimensional CNN above, cell C(i) has only the two align a pair of DNA sequences using the one-dimensional CNN
links to its two neighbor cells, Cði 1Þ and Cði þ 1Þ. So the neighbor model, and how to evaluate the alignment performance.
domain of C(i) consists of only the three cells Cði 1Þ, Cði þ 1Þ and
itself, namely N i ðrÞ ¼ fi 1; i; i þ1g. Furthermore, C(i) just receives 3.1. The designed one-dimensional pairwise CNN
the two feedback inputs Ai 1 yi 1 and Ai þ 1 yi þ 1 from the
outputs of its two neighbor cells Cði 1Þ and Cði þ 1Þ, respectively. As described in Section 2, the one-dimensional CNN model has
These two feedback inputs are modulated by the feedback tem- a linear cell arrangement structure, as shown in Fig. 1(a). Each cell
plate A. At the same time, C(i) also receives the two cell link inputs has only two link cells at the most, as illustrated in Fig. 1(b). These
Bi 1 ui 1 and Bi þ 1 ui þ 1 from its neighbors Cði 1Þ and Cði þ1Þ, characteristics of the model can be effectively utilized to develop
respectively. These two cell link inputs are modulated by the an alignment algorithm for two DNA sequences. In order to
control template B. In general, cell C(i) also receives the two extra develop such an algorithm, first a pairwise CNN is constructed
signals from itself, the self-feedback input Ai yi and the self-link using the one-dimensional model proposed in Section 2. The
input Bi ui . designed pairwise CNN is different from those traditional CNNs
For most of the time, Eq. (3) is usually used for depicting the because it only consists of two separate one-dimensional CNNs, as
state dynamics of the cells in a continuous-time CNN. However, in shown in Fig. 3.
some special applications such as digital image processing, a Fig. 3 exhibits the pairwise network at the initial state, t ¼0.
discrete-time CNN is more suitable. From Eq. (3), the discrete Structurally, this pairwise CNN consists of two separate one-
dynamics of the one-dimensional model can be approximately dimensional CNNs, the master sub-network CNN1 with m cells
deduced as follows: and the slave sub-network CNN2 with n cells. Furthermore, all cells
of CNN1 are fixed and represented as C 1 ðiÞ, i ¼ 0; 1; …; m 1, while
xi ðt þ ΔtÞ xi ðtÞ 1 the cells of CNN2 are movable and represented as C 2 ðjÞ,
C ¼ xi ðtÞ þ ∑ Ak yk ðtÞ þ ∑ Bk j ¼ 0; 1; …; n 1. From the point of view of the interrelation, as
Δt Rx k A N i ðrÞ k A N i ðrÞ
shown in Fig. 3, CNN2 is separate from CNN1 and also in parallel to
uk þ I i ð5Þ CNN1. Moreover, the distance between two adjacent cells in a
same CNN and the distance between CNN1 and CNN2 are fixed as
the critical distance r, same as that in Eq. (6). In addition, CNN1
Ai −1 Ai +1
CNN1 (master, fixed)
yi −1 yi +1 … … m-1
Ai × yi 0 1 i
i -2 i -1 i i +1 i +2 r r r
Bi × ui
ui −1 ui +1
n-1 … 1 0
Bi −1 Bi +1 CNN 2 (slave, movable)
Fig. 2. The typical neighbor domain structure of the cell C(i). Fig. 3. The structure of one-dimensional pairwise CNN, at t¼ 0.
508 L. Ji et al. / Neurocomputing 149 (2015) 505–514
always keeps fixed at any time step t, t ¼ 0; 1; …, while CNN2 the three ones: C 1 ðm 1Þ, C 1 ðm 2Þ and C 2 ðn 1Þ, and occurs at
regularly moves forward step by step along the direction parallel t ¼ m þ n 1.
to CNN1 and moves forward a fixed distance r at each step. Furthermore, in the right neighbor domain as shown in Fig. 5(a),
Moreover, CNN1 is defined as the work center (the master) and C 1 ð0Þ receives the two link inputs and the two feedback inputs from
the dynamics of CNN2 are ignored in the pairwise CNN. As a result, itself and C 1 ð1Þ, and the link input from C 2 ð0Þ, ignoring the feedback
CNN2 only plays a role of the link supplier (the slave) to CNN1 in from the output of C 2 ð0Þ. Similarly, as shown in Fig. 5(b), C 1 ðiÞ
the running of this pairwise CNN. simultaneously receives the three link inputs and the three feed-
As shown in Fig. 3, at time t¼0, no cell in CNN1 can link to CNN2 back inputs from itself, C 1 ði 1Þ and C 1 ði þ 1Þ, and the link input
because the distance between C 1 ðiÞ and C 2 ðjÞ is bigger than the from C 2 ðjÞ. As shown in Fig. 5(c), C 1 ðm 1Þ receives the two link
critical distance r. Since CNN2 moves forward r at each time step, at inputs and the two feedback inputs from itself and C 1 ðm 2Þ, and
t¼1 cell C 2 ð0Þ exactly moves to the vertical location under C 1 ð0Þ. the link input from C 2 ðn 1Þ.
From then on, at each time step t, CNN2 will continue to move For the pairwise CNN, since CNN1 works as the work center and
forward r. So at time t¼ m, C 2 ð0Þ exactly moves to the vertical CNN2 is regarded as the link supplier to CNN1, the state and output of
location under C 1 ðm 1Þ, and at t ¼ m þ n 1 C 2 ðn 1Þ arrives in CNN1 at any time step t 4 0 can be derived from Eqs. (3) and (4) as
the location right under C 1 ðm 1Þ. Finally, at t ¼ m þ n, CNN2 8 !
>
> 1 1
>
completely divorces from CNN1, and no cell in CNN1 can link to < x1;i ðt þ 1Þ ¼ 1 CRx x1;i ðtÞ þ C
> ∑ Ak yl;k ðtÞ þ
C l ðkÞ A N 1;i ðr;tÞ
∑ Bk ul;k þ I 1;i
C l ðkÞ A N1;i ðr;tÞ
CNN2, so the pairwise CNN stops at that time step. (
>
> 1 if x1;i ðtÞ ¼ I 1;i ;
In order to clearly show the movement process of CNN2 in the >
>
: y1;i ðtÞ ¼ f ðx1;i ðtÞ; I 1;i Þ ¼ 0 else;
running of the pairwise CNN, Fig. 4 illustrates the three critical
transition positions of CNN2 moving forward respectively at the ð7Þ
time steps t¼1, t¼m and t ¼ m þ n 1. In this figure, the arrows
where t is the time step, Cl(k) is the neighbor cell of C 1 ðiÞ, x1;i ðtÞ and
represent the links of a cell in CNN1 to its neighbor cells.
y1;i ðtÞ represent the state and the output of C 1 ðiÞ at time t,
Moreover, the cell of CNN1 can have the three typical neighbor
respectively. yl;k ðtÞ and ul;k (independent of t) are the output and
domains, as shown in Fig. 5. In this figure, (a) represents the
the link input to Cl(k) at time t, respectively. Moreover, deriving from
neighbor domain of C 1 ð0Þ, which consists of three cells: C 1 ð0Þ,
Eq. (4), f ðx1;i ðtÞ; I 1;i Þ is still a two-valued function with the pulse
C 1 ð1Þ and C 2 ð0Þ, and occurs at t¼ 1; (b) represents the neighbor
output characteristic, thereinto I 1;i is designed as the pulse threshold
domain of C 1 ðiÞ, 1 o i o m 1, which consists of the four cells: C 1 ðiÞ,
for deciding the output of C 1 ðiÞ. In application, ul;k usually comes
C 1 ði 1Þ, C 1 ði þ 1Þ and C 2 ðjÞ, and occurs at 1 o t o m þ n 1; and
from the initialization of the CNN, and always keeps its value
(c) represents the neighbor domain of C 1 ðm 1Þ, which consists of
without any change all the time. For instance, it is initialized using
the grey-scale values of the pixels in the image processing [23].
Similarly, in the alignment algorithm of this paper it will be
initialized using the numerical codes of the DNA sequences.
CNN1 0 1 … i … m-1 The remaining symbols in Eq. (7) have the same meanings as in
Eq. (3). Because CNN2 moves forward a fixed distance r at each
step, its neighbor domain will change with time t. So in Eq. (7),
n-1 … 1 0 CNN2 N1;i ðr; tÞ is still used to indicate the present neighbor domain of
C 1 ðiÞ at time t. Moreover, as derived from Eq. (3), besides
jx1;i ð0Þj r 1 and ju1;i j r 1, Eq. (7) should also constrain ju2;j j r 1,
considering the link input of CNN2 available to CNN1.
0 1 … i … m-1 CNN1
3.2. The alignment using the one-dimensional pairwise CNN
correspondingly quantified as f0; 1; 0:5; 0:5; 1g (which meets where u1;i and u2;j represent the link inputs of C 1 ðiÞ and C 2 ðjÞ,
the constraints ju1;i j r 1 and ju2;j j r1), where ‘n’ represents respectively. After the initialization, the pairwise CNN with the
a blank space. initial state (namely at t¼0) is obtained, as exhibited in Fig. 7.
Then, set these parameters C ¼1, Rx ¼1, I 1;i ¼ 0, the feedback
3.2.1. The initialization of the CNN template A ¼ f0; 0; 0; 0g, the control template B ¼ f0; 1; 0; 1g,
On these five numerical characters, the two DNA sequences S1 x1;i ð0Þ ¼ 0, x2;j ð0Þ ¼ 0, and the selection factor template is set as
and S2 are numerically transformed into: F ¼ fF 0 ; F 1 ; F 2 g ¼ f5; 3; 2g, where the three prime numbers must be
S01 ¼ f 1; 1; 0:5; 0:5; 1; 0:5; 1; 0:5g and different from each other and follow F 0 4 F 1 4F 2 .
S02 ¼ f 0:5; 1; 0:5; 0:5; 1; 1g.
First, CNN1 is initialized as the master network with m ¼ Len 3.2.2. The generation of the global alignment path
ðS1 Þ þ 1 ¼ 9 cells, and CNN2 is initialized as the slave network with First, the state selection function φðÞ is defined for the cell state
n ¼ LenðS2 Þ þ1 ¼ 7 cells. The link input u1;i of cell C 1 ðiÞ, i ¼ 0; 1; …; 8, C 1;i ðtÞ, 1 r t r 15, where φðÞ follows:
and u2;j of cell C 2 ðjÞ, j ¼ 0; 1; …; 6, are respectively initialized using (
x1;i ðtÞ if i ¼ 0 and t ¼ 1;
the numerical codes of S1 and S2, as follows: x1;i ðtÞ ¼ φðx1;i ðtÞÞ ¼ ð9Þ
8
maxðδ0 ; δ1 ; δ2 Þ else;
> (
>
> S01 ði 1Þ if 1 r i r m 1;
>
>
> u ¼ where the three parameters δ1, δ2 and δ3 follow
>
<
1;i
0 if i ¼ 0; 8
>
( 0 ð8Þ < δ0 ¼ x1;i 1 ðt 1Þ F 2 ;
>
> u ¼ S2 ðn 2 jÞ if 0 r j r n 2;
>
> δ1 ¼ x1;i 1 ðt 2Þ þ y1;i ðtÞ F 0 ð1 y1;i ðtÞÞ F 1 ; ð10Þ
>
> 2;j
ifj ¼ n 1; >
>
: 0 : δ ¼ x ðt 1Þ F :
2 1;i 2
CNN1
u= 0 -1 -1 0.5 -0.5 1 -0.5 1 0.5
0 1 2 3 4 5 6 7 8
CNN2
0 1 2 3 4 5 6
u= 1 -1 -0.5 0.5 -1 -0.5 0
CNN2 0 1 2 3 4 5 6 t=1
u2 = 1 -1 -0.5 0.5 -1 -0.5 0
CNN2 0 1 2 3 4 5 6 t=2
u2 = 1 -1 -0.5 0.5 -1 -0.5 0
CNN2 0 1 2 3 4 5 6 t=3
CNN2 0 1 2 3 4 5 6 t=4
CNN2 0 1 2 3 4 5 6 t=5
u2 = 1 -1 -0.5 0.5 -1 -0.5 0
…
…
u1 = 0 -1 -1 0.5 -0.5 1 -0.5 1 0.5
Fixed, CNN1 t=15
0 1 2 3 4 5 6 7 8
t=15
CNN2 0 1 2 3 4 5 6
u2 = 1 -1 -0.5 0.5 -1 -0.5 0
Fig. 8. The cell state and output computation steps of the pairwise CNN, at t ¼ 1; 2; …; 15.
have been obtained step by step. After that, the selection array Update N with N 0 , and M with M 0 .
R nm with n rows and m columns using all x1;i ðtÞ, is constructed as 3. Obtain the path node set P.
2 3 4. The generation of alignment path is completed.
x1;0 ð1Þ x1;1 ð2Þ x1;2 ð3Þ ⋯ x1;m 2 ðm 1Þ x1;m 1 ðmÞ
6 x1;0 ð2Þ x1;1 ð3Þ x1;2 ð4Þ ⋯ x1;m 2 ðmÞ x1;m 1 ðmþ 1Þ 7
6 7
6 x1;0 ð3Þ x1;1 ð4Þ x1;2 ð5Þ ⋯ x1;m 2 ðm þ1Þ 7x1;m 1 ðmþ 2Þ
6
6
7
7
As shown above, by back-tracing in the selection array R nm ,
nm
R ¼ 6 x1;0 ð4Þ x1;1 ð5Þ x1;2 ð6Þ ⋯ x1;m 2 ðm þ2Þ 7x1;m 1 ðmþ 3Þ
we can obtain the global alignment path:
6 7
6 ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ 7
6
4 x1;0 ðn 1Þ x1;1 ðnÞ
7
x1;2 ðn þ1Þ ⋯ x1;m 2 ðm þn 3Þ x1;m 1 ðmþ n 2Þ 5
P ¼ fð0; 0Þ; ð1; 0Þ; ð2; 1Þ; ð2; 2Þ; ð3; 3Þ; ð4; 4Þ; ð5; 4Þ; ð6; 5Þ; ð6; 6Þ; ð6; 7Þ;
x1;0 ðnÞ x1;1 ðn þ 1Þ x1;2 ðn þ2Þ ⋯ x1;m 2 ðm þn 2Þ x1;m 1 ðmþ n 1Þ ð6; 8Þg.
By now, the key global alignment path P for the two DNA
ð11Þ
sequences has been achieved.
Therefore, based on Eq. (11), all these cell selection states x1;i ðtÞ,
where i ¼ 0; 1; …; 8 and t ¼ 1; 2; …; 15, obtained in the previous
3.2.3. The global alignment of two DNA sequences
step are used to construct the selection array R 79 for CNN1 as
2 3 This part will exhibit how to conduct the global alignment of
0 2 4 6 8 10 12 14 16 two original DNA sequences under the guidance of the global
6 2 3 5 7 3 5 5 7 9 7
6 7 alignment path node set P.
6 7
6 4 3 2 0 2 4 6 8 10 7 First, define P(i), i ¼ 0; 1; 2; …, to represent the ith element of
6 7
79 6 7 the coordinate set P (namely, the path node set), and use lP to
R ¼ 6 6 1 0 7 5 3 1 1 3 7
6 7 represent the size of P. The full alignment process is described as
6 8 1 2 5 12 10 8 6 4 7
6 7 follows:
6 7
4 10 3 4 3 10 9 7 5 3 5
12 5 2 1 8 15 13 12 10 1. Let integer I ¼0, PðIÞ ¼ ðV I ; H I Þ.
After having obtained the selection array, the global alignment 2. While I olP 1, do the following loops:
path is to be determined. Define a node set P as the alignment Compare PðI þ1Þ with P(I).
path, where the initial P ¼ ∅. Moreover, use Rðn0 ; m0 Þ to represent case (i): if V I þ 1 V I ¼ 1 and H I þ 1 H I ¼ 0, a ‘n’ is inserted
the element at coordinate ðn0 ; m0 Þ of the array Rnm . By back- into S1 at the location of S1 ðIÞ;
tracing from the last state x1;m 1 ðm þn 1Þ to the first one x1;0 ð1Þ case (ii): if V I þ 1 V I ¼ 1 and H I þ 1 H I ¼ 1, no action;
of the array R nm step by step, the global alignment path can be case (iii): if V I þ 1 V I ¼ 0 and H I þ 1 H I ¼ 1, a ‘n’ is inserted
determined as the following steps: into S2 at the location of S2 ðIÞ;
case (iv): else, no action.
1. Let P ¼ fðn 1; m 1Þg, and N ¼ n 1; M ¼ m 1. Update I with I ¼ I þ 1;
2. While N a 0 and M a 0, do the following loops: 3. The global alignment is completed.
Find the maximum of RðN; M 1Þ, RðN 1; M 1Þ and
RðN 1; MÞ. If any one of the three does not exist, ignore it. Using the alignment process described above, the two DNA
Save coordinates of the maximum into ðN 0 ; M 0 Þ. sequences, S1 and S2, in this example have been globally aligned.
P ¼ fðN 0 ; M 0 Þg⋃P. Fig. 9 shows the final alignment results. In this figure, the vertical
L. Ji et al. / Neurocomputing 149 (2015) 505–514 511
Fig. 9. An example of the global alignment: (a) two original sequences; (b) two sequences with alignment labels.
Table 1
Time consumption of the CNN algorithm to complete the alignment (ms).
S2 S1
line ‘j’ between two nucleotide bases indicates one labelled couple t ¼ m þ n þ1 because of the parallel computation of the neural
of the nucleotide bases which have been aligned exactly. More- networks. So the time complexity O(T) of the algorithm in this paper
over, ‘n’ still represents a blank space which has been inserted into is OðTÞ ¼ Oðm; nÞ Oðm þ n þ 1Þ, which is approximately in propor-
these two sequences. tion to the total length of the two DNA sequences to be aligned.
Table 2
Global similarity obtained with the CNN algorithm (%).
S2 S1
Table 3
Computation time comparisons of the three algorithms (ms).
Methods Pairs
CNN algorithm failed to align the sequences. This does not mean 3500
that this CNN algorithm does not work for the long sequences but
just implies that a computer with a sufficient memory can 3000 MILP
The algorithm computation time, ms
Table 4
The Global similarity comparisons (%).
Methods Pairs
vs. NM_000405, the proposed algorithm took 998.58 ms to complete a little bit more alignment pairs of the nucleotide bases than the
the global alignment, and it is approximately 2.2 times faster than other two. Table 4 also shows that the proposed CNN algorithm
MILP, approximately 3.0 times faster than SPA, as shown in Table 3. can often obtain the higher similarity than the other two espe-
Table 3 also shows that the new algorithm based on the cially on these long DNA sequence pairs.
pairwise CNN is the fastest one in terms of the computation time.
Although the proposed one is only slightly better than the other
two for the short DNA sequences, for the longer ones (such as, 5. Conclusions
more than 10 000 nucleotides), the proposed one is significantly
better than the other two. When the total length of the two The cellular neural network is one of the classic neural net-
sequences is very large (e.g. NG_009301 vs. NM_000405), the works. During the past decades, the two-dimensional CNNs have
proposed algorithm is approximately 55% faster than MILP, and attracted the most interests from the researchers but the one-
approximately 68% faster than SPA. dimensional models of the CNN have not yet been paid enough
These experiments and these comparisons of the computation attention too. In order to exploit a new model for the CNN and
time prove that the proposed algorithm performs much better develop its applications, this paper proposes a simplified one-
than both MILP and SPA. Especially, the proposed algorithm will dimensional CNN model, which consists of a linear cell chain. Then
over-perform more significantly over the other two as aligning the a pairwise CNN is constructed using two one-dimensional CNNs,
long DNA sequences. including an immovable one and a movable one. Based on the
Moreover, in order to visually demonstrate the advantages of pairwise CNN, the global alignment algorithm for two DNA
the proposed algorithm in terms of the time performance, another sequences is developed in this paper. The simulation and compar-
group of experiments have been done to relate the time and the ison experiments are carried out and evaluated with the DNA
total length of two sequences. In these experiments, a group of sequences from the publicly available databases of the NCBI. By
sequence pairs are aligned using the three algorithms described comparing with the other two algorithms, the developed one
above. All these experiment results are plotted in a Cartesian performs better in the global alignment with the significantly less
coordinate system, as shown in Fig. 10. The three curves illustrate execution time. These experiments in this paper prove that the
the relationships between the computation time and the total proposed model is practically applicable and the algorithm with
length of the sequence pair for these three algorithms. a pairwise one-dimensional CNN is efficient.
In Fig. 10, the horizontal-axis is corresponding to the total
lengths of these sequence pairs, and the vertical-axis is corre-
sponding to the computation time taken by these three algorithms Acknowledgments
to align these sequence pairs. It clearly shows that these three
algorithms have a quite close performance in terms of the time This work is supported by the National Science Foundation of
when the total length, m þ n is not very large. With the increase of China (NSFC) under Grants 61175061 and 61273308, and the
the total length, except the CNN curve, both SPA curve and MILP Fundamental Research Funds for the Central Universities under
curve rise up steeply. Although the curve of the proposed algo- Grant ZYGX2013J076. Authors would also like to thank NCBI for
rithm also rises up with the increase of the total length as well, it the experiment data of DNA sequences, and thank all editors/
rises up much slower than the other two. Same as Table 3, Fig. 10 reviewers for their comments and helps on manuscript
also indicates that the proposed one needs the less computation improvement.
time to align these sequence pairs, and the larger the m þ n is, the
more the superior it is.
Just as mentioned before, besides the computation time, the References
global similarity is also of importance to distinct these three
algorithms. The experiments on the global similarity of the [1] L.O. Chua, L. Yang, Cellular neural networks: theory, IEEE Trans. Circuits Syst.
35 (1988) 1257–1272.
following DNA sequence pairs are also conducted, where the [2] L.O. Chua, L. Yang, Cellular neural networks: applications, IEEE Trans. Circuits
numerical global similarity is calculated using Eq. (12), as exhib- Syst. 35 (1988) 1273–1290.
ited in Table 4. [3] Z. Liu, H. Zhang, Z. Wang, Novel stability criterions of a new fuzzy cellular
neural networks with time-varying delays, Neurocomputing 72 (2009)
Consistent with Table 3, Table 4 shows that each of the three 1056–1064.
algorithms can achieve a good similarity for the DNA sequence [4] S. Long, D. Xu, Stability analysis of stochastic fuzzy cellular neural networks
pair. For the short pair such as S62051 vs. NM_008134, the three with time-varying delays, Neurocomputing 74 (2011) 2385–2391.
[5] L. Wang, T. Chen, Complete stability of cellular neural networks with
algorithms obtain an identical similarity of 39:95%. However, for
unbounded time-varying delays, Neural Netw. 36 (2012) 11–17.
the longer DNA sequence pairs such as NM_010422 vs. [6] W.H. Chen, W.X. Zheng, A new method for complete stability analysis of
NG_009301, the global similarity values with these three algo- cellular neural networks with time delay, IEEE Trans. Neural Netw. 7 (2010)
rithm are slightly different. Usually, the proposed new CNN 1126–1139.
[7] X. Song, X. Xin, W. Huang, Exponential stability of delayed and impulsive
algorithm results in a little bit larger similarity value than the cellular neural networks with partially Lipschitz continuous activation func-
other two. The reason is that this CNN algorithm always generates tions, Neural Netw. 29–30 (2012) 80–90.
514 L. Ji et al. / Neurocomputing 149 (2015) 505–514
[8] M. Di Marco, M. Forti, M. Grazzini, L. Pancioni, Convergence of a class of [35] D.F. Feng, R.F. Doolittle, Progressive sequence alignment as a prerequisite to
cooperative standard cellular neural network arrays, IEEE Trans. Circuits Syst. correct phylogenetic trees, J. Mol. Evol. 25 (1987) 351–360.
I: Regul. Pap. 4 (2012) 772–783. [36] S.A.M.A. Junid, N.M. Tahir, Z.A. Majid, M.F.M. Idros, Potential of graph theory
[9] X. Huang, Z. Zhao, Z. Wang, Y. Li, Chaos and hyperchaos in fractional-order algorithm approach for DNA sequence alignment and comparison, in: Proceedings
cellular neural networks, Neurocomputing 94 (2012) 13–21. of the Third International Conference on Intelligent Systems, Modelling and
[10] L. Wang, W. Liu, H. Shi, J.M. Zurada, Cellular neural networks with transient Simulation (ISMS 2012), vol. 2, 2012, pp. 187–190.
chaos, IEEE Trans. Circuits Syst. II: Express Briefs 5 (2007) 440–444. [37] S.R. Mcallister, R. Rajgaria, C.A. Floundas, Global pairwise sequence alignment
[11] C.L. Chang, K.W. Fan, I.F. Chung, C.T. Lin, A recurrent fuzzy coupled cellular through mixed-integer linear programming: a template-free approach, Optim.
neural network system with automatic structure and template learning, IEEE Methods Softw. 1 (2007) 127–144.
Trans. Circuits Syst. II: Express Briefs 8 (2004) 602–606. [38] S. Shen, J. Yang, A. Yao, P. Hwang, Super pairwise alignment (SPA): an efficient
[12] C.H. Huang, C.T. Lin, Bio-inspired computer fovea model based on hexagonal- approach to global alignment for homologous sequences, J. Comput. Biol. 3
type cellular neural network, IEEE Trans. Circuits Syst. I: Regul. Pap. 1 (2007) (2002) 477–486.
35–47.
[13] L. Zhang, Z. Yi, S.L. Zhang, P.A. Heng, Activity invariant sets and exponentially
stable attractors of linear threshold discrete-time recurrent neural networks,
IEEE Trans. Autom. Control 6 (2009) 1341–1347. Luping Ji received his B.S. degree in Mechanical &
[14] G. Manganaro, J.P. de Gyvez, One-dimensional discrete-time CNN with multi- Electronic Engineering from Beijing Institute of Technol-
plexed template-hardware, IEEE Trans. Circuits Syst. I: Fundam. Theory Appl. 5 ogy, Beijing, PR China, 1999. Then, he received his M.S.
(2000) 764–769. and Ph.D. degrees respectively in Computer Application
[15] S.N.T. Polat, O. Yavuz, V. Tavsanoglu, Efficient Simulation of Time-Derivative & Technology, 2005 and in Computer Software & Theory,
Cellular Neural Networks, IEEE Trans. Circuits Syst. I: Regul. Pap. 11 (2012) 2008 from the University of Electronic Science and
2638–2645. Technology of China, Chengdu, PR China. Currently, he
[16] J. Martłnez, J. Garrigs, J. Toledo, J. Ferr¢ndez, An efficient and expandable is working as an associate professor in the School of
hardware implementation of multilayer cellular neural networks, Neurocom- Computer Science and Engineering, University of Elec-
puting 114 (2013) 54–62. tronic Science and Technology of China, Chengdu,
[17] C.Y. Wu, S.H. Chen, The design and analysis of a CMOS low-power large- PR China. His current research interests include neural
neighborhood CNN with propagating connections, IEEE Trans. Circuits Syst. I: networks and pattern recognition.
Regul. Pap. 2 (2009) 440–452.
[18] E. Cesur, N. Yildiz, V. Tavsanoglu, On an improved FPGA implementation of
CNN-based Gabor-type filters, IEEE Trans. Circuits Syst. II: Express Briefs 11
(2012) 815–819. Xiaorong Pu is a professor at the University of Electro-
[19] D. Amanatidis, D. Tsaptsinos, P. Giaccone, G. Jones, Optimizing motion and nic Science and Technology of China, PR China. She has
colour segmented images with neural networks, Neurocomputing 62 (2004) been a visiting scholar at Illinois Institute of Technology
197–223. (IIT), USA, 2008, and at the University of Manchester
[20] Y.W. Shou, C.T. Lin, Image descreening by GA-CNN-based texture classification, Institute of Science and Technology (UMIST), Britain
IEEE Trans. Circuits Syst. I: Regul. Pap. 11 (2004) 2287–2299. 2004. Her research interests include computer vision,
[21] C.T. Lin, C.H. Huang, S.A. Chen, CNN-based hybrid-order texture segregation as biometrics, neural networks and operating system. She
early vision processing and its implementation on CNN-UM, IEEE Trans. has 8 books and more than 30 academic papers
Circuits Syst. I: Regul. Pap. 10 (2007) 2277–2287. published so far.
[22] Q. Gao, G.S. Moschytz, Fingerprint feature matching using CNNs, in: Proceed-
ings of the 2004 International Symposium on Circuits and Systems, vol. 3,
2004, pp. 73–76.
[23] M. Milanova, U. Bker, Object recognition in image sequences with cellular
neural networks, Neurocomputing 31 (2000) 125–141.
[24] A. Delbem, L. Correa, L. Zhao, Design of associative memories using cellular
Hong Qu received the B.S., M.S. and Ph.D. degrees in
neural networks, Neurocomputing 72 (2009) 2180–2188.
Computer Science and Engineering from the University
[25] Q. Han, X. Liao, T. Huang, et al., Analysis and design of associative memories
of Electronic Science and Technology of China,
based on stability of cellular neural networks, Neurocomputing 97 (2012)
Chengdu, China, in 2000, 2003 and 2006, respectively.
192–200.
From March 2007 to February 2008, he was a post-
[26] Z. Zeng, J. Wang, Analysis and design of associative memories based on
doctoral fellow at the Advanced Robotics and Intelli-
recurrent neural networks with linear saturation activation functions and
gent Systems Lab, School of Engineering, University of
time-varying delays, Neural Comput. 8 (2007) 2149–2182.
Guelph, Guelph, ON, Canada. Currently, he is currently a
[27] N. Takahashi, M. Nagayoshi, S. Kawabata, T. Nishi, Stable patterns realized by
professor in the School of Computer Science and
a class of one-dimensional two-layer CNNs, IEEE Trans. Circuits Syst. II: Regul.
Engineering, University of Electronic Science and Tech-
Pap. 11 (2008) 3607–3620.
nology of China. His current research interests include
[28] N. Takahashi, et al., Sufficient conditions for one-dimensional cellular neural
neural networks, robot, neurodynamics, intelligent
networks to perform connected component detection, Nonlinear Anal.: Real
computation, and optimization.
World Appl. 11 (2010) 4202–4213.
[29] S. Needleman, C. Wunsch, A general method applicable to the search for
similarities in the amino acid sequence of two sequences, J. Mol. Biol. 3 (1970)
443–453.
Guisong Liu received his B.S. degree in Mechanics from
[30] A.J. Gibbs, G.A. Mcintyre, A. George, The diagram, a method for comparing
Xi'an Jiao Tong University, Xi'an, China, in 1995, the M.S.
sequences, its use with amino acid and nucleotide sequences, Eur. J. Biochem.
degree in Automatics and the Ph.D. degree in Computer
1 (1970) 1–11.
Science both from the University of Electronic Science
[31] T.F. Smith, M.S. Watermann, Identification of common molecular subsequence,
and Technology of China (UESTC, Chengdu, China), in
J. Mol. Biol. 147 (1981) 196–197.
2000 and 2007, respectively. Now he is an associate
[32] D.J. Lipman, W.R. Pearson, Rapid and sensitive protein similarity searchers,
professor in the Computational Intelligence Laboratory,
Science 4693 (1985) 1435–1441.
the School of Computer Science and Engineering, UESTC.
[33] S.F. Altschul, T.L. Madden, A.A. Schaffer, et al., Gapped BLAST and PSI-BLAST: a
His research interests include computational intelli-
new generation of protein database search programs, Nucleic Acids Res. 7
gence, pattern recognition and machine learning.
(1997) 3389–3402.
[34] F. Corpet, Multiple sequence alignment with hierarchical clustering, Nucleic
Acids Res. 22 (1988) 10881–10890.