Professional Documents
Culture Documents
2004 - A Simple Algorithm For The Constrained Sequence Problems PDF
2004 - A Simple Algorithm For The Constrained Sequence Problems PDF
www.elsevier.com/locate/ipl
Abstract
In this paper we address the constrained longest common subsequence problem. Given two sequences X, Y and a constrained
sequence P , a sequence Z is a constrained longest common subsequence for X and Y with respect to P if Z is the longest
subsequence of X and Y such that P is a subsequence of Z.
Recently, Tsai [Inform. Process. Lett. 88 (2003) 173–176] proposed an O(n2 · m2 · r) time algorithm to solve this problem
using dynamic programming technique, where n, m and r are the lengths of X, Y and P , respectively.
In this paper, we present a simple algorithm to solve the constrained longest common subsequence problem in O(n · m · r)
time and show that the constrained longest common subsequence problem is equivalent to a special case of the constrained
multiple sequence alignment problem which can also be solved with the same time complexity.
2004 Published by Elsevier B.V.
Keywords: Constrained longest common subsequence; Algorithms; Dynamic programming; Sequence alignment
on the dynamic programming technique was proposed The following theorem characterizes the structure
for this problem by Tsai [7], where n, m and r are the of an optimal solution based on optimal solutions to
lengths of X, Y and P , respectively. subproblems, for the constrained LCS problem.
In this paper, we present a simple algorithm to
solve the CLCS problem in O(n · m · r) time. We Theorem 2.1. If Z = z1 , z2 , . . . , z is the con-
further show that this CLCS problem is equivalent to strained LCS of X = x1 , x2 , . . . , xn and Y = y1 , y2 ,
a special case of the constrained sequence alignment . . . , ym with respect to P = p1 , p2 , . . . , pr , the fol-
(CSA) problem which can be solved with the same lowing conditions hold:
time complexity.
The rest of this paper is organized as follows. In 1. If xn = ym = pr then z = xn = ym = pr and
Section 2, we define formally the CLCS problem and Z−1 is the constrained LCS of Xn−1 and Ym−1
characterize the structure in computing the CLCS. In with respect to Pr−1 .
Section 3, we present a simple dynamic programming 2. If xn = ym and xn = pr then z = xn = ym and
algorithm that computes the CLCS of two sequences Z−1 is the constrained LCS of Xn−1 and Ym−1
with respect to a constrained sequence. In Section 4, with respect to P .
we show that the CLCS problem is equivalent to a 3. If xn = ym then z = xn implies that Z is a
special case of the constrained sequence alignment constrained LCS of Xn−1 and Y with respect to P .
problem [1,6]. 4. If xn = ym then z = ym implies that Z is a
constrained LCS of X and Ym−1 with respect
to P .
2. Characterization of the constrained LCS
Proof. As Z is the constrained LCS of X and Y with
problem
respect to P , xn , ym and z are the last characters of
X, Y and Z, respectively, we have z = xn = ym if
A sequence is a string of characters over a set of xn = ym . Assume by contradiction that z = xn , we
alphabets Σ. A subsequence Z of a sequence X is may append xn = ym to Z to obtain a constrained
obtained by deleting some characters from X (not common subsequence of X and Y of length + 1,
necessarily contiguous); we also say that X contains contradicting the hypothesis that Z, of length , is
Z if Z is a subsequence of X. Given two sequences a constrained LCS of X and Y with respect to P .
X and Y , Z is a common subsequence of X and Y Therefore, if xn = ym then zl = xn = ym . This will be
if Z is a subsequence of both X and Y . Z is the used in the proofs of 1 and 2. Now, we prove the four
longest common subsequence (LCS) of X and Y if properties.
Z is the longest among all common subsequences Proof of 1. Since xn = ym = pr , we have xn =
of X and Y . For example, “lm” and “rm” are both ym = pr = z . The prefix Z−1 is a common subse-
the longest common subsequence of “problem” and quence of Xn−1 and Ym−1 with respect to Pr−1 of
“algorithm”. Let P be a constrained sequence. We length − 1. Now, we show that Z−1 is a constrained
say that Z is the constrained LCS of X and Y with LCS of Xn−1 and Ym−1 with respect to Pr−1 . Assume
respect to P if Z is the longest subsequence of X by contradiction that there exists a constrained com-
and Y and Z contains P (i.e., P is a subsequence mon subsequence S of Xn−1 and Ym−1 with respect to
of Z). For example, “lm” is the longest common Pr−1 whose length is greater than − 1. If we append
subsequence of “problem” and “algorithm” with xn = ym = pr to S we obtain a constrained common
respect to “l”. subsequence of X and Y with respect to P of length
Given a sequence X = x1 , x2 , . . . , xn , where char- greater than , contradicting the hypothesis that Z is a
acter xi ∈ Σ for any i = 1, . . . , n, we denote the constrained LCS of X and Y with respect to P .
ith prefix of X by Xi = x1 , x2 , . . . , xi for any Proof of 2. Since xn = ym and xn = pr , then
i = 1, . . . , n. In particular, X0 denotes the empty se- xn = ym = z and z = pr . As z = pr , the prefix
quence. For example, if X = “algorithm” then Z−1 is a common subsequence of Xn−1 and Ym−1
X3 = “alg”. with respect to P of length − 1. Now, we show
F.Y.L. Chin et al. / Information Processing Letters 90 (2004) 175–179 177
that Z−1 is a constrained LCS of Xn−1 and Ym−1 Proof of 2. Assume by contradiction that there is
with respect to P . Assume by contradiction that there no constrained common subsequence of X and Y with
exists a constrained common subsequence S of Xn−1 respect to P but there exists a constrained common
and Ym−1 with respect to P whose length is greater subsequence Z of X and Y with respect to P . This
than − 1. If we append xn = ym to S, we obtain a is a contradiction because Z is also a constrained
constrained common subsequence of X and Y with common subsequence of X and Y with respect to P .
respect to P of length greater than , contradicting the
hypothesis that Z is a constrained LCS of X and Y
with respect to P . 3. A simple algorithm
Proof of 3. Since z = xn , Z is a constrained com-
mon subsequence of Xn−1 and Y with respect to P .
Now, we show that Z is a constrained LCS of Xn−1 Given two sequences X, Y and a constrained se-
and Y with respect to P . Assume by contradiction that quence P , whose lengths are n, m and r, respectively,
there exists a constrained common subsequence S of we define L(i, j, k) as the constrained LCS length
Xn−1 and Y with respect to P whose length is greater of Xi and Yj with respect to Pk , for any 0 i n,
than , then S also is a constrained common subse- 0 j m and 0 k r. In particular, L(n, m, r)
quence of X and Y with respect to P whose length is gives the length of the constrained LCS of X and Y
greater than . This contradicts the assumption that Z with respect to P . We design an algorithm that com-
is a constrained LCS of X and Y with respect to P . putes the CLCS of X and Y with respect to P in
Proof of 4. The proof is similar to proof of 3. 2 O(n · m · r) time.
If either i < k or j < k, there is no constrained
common subsequence for Xi and Yj with respect
The next theorem shows a characterization of
to Pk . We represent this condition by denoting L(i, j,
the constrained LCS problem when no constrained
k) = −∞, where ∞ represents a large number, greater
common subsequence exists.
than the maximum value of n and m. If i = 0 or j = 0
and k = 0, the CLCS for Xi and Yj with respect to P0
Theorem 2.2. If there is no constrained common sub- has length 0. The characterization of the structure of a
sequence of X = x1 , x2 , . . . , xn and Y = y1 , y2 , . . . , solution for the CLCS problem based on solutions to
ym with respect to P = p1 , p2 , . . . , pr , the follow- subproblems shown in Section 2, yields the following
ing conditions hold: recursive relation, for any 0 i n, 0 j m and
0 k r,
1. If xn = ym = pr then there is no constrained
1 + L(i − 1, j − 1, k − 1)
common subsequence of Xn−1 and Ym−1 with
if i, j, k > 0 and xi = yj = pk ,
respect to Pr−1 .
1 + L(i − 1, j − 1, k)
2. There is no constrained common subsequence of
the two sequences X and Y with respect to P , L(i, j, k) = if i, j > 0, xi = yj and (1)
= =
for each of the following three cases:
(k 0 or x i pk ),
• X = Xn−1 and Y = Ym−1 ; max(L(i − 1, j, k), L(i, j − 1, k))
• X = Xn−1 and Y = Y ; if i, j > 0 and xi = yj
• X = X and Y = Ym−1 .
with boundary conditions,
Proof of 1. Assume by contradiction that there is no
L(i, 0, 0) = L(0, j, 0) = 0
constrained common subsequence of X and Y with re-
spect to P but there exists a constrained common sub- and
sequence Z of Xn−1 and Ym−1 with respect to Pr−1 .
Since xn = ym = pr then the concatenation of xn to Z L(0, j, k) = L(i, 0, k) = −∞,
is a constrained common subsequence of X and Y with
respect to P . Contradiction. for i = 0, . . . , n, j = 0, . . . , m and k = 1, . . . , r.
178 F.Y.L. Chin et al. / Information Processing Letters 90 (2004) 175–179
In this section, we show that the CLCS problem Theorem 4.1. Given two sequences X, Y and a
is in fact a special case of the constrained sequence constrained sequence P , the CLCS of problem is
alignment (CSA) problem [1,6]. equivalent to the CSA problem when the distance
Let X = x1 , x2 , . . . , xn and Y = y1 , y2 , . . . , ym function δ(x , y ) given in Eq. (3) is used.
be two sequences over Σ, with lengths n and m, re-
spectively. We define the sequence alignment of X and Proof. Let X Y be the CSA solution with the mini-
Y as two equal-length sequences X = x1 , x2 , . . . , xn mum alignment score, n = |X | = |Y |. By the de-
and Y = y1 , y2 , . . . , yn such that |X | = |Y | = n , finition of CSA with the distance function δ(x , y ),
X
where n n, m, and removing all space characters
Y has the minimum alignment score only if X and
“-” from X and Y gives X and Y , respectively, with Y have the most number of matches (i.e., xi = yi )
construct X and Y by inserting spaces into X and Y max L n/2, j, k
0j m,0kr
respectively, such that every character in Z appears in
the same position in X and Y . Using the distance + Lr n/2 + 1, j + 1, k + 1
function δ(x , y ), the alignment score of X
Y is −, which can be found in O(n · m · r) time (say Knmr
X
i.e., the alignment Y has matching columns. Since time, where K is a constant) and O(m · r) space. As-
P is a subsequence of Z, there is a column matching in sume the maximum value occurs when j = j and
X
k = k , then we can further solve the two subprob-
Y for every character in P . As Z is a CLCS of X and
Y with respect to P , the optimal CSA solution for X lems L( 12 n, j , k ) and Lr ( 12 n + 1, j + 1, k + 1) in
2 Kmnr time and O(m · r) space. Continuing this re-
1
and Y with respect to P has at most matches, i.e.,
with the minimum alignment score −. Hence, X cursively, we can solve the CLCS problem in the same
Y is
the optimal CSA solution for X and Y with respect O(n · m · r) time and O(m · r) space complexities.
to P . 2
References
5. Conclusions [1] F.Y.L. Chin, N.L. Ho, T.W. Lam, P.W.H. Wong, M.Y. Chan,
Efficient constrained multiple sequence alignment with perfor-
In this paper, we have addressed the constrained mance guarantee, in: Proceedings of the IEEE Computer Society
Conference on Bioinformatics, IEEE Computer Society, 2003,
longest common subsequence problem proposed by
pp. 337–346.
Tsai [7]. An O(n2 · m2 · r) time algorithm based on [2] T.H. Cormen, C.E. Leiserson, R.L. Rivest, Introduction to
the dynamic programming technique was proposed to Algorithms, MIT Press, Cambridge, MA, 1990.
compute a CLCS for X and Y with respect to P , where [3] D.S. Hirschberg, Algorithms for the longest common subse-
n, m and r are the lengths of X, Y and P , respectively. quence problem, J. ACM 24 (1977) 664–675.
[4] D. Mayer, The complexity of some problems on subsequences
We have described a simple O(n · m · r) time algorithm and supersequences, J. ACM 25 (1978) 322–336.
to solve this problem and have also showed that the [5] W.J. Masek, M.S. Paterson, A faster algorithm computing string
CLCS problem is indeed a special case of the CSA edit distances, J. Comput. System Sci. 20 (1980) 18–31.
problem. [6] C.Y. Tang, C.L. Lu, M.D.T. Chang, Y.T. Tsai, Y.J. Sun, K.M.
Chao, J.M. Chang, Y.H. Chiou, C.M. Wu, H.T. Chang, W.I.
It is not difficult to show that this problem can
Chou, Constrained multiple sequence alignment tool develop-
also be solved in O(min(n, m) · r) space based on ment and its application to RNase family alignment, J. Bioin-
Hirschberg’s Algorithm [3]. We define Lr (i, j, k) form. Comput. Biol. 1 (2003) 267–287.
as the length of CLCS of the suffices of X, Y [7] Y.T. Tsai, The constrained longest common subsequence prob-
and P starting from the ith, j th and kth positions, lem, Inform. Process. Lett. 88 (2003) 173–176.
[8] R.A. Wagner, M.J. Fischer, The string-to-string correction
respectively. WLOG, assume n m, then Z, the problem, J. ACM 21 (1974) 168–173.
solution for the CLCS problem, can be defined as