Professional Documents
Culture Documents
DNA Fragment Assembly: An Ant Colony System Approach
DNA Fragment Assembly: An Ant Colony System Approach
System Approach
Abstract. This paper presents the use of an ant colony system (ACS)
algorithm in DNA fragment assembly. The assembly problem generally
arises during the sequencing of large strands of DNA where the strands
are needed to be shotgun-replicated and broken into fragments that are
small enough for sequencing. The assembly problem can thus be classi-
fied as a combinatorial optimisation problem where the aim is to find
the right order of each fragment in the ordering sequence that leads
to the formation of a consensus sequence that truly reflects the origi-
nal DNA strands. The assembly procedure proposed is composed of two
stages: fragment assembly and contiguous sequence (contig) assembly.
In the fragment assembly stage, a possible alignment between fragments
is determined with the use of a Smith-Waterman algorithm where the
fragment ordering sequence is created using the ACS algorithm. The re-
sulting contigs are then assembled together using a nearest neighbour
heuristic (NNH) rule. The results indicate that in overall the perfor-
mance of the combined ACS/NNH technique is superior to that of the
NNH search and a CAP3 program. The results also reveal that the solu-
tions produced by the CAP3 program contain a higher number of contigs
than the solutions produced by the proposed technique. In addition, the
quality of the combined ACS/NNH solutions is higher than that of the
CAP3 solutions when the problem size is large.
1 Introduction
F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 231–242, 2006.
c Springer-Verlag Berlin Heidelberg 2006
232 W. Wetcharaporn, N. Chaiyaratana, and S. Tongsima
covering over three billion genetic codes, is investigated. To achieve the goal, the
project has to be divided into many components. One of major components is
DNA fragment assembly. DNA is a double helix comprised of two complementary
strands of polynucleotides. Each strand of DNA can be viewed as a character
string over an alphabet of four letters: A, G, C and T. The four letters represent
four bases, which are adenine (A), guanine (G), cytosine (C) and thymine (T).
The two strands are complementary in the sense that at corresponding positions
A’s are always paired with T’s and C’s with G’s. These pairs of complementary
bases are referred to as “base pairs”. At present, strands of DNA that are longer
than 600 base pairs cannot routinely be sequenced accurately [2]. The sequenc-
ing technique that involves fragmentation of a DNA strand is called a shotgun
sequencing technique. Basically, DNA is first replicated many times and then in-
dividual strands of the double helix are broken randomly into smaller fragments.
This produces a set of out of order fragments short enough for sequencing.
A DNA fragment assembly problem involves finding the right order of each
fragment in the fragment ordering sequence, which leads to the formation of a
consensus sequence that truly reflects the parent DNA strands. In other words,
the DNA fragment assembly problem can be treated as a combinatorial opti-
misation problem. A number of deterministic and stochastic search techniques
have been used to solve DNA fragment assembly problems [3]. For instance,
Huang and Madan [4] and Green [5] have used a greedy search algorithm to
solve the problem. However, a manual manipulation on the computer-generated
result is required to obtain a biologically plausible final result. Other determin-
istic search algorithms that have been investigated include a branch-and-cut
algorithm [6] and a graph-theoretic algorithm where DNA fragments are either
represented by graph nodes [7, 8] or graph edges [9]. The capability of stochas-
tic search algorithms such as a simulated annealing algorithm [10], a genetic
algorithm [11, 12, 13] and a neural network based prediction technique [14] has
also been investigated. The best DNA fragment assembly results obtained from
stochastic searches have been reported in Parsons and Johnson [12], and Kim and
Mohan [13] where genetic algorithms have proven to outperform greedy search
techniques in relatively small-sized problems. In addition, the need for manual
intervention is also eliminated in this case. Although a significant improvement
over the greedy search result has been achieved, the search efficiency could be
further improved if the redundancy in the solution representation could be elim-
inated from the search algorithms [11]. Similar to a number of combinatorial
optimisation techniques, the use of a permutation representation is required to
represent a DNA fragment ordering solution in a genetic algorithm search. With
such representation, different ordering solutions can produce the same DNA con-
sensus sequence. Due to the nature of a genetic algorithm as a parallel search
technique, the representation redundancy mentioned would inevitably reduce
the algorithm efficiency. A stochastic search algorithm that does not suffer from
such effect is an ant colony system (ACS) algorithm [15].
The natural metaphor on which ant algorithms are based is that of ant
colonies. Real ants are capable of finding the shortest path between a food
DNA Fragment Assembly: An Ant Colony System Approach 233
source and their nest without using visual clues by exploiting pheromone in-
formation. While walking, ants deposit pheromone on the ground, and proba-
bilistically follow pheromone previously deposited by other ants. The way ants
exploit pheromone to find the shortest path between two points can be described
as follows. Consider a situation where ants arrive at a decision point in which
they have to decide between two possible paths for both getting to and returning
from their destination. Since they have no clue about which is the best choice,
they have to pick the path randomly. It can be expected that on average half
of the ants will decide to go on one path and the rest choose to travel on the
other path. Suppose that all ants walk at approximately the same speed, the
pheromone deposited will accumulate faster on the shorter path. After a short
transitory period the difference between the amounts of pheromone on the two
paths is sufficiently large so as to influence the decision of other ants arriving at
the decision point. New ants will thus prefer to choose the shorter path since at
the decision point they perceive more pheromone. At the end, all ants will use
the shorter path. If the ants have to complete a circular tour covering n different
destinations without visiting order preference, the emerged shortest path will be
a solution to the n-city travelling salesman problem (TSP). Although the ACS
algorithm also exploits stochastic parallel search mechanisms, the algorithm per-
formance does not depend upon the solution representation. This is because the
optimal solution found by the ACS algorithm will emerge as a single “shortest
path”. In other words, the problem regarding the redundancy in the solution
representation mentioned early would be completely irrelevant to the context of
the ACS algorithm search.
The organisation of this paper is as follows. In section 2, the overview of a
DNA fragment assembly problem will be given. In section 3, the background on
the ACS algorithm will be discussed. The application of the ACS algorithm on
the DNA fragment assembly problem will be explained in section 4. Next, the
case studies are explained in section 5. The results obtained after applying the
ACS algorithm to the problem and the result discussions are given in section 6.
Finally, the conclusions are drawn in section 7.
Parent Strands
TTAGCACAGGAACTCTA
|||||||||||||||||
AATCGTGTCCTTGAGAT
sion that leads to the construction of the shortest global tour using pheromone
deposition, an optimal solution to the TSP would be co-operatively created by
all ants in the colony.
Based on the overview of the algorithm, three main components that make up
the algorithm are a state transition rule, a local pheromone-updating rule and
a global pheromone-updating rule. In addition, Dorigo and Gambardella [15]
have also introduced the use of a data set called a candidate list that creates a
limitation on the city chosen by an ant for a state transition. The explanation
on these three rules and the candidate list can be found in Dorigo and Gam-
bardella [15]. Although the explanation of the ACS algorithm is given in the
context of a TSP, the ACS algorithm can be easily applied to a DNA fragment
assembly problem. This is because the overlap score, which provides information
regarding how well two fragments can fit together, can be directly viewed as the
inverse of the distance between two cities.
ordering sequence are the locations where the overlap score between two consec-
utive fragments is lower than a threshold value, which in this investigation is set
to 95. However, more than one solution generated may have the same number
of contigs. If this is the case, the solution that is the better solution is the one
where the difference between the length of the longest and the shortest contigs
is minimal. This part of the objective function is derived from the desire that
the ultimate goal solution is the one with either only one contig or the fewest
possible number of contigs where each contig is reasonably long.
5 Case Studies
The capability of the ACS algorithm in DNA fragment assembly will be tested
using data sets obtained from a GenBank database at the National Center
for Biotechnology Information or NCBI (http://www.ncbi.nlm.nih.gov). The
parent DNA strands in this case are extracted from the human chromosome 3
where the strands with the sequence length ranging from 21K to 83K base pairs
are utilised. It is noted that each complementary pair of parent DNA strands can
be referred from the database using its accession number. The fragments used
to construct the consensus sequence are also obtained from the same database
where each fragment is unclipped (low quality base reads are retained) and has
the total number of bases between 700 and 900. This means that the fragments
used in the case studies contain sequencing errors generally found in any ex-
periments. In addition, no base quality information is used during the assembly
investigation. In order for a consensus sequence to accurately represent the par-
ent DNA strands, there must be more than one fragment covering any base pairs
238 W. Wetcharaporn, N. Chaiyaratana, and S. Tongsima
of the parent strands. The average number of fragments covering each base pair
on the parent strands is generally referred to as coverage. In addition, for a con-
sensus sequence to be made up from one contig, there must be no gaps in the
fragment ordering sequence. In this paper, the data set is prepared such that the
consensus sequence contains either one contig or multiple contigs. The summary
of the data set descriptions is given in Table 2.
In order to benchmark the performance of the ACS algorithm, its search per-
formance will be compared with that of the NNH rule and a CAP3 program [4],
which is one of the most widely used programs in bioinformatics research com-
munity. Since the base quality information is not used during the assembly, the
solutions produced by the CAP3 program would be similar to the results from
a PHRAP program [5], which is also a standard program [13]. The parameter
setting for the ACS algorithm is the recommended setting for solving symmetric
travelling salesman problems given in Dorigo and Gambardella [15].
Table 3. Number of contigs from the solutions produced by the NNH+NNH approach,
the ACS+NNH approach and the CAP3 program
7 Conclusions
In this paper, a DNA fragment assembly problem, which is a complex combina-
torial optimisation problem that can be treated as a travelling salesman problem
(TSP), has been discussed. In the context of a TSP, a fragment ordering sequence
would represent a tour that covers all cities while the overlap score between two
aligned fragments in the ordering sequence can be viewed as the inverse of the
DNA Fragment Assembly: An Ant Colony System Approach 241
Table 4. Assembly errors expressed in terms of the sum of substitution and inser-
tion/deletion errors, and the coverage error
distance between two cities. The assembly procedure proposed consists of two
stages: fragment assembly and contig assembly stages. In the fragment assembly
stage, a search for the best alignment between fragments is carried out using an
ant colony system (ACS) algorithm. The resulting contigs are then assembled
together using a nearest neighbour heuristic (NNH) rule in the contig assembly
stage. The assembly procedure proposed has been benchmarked against a CAP3
program [4]. The results suggest that the solutions produced by the CAP3 pro-
gram contain a higher number of contigs than the solutions generated by the
proposed technique. In addition, the quality of the combined ACS/NNH solu-
tions is higher than that of the CAP3 solutions when the problem size is large.
Since the core algorithm of the CAP3 program is a greedy search algorithm,
a replacement of the core algorithm with the ACS algorithm may yield an im-
provement on the final assembly solution.
Acknowledgements
This work was supported by the Thailand Toray Science Foundation (TTSF)
and National Science and Technology Development Agency (NSTDA) under the
Thailand Graduate Institute of Science and Technology (TGIST) programme.
The authors acknowledge Prof. Xiaogiu Huang at the Iowa State University for
providing an access to the SIM and CAP3 programs.
References
1. International Human Genome Sequencing Consortium: Initial sequencing and anal-
ysis of the human genome. Nature 409(6822) (2001) 860–921
2. Applewhite, A.: Mining the genome. IEEE Spectrum 39(4) (2002) 69–71
242 W. Wetcharaporn, N. Chaiyaratana, and S. Tongsima
3. Pop, M., Salzberg, S.L., Shumway, M.: Genome sequence assembly: Algorithms
and issues. Computer 35(7) (2002) 47–54
4. Huang, X., Madan, A.: CAP3: A DNA sequence assembly program. Genome Re-
search 9(9) (1999) 868–877
5. Green, P.: Phrap documentation. Phred, Phrap, and Consed www.phrap.org
(2004)
6. Ferreira, C.E., de Souza, C.C., Wakabayashi, Y.: Rearrangement of DNA frag-
ments: A branch-and-cut algorithm. Discrete Applied Mathematics 116(1-2)
(2002) 161–177
7. Batzoglou, S., Jaffe, D., Stanley, K., Butler, J., Gnerre, S., Mauceli, E., Berger,
B., Mesirov, J.P., Lander, E.S.: ARACHNE: A whole-genome shotgun assembler.
Genome Research 12(1) (2002) 177–189
8. Kececioglu, J.D., Myers, E.W.: Combinatorial algorithms for DNA sequence as-
sembly. Algorithmica 13(1-2) (1995) 7–51
9. Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA
fragment assembly. Proceedings of the National Academy of Sciences of the United
States of America 98(17) (2001) 9748–9753
10. Burks, C., Engle, M., Forrest, S., Parsons, R., Soderlund, C., Stolorz, P.: Stochastic
optimization tools for genomic sequence assembly. In: Adams, M.D., Fields, C.,
Venter, J.C. (eds.): Automated DNA Sequencing and Analysis. Academic Press,
London, UK (1994) 249–259
11. Parsons, R.J., Forrest, S., Burks, C.: Genetic algorithms, operators, and DNA
fragment assembly. Machine Learning 21(1-2) (1995) 11–33
12. Parsons, R.J., Johnson, M.E.: A case study in experimental design applied to
genetic algorithms with applications to DNA sequence assembly. American Journal
of Mathematical and Management Sciences 17(3-4) (1997) 369–396
13. Kim, K., Mohan, C.K.: Parallel hierarchical adaptive genetic algorithm for frag-
ment assembly. In: Proceedings of the 2003 Congress on Evolutionary Computa-
tion, Canberra, Australia (2003) 600–607
14. Angeleri, E., Apolloni, B., de Falco, D., Grandi, L.: DNA fragment assembly using
neural prediction techniques. International Journal of Neural Systems 9(6) (1999)
523–544
15. Dorigo, M., Gambardella, L.M.: Ant colony system: A cooperative learning ap-
proach to the traveling salesman problem. IEEE Transactions on Evolutionary
Computation 1(1) (1997) 53–66
16. Allex, C.F., Baldwin, S.F., Shavlik, J.W., Blattner, F.R.: Improving the quality
of automatic DNA sequence assembly using fluorescent trace-data classifications.
In: Proceedings of the Fourth International Conference on Intelligent Systems for
Molecular Biology, St. Louis, MO (1996) 3–14
17. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences.
Journal of Molecular Biology 147(1) (1981) 195–197
18. Huang, X., Miller, W.: A time-efficient, linear-space local similarity algorithm.
Advances in Applied Mathematics 12(3) (1991) 337–357