Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Sorting Unsigned Permutations by Reversals using

Multi-Objective Evolutionary Algorithms with


Variable Size Individuals
Ahmadreza Ghaffarizadeh, Kamilia Ahmadi and Nicholas S. Flann
Computer Science Department, Utah State University, Logan Utah 84322-4205
Email: {ghaffarizadeh, k.ahmadi}@aggiemail.usu.edu, nick.flann@usu.edu

Abstract—Sorting by reversals is a simplified version of the πn+1 = n + 1 to the end of each permutation to count the
genome rearrangement problem that seeks to discover the evo- breakpoints at the start and end of permutation accordingly.
lutionary relationship between different genomes, and is one of As the goal identity permutation has no breakpoints, sorting
the many challenging problems in Bioinformatics. Solving the
problem optimally has been proved to be NP-Hard and so a by reversals corresponds to finding a series of reversals that
selection of approximation algorithms have been developed. In eliminates all breakpoints.
this paper a new mapping order is introduced to solve the There are two types of permutations, signed and unsigned.
problem of sorting unsigned permutations using a specialized For signed permutations each πi has a positive or negative sign
multi-objective genetic algorithm. Our modified genetic algorithm reflecting the orientation of that block of genes in genome.
uses a population with variable length individuals to maintain a
worst time running time complexity of O(n4 log2 n), where n The problem of sorting a signed permutation by reversals can
is the problem size. The results show that this approach is more be solved sub-optimally in O(n2 ) time [4], with the number
effective than the 3/2 heuristic method and previous genetic of reversal steps no greater than 3× the optimal solution [2].
algorithm approaches. However, having signed permutations is not always possible
Index Terms—Sorting by reversals, genome rearrangement, due to limitations in equipment and costs, so at this time the
sorting unsigned permutations, variable size individuals, multi-
objective genetic algorithm application of sorting by reversals to the unsigned permutation
problem has wider applicability.
From a combinatorial mathematics point-of-view, identify-
I. I NTRODUCTION
ing the optimal sequence of reversals for sorting of unsigned
Analysis of genome rearrangements in molecular evolution permutations is an NP-Hard problem [5], therefore error
was pioneered by Dobzhansky and Sturtevant in 1938, who bounded heuristic solutions have been proposed [2] [6] [7].
published a milestone paper with an evolutionary tree pre- The lowest guaranteed error bound thus far is the 1.375
senting a rearrangement scenario with 17 inversions linking algorithm proposed by Berman et al [8], meaning that the
the species D. pseudoobscura and D. Miranda [1] [2]. Every length of the sequence found will be within 1.375× the length
genome rearrangement study involves solving a combinato- of the optimal sequence. Auyeung and Abraham [9] suggested
rial puzzle to find a series of genome rearrangements that a genetic algorithm (GA) approach to solve the problem by
transform one genome into another [3]. Reversal is the most mapping the unsigned reversal problem into 2n possible signed
commonly observed mechanism in rearrangement of genes reversal problems, then using the GA to heuristically search
and this makes sorting by reversal to be one of the most this combinatorial space. Results showed that their method
challenging problems in bioinformatics in past decade. was effective in many cases, but the estimated time complexity
In their simplest form, rearrangement events can be modeled of their method was O(n5 ). Here we use a modified version
by series of reversals that transform one genome to another of the standard GA that employs different size individuals
[3]. The order of genes in a genome can be represented by a to decrease the running time of algorithm such that it takes
permutation Π = hπ1 , π2 , π3 . . . πn i, where n is the number O(n4 log2 n) in the worst case, but empirical studies demon-
of genes and πi is the gene id in position i. This problem is strate that the performance is considerably reduced in the
reduced to sorting by reversals problem that can be described average case. It is difficult to determine a guaranteed bound for
as: given a permutation Π, find a shortest series of reversals our algorithm since it is stochastic, however as our empirical
hρ1 ρ2 . . . ρm i that transforms it into the identity permutation. results show, the method works well for smaller permutation
To solve this problem, we introduce the breakpoint distance of problems.
a permutation to assess the progress of the algorithm towards The rest of this paper is organized as follow: in Section 2
the solution. We call a pair of neighboring elements πi and our proposed method is explained and our modified genetic
πi+1 ∈ Π, with 0 ≤ i ≤ n an adjacency if πi and πi+1 , are algorithm is described in detail by defining specific crossover
consecutive numbers, otherwise we call the pair a breakpoint. and mutation operators. Results are presented in Section 3 and
Then the breakpoint distance is the number of breakpoints finally we conclude the paper with some discussion about our
in a permutation. Note that we add to π0 = 0 the start and proposed method and a future open research problems.

978-1-4244-7835-4/11/$26.00 ©2011 IEEE 292

Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:16:02 UTC from IEEE Xplore. Restrictions apply.
The genetic algorithm attempts to solve this problem by bal-
ancing these two competing objectives in its fitness function.
Given two individual solutions χxi and χyj , i is selected over j
probabilistically when i dominates j, or in other words, when
x < y (i is shorter than j) and B(Πi ) < B(Πj ). The method
(a) follows the framework of the standard genetic algorithm, but
uses special mutation and crossover operators that extend
and shrink the number and arrangement of reversals in each
solution:
• Create an initial random population with size n log n and
containing a diversity of short individuals and set it as the
current population. The maximum length of each reversal
(b)
sequence is set to 10% of the input permutation length,
n.
• Evaluate current population with the two objective func-
tions; the result will be the breakpoint distance of each
individual and its length.
• Select the best individuals (individuals with smallest
(c) breakpoint distance and length based on domination)
from current population. These individuals can be of
Fig. 1. Different crossover operators used in the method. (a) Regular 2-points different sizes.
crossover; (b) Absorption crossover; (c) Adjunction crossover
• Apply the modified crossover (see below) on this selected
individuals to create new offspring.
II. M ETHOD • Apply the modified mutation (see below) on these off-

The method utilizes a genetic algorithm approach that spring.


• Select between new generated offspring and parents in
searches the space of all possible sequences of reversals.
Search is guided by two complementary objectives: mini- population to create new population and set it as current
mizing the length of reversals in the current solution and population.
minimizing the estimated number of reversals needed to The algorithm terminates with success when an individual
complete the solution. Since each solution is a particular is found where B = 0; the algorithm terminates with failure
sequence of reversals, the first objective is simply the length if the maximum number of allowable steps is exceeded. In
of the solution. The second objective is more computationally this case the algorithm can be re-run with a new random
intensive because all the reversals in the solution have to be population.
applied before the number of breakpoints in the resulting A. Genetic Search Operators
permutation can be calculated. Before more details of the
This work utilizes a distinct set of genetic operators that are
search method are given, first we define a reversal and a
tailored to this problem. Specifically, the length of the reverse
solution.
sequences must be allowed to grow so a solution can have
Given an example permutation Πt at step t of the algorithm:
the potential to reduce the breakpoints to zero; conversely, the
π1 π2 π3 π4 π5 π6 solution must be allowed to shrink to optimize the length of
5 1 3 6 4 2 the reversal sequence. Below briefly describes the crossover
then if a single reversal ρ5,2 is applied, the permutation operator that is applied to two solutions χxi and χyj and
Πt+1 would be: mutation operators applied to χzk :
1) Regular Crossover: This crossover is the standard
π1 π2 π3 π4 π5 π6
crossover operator that can be a single point or a multiple
5 4 6 3 1 2
point crossover. The operator is applicable only if x = y. An
In general, a reversal ρti,j at index t in a solution will desig- example of this crossover is depicted in Figure 1.a.
nate two place indexes in the permutation where 1 ≤ i, j ≤ n 2) Absorption crossover: This crossover operator is appli-
specifying that the permutation in Π between i and j is to cable only if x 6= y. Without loss of generality let x < y,
be reversed to form Πt+1 . Let solution k of length m be a then in this process, i exchanges a subsequence with a random
sequence of reversals χm 1 2 3 m
k = hρ , ρ , ρ , . . . , ρ i, so if the subsequence of j of equal length. An example of this crossover
initial permutation is Π1 , then the permutation generated by is shown in Figure 1b.
solution k is Πk = Π1 · ρ1 · ρ2 · . . . · ρm . Let the number 3) Adjunction Crossover: This crossover operator generates
of breakpoints in solution k’s permutation Πk be B(Πk ) offspring that can be longer or shorter than their parents.
Therefore the two objectives of the search are to identify a It selects two random indexes in i and j then replaces a
solution , χm , where m is minimized and B(Π ) = 0. shorter sequence in one for a longer sequence in the other.

293

Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:16:02 UTC from IEEE Xplore. Restrictions apply.
Fig. 3. Quality of solutions found by the three methods for increasing size of
Fig. 2. A sample run of the algorithm for a permutation with length 35, problem. Note that the method described here produces on average solutions
showing the progression of the best individual of the population. that are shorter by 20% than the 3/2 algorithm and on average 4% better than
the previous GA method.

The selected index is chosen in a manner that enables the


algorithm to generate offspring longer than their parents. An The settings for the competing algorithms are the same as
example of this crossover is depicted in Figure 1.c. stated in their references. A sample run for a permutation
with length 35 is depicted in Figure 2; this figure shows the
B. Modified Mutation process of minimizing the first objective. The performance of
Our modified mutation operator probabilistically decreases our proposed method for permutations with length lesser than
the length of an individual or changes some values of that 110 is compared with the two other methods in Figure 3.
individual by applying standard mutation operator. Given an
individual solution, χzk = hρ1 , ρ2 , ρ3 , . . . , ρz i, the specialized IV. D ISCUSSION
operator first chooses two random reversals ρi , ρj within that Genome rearrangement is a challenging and important prob-
individual i, j; i < j; i ≥ 1; j ≤ z. Let the indexes defining lem in bioinformatics. Inversions (reversals) are a common
the reversals for ρi be ha, bi and the indexes defining the occurrence in genome rearrangement and therefore an im-
reversal for ρj be hc, di, then the new solution is formed by portant question for biologists who want to understand how
removing b and c and shifting all intervening reversal indexes. species evolve from common ancestors, both under natural and
This operator effectively reduces the length of the solution by artificial selection. Biologists seek the solution that involves
one. the minimum number of genetic reversals, but there is no
intrinsic reason why this sequence would be the one that
C. Complexity of Method actually occurred during evolution. However, the shortest
The complexity of the method can easily be determined. solution can provide the clearest interpretation of the process
The size of the population is maintained at n log n and the and offers the chance for scientists to engineer mutations that
number of algorithm iterations is fixed at n log n. It requires lead to specific traits in organisms. Since it is computationally
on average O(m2 ) steps to evaluate each solution and log n impossible to find this optimal solution, heuristic solutions
to select the next population then the complexity will be have been developed that find sub-optimal solutions whose
O(n2 m2 log2 n). Given that we know m is bound by n, since length is close to the minimum.
this is the maximum number of breakpoints a solution can In this paper, we proposed a modified multi-objective based
contain, the final complexity is given by: O(n4 log2 n). genetic algorithm to solve this problem. The key idea in
the method is that the population contains individuals of
III. E XPERIMENTS AND R ESULTS diverse lengths and genetic operators manipulate both the
We conducted experiments on different permutations to length and the arrangement of reversal steps in the solution.
compare the quality of solutions among competing heuristic Two complementary objectives are employed during the search
methods: the 3/2 algorithm introduced by [7] and a GA process that measure both the actual length of the solution and
approach described in [9]. For permutations with size lesser the estimated length of steps needed to complete the solution,
or equal to approximately 110, our algorithm is effective reminiscent of the A* method [10] used to find satisfying
and finds a short series of reversals needed for sorting the solutions to many NP-hard problems.
permutation. In this area, it outperforms both alternative algo- A potential drawback of our method is its limit in solv-
rithms. For longer permutations, the algorithm can terminate ing long permutations problems. This deficiency could be
due to the imposed limit on iterations rather than identifying explained by the high correlation between the reversals within
the solution with no break points. These solutions can easily individuals, which is a common problem in GA methods; the
be repaired to reduce the number of breakpoints to zero longer the individual, the higher correlation between reversals
by using the näive algorithm. In our experiments, size of due to accumulated crossover operators. This weakness may be
population and the number of generations are both n log n. overcome by specializing the operators further to exploit this

294

Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:16:02 UTC from IEEE Xplore. Restrictions apply.
correlation so that common sub-sequences become schema in
the solutions and therefore accelerate problem solving. This
issue is left as an open research problem.
ACKNOWLEDGMENT
Authors thank Dr. Minghui Jiang at Utah State University
for his help in this research.
R EFERENCES
[1] T. Dobzhansky and A. H. Sturtevant, “Inversions in
the Chromosomes of Drosophila Pseudoobscura.” Genetics,
vol. 23, no. 1, pp. 28–64, Jan. 1938. [Online]. Available:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1209001/
[2] V. Bafna and P. A. Pevzner, “Genome Rearrangements and Sorting by
Reversals,” SIAM J. Comput., vol. 25, no. 2, pp. 272–289, Feb. 1996.
[Online]. Available: http://dx.doi.org/10.1137/S0097539793250627
[3] N. C. Jones and P. A. Pevzner, An Introduction
to Bioinformatics Algorithms (Computational Molecular Bi-
ology). The MIT Press, Aug. 2004. [Online]. Avail-
able: http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-
20&path=ASIN/0262101068
[4] H. Kaplan, R. Shamir, and R. E. Tarjan, “Faster and simpler algorithm
for sorting signed permutations by reversals,” in Proceedings of
the eighth annual ACM-SIAM symposium on Discrete algorithms,
ser. SODA ’97. Philadelphia, PA, USA: Society for Industrial
and Applied Mathematics, 1997, pp. 344–351. [Online]. Available:
http://portal.acm.org/citation.cfm?id=314318
[5] A. Caprara, “Sorting by reversals is difficult,” in Proceedings of the first
annual international conference on Computational molecular biology,
ser. RECOMB ’97. New York, NY, USA: ACM, 1997, pp. 75–83.
[Online]. Available: http://dx.doi.org/10.1145/267521.267531
[6] J. Kececioglu and D. Sankoff. (1995) Exact and Ap-
proximation Algorithms for Sorting By Reversals, With
Application to Genome Rearrangement. [Online]. Available:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.49.9970
[7] D. A. Christie, “A 3/2-approximation algorithm for sorting by
reversals,” in Proceedings of the ninth annual ACM-SIAM symposium
on Discrete algorithms, ser. SODA ’98. Philadelphia, PA, USA:
Society for Industrial and Applied Mathematics, 1998, pp. 244–252.
[Online]. Available: http://portal.acm.org/citation.cfm?id=314711
[8] P. Berman, S. Hannenhalli, and M. Karpinski. (2001) 1.375-
Approximation Algorithm for Sorting by Reversals. [Online]. Available:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.9.5673
[9] A. Auyeung, “Estimating Genome Reversal Distance by
Genetic Algorithm,” The IEEE Congress on Evolutionary
Computation, vol. 2003, pp. 1157–1161. [Online]. Available:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.75.6866
[10] P. Hart, N. Nilsson, and B. Raphael, “A Formal Basis for the Heuristic
Determination of Minimum Cost Paths,” IEEE Transactions on Systems
Science and Cybernetics, vol. 4, no. 2, pp. 100–107, Feb. 1968.
[Online]. Available: http://dx.doi.org/10.1109/TSSC.1968.300136

295

Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:16:02 UTC from IEEE Xplore. Restrictions apply.

You might also like