Comparison and Selection

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Comparison and Selection of Exact and Heuristic

Algorithms1
Joaqun Prez O.1, Rodolfo A. Pazos R.1, Juan Frausto S.2, Guillermo Rodriguez
O.3, Laura Cruz R.4, Hctor Fraire H.4
1

Centro Nacional de Investigacin y Desarrollo Tecnolgico (CENIDET)


AP 5-164, Cuernavaca, Mor. 62490, Mxico
{jperez, pazos}@sd-cenidet.com.mx
2
ITESM, Campus Cuernavaca, Mxico
AP C-99, Cuernavaca, Mor. 62589, Mxico
juan.frausto@itesm.mx
3
Instituto de Investigaciones Elctricas, IIE
gro@iie.org.mx
4
Instituto Tecnolgico de Ciudad Madero, Mxico
{hfraire, lcruzreyes}@prodigy.net.mx

Abstract. The traditional approach for comparing heuristic algorithms uses


well-known statistical tests for meaningfully relating the empirical performance
of the algorithms and concludes that one outperforms the other. In contrast, the
method presented in this paper, builds a predictive model of the algorithms
behavior using functions that relate performance to problem size, in order to
define dominance regions. This method generates first a representative sample
of the algorithms performance, then using a common and simplified regression
analysis determines performance functions, which are finally incorporated into
an algorithm selection mechanism. For testing purposes, a set of same-class
instances of the database distribution problem was solved using an exact
algorithm (Branch & Bound) and a heuristic algorithm (Simulated Annealing).
Experimental results show that problem size affects differently both algorithms,
in such a way that there exist regions where one algorithm is more efficient than
the other.
Keywords: Heuristic optimization, statistical tests.

1 Introduction
In the solution of many difficult combinatorial problems (such as the data distribution
problem), exact and heuristic algorithms have been used. Exact algorithms have been
extensively studied and are considered adequate for moderately size instances,
whereas heuristic algorithms are considered promising for very large instances [1, 2,
3, 4]. To get the best of both, it is necessary to analytically determine for what
problem size it is convenient to use an exact algorithm and when it is better to use a
heuristic algorithm. However the lack of mathematical methods to predict the
1

This research was supported in part by CONACYT and COSNET.

performance of these algorithms, especially the heuristic ones, hinders their


evaluation and the selection of the best for a given instance [5].
The theoretical study of heuristic algorithm performance based on the average
and worst cases, is generally difficult to carry out. Furthermore, since it describes
algorithm performance on the limit, it does not help for determining how well the
algorithms perform with specific instances [6]. On the other hand, experimental
analysis is certainly adequate for specific instances and is widely reported in scientific
papers. However, it is very informally used and does not satisfy minimal reproduction
standards; furthermore, results obtained through rigorous statistical methods are
seldom reported [7].
Experimental studies that compare exact and heuristic algorithms are scarce and
of limited practical utility. The most complete and up-to-date work is the one
presented by Hoos and Stutzle in [8], which focuses on the study of some of the best
well-known algorithms for the satisfability problem, contrasting the performance of
the exact and heuristic algorithms on a wide range of problem instances. A similar
study presented in [9] proposes a very general technique to model the growth of the
search cost with respect to the problem size. These works and those presented in
[4,10,11], have contributed to improve the knowledge of the performance of specific
algorithms on some classes of instances. However, for practical effects it becomes
still necessary to have a mechanism that allows to select from several algorithms the
one that solves the problem in the smallest available time.
In this paper a new method is proposed for (1) statistically evaluating and
characterizing exact and heuristic algorithms using performance functions, which
relate the solution quality and processing time with the problem size; and (2) choosing
the most adequate algorithm for a specific instance. Experimental results corroborate
that problem size is an important factor that affects algorithm performance, since
there exist regions where each algorithm is more efficient than the other.
This paper is organized as follows. The problem application, used to prove our
hypothesis is presented in Section 2, solution algorithms are described too. Section 3
describes a general method for comparison and selection of exact and heuristic
algorithms, used as solution algorithms. The experimental results of applying, the
proposed selection method, to the solution of the distribution design problem, are
described in section 4.

2 The Distribution Design Problem


In this section the mathematical formulation of the data distribution problem modeled
by DFAR (distribution, fragmentation, allocation and reallocation) and the solution
algorithms are described.
2.1 DFAR Mathematical Model
DFAR is an integer (binary) programming model. In this model, the decision about
storing an attribute m in site j is represented by a binary variable xmj. Thus xmj = 1 if m

is stored in j, and xmj = 0 otherwise.


The objective function below models costs using four terms: transmission, access
to several fragments, fragment storage, and fragment migration.
min

z= f
q km l km c jt x mt + cv
kj
m
t
t
j
j
k
k
+ cf wt + a mj c jt d m x mt
m
t
t
j

f kj

ykt

(1)

where
fki =
q =

emission frequency of query k from site i;


usage parameter, q = 1 if query k uses attribute m, otherwise q = 0;

lkm =
ci =

number of packets for transporting attribute m for query k;


communication cost between sites i and j;

c1 =
ykj =
c2 =
wj =
ami =
dm =

cost for accessing several fragments to satisfy a query;


indicates if query k accesses one or more attributes located at site j;
Cost for allocating a fragment to a site;
indicates if there exist attributes at site j;
indicates if attribute m was previously located at site i;
number of packets for moving attribute m to another site if necessary;

km

km

km

The model solutions are subject to five constraints: each attribute must be stored in
one site only, each attribute must be stored in a site which executes at least one query
that uses it, variables wj and ykj are forced to adopt values compatible with those of xmj,
and site storage capacity must not be exceeded by the attributes stored at each site. The
detailed description of this model can be found at [12, 13].
2.2 Solution Algorithms
Since the distribution problem modeled by DFAR is NP-complete [14], a heuristic
method is needed. As an exact solution method, the Branch and Bound algorithm
implemented in the Lindo 6.01 commercial software was used. As a heuristics
method, a variation of the Simulated Annealing algorithm, known as Threshold
Accepting, was implemented. In the cases reported in the specialized literature, this
version consumes less computing time and generates better quality solutions [15].
More details of the implementations are reported in [16].

3 Evaluation of Algorithms
In this section a statistical method is presented for comparing exact and heuristic
algorithms. Additionally, the steps for estimating algorithm performance and selecting
the best are detailed.
3.1 Method for Comparison of Exact and Heuristic algorithms
The following method was devised considering the notions presented in [17, 18]:

Step 1. Sampling. Obtain through experimentation a tabular description of each


algorithm behavior for instances of different size. Behavior examples are the
deviation from the optimum and processing time.
Step 2. Estimation. Find the estimation functions for the algorithm performance, by
applying to the tabular results a statistical treatment based on approximation
techniques (for example regression analysis).
Step 3. Algorithm Selection. Choose the best algorithm according to the problem
size using a selection algorithm based on the performance functions and the
user requirements.
3.2 Measuring Performance
For the different solution methods we use two comparison aspects: CPU time and
error percentage. We use the CPU time to measure the processing cost, while for the
quality aspect we use the percentage of deviation from the optimum. Both quantities
were determined for instances of the same class, each one of different size. Unlike
many works, we measure the size of the problem counting the bytes occupied by all
the problem parameters. We consider that this measure is more exact because it has a
total representativeness of the data structure.
In other researches [8], oriented to improve algorithms, used as comparison
strategy: a) run the algorithm until obtaining a very near solution to the optimal, b)
run the algorithm until a predetermined time. However, our work is focused to the
characterization of algorithms, for selection purpose, without seeking to modify them
(they are black boxes). In these circumstances, we let operate the algorithms until
their own termination criteria are satisfied.
3.4 Finding Performance Estimation Functions
As mentioned before, for the performance characterization only two aspects will be
considered: quality of results and computational effort required. In the sequel, the
performance functions will be denoted by T(n) and E(n). The first one is the
efficiency function which represents the relationship between problem size and the
processing time of a specific algorithm, whereas the second one is the efficacy
function which represents the relationship between problem size and the quality of the
solution found by the algorithm. The quality is determined by the deviation from the
optimum.
For test purposes a common and simplified method based on regression analysis
was used. In this method n is the independent variable that represents the problem
size, t is the random variable that represents the algorithm processing time, and r is
the sample size.
Step 1. Generate a set of feasible polynomials, i.e. those whose degree is small and
smaller than the sample size.
E(n) = a0 + a1 n + a2 n2 +...+ ag ng
1 g (r-1)
T(n) = b0 + b1 n + b2 n2 +...+ bg ng
1 g (r-1)

Step 2. Calculate the coefficients a and b of the set of feasible polynomials using a
fitness method such as least squares.
Step 3. Select the most adequate polynomial using statistical tests, which quantify
their goodness to represent the relationship between performance and
problem size. In order to increase the confidence level of the chosen function,
three fit tests are recommended: estimation of the error variance, the global F
test, and the Student t test. The first provides a preliminary assessment of the
function confidence, with the second a subset of useful functions is obtained,
and with the third the usefulness of the candidate function coefficients is
determined. Table 1 presents the equations and conditions that are used to
determine the goodness of fit of the efficiency polynomials, which are similar
for the efficacy polynomials.
Table 1. Goodness Tests

Error Variance
r

se 2 =

(t i - T( ni ))

Global F Test
r

r - ( g + 1)

R2 =1-

(t i -T (ni ))

2
(t i -t )
i

The polynomial is
adequate if it has the
smallest se value.

R2 /g
F=
(1- R 2 )/[r -( g +1)]

The polynomial is useful if


F > (Fa(g, r-(g+1)))*

t Student Test
sebi = standard error of bi
calculated by least
squares

t=

bi
sebi

0i g

Coefficient bi is useful if
t < -(t a (r-(g+1)))* or
t > (t a (r-(g+1)))*

* Tabular value of the corresponding distribution.

3.5 Selection Algorithm


In this algorithm t and e are the allowed tolerance for the available processing time
and the algorithm error. Additionally, n is the problem size, T(n) is the efficiency
function, E(n) is the efficacy function, and f(I) is the solution algorithm expressed as a
function that maps a problem instance I to a solution x. Consequently with fE(I) the
exact solution is found, and with fA(I) the approximate solution is obtained.
The first part of the selection algorithm is for problems in which suboptimal
solutions are unacceptable. In this case it is necessary to check if the processing time
predicted by fE(I) lies within the tolerance interval defined by t. In such a case the
solution x of an instance I is found using an exact method, otherwise the algorithm
ends without result.
The second part is for problems where suboptimal solutions are permitted. The first
action is to verify if the exact method is adequate, else the heuristic method is
evaluated. The exact method is adequate if the processing time predicted by TE(n) lies
within the tolerance given by t, in this case the solution x is found using it. Otherwise,
the heuristic method is adequate when the predicted processing time and the error lie
within the tolerance intervals, in this case an initial solution x is obtained using the
heuristic method, and from this one new and possible better solutions are generated

through successive runs. The number of runs is determined by dividing the tolerance
time by the estimated processing time. Finally, if none is adequate, the algorithm ends
without result. The algorithm previously described is the following:

Algorithm
Begin
real t, e ;
// tolerances
integer n;
// problem size
if only an optimal solution is acceptable then
if TE (n) t then
x = E ( I );
else finish without solution
end_if
else
if TE (n) t then
x = E ( I );
else
if TA (n) t and EA (n) e then
if z(y) < z (x) then
x = A ( I );
for i = 2 to t / TA (n)
y = A ( I );
if z(y) < z (x) then
x=y
end_if
end_for
else finish without solution
end_if
end_if
end_if
end_if
End

4 Experimental Results
4.1 Results of Algorithm Behavior
In order to obtain the tabular description of the algorithm behavior, 40 experiments
were conducted for each instance. 17 instances of wide size range and with known
optimal solution were generated. These belong to the same class and were
mathematically obtained using the Uncoupled Components Method [19].
Each test instance was solved using a Branch&Bound algorithm and the Threshold
Accepting Algorithm. Tables 2 and 3 show a subset of the results of these tests. The

second and third columns of Table 2 show the problem sizes for the test cases; while
the last two columns show the performance results. Table 3 show the results obtained
using the Threshold Accepting Algorithm. The difference percentage with respect to
the optimal is shown on columns two through four. The last column shows the
execution time of the algorithm.
Table 2. Exact Solution Using Branch&Bound

Instance
I1
I2
I3
I4
I5
I6
I7
I8

Sites
2
18
20
32
64
128
256
512

Queries
2
18
20
32
64
128
256
512

Optimal
Value
302
2719
3022
* 4835
* 9670
* 19340
* 38681
* 77363

Time
(sec.)
0.05
1.15
3.29
**
**
**
**
**

*Solution value determined using the Uncoupled Components Method.


**Problem size exceeds algorithm implementation capacity.

Table 3. Approximate Solution Using Threshold Accepting

Instance

I1
I2
I3
I4
I5
I6
I7
I8

% Difference
(deviation from optimal)
Best
Worst
Average
0
0
0
0
141
10
0
0
0
0
78
4
0
100
20
0
140
36
0
405
88
66
383
215

Time
(sec.)
0.03
0.3
0.4
1.2
6.1
43.6
381.2
3063.4

4.2 Performance Functions


The procedure for obtaining the performance functions was implemented with the
mathematical package Matlab and the statistical software Origin.
Tables 4 and 5 show the best polynomial functions that were found using the
regression method applied to the Threshold Accepting and the Branch&Bound
algorithms.

Table 4. Polynomial Functions for Efficiency

Algorithm
Polynomial Function T(n)
Threshold Accepting -0.31458651 + 6.7247624E-5n + 4.3424044E-10n2 6.1504908E-17n3
Branch&Bound
+0.0036190847 + 4.4856655E-4 n - 4.4942872E-7 n2 +
2.5914131E-10 n3 -5.4339889E-14 n4 + 2.5641303E-18 n5 +
2.4019059E-22 n6
Table 5. Polynomial Functions for Efficacy (Threshold Accepting)

Test Case
Large instances
(best case)
Large instances
(average case)
Large instances
(worst case)
Random problems
(average case)

Polynomial Function E(n)


-0.45443448 + 2.2522053E-5 n
+2.0494162 + 1.1588697E-4 n
6.3173862E-19 n3
+56.62439 + 0.00011 n

- 1.8006418E-11 n2

-0.23663 + 0.00049 n

Figure 1 shows the efficiency functions for both algorithms. For Branch&Bound a
sixth degree polynomial was obtained, whereas a third degree polynomial was found
for Threshold Accepting. Notice that for small instances the first outperforms the
second, whereas for large instances the situation is just the opposite, and there is a
crossing point between the two functions.

Branch&Bound
Threshold Accepting

Experimental results
from a B&B run
from a set of TA
runs

n Problem size in bytes

Fig. 1. Graph of the Efficiency Functions

Experimental results

Worst case

from one run


from a set of runs

Average case

Best case

n Problem size in bytes

Fig. 2. Graph of the Efficacy Functions for the Threshold Accepting Algorithm
Due to the large spread of the efficacy results for the Threshold Accepting
Algorithm, three polynomials were determined (Figure 2). For the best, average and
worst cases the resulting polynomials were of first, third and first degrees.

5. Final Remarks
This paper shows that by finding the performance functions that characterize the exact
and heuristic algorithms, it is possible to automatically determine the most adequate
algorithm given the problem size. Also, the characterization helps us to better
understand their behavior. For example, it defines regions in which one algorithm
outperforms the other, as opposed to the traditional approaches, which oversimplify
algorithm evaluation; i.e., they claim that one algorithm outperforms the other in all
cases, which is not always true.
For demonstration purposes the performance functions for Branch&Bound and
Simulated Annealing were obtained when applied to the solution of the database
distribution problem modeled by DFAR. The experimental results have proved that
Branch&Bound is satisfactory for small problems, Simulated Annealing is promising
for large problems, and there exists a crossing point that divides both regions.
Future plans for research include the following: exploring probability distributions
for characterizing the behavior of two type of algorithms: exacts and heuristics;
integrating our work with another model, developed by us, to select the best between
different heuristics algorithms.

References
1. G. Murty: Operations Research: Deterministic Optimization Models, New Jersey: Prentice
Hall (1995) 581.
2. R. K. Ahuja, A. Kumar, K. Jha: Exact and Heuristic Algorithms for theWeapon Target
Assignment Problem, working paper (2003).
3. J. Gu. Efficient Local Search for very Large-Scale Satisfability Problem. SIGART Bulletin,
(1992) 3:8-12.
4. B. Selman, H.A. Kautz and B. Cohen. Noise Strategies for Improving Local Search.
Proceeding of AAAI94, Mit Press, (1994) 337-343.
5. C. Papadimitriou, K. Steiglitz: Combinatorial Optimization: Algorithms and Complexity.
New Jersey, Prentice-Hall (1982) 496.
6. B. J. Borghetti, Inference Algorithm Performance and Selection Under Contrained
Resources, MS Thesis, AFIT/GCS/ENG/96D-05, (1996).
7. J.N. Hooker: Testing Heuristics: We Have it All Wrong. Journal of Heuristics (1996).
8. H.H. Hoos and T. Stutzle. Systematic vs. Local Search for SAT. Journal of Automated
Reasoning, Vol. 24, (2000) 421-481.
9. I. P. Gent, E. MacIntyre, P. Prosser and T.Walsh. The Scaling of Search Cost. Proceedings of
AAAI97, Mit Press, (1997) 315-320.
10. D.S. Johnson and M.A. Trick, editors. Clique, Coloring, and Satisfability. DIMACS Series
on Discrete Mathematics and Theoretical Computer Science, AMS, Vol. 16 (1996).
11. A. Davenport. A Comparison of Complete and Incomplete Algorithms in the Easy and Hard
Regions. Workshop on Studying and Solving Really Hard Problems. CP95 (1995).
12. Prez, J., Pazos, R.A., Frausto, J., Romero, D., Cruz, L.: Vertical Fragmentation and
Allocation in Distributed Databases with Site Capacity Restrictions Using the Threshold
Accepting Algorithm. Lectures Notes in Computer Science, Vol. 1793. Springer-Verlag,
(2000) 75-81.
13. Prez, J., Pazos, R.A., Romero, D., Santaolaya, R., Rodrguez, G., Sosa, V.: Adaptive and
Scalable Allocation of Data-Objects in the Web. Lectures Notes in Computer Science, Vol.
2667. Springer-Verlag, (2003) 134-143
14. J. Prez, R. Pazos, D. Romero, L. Cruz: Anlisis de Complejidad del Problema de la
Fragmentacin Vertical y Reubicacin Dinmica en Bases de Datos Distribuidas. 7th.
International Congress on Computer Science Research, Cd. Madero (2000) 63-70.
15. L. Morales, R. Garduo, D. Romero: The Multiple-minima Problem in Small Peptides
Revisted. The Threshold Accepting Approach. Journal of Biomelecular Structure &
Dynamics. Vol. 9, No. 5 (1992) 951-957.
16.Prez, J., Pazos, R.A., Velez, L. Rodriguez, G.: Automatic Generation of Control
Parameters for the Threshold Accepting Algorithm, Lectures Notes in Computer Science,
Vol. 2313. Springer-Verlag, , (2002) 119-127
17. R. Scheaffer, J. McClave: Probabilidad y Estadstica para Ingeniera, Tr. V. Gonzlez, Grupo
Editorial Iberoamrica (1990) 690.
18. R.Walpole, R. Myers: Probabilidad y Estadstica, Tr. G. Maldonado, McGraw Hill (1990)
797.
19. L. Cruz: Automatizacin del Diseo de la Fragmentacin Vertical y Ubicacin en Bases de
Datos Distribuidas Usando Mtodos Heursticos y Exactos, M.S. thesis, Instituto Tecnolgico y
de Estudios Superiores de Monterrey (1999) 116.

You might also like