Professional Documents
Culture Documents
JPDC Num PDF
JPDC Num PDF
on GPUs
Jose M. Cecilia, Jose M. Garca
Computer Architecture Department
University of Murcia
30100 Murcia, Spain
Manuel Ujaldon
Computer Architecture Department
University of M
alaga
29071 M
alaga, Spain
Abstract
Ant Colony Optimisation (ACO) is an effective population-based metaheuristic for the solution of a wide variety of problems. As a populationbased algorithm, its computation is intrinsically massively parallel, and it is
therefore theoretically well-suited for implementation on Graphics Processing Units (GPUs). The ACO algorithm comprises two main stages: Tour
construction and Pheromone update. The former has been previously implemented on the GPU, using a task-based parallelism approach. However,
up until now, the latter has always been implemented on the CPU. In this
paper, we discuss several parallelisation strategies for both stages of the ACO
algorithm on the GPU. We propose an alternative data-based parallelism
scheme and selection procedure, namely I-Roulette, for the Tour construction, which fit better on the GPU architecture. We also describe novel GPU
programming strategies for the Pheromone update stage. Our results show
Corresponding author
a total speed-up exceeding 21x for the Tour construction stage, and 20x for
Pheromone update, compared to its sequential counterpart version, and suggest that ACO is a potentially fruitful area for future research in the GPU
domain.
Keywords: Metaheuristics, GPU programming, Ant Colony Optimization,
TSP, Performance analysis
1. Introduction
Ant Colony Optimisation (ACO) [? ] is a population-based search
method inspired by the behaviour of real ants. It may be applied to a wide
range of hard problems [? ? ], many of which are graph-theoretic in nature.
It was first applied to the Travelling Salesman Problem (TSP) [? ] by Dorigo
and colleagues, in 1991 [? ? ].
In essence, simulated ants construct solutions to the TSP in the form of
tours. The artificial ants are simple agents which construct tours in a parallel,
probabilistic fashion. They are guided in this task by simulated pheromone
trails and heuristic information. Pheromone trails are a fundamental component of the algorithm, since they facilitate indirect communication between
agents via their environment, a process known as stigmergy [? ]. A detailed
discussion of ant colony optimization and stigmergy is beyond the scope of
this paper, but the reader is directed to [? ] for a comprehensive overview.
ACO algorithms are population-based, in that a collection of agents collaborates to find an optimal (or even satisfactory) solution. Such approaches
are naturally suited to parallel processing, but their success strongly depends
on both the nature of the particular problem and the underlying hardware
available. Several parallelisation strategies have been proposed for the ACO
algorithm, on both shared and distributed memory architectures [? ? ? ].
The Graphics Processing Unit (GPU) is a major current theme of interest
in the field of high performance computing, as they follow in the footsteps of
earlier throughput-oriented processor designs but have achieved far broader
use in commodity machines, mainly motivated by the needs of real-time
computer graphics, making possible for GPUs to become mass-market devices [? ]. For instance, since late 2006, NVIDIA has shipped almost 220
million CUDA-capable GPUsseveral orders of magnitude more than historical massively parallel architectures like CM-2, MasPar and Goodyear MPP
(Massively Parallel Processor) machines [? ].
2
5. We validate the quality of the solution obtained by our GPU algorithms, comparing it to the quality of the solution obtained by the
sequential code given in [? ].
A preliminary and partial version of this work was presented in [? ].
Here, we significantly extend that work with a more formal description of
the algorithm design on GPUs, extensive analysis of our proposals on Fermis
architecture from NVIDIA for both algorithmic stages, and a solution quality
validation of our proposals. Moreover, we extend the evaluation process,
adding several benchmarks from TSPLIB Library.
The paper is organised as follows. We briefly introduce Ant Colony Optimisation for the TSP and Compute Unified Device Architecture (CUDA)
from NVIDIA in Section ??. In Section ?? we present GPU designs for
both main stages of the ACO algorithm. Experimental methodology is introduced in Section ?? before we show the performance evaluation of our
algorithm in Section ??. Finally, parallelization strategies for the ACO algorithm previously presented in the literature are described in Section ??,
before concluding with a brief discussion and consideration of future work.
2. Background
2.1. Ant Colony Optimisation for the Traveling Salesman Problem
The Traveling Salesman Problem (TSP) [? ] involves finding the shortest
(or cheapest) round-trip route that visits each of a number of cities exactly once. The symmetric TSP on n cities may be represented as a complete
weighted graph, G, with n nodes, with each weighted edge, ei,j , representing
the inter-city distance di,j = dj,i between cities i and j. The TSP is a wellknown NP-hard optimisation problem, and is used as a standard benchmark
for many heuristic algorithms [? ].
The TSP was the first problem solved by Ant Colony Optimisation (ACO)
[? ? ]. This method uses a number of simulated ants (or agents), which
perform distributed search on a graph. Each ant moves through on the
graph until it completes a tour, and then offers this tour as its suggested
solution. In order to do this, each ant may drop pheromone on the edges
contained in its proposed solution. The amount of pheromone dropped, if
any, is determined by the quality of the ants solution relative to those obtained by the other ants. The ants probabilistically choose the next city to
visit, based on heuristic information obtained from inter-city distances and
4
the net pheromone trail. Although such heuristic information drives the ants
towards an optimal solution, a process of evaporation is also applied in
order to prevent the process stalling in a local minimum.
The Ant System (AS) is an early variant of ACO, first proposed by Dorigo
[? ]. The AS algorithm is divided into two main stages: Tour construction
and Pheromone update. Tour construction is based on m ants building tours
in parallel. Initially, ants are randomly placed. At each construction step,
each ant applies a probabilistic action choice rule, called the random proportional rule, in order to decide which city to visit next. The probability for
ant k, placed at city i, of visiting city j is given by the equation ??
pki,j = P
[i,j ] [i,j ]
lNik
[i,l ] [i,l ]
if j Nik ,
(1)
dated. This is achieved by first lowering the pheromone value on all edges
by a constant factor, and then adding pheromone on edges that ants have
crossed in their tours. Pheromone evaporation is implemented by
i,j (1 )i,j ,
(i, j) L,
(2)
m
X
k
i,j
,
(i, j) L,
(3)
k=1
10
list are stored one per each thread-block (i.e. Queen ant) instead of one per
thread (i.e. Worker ant).
The number of threads per thread-block on CUDA is a hardware limiting
factor. Thus, the cities should be distributed among threads to allow for
a flexible implementation. A tiling technique is proposed to deal with this
issue. Cities are divided into blocks (i.e. tiles). For each tile, a city is selected
stochastically, from the set of unvisited cities on that tile. When this process
has completed, we have a set of partial best cities. Finally, the city with
the best absolute heuristic value is selected from this partial best set.
The tabu list information can be placed in the register file (since it represents information private to each thread). However, the tabu list cannot
be represented by a single integer register per thread in the tiling version,
because, in that case, a thread represents more than one city. The 32-bit
registers may be used on a bitwise basis for managing the list. The first city
represented by each thread; i.e. on the first tile, is managed by bit 0 on the
register that represents the tabu list, the second city is managed by bit 1,
and so on.
3.2.3. I-Roulette: An alternative selection method
The roulette wheel is a fully sequential stochastic selection process, and it
is hard to be parallelized. To implement this on CUDA, a particular thread is
11
Figure 3: An alternative method for increasing the parallelism on the selection process.
orders of magnitude.
We note that an ants tour length may be bigger than the maximum
number of threads that each thread block can support (see Table ??). Our
algorithm prevents this situation by setting our empirically demonstrated
optimum thread block layout, and dividing the tour into tiles of this length.
This raises up another issue; this is when n + 1 is not divisible by the . We
solve this by applying padding in the ants tour array to avoid warp divergence
(see Figure ??).
Unnecessary loads to device memory can be avoided by taking advantage
of the problems nature. We focus on the symmetric version of the TSP, so
the number of threads can be reduced in half, thus halving the number of device memory accesses. This so-called Reduction version actually reduces the
overall number of accesses to either shared or device memory by having half
the number of threads compared to the previous version. This is combined
also with tiling, as previously explained. The number of accesses per thread
remains the same, giving a total of device memory access of = n4 /.
14
4. Experimental methodology
4.1. Hardware features
Table 1: CUDA and hardware features for Tesla C2050.
GPU element
Streaming
processors
(GPU
cores)
Maximum
number of
threads
SRAM
memory
available per
multiprocessor
Global
(video)
memory
Feature
Cores per SM
Number of SMs
Total SPs
Clock frequency
Per multiprocessor
Per block
Per warp
32-bit registers
Shared memory
L1 cache
(Shared + L1)
Size
Speed
Width
Bandwidth
Technology
Tesla C2050
32
14
448
1 147 MHz
1 536
1 024
32
32 K
16/48 KB
48/16 KB
64 KB
3 GB
2x1500 MHz
384 bits
144 GB/s
GDDR5
The hardware evaluation platforms are: (1) a dual-socket 2.40 GHz quadcore Intel Xeon E5620 WESTMERE processor, and (2) a NVIDIA Tesla
C2050 based on Fermi architecture released in November 2010 (see main
features on Table ??) [? ]. In addition to a larger number of CUDA processors in each Streaming Multiprocessor (SM), including a fourfold increase
over prior SM designs, Fermi improves the GPU capabilities with additional
features like enhanced single-precision floating-point accuracy-implementing
IEEE 754-2008 floating-point standard- and double-precision floating-point
performance, new general-purpose L1 and L2 caches, faster context switching, a unified 64-bit virtual address space, a brand new instruction set, Error Correction Code (ECC) memory support to enhance data integrity in
high performance computing. The same on-chip memory is devoted to L1
cache and shared memory, being configurable at compilation time through
(cudaFuncSetCacheConfig) CUDA instruction.
15
Finally, we use gcc 4.3.4 with the -O3 flag to compile our CPU implementations, and CUDA compilation tools, release 3.2, for our GPU implementations.
4.2. Our benchmark
We test our designs using a set of benchmark instances from the wellknown TSPLIB library [? ] [? ]. All benchmark instances are defined on
a complete graph, and all distances are defined to be integer numbers. The
Table ?? shows a list of all targeted benchmark instances with information
on the number of cities, the type of distance and the length of optimal tours.
ACO parameters such as the number of ants m, , , , and so on are
set according with the values recommended in [? ]. The most important
parameters for the scope of this study is the number of ants, which is set
m = n (being n the number of cities), = 1, = 2, = 0.5.
Table 2: Description of the benchmark instances from TSPLIB library. (EUC 2D: 2Dimensional euclidean distances).
Name
d198
a280
lin318
pcb442
rat783
pr1002
pcb1173
pcb1291
pr2392
Cities
198
280
318
442
783
1002
1173
1291
2392
5. Performance Evaluation
In this Section we deeply analyse both main stages of the ACO algorithm:
Tour construction and Pheromone update. We compare our implementations
with the sequential code, written in ANSI C, provided by St
uzle in [? ]. The
performance tables are recorded for a single iteration, and averaged over 100
iterations. In this work we focus on the computational characteristics of the
16
Ant System and how it can be efficiently implemented on the GPU. However,
to guarantee the correctness of our algorithms, a quality comparison between
sequential, GPU codes, normalized to the optimum solution, for a variety of
benchmarks is provided by running all algorithms a fixed number of iterations
(1000-iterations), and averaged over 5 independent runs. The experiments
are based on single-precision.
5.1. Evaluation of tour construction stage
Now, we evaluate the tour construction stage on Tesla C2050 under different aspects: The performance impact of using costly arithmetic instruction in
the choice info kernel, comparison versus a CPU counterpart, improvement
degree attained through data-based approach, speed-up factor obtained when
using different on-chip GPU memories available on FERMI architecture, and
performance benefits of changing the selection process. We now address each
of these issues separately.
5.1.1. Choice info kernel evaluation
Table 3: Execution times (milliseconds) on the GPU system for the choice info kernel,
using both CUDA instructions: powf and powf. We vary the TSPLIB benchmark instance
to increase the number of cities.
TSPLIB
Benchmarks
d198
a280
lin318
pcb442
rat783
pr1002
pcb1173
d1291
pr2392
CUDA
powf ()
0.038
0.065
0.082
0.147
0.441
0.719
0.981
1.185
4.042
0.013
0.020
0.024
0.042
0.117
0.188
0.254
0.307
1.039
CUDA
powf ()
(2.88x)
(3.18x)
(3.33x)
(3.47x)
(3.77x)
(3.82x)
(3.86x)
(3.86x)
(3.88x)
We first evaluate the choice info kernel, before assessing the impact of
including various modifications to the tour construction. The table ?? shows
the performance of the choice info kernel. It is drastically affected by using
17
costly math function like powf(). However, there are analogous CUDA functions which are map directly to the hardware level like powf(), although
they can provide somehow lower accuracy [? ]. After ptx inspection, this
kernel presents 4 memory access to global memory. The empirical streaming
bandwidth obtained by this kernel is up to 90 GB/s, using powf() and up
to 23 GB/s, using powf() on the Tesla C2050. These performance numbers
translate into up to 67 GFLOPS and 17 GFLOPs respectively, counting both
instructions; powf() and powf(), as a single floating point operation. We use
a 256 thread-block, one per each entry of the choice info data structure to
reach the best performance. These particular values minimize non-coalesced
memory accesses and yield high occupancy values.
5.1.2. Tuning our data-based approach
For the data-based approach, the number of workers ants (threads) per
queen ant (threads-blocks) is a degree of freedom which is analyzed in Table
??. The 128 thread-block configuration reaches our best performance in
all benchmark instances. This particular value is well suited for developing
high-throughput application on GPUs. Notice that, some of configurations
are not allow (n.a) because either the number of worker ants is bigger than
the number of cities, or the number of cities divided by the number of worker
ants is bigger than 32, as it is the maximum number of cities that each worker
ant is capable to manage. The tabu list per each queen ant is divided among
their worker ants, and place it on a bit basis in a single register.
Table 4: Execution times (milliseconds) on Tesla C2050. We vary the number of worker
ants and benchmark instances (n.a. means not available due to register constraints).
Number of
Worker ants
16
32
64
128
256
512
1024
d198
a280
lin318
pcb442
rat783
pr1002
pr2392
10.39
6.85
5.06
4.32
n.a
n.a
n.a
30.06
18.68
12.78
11.94
15.94
n.a
n.a
38.59
23.89
15.51
15.18
20.89
n.a
n.a
101.09
62.77
41.17
38.73
41.85
n.a
n.a
n.a
357.24
235.86
207.08
245.33
296.90
n.a
n.a
749.04
474.79
391.59
412.20
498.55
n.a
n.a
n.a
6083.96
5092.27
5680.81
6680.39
10037.1
18
TSPLIB
Benchmarks
d198
a280
lin318
pcb442
rat783
pr1002
pcb1173
d1291
pr2392
Roulette Wheel
7.76
20.33
26.73
69.86
376.76
747.85
1227.52
1606.87
12017.2
I-Roulette
4.32
11.94
15.18
38.73
207.08
391.56
644.05
840.02
5092.27
(1.79x)
(1.70x)
(1.76x)
(1.80x)
(1.81x)
(1.90x)
(1.90x)
(1.91x)
(2.36x)
TSPLIB
Benchmarks
CPU Xeon
E5620
d198
a280
lin318
pcb442
rat783
pr1002
pcb1173
d1291
pr2392
43.01
151.99
223.78
618.07
3539.01
7965.17
12839.26
17450.53
110573.22
GPU Task-based
Speed-up
Ex. time
Vs CPU
29.37
1.46x
70.52
2.15x
153.33
1.45x
301.66
2.04x
1375.38
2.57x
2437.34
3.26x
3392.74
3.78x
4102.3
4.25x
29792.03
3.71x
GPU Data-based
Speed-up Speed-up
Ex. time
Vs CPU
Vs Task
4.32
9.95x
6.79x
11.94
12.72x
5.9x
15.18
14.69x
10.1x
38.73
15.95x
7.78x
207.08
17.08x
6.64x
392.56
20.33x
6.2x
652.05
19.69x
5.2x
869.96
20.05x
4.71x
5092.27
21.71x
5.85x
for the targeted benchmarks. This particular value produces a very poor
GPU resource usage per SM, and it is not well suited for developing highthroughput applications on the GPU computing arena. The heavy-weight
threads presented by this design need resources to execute their independent
tasks, thus avoiding large serialization phases. In CUDA, this is obtained by
distributing those threads among SMs, which is possible by increasing the
number of thread-block in the execution.
The task-based approach is only rewarded with a maximum of 4.25x
speed-up factor instead of 21.71x speed-up factor reached by data-based parallelism.
5.2. Evaluation of pheromone update kernel
This section evaluates the pheromone update kernel on Tesla C2050 under
different aspects: The performance impact of using different GPU algorithmic
strategies, and comparison versus a CPU counterpart. We now address each
of these issues separately.
5.2.1. Evaluation of different GPU algorithmic strategies
In this case, the baseline version is our best-performing kernel version,
which uses atomic instructions and shared memory. From there, we show
the slow-downs incurred by each technique. As previously explained, this
20
Table 7: Execution times (in milliseconds) for various pheromone update implementations
(Tesla C2050).
Code version
Tesla C2050
1. Atomic Ins.
+ Tiling
2. Atomic
Ins.
3. Ins.& Thread
Reduction
4. Tiled Scatter
to Gather
5. Scatter to
Gather
Slow-Downs
Attained (5 Vs 1)
d198
a280
0.18
0.41
0.49
0.54
2.42
3.52
4.68
5.85
18.57
0.26
0.45
0.60
0.9
2.49
4.45
5.33
6.01
19.04
25.47
93.93
144.63
516.6
4669.58
12256.4
22651.3
33682
390301
66.29
211.81
368.9
1321.3
12331.2
32343.6
58740.7
86445.2
1018150
66.37
260.82
424.1
1534.2
14649.9
39299.1
73384.8
107926
1313744.4
368x
636x
865x
2841x
6053x
11164.x
15680x
18448x
70745x
pr2392
on the benchmark size, and the shared memory used by the targeted benchmark. Better performance, on the same order than previous, are obtained
by having more L1, as long as the shared memory requirements are less
than 16KB, otherwise the nvcc compiler automatically devotes more room
to shared memory in order to execute the application.
5.2.2. GPU Vs CPU
Table 8: Execution times (milliseconds) on different hardware platforms (CPU vs GPU).
We vary the TSPLIB benchmark instance to increase the number of cities.
TSPLIB
Benchmarks
d198
a280
lin318
pcb442
rat783
pr1002
pcb1173
d1291
pr2392
CPU Xeon
Intel E5620
1.02
2.6
4.1
6.6
33.08
54.89
77.59
102.43
371.05
22
GPU NVIDIA
Tesla C2050
0.18 (5.67x)
0.41 (6.34x)
0.49 (8.37x)
0.54 (12.22x)
2.42 (13.67x)
3.52 (15.59x)
4.68 (16.58x)
5.85 (17.51x)
18.57 (19.98x)
Table ?? shows the speed-up factor for the best version of the pheromone
update kernel compared to the sequential code. The pattern of computation
for this kernel is based on data-parallelism, showing a linear speed-up along
with the problem size, reaching up to 20x speed-up factor for Tesla C2050.
5.3. Quality of the Solution
Figure 6: The quality of the solution averaged of 1000 iterations. The 95% of confidence
interval is shown on top of the bars.
Finally, the Figure ?? shows a quality comparison of the solutions obtained by the main targeted algorithms in this work. They are normalized
with respect to the optimum solution for each benchmark previously reported
(see Table ??). The solution quality showed is the result of running all algorithms a fixed number of iterations (1000-iterations), and averaged over 5
independent runs. Moreover, a 95% confidence interval is also provided. It
is worth noticing that the tours quality obtained for GPU codes is similar
than the verified sequential code, being even better in some cases.
6. Related Work
6.1. Parallel implementations
St
uzle [? ] describes the simplest case of ACO parallelisation, in which
independently instances of the ACO algorithm are run on different processors.
Parallel runs have no communication overhead, and the final solution is taken
as the best-solution over all independent executions. Improvements over
non-communicating parallel runs may be obtained by exchange information
among processors. Michel and Middendorf [? ] present a solution based on
this principle, whereby separate colonies exchange pheromone information.
In more recent work, Chen et al. [? ] divide the ant population into equallysized sub-colonies, each assigned to a different processor. Each sub-colony
searches for an optimal local solution, and information is exchanged between
processors periodically. Lin et al. [? ] propose dividing up the problem
into subcomponents, with each subgraph assigned to a different processing
unit. To explore a graph and find a complete solution, an ant moves from
one processing unit to another, and messages are sent to update pheromone
levels. The authors demonstrate that this approach reduces local complexity
and memory requirements, thus improving overall efficiency.
23
Previous efforts for parallelizing ACO on the GPU focused on the former
stage, using task-based parallelism. We have demonstrated that this approach does not fit well on the GPU architecture, and provided an alternative
approach based on data parallelism. This enhances the GPU performance by
both increasing the parallelism and avoiding warp divergence. Moreover, we
have proposed an alternative selection procedure that fits better on the GPU
architecture idiosyncrasy.
In addition, we have provided the first known implementation of the
pheromone update stage on the GPU. Some GPU computing techniques were
discussed in order to avoid atomic instructions. However, we have showed
that those techniques are even more costly than applying atomic operations
directly.
Possible future directions will include investigating the effectiveness of
GPU-based ACO algorithms on other NP-complete optimisation problems.
We will also implement other ACO algorithms, such as the Ant Colony System, which can also be efficiently implemented on the GPU. The conjunction
of ACO and GPU is still at a relatively early stage; we emphasize that we
have only so far tested a relatively simple variant of the algorithm. There
are many other types of ACO algorithm still to explore, and as such, it is
a potentially fruitful area of research. We hope that this paper stimulates
further discussion and work.
Acknowledgements
This work was partially supported by a travel grant from the EU FP7
NoE HiPEAC IST-217068, the European Network of Excellence on High
Performance and Embedded Architecture and Compilation. The first two
authors acknowledge the support of the project from the Spanish MEC and
European Commission FEDER funds under grants Consolider Ingenio-2010
CSD2006-00046 and TIN2006-15516-C04-03, and also from the Fundacion
Seneca (Agencia Regional de Ciencia y Tecnologa, Region de Murcia) under
grant 00001/CS/2007.
References
[1] M. Dorigo, T. St
utzle, Ant Colony Optimization, Bradford Company,
Scituate, MA, USA, 2004.
25
[2] M. Dorigo, M. Birattari, T. Stutzle, Ant colony optimization, Computational Intelligence Magazine, IEEE 1 (4) (2006) 28 39.
[3] C. Blum, Ant colony optimization: Introduction and recent trends,
Physics of Life Reviews 2 (4) (2005) 353373.
[4] E. Lawler, J. Lenstra, A. Kan, D. Shmoys, The traveling salesman problem, Wiley New York, 1987.
[5] M. Dorigo, A. Colorni, V. Maniezzo, Positive feedback as a search strategy, Tech. Rep. 91-016, Dipartimento di Elettronica, Politecnico di Milano, Milan, Italy (1991).
[6] M. Dorigo, V. Maniezzo, A. Colorni, The ant system: Optimization by
a colony of cooperating agents, IEEE Transactions on Systems, Man,
and Cybernetics-Part B 26 (1996) 2941.
[7] M. Dorigo, E. Bonabeau, G. Theraulaz, Ant algorithms and stigmergy,
Future Gener. Comput. Syst. 16 (2000) 851871.
[8] T. St
utzle, Parallelization strategies for ant colony optimization, in:
PPSN V: Proceedings of the 5th International Conference on Parallel
Problem Solving from Nature, Springer-Verlag, London, UK, 1998, pp.
722731.
[9] X. JunYong, H. Xiang, L. CaiYun, C. Zhong, A novel parallel ant
colony optimization algorithm with dynamic transition probability, International Forum on Computer Science-Technology and Applications 2
(2009) 191194.
[10] Y. Lin, H. Cai, J. Xiao, J. Zhang, Pseudo parallel ant colony optimization for continuous functions, International Conference on Natural
Computation 4 (2007) 494500.
[11] M. Garland, D. B. Kirk, Understanding throughput-oriented architectures, Commun. ACM 53 (2010) 5866.
[12] J. R. Fischer, U. States., Frontiers of massively parallel scientific computation, National Aeronautics and Space Administration, Scientific and
Technical Information Office, [Washington, D.C], 1987.
26
[13] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, V. Volkov, Parallel computing experiences
with cuda, IEEE Micro 28 (2008) 1327.
[14] J. M. Cecilia, J. M. Garca, M. Ujaldon, A. Nisbet, M. Amos, Parallelization strategies for ant colony optimisation on gpus, in: NIDISC 2011:
14th International Workshop on Nature Inspired Distributed Computing. Proc. 25th International Parallel and Distributed Processing Symposium (IPDPS 2011), Anchorage (Alaska), USA, 2011.
[15] Johnson, David S., Mcgeoch, Lyle A., The Traveling Salesman Problem:
A Case Study in Local Optimization, 1997.
[16] M. Dorigo, Optimization, learning and natural algorithms, Ph.D. thesis,
Politecnico di Milano, Italy (1992).
[17] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, 1st Edition, Addison-Wesley Longman Publishing Co.,
Inc., Boston, MA, USA, 1989.
[18] NVIDIA, NVIDIA CUDA C Programming Guide 3.1.1, 2010.
[19] NVIDIA, NVIDIA CUDA C Best Practices Guide 3.2, 2010.
[20] NVIDIA, NVIDIA CUDA CURAND Library., 2010.
[21] T. Scavo, Scatter-to-gather transformation for scalability (Aug 2010).
[22] NVIDIA, Whitepaper NVIDIAs Next Generation CUDA Compute Architecture: Fermi, 2009.
[23] G. Reinelt, TSPLIB a traveling salesman problem library, ORSA
Journal on Computing 3 (4) (1991) 376384.
[24] TSPLIB Webpage,
http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/
(February 2011).
[25] R. Michel, M. Middendorf, An island model based ant system with lookahead for the shortest supersequence problem, in: Proceedings of the
5th International Conference on Parallel Problem Solving from Nature,
PPSN V, Springer-Verlag, London, UK, 1998, pp. 692701.
27
28