Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Parallelization Strategies for Ant Colony Optimisation

on GPUs
Jose M. Cecilia, Jose M. Garca
Computer Architecture Department
University of Murcia
30100 Murcia, Spain

Andy Nisbet, Martyn Amos


Novel Computation Group
Division of Computing and IS
Manchester Metropolitan University
Manchester M1 5GD, UK

Manuel Ujaldon
Computer Architecture Department
University of M
alaga
29071 M
alaga, Spain

Abstract
Ant Colony Optimisation (ACO) is an effective population-based metaheuristic for the solution of a wide variety of problems. As a populationbased algorithm, its computation is intrinsically massively parallel, and it is
therefore theoretically well-suited for implementation on Graphics Processing Units (GPUs). The ACO algorithm comprises two main stages: Tour
construction and Pheromone update. The former has been previously implemented on the GPU, using a task-based parallelism approach. However,
up until now, the latter has always been implemented on the CPU. In this
paper, we discuss several parallelisation strategies for both stages of the ACO
algorithm on the GPU. We propose an alternative data-based parallelism
scheme and selection procedure, namely I-Roulette, for the Tour construction, which fit better on the GPU architecture. We also describe novel GPU
programming strategies for the Pheromone update stage. Our results show

Corresponding author

Preprint submitted to Journal of Parallel and Distributed Computing

March 31, 2011

a total speed-up exceeding 21x for the Tour construction stage, and 20x for
Pheromone update, compared to its sequential counterpart version, and suggest that ACO is a potentially fruitful area for future research in the GPU
domain.
Keywords: Metaheuristics, GPU programming, Ant Colony Optimization,
TSP, Performance analysis
1. Introduction
Ant Colony Optimisation (ACO) [? ] is a population-based search
method inspired by the behaviour of real ants. It may be applied to a wide
range of hard problems [? ? ], many of which are graph-theoretic in nature.
It was first applied to the Travelling Salesman Problem (TSP) [? ] by Dorigo
and colleagues, in 1991 [? ? ].
In essence, simulated ants construct solutions to the TSP in the form of
tours. The artificial ants are simple agents which construct tours in a parallel,
probabilistic fashion. They are guided in this task by simulated pheromone
trails and heuristic information. Pheromone trails are a fundamental component of the algorithm, since they facilitate indirect communication between
agents via their environment, a process known as stigmergy [? ]. A detailed
discussion of ant colony optimization and stigmergy is beyond the scope of
this paper, but the reader is directed to [? ] for a comprehensive overview.
ACO algorithms are population-based, in that a collection of agents collaborates to find an optimal (or even satisfactory) solution. Such approaches
are naturally suited to parallel processing, but their success strongly depends
on both the nature of the particular problem and the underlying hardware
available. Several parallelisation strategies have been proposed for the ACO
algorithm, on both shared and distributed memory architectures [? ? ? ].
The Graphics Processing Unit (GPU) is a major current theme of interest
in the field of high performance computing, as they follow in the footsteps of
earlier throughput-oriented processor designs but have achieved far broader
use in commodity machines, mainly motivated by the needs of real-time
computer graphics, making possible for GPUs to become mass-market devices [? ]. For instance, since late 2006, NVIDIA has shipped almost 220
million CUDA-capable GPUsseveral orders of magnitude more than historical massively parallel architectures like CM-2, MasPar and Goodyear MPP
(Massively Parallel Processor) machines [? ].
2

For applications with abundant parallelism, GPUs deliver higher peak


computational throughput than latency-oriented CPUs and thus offering
tremendous potential performance on massively parallel problems [? ], even
though scarifying the serial performance of a single task may be required [?
]. Therefore, some massively parallel workloads, may be redefining for being adapted to throughput-oriented architectures, instead of latency-oriented
counterparts.
Of particular interest to us are attempts to parallelise the ACO algorithm on GPUs. Up to now, these approaches focus on accelerating the tour
construction step performed by each ant by taking a task-based parallelism
approach with pheromone deposition calculated on the CPU.
In this paper, we present the first fully developed ACO algorithm for the
Travelling Salesman Problem (TSP) on GPUs, so that both main phases are
parallelised. This is the main technical contribution of the paper. We clearly
identify two main algorithmic stages: Tour construction and Pheromone
update. A data-parallelism approach (which is better-suited to the GPU
parallelism model than task-based parallelism) is described to enhance tour
construction performance. Additionally, we describe various GPU design
patterns for the parallelisation of the pheromone update, which has not been
previously described in the literature.
The main contributions of this work are the following:
1. To the best of our knowledge, this is the first time that a data-parallelism
approach for the tour construction stage is introduced on GPUs. We
design so using two different types of artificial ants: Queen ants (associated with CUDA thread-blocks), and worker ants (associated with
CUDA threads).
2. We introduce I-Roulette (Independent Roulette) as an alternative selection method to the classic Roulette Wheel which is more adequate
for enhancing the GPU simulation.
3. We also discuss the implementation of the pheromone update stage
on GPUs, either using atomic operations or other alternative GPU
computing techniques to avoid them.
4. We offer an in-depth analysis of both stages of the ACO algorithm for
different instances of the TSP problem. We tune different GPU parameters, reaching up to 21x speed-up factor for the tour construction
stage and 20x speed-up factor for the pheromone update stage.

5. We validate the quality of the solution obtained by our GPU algorithms, comparing it to the quality of the solution obtained by the
sequential code given in [? ].
A preliminary and partial version of this work was presented in [? ].
Here, we significantly extend that work with a more formal description of
the algorithm design on GPUs, extensive analysis of our proposals on Fermis
architecture from NVIDIA for both algorithmic stages, and a solution quality
validation of our proposals. Moreover, we extend the evaluation process,
adding several benchmarks from TSPLIB Library.
The paper is organised as follows. We briefly introduce Ant Colony Optimisation for the TSP and Compute Unified Device Architecture (CUDA)
from NVIDIA in Section ??. In Section ?? we present GPU designs for
both main stages of the ACO algorithm. Experimental methodology is introduced in Section ?? before we show the performance evaluation of our
algorithm in Section ??. Finally, parallelization strategies for the ACO algorithm previously presented in the literature are described in Section ??,
before concluding with a brief discussion and consideration of future work.
2. Background
2.1. Ant Colony Optimisation for the Traveling Salesman Problem
The Traveling Salesman Problem (TSP) [? ] involves finding the shortest
(or cheapest) round-trip route that visits each of a number of cities exactly once. The symmetric TSP on n cities may be represented as a complete
weighted graph, G, with n nodes, with each weighted edge, ei,j , representing
the inter-city distance di,j = dj,i between cities i and j. The TSP is a wellknown NP-hard optimisation problem, and is used as a standard benchmark
for many heuristic algorithms [? ].
The TSP was the first problem solved by Ant Colony Optimisation (ACO)
[? ? ]. This method uses a number of simulated ants (or agents), which
perform distributed search on a graph. Each ant moves through on the
graph until it completes a tour, and then offers this tour as its suggested
solution. In order to do this, each ant may drop pheromone on the edges
contained in its proposed solution. The amount of pheromone dropped, if
any, is determined by the quality of the ants solution relative to those obtained by the other ants. The ants probabilistically choose the next city to
visit, based on heuristic information obtained from inter-city distances and
4

the net pheromone trail. Although such heuristic information drives the ants
towards an optimal solution, a process of evaporation is also applied in
order to prevent the process stalling in a local minimum.
The Ant System (AS) is an early variant of ACO, first proposed by Dorigo
[? ]. The AS algorithm is divided into two main stages: Tour construction
and Pheromone update. Tour construction is based on m ants building tours
in parallel. Initially, ants are randomly placed. At each construction step,
each ant applies a probabilistic action choice rule, called the random proportional rule, in order to decide which city to visit next. The probability for
ant k, placed at city i, of visiting city j is given by the equation ??
pki,j = P

[i,j ] [i,j ]

lNik

[i,l ] [i,l ]

if j Nik ,

(1)

where i,j = 1/di,j is a heuristic value that is available a priori, and


are two parameters which determine the relative influences of the pheromone
trail and the heuristic information respectively, and Nik is the feasible neighbourhood of ant k when at city i. This latter set represents the set of cities
that ant k has not yet visited; the probability of choosing a city outside
Nik is zero (this prevents an ant returning to a city, which is not allowed in
the TSP). By this probabilistic rule, the probability of choosing a particular
edge (i, j) increases with the value of the associated pheromone trail i,j and
of the heuristic information value i,j . The numerator of the equation ??
is pretty much the same for every ant in a single run, thus, computation
times can be saved by storing this information in additional matrix, called
choice info matrix as showed in [? ]. The random propotional rule ends with
a selection procedure, which is done analogously to the roulette wheel selection procedure of evolutionary computation (for more detail see [? ], [? ]).
Each value choice inf o[current city][j] of a city j that ant k has not visited
yet determines a slice on a circular roulette wheel, the size of the slice being
proportional to the weight of the associated choice. Next, the wheel is spun
and the city to which the marker points is chosen as the next city for ant
k. Furthermore, each ant k maintains a memory, M k , called the tabu list,
which contains the cities already visited, in the order they were visited. This
memory is used to define the feasible neighbourhood, and also allows an ant
to both to compute the length of the tour T k it generated, and to retrace the
path to deposit pheromone.
After all ants have constructed their tours, the pheromone trails are up5

dated. This is achieved by first lowering the pheromone value on all edges
by a constant factor, and then adding pheromone on edges that ants have
crossed in their tours. Pheromone evaporation is implemented by
i,j (1 )i,j ,

(i, j) L,

(2)

where 0 < 1 is the pheromone evaporation rate. After evaporation,


all ants deposit pheromone on their visited edges:
i,j i,j +

m
X

k
i,j
,

(i, j) L,

(3)

k=1

where ij is the amount of pheromone ant k deposits. This is defined


as follows:

1/C k if e(i, j)k belongs to T k
k
(4)
i,j =
0
otherwise
where C k , the length of the tour T k built by the k-th ant, is computed as
the sum of the lengths of the edges belonging to T k . According to equation
??, the better an ants tour, the more pheromone the edges belonging to this
tour receive. In general, edges that are used by many ants (and which are
part of short tours), receive more pheromone, and are therefore more likely
to be chosen by ants in future iterations of the algorithm.
2.2. CUDA Programming model
All NVIDIA GPU platforms from the G80 architecture can be programmed
using the Compute Unified Device Architecture (CUDA) programming model
which makes the GPU to operate as a highly parallel computing device. Each
GPU device is a scalable processor array consisting of a set of SIMT (Single Instruction Multiple Threads) Streaming Multiprocessors (SM), each of
them containing several stream processors (SPs). Different memory spaces
are available in each GPU on the system. The global memory (also called
device or video memory) is the only space accessible by all multiprocessors.
It is the largest and the slowest memory space and it is private to each GPU
on the system. Moreover, each multiprocessor has its own private memory
space called shared memory. The shared memory is smaller and also lower
access latency than global memory. Finally, there are other addressing spaces
for specific purpose such as texture and constant memory [? ].
6

The CUDA programming model is based on a hierarchy of abstraction


layers The thread is the basic execution unit that is mapped to a single
SP. A thread-block is a batch of threads which can cooperate together as
they are assigned to the same multiprocessor, and therefore they share all
the resources included in this multiprocessor, such as register file and shared
memory. A grid is composed of several thread-blocks which are equally distributed and scheduled among all multiprocessors. Besides, there is not any
particular order in the way of thread-blocks are executed, therefore they are
executed in Multiple Instruction Multiple Data (MIMD) fashion. Finally,
threads included in a thread-block are divided into batches of 32 threads
called warps. The warp is the scheduled unit, so the threads of the same
thread-block are scheduled in a given multiprocessor warp by warp. The 32
threads in a warp execute the same instruction over multiple data (SIMD).
The programmer declares the number of thread-blocks, the number of threads
per thread-block and their distribution to arrange parallelism given the program constraints (i.e., data and control dependencies).
3. Code Design and Tuning Techniques
Algorithm 1 The sequential AS version for the TSP problem:
1: InitializeData()
2: while Convergence() do
3:
T ourConstruction()
4:
P heromoneU pdate()
5: end while
In this Section, we present several different GPU designs for the Ant System (AS), as applied to the TSP. The Algorithm ?? shows a single-program
multiple data (SPMD) style pseudocode for the AS. Firstly, all AS structures for the TSP problem, such as the distance matrix, the number of cities,
etc. are initialized, before performing both main stages of the AS algorithm,
i.e. Tour construction and Pheromone update. These stages are computed
until the convergence criteria is reached. For the tour construction, we begin by analysing CPU baselines and traditional task-based implementations
on the GPU, which motivates our approach of instead increasing the dataparallelism. For pheromone update, we describe several GPU techniques that
are potentially useful in increasing application bandwidth.

3.1. Previous tour construction proposals


3.1.1. CPU baseline
The tour construction stage is mainly divided into two main stages: Initialization and ASDecisionRule. In the former all data structures, such as
the tabu list (visited), and the initial random city, are initialized by each ant.
The Algorithm ?? shows the latter stage which is divided into two sub-stages
in turn. First, each ant calculates the heuristic information to visit city j
from city i according to equation ?? (lines 1-11). As previously explained,
it is computationally expensive to repeatedly calculate those values for each
computational step of each ant, k, and they can be avoided by using an additional data structure, namely choice info, in which those heuristic values
are stored through adjacency matrix, and are therefore calculated only once
per each call [? ]. Notice that each entry of this structure can be calculated
independently from each other (see Equation ??). Second, the probabilistic choice for choosing the next city by each ant is performed by using the
roulette wheel selection [? ? ] as showed in algorithm ??, lines 12-18.
3.1.2. Task-based approach on GPUs

Figure 1: Task-based parallelism on the tour construction kernel.

The traditional task-based parallelism approach to tour construction is


based on the observation that ants run in parallel looking for the best tour
they can find. Therefore, any inherent parallelism exists at the level of individual ants. To implement this idea of parallelism on CUDA, each ant
is identified as a CUDA thread, and threads are equally distributed among
CUDA thread blocks. Each CUDA thread deals with the task assigned to
8

Algorithm 2 ASDecisionRule for the Tour construct stage. m is the number


of ants, and n is the number of cities of the TSP instance
1: sum prob 0.0;
2: current city ant[k].tour[step 1];
3: for j = 1 to n do
4:
if ant[k].visited[j] then
5:
selection prob[j] 0.0;
6:
else
7:
current probability choice inf o[current city][j];
8:
selection prob[j] current probability;
9:
sum probs sum probs + current probability;
10:
end if
11: end for
{Roulette Wheel Selection Process}
12: r random(1..sum probs);
13: j 1;
14: p selection prob[j];
15: while p < r do
16:
j j + 1;
17:
p p + selection prob[j];
18: end while
19: ant[k].tour[step] j;
20: ant[k].visited[j] true;
each ant; i.e, maintenance of an ants memory (tabu list, list of all visited
cities, and so on) and movement (see the core of this computation in Algorithm ??). Figure ?? briefly summarizes the process sequentially developed
by each ant.
To improve the kernels bandwidth some structures previously presented
are placed in on-chip shared memory. Among all of them, visited and
selection prob list are good candidates to be placed on shared memory as
they are accessed many times along the computation in an irregular access
pattern. However, the shared memory is a scarce resource in CUDA (see
Table ??), and thus the size of these structures becomes limited. Moreover,
in the CUDA programming model, shared memory is allocated at CUDA
thread block level.

3.2. Our tour construction approach based on data-parallelism


The task-based parallelism just described presents several issues for GPU
implementation. Firstly, this approach requires a relatively low number of
threads on the GPU, since the recommended number of ants for solving the
TSP problem is taken as the same as the number of cities [? ]. In addition,
this version presents an unpredictable memory access pattern due to the
execution is guided by a stochastic process. Finally, the checking of the list
of cities visited presents many warp divergences (different threads in a warp
take different paths) leading to serialisation [? ].
3.2.1. The choice info matrix calculation on the GPU
To increase the application parallelism on CUDA, the choice info computation is performed apart from the tour construction kernel, being included
in a different kernel which is executed right before the tour construction at
each stage of ACO algorithm. We set a CUDA thread per each entry of
the choice info structure, and they are equally grouped into CUDA thread
blocks. However, the performance for this kernel may be drastically affected
by the use of costly math function like powf() (see Equation ??). However,
there are analogous CUDA functions which are map directly to the hardware
level like powf(), although they can provide somehow lower accuracy [? ].
3.2.2. Data-based parallelism proposal
Figure ?? shows an alternative design, which increases the data-parallelism
in the tour construction kernel, and also avoids warp divergences. In this design, a thread-block is associated with each ant (i.e. Queen ants), and each
thread in a thread-block represents a city (or cities) the ant may visit (i.e.
Worker ants). All workers ants fully cooperate to obtain a solution, increasing the data-parallelism by a factor of 1:w, being w the number of worker
ants per queen ant.
A thread loads the heuristic value associated with its associated city(ies),
and checks whether the city has been visited or not. To avoid conditional
statements (and, thus, warp divergences), the tabu list is represented in
shared memory as one integer value per each city. A citys value is 0 if it is
visited, and 1 otherwise. Finally, these values are multiplied and stored in a
shared memory array, which is then prepared for the selection process through
roulette wheel. Notice that, the shared memory requirements are drastically
reduced compared to the previous version. Now, tabu list and probabilistic

10

Figure 2: Data-based parallelism on the tour construction kernel.

list are stored one per each thread-block (i.e. Queen ant) instead of one per
thread (i.e. Worker ant).
The number of threads per thread-block on CUDA is a hardware limiting
factor. Thus, the cities should be distributed among threads to allow for
a flexible implementation. A tiling technique is proposed to deal with this
issue. Cities are divided into blocks (i.e. tiles). For each tile, a city is selected
stochastically, from the set of unvisited cities on that tile. When this process
has completed, we have a set of partial best cities. Finally, the city with
the best absolute heuristic value is selected from this partial best set.
The tabu list information can be placed in the register file (since it represents information private to each thread). However, the tabu list cannot
be represented by a single integer register per thread in the tiling version,
because, in that case, a thread represents more than one city. The 32-bit
registers may be used on a bitwise basis for managing the list. The first city
represented by each thread; i.e. on the first tile, is managed by bit 0 on the
register that represents the tabu list, the second city is managed by bit 1,
and so on.
3.2.3. I-Roulette: An alternative selection method
The roulette wheel is a fully sequential stochastic selection process, and it
is hard to be parallelized. To implement this on CUDA, a particular thread is
11

Figure 3: An alternative method for increasing the parallelism on the selection process.

identified to proceed sequentially with the selection, doing this exactly n 1


times, being n the number of cities. Moreover, the kernel needs to generate
costly pseudorandom numbers on the GPU. We use the NVIDIAs CURAND
library [? ].
Figure ?? shows an alternative method for removing the sequential parts
of the previous kernel design. We call this method I-Roulette (Independent
Roulette). I-Roulette consists of generating a random number per each city
in the interval [0, 1] to feed into the stochastic simulation. Thus, three values
are multiplied and stored in the shared memory array per city; i.e. the
heuristic value associated with its associated city, a value showing whether
the city has been visited or not, and the random number associated with its
city.
3.3. Pheromone update stage
The last stage in the ACO algorithm is the pheromone update which
comprise two main tasks: pheromone evaporation and pheromone deposit
as previously explained in Section ??. The pheromone evaporation is quite
straightforward to implement on CUDA, as a single thread can independently calculate the Equation ?? to each entry of the pheromone matrix,
thus lowering the pheromone value on all edges by a constant factor.
Then, ants deposit different quantities of pheromone on the edges that
they have crossed in their tours. The quantity of pheromone deposited by
each ant depends on the quality of the tour found by that ant (see Equations
?? and ??). Figure ?? shows the design of the pheromone kernel; this presents
n CUDA threads per CUDA thread block (being n number of cities), one less
than the tour length (i.e. n + 1). Each ant generates its own private tour
in parallel, and they will feasibly visit the same edge as another ant. This
fact forces us to use atomic instructions for accessing the pheromone matrix,
which diminishes the application performance.
Therefore, a key objective is to avoid using atomic operations. An alternative approach to do so is shown in Figure ??, where we use a scatter to
gather transformations [? ].
The configuration launch routine for the pheromone update kernel now
sets as many threads as there are cells in the pheromone matrix (c = n2 ) and
equally distributes these threads among thread blocks. Thus, each thread
12

Figure 4: Pheromone deposit with atomic instructions.

represents the coordinates of a single entry of the pheromone matrix, and it


is in charge of checking whether the cell represented by it has been visited by
any ants; i.e. each thread accesses device memory to check that information.
This means that each thread performs 2 n2 memory loads/thread for a total
of l = 2 n4 (n2 threads) accesses to device memory.
Notice that, there is a tradeoff between the pressure in device memory
for avoiding a design based on atomic operations and the number of atomic
operations itself doing so (relation loads:atomic from now). For the Scatter to
Gather based design, the relation loads:atomic is l:c. Therefore, this approach
allows us to perform the computation without using atomic operations, but
at the cost of drastically increasing the number of accesses to device memory.
A tiling technique is proposed for increasing the application bandwidth.
Now, all threads cooperate to load data from global memory to shared memory, but they still access edges in the ants tour. Each thread accesses global
memory 2n2 /, being the tile size. The rest of the accesses are performed
on shared memory. Therefore, the total number of global memory accesses
is = 2n4 /. The relation loads/atomics is lower : c, but maintains the
13

Figure 5: Scatter to Gather transformation for the pheromone deposit.

orders of magnitude.
We note that an ants tour length may be bigger than the maximum
number of threads that each thread block can support (see Table ??). Our
algorithm prevents this situation by setting our empirically demonstrated
optimum thread block layout, and dividing the tour into tiles of this length.
This raises up another issue; this is when n + 1 is not divisible by the . We
solve this by applying padding in the ants tour array to avoid warp divergence
(see Figure ??).
Unnecessary loads to device memory can be avoided by taking advantage
of the problems nature. We focus on the symmetric version of the TSP, so
the number of threads can be reduced in half, thus halving the number of device memory accesses. This so-called Reduction version actually reduces the
overall number of accesses to either shared or device memory by having half
the number of threads compared to the previous version. This is combined
also with tiling, as previously explained. The number of accesses per thread
remains the same, giving a total of device memory access of = n4 /.

14

4. Experimental methodology
4.1. Hardware features
Table 1: CUDA and hardware features for Tesla C2050.

GPU element
Streaming
processors
(GPU
cores)
Maximum
number of
threads
SRAM
memory
available per
multiprocessor
Global
(video)
memory

Feature
Cores per SM
Number of SMs
Total SPs
Clock frequency
Per multiprocessor
Per block
Per warp
32-bit registers
Shared memory
L1 cache
(Shared + L1)
Size
Speed
Width
Bandwidth
Technology

Tesla C2050
32
14
448
1 147 MHz
1 536
1 024
32
32 K
16/48 KB
48/16 KB
64 KB
3 GB
2x1500 MHz
384 bits
144 GB/s
GDDR5

The hardware evaluation platforms are: (1) a dual-socket 2.40 GHz quadcore Intel Xeon E5620 WESTMERE processor, and (2) a NVIDIA Tesla
C2050 based on Fermi architecture released in November 2010 (see main
features on Table ??) [? ]. In addition to a larger number of CUDA processors in each Streaming Multiprocessor (SM), including a fourfold increase
over prior SM designs, Fermi improves the GPU capabilities with additional
features like enhanced single-precision floating-point accuracy-implementing
IEEE 754-2008 floating-point standard- and double-precision floating-point
performance, new general-purpose L1 and L2 caches, faster context switching, a unified 64-bit virtual address space, a brand new instruction set, Error Correction Code (ECC) memory support to enhance data integrity in
high performance computing. The same on-chip memory is devoted to L1
cache and shared memory, being configurable at compilation time through
(cudaFuncSetCacheConfig) CUDA instruction.
15

Finally, we use gcc 4.3.4 with the -O3 flag to compile our CPU implementations, and CUDA compilation tools, release 3.2, for our GPU implementations.
4.2. Our benchmark
We test our designs using a set of benchmark instances from the wellknown TSPLIB library [? ] [? ]. All benchmark instances are defined on
a complete graph, and all distances are defined to be integer numbers. The
Table ?? shows a list of all targeted benchmark instances with information
on the number of cities, the type of distance and the length of optimal tours.
ACO parameters such as the number of ants m, , , , and so on are
set according with the values recommended in [? ]. The most important
parameters for the scope of this study is the number of ants, which is set
m = n (being n the number of cities), = 1, = 2, = 0.5.
Table 2: Description of the benchmark instances from TSPLIB library. (EUC 2D: 2Dimensional euclidean distances).

Name
d198
a280
lin318
pcb442
rat783
pr1002
pcb1173
pcb1291
pr2392

Cities
198
280
318
442
783
1002
1173
1291
2392

Type Best tour length


EUC 2D
15780
EUC 2D
2579
EUC 2D
42029
EUC 2D
50778
EUC 2D
8806
EUC 2D
259045
EUC 2D
56892
EUC 2D
50801
EUC 2D
378032

5. Performance Evaluation
In this Section we deeply analyse both main stages of the ACO algorithm:
Tour construction and Pheromone update. We compare our implementations
with the sequential code, written in ANSI C, provided by St
uzle in [? ]. The
performance tables are recorded for a single iteration, and averaged over 100
iterations. In this work we focus on the computational characteristics of the
16

Ant System and how it can be efficiently implemented on the GPU. However,
to guarantee the correctness of our algorithms, a quality comparison between
sequential, GPU codes, normalized to the optimum solution, for a variety of
benchmarks is provided by running all algorithms a fixed number of iterations
(1000-iterations), and averaged over 5 independent runs. The experiments
are based on single-precision.
5.1. Evaluation of tour construction stage
Now, we evaluate the tour construction stage on Tesla C2050 under different aspects: The performance impact of using costly arithmetic instruction in
the choice info kernel, comparison versus a CPU counterpart, improvement
degree attained through data-based approach, speed-up factor obtained when
using different on-chip GPU memories available on FERMI architecture, and
performance benefits of changing the selection process. We now address each
of these issues separately.
5.1.1. Choice info kernel evaluation
Table 3: Execution times (milliseconds) on the GPU system for the choice info kernel,
using both CUDA instructions: powf and powf. We vary the TSPLIB benchmark instance
to increase the number of cities.

TSPLIB
Benchmarks
d198
a280
lin318
pcb442
rat783
pr1002
pcb1173
d1291
pr2392

CUDA
powf ()
0.038
0.065
0.082
0.147
0.441
0.719
0.981
1.185
4.042

0.013
0.020
0.024
0.042
0.117
0.188
0.254
0.307
1.039

CUDA
powf ()
(2.88x)
(3.18x)
(3.33x)
(3.47x)
(3.77x)
(3.82x)
(3.86x)
(3.86x)
(3.88x)

We first evaluate the choice info kernel, before assessing the impact of
including various modifications to the tour construction. The table ?? shows
the performance of the choice info kernel. It is drastically affected by using

17

costly math function like powf(). However, there are analogous CUDA functions which are map directly to the hardware level like powf(), although
they can provide somehow lower accuracy [? ]. After ptx inspection, this
kernel presents 4 memory access to global memory. The empirical streaming
bandwidth obtained by this kernel is up to 90 GB/s, using powf() and up
to 23 GB/s, using powf() on the Tesla C2050. These performance numbers
translate into up to 67 GFLOPS and 17 GFLOPs respectively, counting both
instructions; powf() and powf(), as a single floating point operation. We use
a 256 thread-block, one per each entry of the choice info data structure to
reach the best performance. These particular values minimize non-coalesced
memory accesses and yield high occupancy values.
5.1.2. Tuning our data-based approach
For the data-based approach, the number of workers ants (threads) per
queen ant (threads-blocks) is a degree of freedom which is analyzed in Table
??. The 128 thread-block configuration reaches our best performance in
all benchmark instances. This particular value is well suited for developing
high-throughput application on GPUs. Notice that, some of configurations
are not allow (n.a) because either the number of worker ants is bigger than
the number of cities, or the number of cities divided by the number of worker
ants is bigger than 32, as it is the maximum number of cities that each worker
ant is capable to manage. The tabu list per each queen ant is divided among
their worker ants, and place it on a bit basis in a single register.
Table 4: Execution times (milliseconds) on Tesla C2050. We vary the number of worker
ants and benchmark instances (n.a. means not available due to register constraints).

Number of
Worker ants
16
32
64
128
256
512
1024

d198

a280

lin318

pcb442

rat783

pr1002

pr2392

10.39
6.85
5.06
4.32
n.a
n.a
n.a

30.06
18.68
12.78
11.94
15.94
n.a
n.a

38.59
23.89
15.51
15.18
20.89
n.a
n.a

101.09
62.77
41.17
38.73
41.85
n.a
n.a

n.a
357.24
235.86
207.08
245.33
296.90
n.a

n.a
749.04
474.79
391.59
412.20
498.55
n.a

n.a
n.a
6083.96
5092.27
5680.81
6680.39
10037.1

18

5.1.3. I-Roulette Versus Roulette Wheel


Table 5: Execution times (milliseconds) on both selection procedures: Roulette wheel, and
our approach Independent Roulette (I-Roulette).

TSPLIB
Benchmarks
d198
a280
lin318
pcb442
rat783
pr1002
pcb1173
d1291
pr2392

Roulette Wheel
7.76
20.33
26.73
69.86
376.76
747.85
1227.52
1606.87
12017.2

I-Roulette
4.32
11.94
15.18
38.73
207.08
391.56
644.05
840.02
5092.27

(1.79x)
(1.70x)
(1.76x)
(1.80x)
(1.81x)
(1.90x)
(1.90x)
(1.91x)
(2.36x)

Table ?? shows the improvement attained by increasing the parallelism


on the GPU through our selection method, I-Roulette. This method reaches
up to 2,36x of speed-up factor compared to the classic roulette wheel even
though it generates many costly random numbers. However, the roulette
wheel compromises the GPU parallelism, thus, there is a tradeoff between
improving total throughput at the expense of increased latency on individual
tasks, being favored the former option by a wide margin.
5.1.4. GPU Versus CPU
Table ?? presents execution times on a high-end CPU and Tesla C2050
GPUs already introduced for the set of simulations included within our
benchmarks. It is worth mentioning that for a fair comparison we have
selected hardware platforms with a similar cost (investment ranges between
1.500 and 2.000 euros for each single processor). We see that GPUs obtains
better performance than its CPU counterpart, reaching up to 21x speed-up
factor.
Table ?? also shows us the benefit of having many parallel light-weight
threads through a data-parallelism approach instead of having heavy-weight
threads due to task-based approach on GPUs.
For task-based parallelism versions, we use a 16 CUDA threads with
16 ants running in parallel per thread-block to reach the best performance
19

Table 6: Execution times (milliseconds) on different hardware platforms (CPU vs GPU)


and enabling data-based approach on the GPU.

TSPLIB
Benchmarks

CPU Xeon
E5620

d198
a280
lin318
pcb442
rat783
pr1002
pcb1173
d1291
pr2392

43.01
151.99
223.78
618.07
3539.01
7965.17
12839.26
17450.53
110573.22

GPU Task-based
Speed-up
Ex. time
Vs CPU
29.37
1.46x
70.52
2.15x
153.33
1.45x
301.66
2.04x
1375.38
2.57x
2437.34
3.26x
3392.74
3.78x
4102.3
4.25x
29792.03
3.71x

GPU Data-based
Speed-up Speed-up
Ex. time
Vs CPU
Vs Task
4.32
9.95x
6.79x
11.94
12.72x
5.9x
15.18
14.69x
10.1x
38.73
15.95x
7.78x
207.08
17.08x
6.64x
392.56
20.33x
6.2x
652.05
19.69x
5.2x
869.96
20.05x
4.71x
5092.27
21.71x
5.85x

for the targeted benchmarks. This particular value produces a very poor
GPU resource usage per SM, and it is not well suited for developing highthroughput applications on the GPU computing arena. The heavy-weight
threads presented by this design need resources to execute their independent
tasks, thus avoiding large serialization phases. In CUDA, this is obtained by
distributing those threads among SMs, which is possible by increasing the
number of thread-block in the execution.
The task-based approach is only rewarded with a maximum of 4.25x
speed-up factor instead of 21.71x speed-up factor reached by data-based parallelism.
5.2. Evaluation of pheromone update kernel
This section evaluates the pheromone update kernel on Tesla C2050 under
different aspects: The performance impact of using different GPU algorithmic
strategies, and comparison versus a CPU counterpart. We now address each
of these issues separately.
5.2.1. Evaluation of different GPU algorithmic strategies
In this case, the baseline version is our best-performing kernel version,
which uses atomic instructions and shared memory. From there, we show
the slow-downs incurred by each technique. As previously explained, this
20

Table 7: Execution times (in milliseconds) for various pheromone update implementations
(Tesla C2050).
Code version
Tesla C2050
1. Atomic Ins.
+ Tiling
2. Atomic
Ins.
3. Ins.& Thread
Reduction
4. Tiled Scatter
to Gather
5. Scatter to
Gather
Slow-Downs
Attained (5 Vs 1)

TSPLIB benchmark instance (problem size)


lin318 pcb442 rat783
pr1002 pcb1173
d1291

d198

a280

0.18

0.41

0.49

0.54

2.42

3.52

4.68

5.85

18.57

0.26

0.45

0.60

0.9

2.49

4.45

5.33

6.01

19.04

25.47

93.93

144.63

516.6

4669.58

12256.4

22651.3

33682

390301

66.29

211.81

368.9

1321.3

12331.2

32343.6

58740.7

86445.2

1018150

66.37

260.82

424.1

1534.2

14649.9

39299.1

73384.8

107926

1313744.4

368x

636x

865x

2841x

6053x

11164.x

15680x

18448x

70745x

kernel presents a tradeoff between the number of accesses to global memory


for avoiding costly atomic operations and the number of those atomic operations (called loads:atomic). The scatter to gather computation (5) pattern
presents the major difference between both parameters. This imbalance is
reflected in the performance degradation showed by the bottom-row on Table ??. The slow-down increases exponentially with the benchmark size, as
expected.
The tiling technique (4) improves the application bandwidth with the
scatter to gather approach. The Reduction technique (3) actually reduces
the overall number of accesses to either shared or device memory by having
half the number of threads of versions 4 or 5. This also uses tiling to alleviate
the pressure on device memory. Even though the number of loads per thread
remains the same, the overall number of loads in the application is reduced.
Finally, the atomic instruction version (version (2)) is also slightly improved through tiling. However, Tesla C2050 provides an improved memory
hierarchy which actually alliviates the pressure on device memory. As previously mentioned, the on-chip memory is statically configurable, setting different partitions size for L1 and shared memory. It is noteworthy to mention
that for version (2) it performs slightly better by having a bigger L1 partion
instead of having too much unused shared memory (around 0.5% improvement on average). For the version (1), the on-chip configuration depends
21

pr2392

on the benchmark size, and the shared memory used by the targeted benchmark. Better performance, on the same order than previous, are obtained
by having more L1, as long as the shared memory requirements are less
than 16KB, otherwise the nvcc compiler automatically devotes more room
to shared memory in order to execute the application.
5.2.2. GPU Vs CPU
Table 8: Execution times (milliseconds) on different hardware platforms (CPU vs GPU).
We vary the TSPLIB benchmark instance to increase the number of cities.

TSPLIB
Benchmarks
d198
a280
lin318
pcb442
rat783
pr1002
pcb1173
d1291
pr2392

CPU Xeon
Intel E5620
1.02
2.6
4.1
6.6
33.08
54.89
77.59
102.43
371.05

22

GPU NVIDIA
Tesla C2050
0.18 (5.67x)
0.41 (6.34x)
0.49 (8.37x)
0.54 (12.22x)
2.42 (13.67x)
3.52 (15.59x)
4.68 (16.58x)
5.85 (17.51x)
18.57 (19.98x)

Table ?? shows the speed-up factor for the best version of the pheromone
update kernel compared to the sequential code. The pattern of computation
for this kernel is based on data-parallelism, showing a linear speed-up along
with the problem size, reaching up to 20x speed-up factor for Tesla C2050.
5.3. Quality of the Solution
Figure 6: The quality of the solution averaged of 1000 iterations. The 95% of confidence
interval is shown on top of the bars.

Finally, the Figure ?? shows a quality comparison of the solutions obtained by the main targeted algorithms in this work. They are normalized
with respect to the optimum solution for each benchmark previously reported
(see Table ??). The solution quality showed is the result of running all algorithms a fixed number of iterations (1000-iterations), and averaged over 5
independent runs. Moreover, a 95% confidence interval is also provided. It
is worth noticing that the tours quality obtained for GPU codes is similar
than the verified sequential code, being even better in some cases.
6. Related Work
6.1. Parallel implementations
St
uzle [? ] describes the simplest case of ACO parallelisation, in which
independently instances of the ACO algorithm are run on different processors.
Parallel runs have no communication overhead, and the final solution is taken
as the best-solution over all independent executions. Improvements over
non-communicating parallel runs may be obtained by exchange information
among processors. Michel and Middendorf [? ] present a solution based on
this principle, whereby separate colonies exchange pheromone information.
In more recent work, Chen et al. [? ] divide the ant population into equallysized sub-colonies, each assigned to a different processor. Each sub-colony
searches for an optimal local solution, and information is exchanged between
processors periodically. Lin et al. [? ] propose dividing up the problem
into subcomponents, with each subgraph assigned to a different processing
unit. To explore a graph and find a complete solution, an ant moves from
one processing unit to another, and messages are sent to update pheromone
levels. The authors demonstrate that this approach reduces local complexity
and memory requirements, thus improving overall efficiency.
23

6.2. GPU implementations


In terms of GPU-specific designs for the ACO algorithm, Jiening et al.
[? ] propose an implementation of the Max-Min Ant System (one of many
ACO variants) for the TSP, using C++ and NVIDIA Cg. They focus their
attention on the tour construction stage, and compute the shortest path
in the CPU. Catala et al. [? ] proposed two ACO implementations on
GPUs, applying them to the Orienteering Problem, using vertex and shader
processors. In [? ] You discusses a CUDA implementation of the Ant
System for the TSP. The tour construction stage is identified as a CUDA
kernel, being launched by as many threads as there are artificial ants in
the simulation. The tabu list of each ant is stored in shared memory, and
the pheromone and distances matrices are stored in texture memory. The
pheromone update stage is calculated on the CPU. Li et al. [? ] propose
a method based on a fine-grained model for GPU-acceleration, which maps
a parallel ACO algorithm to the GPU through CUDA. Ants are assigned to
single processors, and they are connected by a population-structure [? ].
Fu et al. [? ] design the MAX-MIN Ant System for the TSP with MATLAB and the Jacket toolbox for accelerating some parts of the algorithm
on the GPU. They states the low performance obtained by the traditional
roulette wheel as a selection process on GPUs. They propose an alternative
selection process, called All-In-Roulette, which generates m n pseudorandom number matrix, being m the number of ants and n the number of
cities. In the Robin M. Weisss phd, he applies the ACO algorithm to the
data-mining problem [? ]. He analyses several GPU ACO design, showing
the low performance of the previous designs based on task-parallelism.
Although these proposals offer a useful starting point when considering
GPU-based parallelisation of ACO, they are deficient in two main regards.
Firstly, they fail to offer any systematic analysis of how best to implement this
particular algorithm. Secondly, they fail to consider an important component
of the ACO algorithm; namely, the pheromone update.
7. Conclusions and Future Work
Ant Colony Optimisation (ACO) belongs to the family of populationbased meta-heuristic that has been successfully applied to many NP-complete
problems. As a population-based algorithm, it is intrinsically parallel, and
thus well-suited to implementation on parallel architectures. The ACO algorithm comprises two main stages; tour construction and pheromone Update.
24

Previous efforts for parallelizing ACO on the GPU focused on the former
stage, using task-based parallelism. We have demonstrated that this approach does not fit well on the GPU architecture, and provided an alternative
approach based on data parallelism. This enhances the GPU performance by
both increasing the parallelism and avoiding warp divergence. Moreover, we
have proposed an alternative selection procedure that fits better on the GPU
architecture idiosyncrasy.
In addition, we have provided the first known implementation of the
pheromone update stage on the GPU. Some GPU computing techniques were
discussed in order to avoid atomic instructions. However, we have showed
that those techniques are even more costly than applying atomic operations
directly.
Possible future directions will include investigating the effectiveness of
GPU-based ACO algorithms on other NP-complete optimisation problems.
We will also implement other ACO algorithms, such as the Ant Colony System, which can also be efficiently implemented on the GPU. The conjunction
of ACO and GPU is still at a relatively early stage; we emphasize that we
have only so far tested a relatively simple variant of the algorithm. There
are many other types of ACO algorithm still to explore, and as such, it is
a potentially fruitful area of research. We hope that this paper stimulates
further discussion and work.
Acknowledgements
This work was partially supported by a travel grant from the EU FP7
NoE HiPEAC IST-217068, the European Network of Excellence on High
Performance and Embedded Architecture and Compilation. The first two
authors acknowledge the support of the project from the Spanish MEC and
European Commission FEDER funds under grants Consolider Ingenio-2010
CSD2006-00046 and TIN2006-15516-C04-03, and also from the Fundacion
Seneca (Agencia Regional de Ciencia y Tecnologa, Region de Murcia) under
grant 00001/CS/2007.
References
[1] M. Dorigo, T. St
utzle, Ant Colony Optimization, Bradford Company,
Scituate, MA, USA, 2004.

25

[2] M. Dorigo, M. Birattari, T. Stutzle, Ant colony optimization, Computational Intelligence Magazine, IEEE 1 (4) (2006) 28 39.
[3] C. Blum, Ant colony optimization: Introduction and recent trends,
Physics of Life Reviews 2 (4) (2005) 353373.
[4] E. Lawler, J. Lenstra, A. Kan, D. Shmoys, The traveling salesman problem, Wiley New York, 1987.
[5] M. Dorigo, A. Colorni, V. Maniezzo, Positive feedback as a search strategy, Tech. Rep. 91-016, Dipartimento di Elettronica, Politecnico di Milano, Milan, Italy (1991).
[6] M. Dorigo, V. Maniezzo, A. Colorni, The ant system: Optimization by
a colony of cooperating agents, IEEE Transactions on Systems, Man,
and Cybernetics-Part B 26 (1996) 2941.
[7] M. Dorigo, E. Bonabeau, G. Theraulaz, Ant algorithms and stigmergy,
Future Gener. Comput. Syst. 16 (2000) 851871.
[8] T. St
utzle, Parallelization strategies for ant colony optimization, in:
PPSN V: Proceedings of the 5th International Conference on Parallel
Problem Solving from Nature, Springer-Verlag, London, UK, 1998, pp.
722731.
[9] X. JunYong, H. Xiang, L. CaiYun, C. Zhong, A novel parallel ant
colony optimization algorithm with dynamic transition probability, International Forum on Computer Science-Technology and Applications 2
(2009) 191194.
[10] Y. Lin, H. Cai, J. Xiao, J. Zhang, Pseudo parallel ant colony optimization for continuous functions, International Conference on Natural
Computation 4 (2007) 494500.
[11] M. Garland, D. B. Kirk, Understanding throughput-oriented architectures, Commun. ACM 53 (2010) 5866.
[12] J. R. Fischer, U. States., Frontiers of massively parallel scientific computation, National Aeronautics and Space Administration, Scientific and
Technical Information Office, [Washington, D.C], 1987.

26

[13] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, V. Volkov, Parallel computing experiences
with cuda, IEEE Micro 28 (2008) 1327.
[14] J. M. Cecilia, J. M. Garca, M. Ujaldon, A. Nisbet, M. Amos, Parallelization strategies for ant colony optimisation on gpus, in: NIDISC 2011:
14th International Workshop on Nature Inspired Distributed Computing. Proc. 25th International Parallel and Distributed Processing Symposium (IPDPS 2011), Anchorage (Alaska), USA, 2011.
[15] Johnson, David S., Mcgeoch, Lyle A., The Traveling Salesman Problem:
A Case Study in Local Optimization, 1997.
[16] M. Dorigo, Optimization, learning and natural algorithms, Ph.D. thesis,
Politecnico di Milano, Italy (1992).
[17] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, 1st Edition, Addison-Wesley Longman Publishing Co.,
Inc., Boston, MA, USA, 1989.
[18] NVIDIA, NVIDIA CUDA C Programming Guide 3.1.1, 2010.
[19] NVIDIA, NVIDIA CUDA C Best Practices Guide 3.2, 2010.
[20] NVIDIA, NVIDIA CUDA CURAND Library., 2010.
[21] T. Scavo, Scatter-to-gather transformation for scalability (Aug 2010).
[22] NVIDIA, Whitepaper NVIDIAs Next Generation CUDA Compute Architecture: Fermi, 2009.
[23] G. Reinelt, TSPLIB a traveling salesman problem library, ORSA
Journal on Computing 3 (4) (1991) 376384.
[24] TSPLIB Webpage,
http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/
(February 2011).
[25] R. Michel, M. Middendorf, An island model based ant system with lookahead for the shortest supersequence problem, in: Proceedings of the
5th International Conference on Parallel Problem Solving from Nature,
PPSN V, Springer-Verlag, London, UK, 1998, pp. 692701.
27

[26] L. Chen, H.-Y. Sun, S. Wang, Parallel implementation of ant colony


optimization on mpp, in: Machine Learning and Cybernetics, 2008 International Conference on, Vol. 2, 2008, pp. 981 986.
[27] W. Jiening, D. Jiankang, Z. Chunfeng, Implementation of ant colony
algorithm based on gpu, in: CGIV 09: Proceedings of the 2009 Sixth
International Conference on Computer Graphics, Imaging and Visualization, IEEE Computer Society, Washington, DC, USA, 2009, pp. 5053.
[28] A. Catala, J. Jaen, J. Modioli, Strategies for accelerating ant colony optimization algorithms on graphical processing units, in: IEEE Congress
on Evolutionary Computation, 2007, pp. 492500.
[29] Y.-S. You, Parallel ant system for traveling salesman problem on gpus,
in: GECCO 2009 - GPUs for Genetic and Evolutionary Computation.,
2009, pp. 12.
[30] J. Li, X. Hu, Z. Pang, K. Qian, A parallel ant colony optimization algorithm based on fine-grained model with gpu-acceleration, International
Journal of Innovative Computing, Information and Control 5 (2009)
37073716.
[31] J. Fu, L. Lei, G. Zhou, A parallel ant colony optimization algorithm with
gpu-acceleration based on all-in-roulette selection, in: 2010 Third International Workshop on Advanced Computational Intelligence (IWACI),
2010, pp. 260 264.
[32] R. M. Weiss, Gpu-accelerated data mining with swarm intelligence,
Ph.D. thesis, Department of Computer Science (Macalester College)
(2010).

28

You might also like