Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

CS 420 Final Project

Parallel Breadth First Search


Agrima Bansal
Ryan Chui
Krishnan Swaminanthan-Gopalan

Problem

Implement a parallel Breadth-First Search to compute the distances to all the nodes in a
graph starting from node zero. Use a level-synchronous algorithm that synchronises
after computing distances to all nodes at depth d before visiting nodes at depth d+1.
The graph can be generated using a random graph generator, such as available for
Graph500.

Introduction

The breadth-first search (BFS) algorithm is a systematic traversal algorithm for


complete graph exploration. It starts at a root node and explores every node with the
same depth, relative to the root, before moving to nodes of a greater depth.

Challenges

Performing BFS in parallel faces significant challenges in both extensive branching as


well as blocking communication to maintain level synchronization across all processes.
The problem of level synchronization could be explained as BFS implementation on
graphs, faces the problem that graphs have cycles and we need to avoid accessing the
same node of the graph twice (by different threads/cores under parallelism). This limits
the effectiveness of the using GPUs to perform the computations as they are throughput
oriented and are slow with both communications and with branchings.
We divided these challenges into two specific areas when working to perform
optimizations:
1. The graph traversal problems such as BFS are predominantly memory
access-bound. These accesses are further dependent on the structure of the
input graph, thus causing unpredictable branching behavior and leading to
increased computational requirement with increasing depth.

2. The difficulty of effectively allocating tasks between all nodes to balance their
workload and compute times for each level.

Implementation

We implemented three different versions of the BFS algorithm. They were:

1. Sequential BFS:
The sequential implementation involved no parallelism and used a single thread
of a single processing core to traverse through the entire graph. It served as the
basis for comparison for the parallel codes. This code was varied to test different
optimization flags to observe their effects on the BFS running time.

2. OpenMP BFS:
The sequential code was parallelized to utilize the several threads in a single
computation node. OpenMP is used to implement shared memory parallelism.
This code could not involve any message passing and the only performance
bottleneck of the parallel BFS was data races resulting from multi-threaded
parallelism. This code was varied using different numbers of threads, using
different scheduling clauses, and with the proc_bind clause to test the scalability
of the code. The effect of varying the number of threads, different scheduling
policies and proc_bind clauses on the running time of this OpenMP code was
investigated to understand the scalability of the code.

3. MPI + OpenMP (hybrid) BFS:


Next, the code was extended to include multiple nodes using distributed memory
parallelism between the nodes with MPI and shared memory parallelism using
OpenMP. Three different methodologies were implemented in order to achieve
load balancing between the different compute nodes. They are detailed below:

a. Load balancing by splitting the vertices of the graph between each node:
The total number of vertices are split equally between the different processors.
Each processor contains the neighbor list of its own vertices. So at each level we
have to pass the neighboring vertices that needs to be traversed at the next level
to the corresponding processor. Thus, it would require each processor to
communicate with all others. This was implemented in done in three ways -
Isend/Irecv, Alltoallv, Ialltoallv. Both strong scaling and weak scaling studies are
performed for each of them.

b. Split the edges of the graph between each node:


Instead of splitting the vertices equally among the processors, the vertices are
split in such a way that the number of neighbor nodes in each processor is
roughly the same. The rationale behind this method of splitting is that the amount
of work that a processor has to perform overall is proportional to the number of
neighbor nodes present within the processor. The same three message passing
methods as the previous case was implemented. Again both strong scaling and
weak scaling studies are performed for each of them.

c. Local list of nodes with each node and synchronization at each level:
Instead of partitioning based on the vertices, the graph is partitioned based on
the edges. Each processor contains a list of edges that will processed by node
during the BFS traversal. At the end of each level, nodes use Allgatherv generate
the list of nodes to be traveled in the next iteration. Additionally all nodes use
Allgatherv to synchronize vertices just visited and their corresponding depths.
Only strong scaling study can be performed for this due to memory requirements.

Results

1. Sequential BFS
Our results from our sequential implementation of the BFS graph traversal were
expected and in-line with results generated from previous homework assignments. The
-O1 optimization flag produces an optimized image in a short amount of time, -O2
optimization performs all supported optimizations within the given architecture that does
not involve space-speed tradeoff, and -O3 performs even more optimizations prioritizing
speed over size. From -O1 to -O2 we observe an 18% speed increase and from -O2 to
-O3 we observe a 4% speed increase. As optimization level increases so does the
speed of BFS which is the expected result.

2. OpenMP BFS

In our OpenMP BFS results we observe a strong relationship between increasing the
number of threads and the time taken for execution; as the number of threads increases
the time that it took to traverse the graph using BFS decreased. This trend is present for
unscheduled, as well as static and dynamic scheduling.

When the scheduling policy within the BFS code was varied, it was found that the
dynamic scheduling tended to perform faster than static scheduling. This is because the
code was parallelized by vertices and each vertex has a variable amount of work. In this
case, the load balancing from dynamic scheduling improves the speed of our
implementation even with the increased scheduling overhead.
Chunk size was defined as (N / t * num_threads) where N is the number of vertices to
be processed and t is a varying constant that affects the chunk size. When varying
chunk size and for a large number of threads in dynamic scheduling, we observed a
strong trend towards a larger chunk size resulting in faster runtime. Conversely, the
opposite result occurred when dealing with a small number of threads as is to be
expected. When having smaller number of threads, it is less beneficial to use dynamic
load balancing with a higher scheduling overhead.
No significant differences were observed in the execution time between spread vs
master for the proc_bind clause, but in general spread was faster than master and
benefited more from a smaller chunk size.
3. MPI + OpenMP BFS
A large variation in the BFS time is observed for the different types of message passing
methods - Alltoallv(Ialltoallv) vs Isend/Irecv. Time difference between the codes
employing Alltoallv and Ialltoallv is minimal, owing to the fact that there is not much
computational work overlap with the message passing (due to the nature of BFS
algorithm). Surprisingly the non-blocking functions seems to be less efficient than the its
blocking counterpart. Further this result is consistent with both vector and neighbor
splitting. More in-depth understanding of the exact functioning and the various
processes involved in the non-blocking communication is required to explain the
observed trend. At this point, it can be hypothesized that the non-blocking
communication requires more work than the blocking version, and the gain achieved by
the communication-computation overlap is not large enough to offset this.

On the other hand, the BFS code employing Isend/Irecv took around two orders of
magnitude more time than the Alltoallv communication methodology. Since during the
BFS traversal algorithm, each process has to communicate with all the other
processors, there will be huge congestion when large number of processors are
involved. In this scenario, it is much more optimal to use the MPI in-built communication
functions like Alltoallv.

Although we observe a large variation between the different message passing functions,
the time taken by BFS codes employing load balancing by vertex splitting and neighbor
list splitting is similar. This is a direct result of the graph properties in the graph-500
benchmark. The neighbors of vertices are not concentrated around some particular
vertices, but evenly spread over all the vertices. Hence this results in an almost
balanced stage even when vertex splitting is employed. This is also evident from the
small number of vertices which are moved between the processors while load
balancing, and also these communications only occur between adjacent processors.
The local division of the edges also is much slower than the Alltoall and Ialltoall (both
vertex and neighbor), although it is faster than Isend/Irecv code. In addition, there is
another great disadvantage with using this kind of splitting. Since each processor has to
maintain the entire vertex list for updating, this method is not scalable in terms of size
with increasing number of processors. This is the reason for the absence of this method
in the weak scaling study.

Based on our results for the MPI and OpenMP BFS implementations, level-by-level BFS
does not scale well with multi-node message passing. We believe that this is due to the
amount of time message passing necessary to synchronize all nodes takes outweighs
the actual computation time performed by each node. This is because assigning and
reporting the depth of each vertex of a graph is a trivial computation compared to the
message passing as the number of vertices and edges of a graph increase. That line of
thought was what caused us to exclude the use of GPUs when first investigating the
problem apart from the branching in the code.
Conclusion

The different implementations of our project can be summarized as below:


1. In general, we found that for BFS, using the highest level optimization flag
generated code with the best runtime. This is because the higher flag allows the
compiler to perform additional optimizations to improve code efficiency.
2. When using OpenMP for shared memory parallelization, it is best to use dynamic
scheduling since the work size per task is observed to vary substantially in order
to warrant the additional scheduling overhead generated by dynamic scheduling.
3. We found that BFS graph traversals do not scale well with multi-node message
passing schemes as the communication time far outweighs the additional
throughput benefits from using multiple nodes.

You might also like