Professional Documents
Culture Documents
Europar2024workshops Paper 106
Europar2024workshops Paper 106
1 Introduction
The use of collective operations to simplify the communication between dis-
tributed memory environments is common in high-performance computing. Im-
plementations of the Message Passing Interface (MPI), e.g., OpenMPI [3], pro-
vide several highly optimized collectives. However, pure MPI adds an overhead
on modern many-core processors.
Recently, several asynchronous many-task runtimes emerged; see [7] for a
detailed comparison of existing asynchronous many-task runtimes. In this work,
we focus on HPX, an actively developed C++ Standard Library for Concurrency
and Parallelism [5]. While HPX realized asynchronous parallelism via futurized
tasks, its Active Global Address Space (AGAS) provides distributed functionali-
ties. In contrast to messages, HPX uses an abstraction called parcel. These parcels
are transferred over a so-called parcelport. HPX supports multiple parcelports,
allowing to switch the communication backend via command line arguments.
We use collectives to implement a distributed Fast Fourier Transform (FFT)
and benchmark the parcelports of HPX against the MPI+X parallelization of
the popular FFTW3 library [2].
2 A. Strack, D. Pflüger
2 Application
The FFT is one of the most common algorithms in modern computing. The
one-dimensional transform can be easily expanded into multiple dimensions by
dimension-wise FFT computations. In order to compute a distributed FFT in
multiple dimensions, four steps are necessary (see Figure 1). Hereby, the com-
Domain 1 Domain 1
Domain 1 Domain 2
Domain 1 Domain 2
Domain 2 Domain 2
3 Methods
While there exist various collectives, we focus on those relevant to the FFT
communication pattern. The easiest way to implement this pattern is the all-to-
all collective. While all-to-all exactly matches the FFT communication pattern,
it has one major drawback: The all-to-all collective is synchronized. However,
the arriving data chunks can be transposed as soon as they are received. Thus,
we propose to replace the all-to-all collective with a combination of N scatter
collectives. This allows us to hide some of the computation behind the long
communication time.
At the time of writing, HPX ships with three parcelports:
• TCP Parcelport: Requires no external dependencies.
• MPI Parcelport: Requires external MPI installation.
• LCI Parcelport: Can be fetched alongside HPX or requiresexternal LCI in-
stallation.
Both TCP [1] and MPI [6] parcelports were provided by Heller [4]. The TCP
parcelport is mainly intended to act as a fallback in case something goes wrong.
Yan et. al. recently added support for a new parcelport to HPX based on the
Lightweight Communication Interface (LCI) [8].
A HPX Communication Benchmark 3
4 Results
The results in this section were obtained on a 16-node cluster. For more detailed
hardware specifications, see Figure 2. All runtimes are averaged over 50 runs and
are visualized with 95% confidence bars.
Cluster buran
TCP parcelport
Nodes 16
101 MPI parcelport
Connection InfiniBand HDR
Communication time in s
LCI parcelport
Speed 200Gb/s 100
Sockets 2
10−1
CPU AMD EPYC 7352
Cores 24 10−2
Clock rate 2.3GHz
10−3
L3 Cache 128MB
RAM 256GB
25 27 29 211 213 215 217 219 221 223 225 227 231
Data chunk size (in double-precision floating-point numbers)
In our chunk size benchmark (see Figure 3), we use the scatter collective
to simulate two separate one-way communication channels between two nodes.
The LCI parcelport outperforms the other parcelports across all chunk sizes.
Especially the TCP parcelport has a big overhead for small data chunks.
In our strong scaling benchmarks, we compare the parcelports for both all-
to-all (see Figure 4) and scatter (see Figure 5) collectives against an FFTW3
implementation parallelized with MPI+pthreads. We choose a two-dimensional
FFT of size 214 ×214 . Generally, the scatter based approach is faster than one all-
to-all collective. Interestingly, the TCP parcelport is on the benchmark cluster
faster than the MPI parcelport for the used problem size. However, the runtimes
Runtime in s
101
100
of the TCP parcelport skyrocket for the scatter approach. Again the LCI parcel-
port dominates the performance and is even able to beat the FFTW3 reference
up to a factor of 3.
Supplementary materials
The source code and benchmark scripts are available at DaRUS1 . The respective
software and compiler versions are stated in the README.md of the source code.
References
1. Cerf, V.G., Dalal, Y.K.: Specification of Internet Transmission Control Program.
RFC 675 (1974)
2. Frigo, M., Johnson, S.: The Design and Implementation of FFTW3. Proceedings of
the IEEE 93(2), 216–231 (2005)
3. Graham, R.L., Woodall, T.S., Squyres, J.M.: Open MPI: A Flexible High Perfor-
mance MPI. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Waśniewski, J. (eds.)
Parallel Processing and Applied Mathematics. pp. 228–239. Springer Berlin Heidel-
berg, Berlin, Heidelberg (2006)
4. Heller, T.: Extending the C++ Asynchronous Programming Model with the
HPX Runtime System for Distributed Memory Computing. phdthesis, Friedrich-
Alexander-Universität Erlangen-Nürnberg (2019)
5. Kaiser, H., et al.: HPX – the C++ standard library for parallelism and concurrency.
Journal of Open Source Software 5(53), 2352 (2020)
6. Message Passing Interface Forum: MPI: A Message-Passing Interface Standard Ver-
sion 3.0 (2021)
7. Thoman, P., et. al.: A taxonomy of task-based parallel programming technologies
for high-performance computing. J. Supercomput. 74(4), 1422–1434 (04 2018)
8. Yan, J., Kaiser, H., Snir, M.: Design and Analysis of the Network Software Stack of
an Asynchronous Many-task System – The LCI parcelport of HPX. In: Proceedings
of the SC ’23 Workshops. p. 1151–1161. ACM, New York, NY, USA (2023)
1
Available in camera-ready version (visited on XX/0X/2024)