Europar2024workshops Paper 106

An HPX Communication Benchmark:
Distributed FFT using Collectives
Alexander Strack1[0000−0002−9939−9044] and Dirk Pflüger1[0000−0002−4360−0212]
Institute of Parallel and Distributed Systems, University of Stuttgart,

70569 Stuttgart, Germany
{alexander.strack, dirk.pflueger}@ipvs.uni-stuttgart.de
Abstract. Due to increasing core counts in modern processors, several

task-based runtimes emerged, including the C++ Standard Library for
Concurrency and Parallelism (HPX). Although the asynchronous many-
task runtime HPX allows implicit communication via an Active Global
Address Space, it also supports explicit collective operations. Collectives
are an efficient way to realize complex communication patterns.
In this work, we benchmark the TCP, MPI, and LCI communication
backends of HPX, which are called parcelports in HPX terms. We use
a distributed multi-dimensional FFT application relying on collectives.
Furthermore, we compare the performance of the HPX all-to-all and
scatter collectives against an FFTW3 reference based on MPI+X on a
16-node cluster.
Of the three parcelports, LCI performed best for both scatter and all-
to-all collectives. Furthermore, the LCI parcelport was up to factor 3
faster than the MPI+X reference. Our results highlight the potential of
message abstractions and the parcelports of HPX.
Keywords: HPX · Collectives · FFT · Distributed Computing
1 Introduction
The use of collective operations to simplify the communication between dis-
tributed memory environments is common in high-performance computing. Im-
plementations of the Message Passing Interface (MPI), e.g., OpenMPI [3], pro-
vide several highly optimized collectives. However, pure MPI adds an overhead
on modern many-core processors.
Recently, several asynchronous many-task runtimes emerged; see [7] for a
detailed comparison of existing asynchronous many-task runtimes. In this work,
we focus on HPX, an actively developed C++ Standard Library for Concurrency
and Parallelism [5]. While HPX realized asynchronous parallelism via futurized
tasks, its Active Global Address Space (AGAS) provides distributed functionali-
ties. In contrast to messages, HPX uses an abstraction called parcel. These parcels
are transferred over a so-called parcelport. HPX supports multiple parcelports,
allowing to switch the communication backend via command line arguments.
We use collectives to implement a distributed Fast Fourier Transform (FFT)
and benchmark the parcelports of HPX against the MPI+X parallelization of
the popular FFTW3 library [2].
2 A. Strack, D. Pflüger
2 Application
The FFT is one of the most common algorithms in modern computing. The
one-dimensional transform can be easily expanded into multiple dimensions by
dimension-wise FFT computations. In order to compute a distributed FFT in
multiple dimensions, four steps are necessary (see Figure 1). Hereby, the com-
Domain 1 Domain 1
Domain 1 Domain 2
Domain 1 Domain 2
Domain 2 Domain 2
FFT Computation Data Chunking Communication Data Transposition

Fig. 1: Four steps that need to be executed in sequence for each dimension of a two-
dimensional FFT on two separated memory domains.
munication step is the most expensive. Assuming N participating nodes, the

communication step requires the transfer of (1 − N1 ) chunks of the equally chun-
ked local data. More specifically, the i-th chunk needs to be transferred to the
i-th participating node. As point-to-point communication becomes complicated
quickly, this communication pattern is realized with collectives in practice.
3 Methods
While there exist various collectives, we focus on those relevant to the FFT
communication pattern. The easiest way to implement this pattern is the all-to-
all collective. While all-to-all exactly matches the FFT communication pattern,
it has one major drawback: The all-to-all collective is synchronized. However,
the arriving data chunks can be transposed as soon as they are received. Thus,
we propose to replace the all-to-all collective with a combination of N scatter
collectives. This allows us to hide some of the computation behind the long
communication time.
At the time of writing, HPX ships with three parcelports:
• TCP Parcelport: Requires no external dependencies.
• MPI Parcelport: Requires external MPI installation.
• LCI Parcelport: Can be fetched alongside HPX or requiresexternal LCI in-
stallation.
Both TCP [1] and MPI [6] parcelports were provided by Heller [4]. The TCP
parcelport is mainly intended to act as a fallback in case something goes wrong.
Yan et. al. recently added support for a new parcelport to HPX based on the
Lightweight Communication Interface (LCI) [8].
A HPX Communication Benchmark 3
4 Results
The results in this section were obtained on a 16-node cluster. For more detailed
hardware specifications, see Figure 2. All runtimes are averaged over 50 runs and
are visualized with 95% confidence bars.
Cluster buran
TCP parcelport
Nodes 16
101 MPI parcelport
Connection InfiniBand HDR
Communication time in s
LCI parcelport
Speed 200Gb/s 100
Sockets 2
10−1
CPU AMD EPYC 7352
Cores 24 10−2
Clock rate 2.3GHz
10−3
L3 Cache 128MB
RAM 256GB
25 27 29 211 213 215 217 219 221 223 225 227 231
Data chunk size (in double-precision floating-point numbers)
Fig. 2: Hardware specification Fig. 3: Chunk size scaling on two nodes.

of benchmark cluster
In our chunk size benchmark (see Figure 3), we use the scatter collective
to simulate two separate one-way communication channels between two nodes.
The LCI parcelport outperforms the other parcelports across all chunk sizes.
Especially the TCP parcelport has a big overhead for small data chunks.
In our strong scaling benchmarks, we compare the parcelports for both all-
to-all (see Figure 4) and scatter (see Figure 5) collectives against an FFTW3
implementation parallelized with MPI+pthreads. We choose a two-dimensional
FFT of size 214 ×214 . Generally, the scatter based approach is faster than one all-
to-all collective. Interestingly, the TCP parcelport is on the benchmark cluster
faster than the MPI parcelport for the used problem size. However, the runtimes
101 FFTW3 MPI all-to-all

102 HPX TCP scatter
HPX MPI scatter
HPX LCI scatter
Runtime in s
Runtime in s
101
100
FFTW3 MPI all-to-all 100

HPX TCP all-to-all
HPX MPI all-to-all
HPX LCI all-to-all
10−1 10−1
1 2 4 8 16 1 2 4 8 16
N nodes N nodes
Fig. 4: Strong scaling on up to 16 nodes Fig. 5: Strong scaling on up to 16 nodes

for HPX all-to-all collective. for HPX scatter collective.
4 A. Strack, D. Pflüger
of the TCP parcelport skyrocket for the scatter approach. Again the LCI parcel-
port dominates the performance and is even able to beat the FFTW3 reference
up to a factor of 3.
5 Conclusion and outlook

In the scope of this work, we used distributed FFT to evaluate the performance
of the existing HPX parcelports for different chunk sizes. Furthermore, we com-
pared the performance of HPX collectives against the popular FFTW3 based
on MPI collectives. Even though the HPX collectives are not optimized to rival
their MPI equivalents in direct comparison, the LCI parcelport accelerated the
communication to such an extent, that our HPX implementation was able to
beat FFTW3 up to a factor of 3.
These results highlight the flexibility and the potential of the parcelport
concept used in HPX. We plan to provide more parcelports for HPX in order to
add even more flexibility for users.
Supplementary materials
The source code and benchmark scripts are available at DaRUS1 . The respective
software and compiler versions are stated in the README.md of the source code.
References
1. Cerf, V.G., Dalal, Y.K.: Specification of Internet Transmission Control Program.
RFC 675 (1974)
2. Frigo, M., Johnson, S.: The Design and Implementation of FFTW3. Proceedings of
the IEEE 93(2), 216–231 (2005)
3. Graham, R.L., Woodall, T.S., Squyres, J.M.: Open MPI: A Flexible High Perfor-
mance MPI. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Waśniewski, J. (eds.)
Parallel Processing and Applied Mathematics. pp. 228–239. Springer Berlin Heidel-
berg, Berlin, Heidelberg (2006)
4. Heller, T.: Extending the C++ Asynchronous Programming Model with the
HPX Runtime System for Distributed Memory Computing. phdthesis, Friedrich-
Alexander-Universität Erlangen-Nürnberg (2019)
5. Kaiser, H., et al.: HPX – the C++ standard library for parallelism and concurrency.
Journal of Open Source Software 5(53), 2352 (2020)
6. Message Passing Interface Forum: MPI: A Message-Passing Interface Standard Ver-
sion 3.0 (2021)
7. Thoman, P., et. al.: A taxonomy of task-based parallel programming technologies
for high-performance computing. J. Supercomput. 74(4), 1422–1434 (04 2018)
8. Yan, J., Kaiser, H., Snir, M.: Design and Analysis of the Network Software Stack of
an Asynchronous Many-task System – The LCI parcelport of HPX. In: Proceedings
of the SC ’23 Workshops. p. 1151–1161. ACM, New York, NY, USA (2023)
1
Available in camera-ready version (visited on XX/0X/2024)

Europar2024workshops Paper 106

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Europar2024workshops Paper 106

Uploaded by

Copyright:

Available Formats

An HPX Communication Benchmark:

Distributed FFT using Collectives

Alexander Strack1[0000−0002−9939−9044] and Dirk Pflüger1[0000−0002−4360−0212]

Institute of Parallel and Distributed Systems, University of Stuttgart,

Abstract. Due to increasing core counts in modern processors, several

Keywords: HPX · Collectives · FFT · Distributed Computing

FFT Computation Data Chunking Communication Data Transposition

munication step is the most expensive. Assuming N participating nodes, the

Fig. 2: Hardware specification Fig. 3: Chunk size scaling on two nodes.

101 FFTW3 MPI all-to-all

FFTW3 MPI all-to-all 100

Fig. 4: Strong scaling on up to 16 nodes Fig. 5: Strong scaling on up to 16 nodes

5 Conclusion and outlook

You might also like