Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

POLITECNICO DI TORINO

Corso di Laurea Magistrale in Ingegneria Elettronica

Sommario della Tesi di Laurea Magistrale

FPGA-based acceleration of a particle simulation


High Performance Computing application

Relatori: Candidato:
Prof. Luciano Lavagno Aldo Conte
Dott. Giuseppe Piero Brandino

Key words: FPGA, HPC, Molecular Dynamics, Xilinx Vivado, High-Level-Synthesis

Today and in the foreseeable future, high-performance computers will be fundamental and widely
used due to their capability to analyze and process huge volumes of data. Such systems currently
use hybrid platforms that exploit the huge parallel computing power of Graphics Processing Units
(GPUs). The basic idea in such heterogeneous systems, is to “off-load” some of the workload from
the main processor to a co-processor, in order to speed up the execution. GPUs can be considered
optimal candidates to do so, due to their very high floating-point throughput, a favorable
architecture for data parallelism and a higher memory bandwidth with respect to classic
processors.
However, GPU-based systems are power-hungry and require a power consumption so large, that
running and maintaining such HPC systems at the hexa-flop scale could be technologically and
economically too expensive. In other words, reaching exaflop performance with state-of-the-art
technology will have energy and cost requirements so high that would make an exa-scale system
unfeasible.
In this scenario the European ExaNeSt H2020 project, in whose context this thesis work was
performed in cooperation with eXactLab, tries to explore the idea of greatly lowering power
consumption by employing Field Programmable gate Arrays (FPGAs), while increasing at the same
time the overall performance.
The platform used in ExaNeSt, called Zynq UltraScale+, is based on the combination of a multi-core
processor along with FPGA fabric: the accelerator is implemented on a single System on Chip
(SOC) that includes both a multi-processor ARM Cortex and an Ultrascale+ FPGA.
Specifically, this thesis investigated the possibility of using FPGA accelerators in the field of
computational science to offload the compute intensive parts (or kernels) of a Molecular Dynamic
code which were originally written in C++ and then translated into the OpenCL parallel language.
The miniMD is a simple, parallel molecular dynamics (MD) code composed of five different OpenCL
kernels (neighbor_bin, neighbor_build, force_compute, integrate_initial, integrate_final) designed
for studying the physical movements of particles such as atoms or molecules.
In the first part of this thesis, each kernel has been studied in order to understand how it could be
accelerated using the FPGA portion of the Multiprocessor SoC. After a profiling of the full miniMD
application, it was shown that the task that identifies neighboring particles with respect to the single
one inside the system (neighbor_build kernel) and the one related to the force computation
(force_compute kernel) are the most compute intensive one and have therefore a prominent role
in the total execution time of the application.

1
Using the technique of High-Level-Synthesis it was possible to directly generate several RTL
implementations for each of the above kernels. Specifically, for the “neighbor_build” and
“force_compute” kernels, different optimizations were implemented starting from the original
OpenCL code in order to accelerate their execution onto the FPGA. The multiprocessor system-on-
chip architecture employed can be viewed in the following figure.

The Programmable logic (FPGA) and the Programmable System (ARM Cortex) share the same DDR
memory space. This off-chip global memory has the role to gather all the data prepared by the
Programmable system (ARM Cortex) in order to be transferred and processed by the
programmable logic. At the same time, once the FPGA has finished to process all the required
variables, an upload to the DRAM will be carried out. The off-chip DRAM is accessed by means of
four 128 bit-width buses S_AXI_HP[0-3]_FPD. In the FPGA fabric, one or more Compute units have
been instantiated and each of them can host many work-groups with many work-items executing
the kernel’s code.
The Xilinx OpenCL high-level synthesis tool, called Vivado HLS, has been used to optimize the code
for its execution on the Xilinx FPGAs. Several manual code annotations were performed, using both
standard and Xilinx-specific OpenCL attributes in order to drive the HLS tool to a good level of
performance. The optimization work-flow has required two main steps. The first one was related to
the reading and writing of the off-chip DRAM memory by the kernels, moving data in and out of
the internal FPGA SRAM memory (local memory) by means of the AXI interfaces. These
optimizations were meant to exploit the burst memory transactions and to improve the access
bandwidth between the External DDR memory and the Programmable Logic (PL) therefore
reducing memory access time. It is important to notice that these optimizations were not trivial due
to two main aspects: (1) not all DRAM arrays were accessed sequentially by the kernels, and (2) not
all the loops that accessed arrays were perfectly nested, thus preventing some burst inference by
the HLS tool.
The second optimization step was related to the speed-up of the kernel ‘code execution by
applying two main techniques: pipelining and unrolling. Thanks to the previous optimization step, all
2
the needed variables have been downloaded in the local memories. This enables fast access to
them, by all the work-items, so that they can process the kernel’s code in a pipeline fashion (as
depicted in the figure above).
At the end of all the optimizations, the following results have been obtained.

Kernel name Original Version Optimized Version N atoms Frequency


Neighbor_build 4.856 s 4.3 s 10976 250 MHz
Force_compute 395.67 ms 111 ms 10976 250 MHz

Since the neighbor_build kernel was made up by two nested loops with a not-fixed upper-bound, it
was not possible to obtain further optimizations. The overall optimizations in this case have been
dealt with the burst data transfer achieving a decrease of 0.5 s on the overall execution time. As to
Force_compute kernel, thanks to a partial change in the code, better optimizations were obtained
both in data transfer and processing. The execution onto the FPGA have revealed a decrease of
284.7 ms from the initial 395.6 ms.
Up to this point, every effort has been performed to optimize the execution time of the most-
compute intensive kernels and this has consequently reduced the time to complete the full miniMD
simulation. Further optimizations have then been carried out creating onto the FPGA a customized
area called OpenCL region (OCL region) that allowed us to further exploit the OpenCL parallel
model. This means that multiple compute units (instances) of the same kernel could be created, in
order to further parallelize the work, beyond the pipelining of the work items. This technique resulted
in a much more efficient use of the bandwidth between external memory and programmable
logic allowing parallelism on a coarse-grained level.

At the end, we implemented a version of the application with two compute units instantiated for
each of the aforementioned kernels. As can be noticed from the figure above, both the instances
are linked to the external memory by means of two AXI interfaces. This implies a doubling of
bandwidth to transfer data and therefore a further increased performance as demonstrated by
the following results.

Kernel name Single CU Two CUs


Neighbor_build 4.3 s 2.1 s
Force_compute 111 ms 64 ms

Some further optimizations of the current architecture would also be possible both at the memory
and system-level in order to further increase performances, particularly on the data transfer side.
After this, the synthesized CUs can be mapped to each node of a massive parallel FPGA-based
accelerated super-computer to process more realistic versions of Large-scale Atomic/Molecular
simulations (since the miniMD used in this thesis is just a simplified version, showing the key issues to
be faced by the real application).
3

You might also like