Journal of Computational Physics: Sanghyun Ha, Junshin Park, Donghyun You

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Journal of Computational Physics 352 (2018) 246–264

Contents lists available at ScienceDirect

Journal of Computational Physics


www.elsevier.com/locate/jcp

A GPU-accelerated semi-implicit fractional-step method for


numerical solutions of incompressible Navier–Stokes
equations
Sanghyun Ha, Junshin Park, Donghyun You ∗
Department of Mechanical Engineering, Pohang University of Science and Technology, 77 Cheongam-ro, Nam-gu, Pohang, Gyeongbuk 37673,
Republic of Korea

a r t i c l e i n f o a b s t r a c t

Article history: Utility of the computational power of Graphics Processing Units (GPUs) is elaborated for
Received 6 February 2017 solutions of incompressible Navier–Stokes equations which are integrated using a semi-
Received in revised form 25 September implicit fractional-step method. The Alternating Direction Implicit (ADI) and the Fourier-
2017
transform-based direct solution methods used in the semi-implicit fractional-step method
Accepted 26 September 2017
Available online 29 September 2017
take advantage of multiple tridiagonal matrices whose inversion is known as the major
bottleneck for acceleration on a typical multi-core machine. A novel implementation of the
Keywords: semi-implicit fractional-step method designed for GPU acceleration of the incompressible
GPU (Graphics Processing Unit) Navier–Stokes equations is presented. Aspects of the programing model of Compute Unified
CUDA (Compute Unified Device Device Architecture (CUDA), which are critical to the bandwidth-bound nature of the
Architecture) present method are discussed in detail. A data layout for efficient use of CUDA libraries
Navier–Stokes equations is proposed for acceleration of tridiagonal matrix inversion and fast Fourier transform.
Semi-implicit method
OpenMP is employed for concurrent collection of turbulence statistics on a CPU while the
Tridiagonal matrix
Navier–Stokes equations are computed on a GPU. Performance of the present method using
Fast Fourier transform
CUDA is assessed by comparing the speed of solving three tridiagonal matrices using ADI
with the speed of solving one heptadiagonal matrix using a conjugate gradient method.
An overall speedup of 20 times is achieved using a Tesla K40 GPU in comparison with a
single-core Xeon E5-2660 v3 CPU in simulations of turbulent boundary-layer flow over a
flat plate conducted on over 134 million grids. Enhanced performance of 48 times speedup
is reached for the same problem using a Tesla P100 GPU.
© 2017 Elsevier Inc. All rights reserved.

1. Introduction

Reduction of computational time is a major challenge in numerical simulations of fluid flow. At high Reynolds numbers,
the three-dimensional Navier–Stokes equations require a very large number of grid points to resolve broadband scales of
interest. Particularly in a direct numerical simulation (DNS) of turbulent fluid flow, the computational grid is required to be
dense enough to resolve the entire spectrum of turbulent scales in space and time. As a result, parallel computing based on
multi-core Central Processing Units (CPUs) has long been used to overcome such computational cost and still remains as the
mainstream methodology for solving large-scale problems [2,17]. Nevertheless even for moderate Reynolds numbers, DNS

* Corresponding author.
E-mail address: dhyou@postech.ac.kr (D. You).

https://doi.org/10.1016/j.jcp.2017.09.055
0021-9991/© 2017 Elsevier Inc. All rights reserved.
S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264 247

Table 1
Comparison of CPU and GPU used in the present study.
Tesla K40c Xeon E5-2660 v3
GPU CPU
Cache 1.5 MBa 25 MBb
Core frequency 745 MHz 2.6 GHz
Memory bandwidth 288 GB/s 68 GB/s
DP Peak throughput 1430 GFlops 416 GFlops
Computation units 15 SMX 10 cores
Performance per Watt 6.09 GFlops/W 3.96 GFlops/W
a
GPU L2 cache shared by multiprocessors.
b
CPU L3 cache shared by cores.

requires a few hundred million number of grid points and thereby calls for compute nodes occupying considerable footprint
in terms of space and energy consumption.
Graphics Processing Units (GPUs) on the other hand, offer new opportunities for accelerating computational solutions of
the Navier–Stokes equations. As shown in Table 1 in which distinct characteristics of two hardwares are compared, GPUs are
generally characterized by energy-efficiency and high devotion to throughput and memory bandwidth rather than latency
enhancement. In other words, GPU is more suited for handling large amount of data in parallel rather than fast processing
of a single operation. Thus GPU computing can become an attractive candidate for large-scale problems, many of which can
be formulated into massively parallel tasks. A comprehensive review by Niemeyer et al. [18] introduces a number of recent
studies which have used GPUs to successfully accelerate flow solvers.
Although a GPU is known to have high compute throughput and memory bandwidth, it may not always deliver impres-
sive performance gains depending on the numerical scheme used for temporal integration of Navier–Stokes equations. GPUs
in general become effective when the given problem is decomposed into tasks operating on independent data set. Regard-
ing the methods of time advancement, fully explicit schemes are examples with such data independence. For this reason,
researchers have employed GPUs to accelerate flow solvers based on fully explicit temporal integration of compressible as
well as incompressible Navier–Stokes equations [1,25,27].
In contrast to fully explicit schemes which are usually used for compressible flows, semi-implicit fractional-step methods
with second-order finite-volume or finite-difference spatial discretization schemes [13] have been commonly used for so-
lutions of wall-bounded incompressible flows. In this method, Navier–Stokes equations are integrated using a combination
of explicit and implicit schemes for convective and viscous terms, respectively. The main advantage of the method for wall-
bounded flows is that the implicit treatment of viscous terms allows a stable solution even at a larger time-step size, which
is limited in fully explicit schemes due to the small grid size near the wall. Owing to stability and savings in computation
time, this method has widely been used for solving incompressible Navier–Stokes equations [9,12,16,28].
Unfortunately, the semi-implicit fractional-step method has limited scalability due to the serial nature of its algorithms.
Fractional-step methods in general divide the original three-dimensional incompressible Navier–Stokes equations into mo-
mentum equations for solving intermediate velocities, and a Poisson equation for correcting the velocities. When the viscous
terms are discretized using a commonly used second-order central-difference scheme and are integrated implicitly, the re-
sulting momentum equations require inversion of multiple tridiagonal matrices (TDMAs). A classic way of inverting TDMAs
is the Thomas algorithm, which performs O (n) operations; yet it is inherently difficult to parallelize. Similar reasoning ap-
plies to the Poisson equation; despite the efficiency of direct solution of matrices after Fourier transform in homogeneous
directions and finite-difference discretization in the wall-normal inhomogeneous flow direction, this method has not re-
ceived much attention from prior studies using GPUs because of its additional complexity coming from inverting TDMAs.
Alfonsi et al. [1] used a modified version of Thomas algorithm on GPUs to invert multiple TDMAs in the Poisson equation,
but parallelism was limited because each thread solved one linear system at a time. Deng et al. [6] implemented a TDMA
solver for accelerating the Alternating Direction Implicit (ADI) method on GPUs, but parallelism was likewise limited in that
each thread swept each line in the z-direction. For this reason, recent studies have simplified fractional-step methods by
using fully explicit time integration when GPUs were adopted for acceleration [1,10].
A few studies have assessed the performance of semi-implicit fractional-step methods on parallel machines. Borrell et
al. [2] developed a high-resolution DNS code for a solution of the flat-plate boundary layer under a zero pressure gradient up
to Re θ = 6800 using a semi-implicit fractional-step method, which treated only the wall-normal viscous terms implicitly. In
addition to Message Passing Interface (MPI), OpenMP was used for an additional level of parallelism, achieving weakly linear
scalability up to 32768 cores of CPU. The authors mention about the difficulty in parallelization of the TDMA solver and fast
Fourier transform (FFT) when using a hybrid MPI-OpenMP approach. Another semi-implicit fractional-step method using
ADI for the Helmholtz-type momentum equations and partial diagonalization for the Poisson equation was implemented
using MPI [29]. The implementation showed that the discretized momentum equations which consist of nine tridiagonal
matrices (three for each velocity component) were the major source of reduced scalability. The work was further extended
to a hybrid CPU–GPU environment [30] and for the same method, Thomas algorithm was ported onto GPUs. However, the
approach exhibits coarse-grain parallelism, providing insufficient amount of work needed for effective GPU acceleration.
The present study proposes a new implementation of a semi-implicit fractional-step method coupled with ADI and
Fourier-transform methods which is designed particularly for a GPU-accelerated computation of incompressible Navier–
248 S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264

Fig. 1. Flow configuration used for a simulation of boundary-layer flow over a flat plate.

Stokes equations. Due to its highly serial and bandwidth-bound nature, the present choice of numerical methods is con-
sidered to be a good candidate for evaluating the potential of GPUs for solving Navier–Stokes equations using non-explicit
integration in time. Aspects of the explicit memory model of Compute Unified Device Architecture (CUDA) which are critical
to the present implementation of numerical methods are discussed in detail. Data layouts for efficient use of the cuSPARSE
library are suggested for accelerating TDMA inversion to overcome the most important bottleneck of this method. CUDA
streams and OpenMP are employed so that the computation time required for collecting turbulence statistics on CPU is
completely hidden by the main computation of Navier–Stokes equations on GPU. The potential of a single GPU in terms of
performance is demonstrated in a simulation of turbulent boundary-layer flow under a zero pressure gradient on a range of
computational grids. Performance tests are conducted on two different GPUs built on distinct architectures.
The present paper is organized as follows: in Section 2, numerical methods used in the flow solver are described. In
Section 3, strategies for GPU implementation of the present method are explained with detailed descriptions of the major
parts of the solver. Results from numerical experiments of flow over a flat plate with performance analyses are reported in
Section 4. Concluding remarks follow in Section 5.

2. Numerical methods

The non-dimensionalized form of incompressible Navier–Stokes equations are given as


∂ ui
= 0, (1)
∂ xi
∂ ui ∂ ∂p 1 ∂ ∂ ui
+ ui u j = − + , (2)
∂t ∂xj ∂ xi Re δ ∂ x j ∂ x j
where Re δ is the Reynolds number based on a characteristic length scale δ and a reference velocity U . For the present
study, δ is chosen to be the inlet displacement thickness and U to be the free-stream velocity. Non-dimensional variables u i
and p represent velocity in the i-direction and pressure, respectively. The present solver uses a three-dimensional staggered
structured grid topology in which the velocity components are stored at cell faces, and the pressure values at the center of
each cell.
The flow configuration of interest corresponds to the one for simulation of flow over a flat plate and is modeled on a
rectangular box having dimensions L x × L y × L z (Fig. 1). Uniform grid spacings are employed in the streamwise x1 - and
spanwise x3 -directions, and clustered grids near the wall in the wall-normal x2 -direction. The domain is discretized into
N X × N Y × N Z cells along x1 , x2 and x3 directions, respectively.
The above equations are solved by a semi-implicit fractional-step method in which the convection terms of the mo-
mentum equation (Eq. (2)) are integrated explicitly in time using third-order Runge–Kutta scheme, while the viscous terms
are integrated implicitly using Crank–Nicholson scheme [8]. Spatial discretization is performed using second-order central
difference schemes. The implicit coupling between Eqs. (1) and (2) results in a Poisson equation of which source-term orig-
inates from an intermediate velocity of the momentum equation. As a result, the fractional-step method requires to solve
the following discretized equations:
m −1
ûm
i − ui m −1 δ p m −1
= αm ( L̂m
i + Li ) − γm N im−1 − ρm N im−2 − 2αm , (3)
$t δ xi
δ δφ m 1 1 δ ûm i
= , (4)
δ xi δ xi 2αm $t δ xi
δφ m
um m
i = û i − 2αm $t , (5)
δ xi
αm $t δ δφm
p m = p m −1 + φ m − (6)
Re δ xi δ xi
S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264 249

where the superscript m indicates the sub-step index, and terms with a hat (·) ˆ represent variables at the intermediate
sub-step before projection. $t is the sub-step size; αm , γm , ρm are integration coefficients [15]; L i and N i represent dis-
cretizations of the linear viscous term and the nonlinear convection term, respectively: L i = Re1 δδx δδx u i , N i = δδx u i u j , where
δ j j j
δ m
δ x is a discrete finite difference operator. Writing out Eq. (3) for the intermediate velocity û i gives:
! " #$
δ2
1 δ2 δ2
1 − $t αm + 2+ 2 ûm
i
Re δ δ x21 δ x2 δ x3
! " #$ (7)
1 δ2 δ2 δ2 −1 δ p m −1
= 1 + $t αm 2
+ 2+ 2 um
i − $t γm N im−1 − $t ρm N im−2 − 2αm ,
Re δ δ x1 δ x2 δ x3 δ xi
which would produce a hepta-diagonal matrix on the left-hand side.
Alternatively, one efficient scheme for solving Eq. (7) for the intermediate velocity ûm
i
is the ADI method, which approx-
imates the equation as a factorized form:

(1 − δ1 )(1 − δ2 )(1 − δ3 )ûm


i
δ p m −1 (8)
−1
= (1 + δ1 + δ2 + δ3 )um
i
− $t γm N im−1 − $t ρm N im−2 − 2αm ,
δ xi
2 2 2
where δ1 = $t αm Re1 δ 2 , δ2 = $t αm Re1 δ 2 , δ3 = $t αm Re1 δ 2 . This can be written in a matrix form as follows:
δ δ x1 δ δ x2 δ δ x3

(I − A3)B = Rm
i ,

( I − A 1 )C = B T xz , (9)
m T xy
(I − A 2 )û i =C ,
where R m i
is a N Z × N Y × N X matrix corresponding to the right-hand side of Eq. (8) and B, C indicate N Z × N X × N Y and
N X × N Y × N Z matrices, respectively for storing intermediate solutions. (·) T xz and (·) T xy each represent three-dimensional
transpose with respect to x, z and x, y directions. Note that for each velocity component ûm i
(i = 1, 2, 3), N X × N X -sized,
N Y × N Y -sized and N Z × N Z -sized matrices are inverted at N Y N Z , N X N Z and N X N Y number of points, respectively. Note
further that periodic boundary condition along the spanwise x3 -direction produces periodic tridiagonal matrices whose
elements at the upper right and the lower left corners are non-zero.
Tridiagonal matrices are advantageous over other matrices in that they can easily be inverted using the Thomas algo-
rithm, which is the Gauss elimination requiring O (n) operations. However the algorithm is inherently serial and difficult to
parallelize especially when one is aiming for fine-grain parallelism, which is the case for GPUs. Details regarding tridiagonal
matrices will further be discussed in the next section.
The Poisson equation (Eq. (4)) is solved using a direct solution method after Fourier transform to take advantage of
uniform grid spacings in the x1 -direction and in the periodic x3 -direction. The equation is solved by first performing a half-
range cosine transform in the x1 -direction, followed by a Fourier transform in the x3 -direction. Finally, multiple tridiagonal
matrices are inverted along the x2 direction, after which the scalar φ is obtained via inverse transforms in the x3 - and
x1 -directions. Along with the resulting tridiagonal matrices, the main advantage of this method is that one can utilize the
FFT algorithm for which optimized libraries exist (e.g. FFTW library [7]).
A no-slip condition is imposed at the bottom wall at x2 = 0, and a stress-free condition at the top boundary (Fig. 1).
Convective boundary conditions are applied at the outlet. Turbulent inflow which has previously been computed from a
separate code using a recycling method [16] is imposed at each time-step.

3. GPU implementation

The main objective of the present study is to achieve fine-grain parallelism in the semi-implicit fractional-step method,
which is not feasible in codes based on multi-core CPUs. To do so, the Navier–Stokes solver is represented in a way that
suits well with the GPU architecture using a programming model named CUDA. Unlike CPU programming in which hardware
details are hidden from the user, the CUDA programming model allows the programmer to have explicit control of the GPU
memory. Thus for an efficient implementation, the programmer must be aware of the GPU memory model in order to
identify the right performance bottlenecks and fully exploit available resources.

3.1. GPU architecture and CUDA programming

The memory model of an NVIDIA GPU built on the Kepler architecture is depicted in Fig. 2. It consists of 15 multipro-
cessors (SMX), each of which contains 192 single-precision (SP) CUDA cores, 64 double-precision (DP) units and 32 special
function units, all of which perform integer or floating-point instructions. For the present study, all floating-point data are
250 S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264

Fig. 2. Simplified diagram illustrating the memory hierarchy of a Kepler GPU. Note that only double-precision units are displayed for brevity.

Table 2
Comparison of two different GPUs used in the present study.
Tesla K40c Tesla P100
Architecture Kepler Pascal
SP cores 2880 3584
DP cores 960 1792
Memory size 12 GB 16 GB
Core frequency 875 MHza 1480 MHza
Memory bandwidth 288 GB/s 732 GB/s
DP peak throughput 1.68 TFlops 5.30 TFlops
Multiprocessors (MP) 15 56
Performance per Watt 7.16 GFlops/W 17.7 GFlops/W
a
Boost clock frequencies.

computed at DP accuracy, so only DP cores are shown in Fig. 2. Below the boxes denoted as DP are various boxes indi-
cating different types of memories, which are drawn in the order of access speed. Memories farther away from DP cores
have slower access speeds, the fastest resources being registers, and the slowest being the global memory. In addition, each
multiprocessor contains a fast on-chip memory consisting of shared memory (SMEM) and L1 cache, a read-only space, and
an L2 cache shared by all multiprocessors.
Note that Kepler GPUs have a 3:1 ratio of SP to DP cores. Since most operations in computational fluid dynamics (CFD)
require DP accuracy, this ratio indicates that only part of the chip would be utilized. This aspect has been improved in
modern GPUs built on the Pascal architecture which features a 2:1 ratio of SP to DP cores (Table 2). In addition, GPUs of the
next-generation architecture codenamed Volta are also reported to have a 2:1 ratio of SP to DP cores [22]. Thus although
GPU underutilization is inevitable in computational physics, this problem has been alleviated in modern GPUs.
There are two major reasons why managing different types of memories is a critical part of CUDA programming. Firstly,
there is generally a large gap between memory bandwidth and computational throughput. For example Table 2 shows the
difference between these two parameters per unit time: 288 GB/s and 1.68 TFLOPs. Although the modern Tesla P100 GPU
has a much higher memory bandwidth of 732 GB/s, it still exhibits significant discrepancy compared to its peak throughput
of 5.3 TFlops. To compensate for such difference, it is, therefore, important to maximize the use of faster on-chip memories.
Secondly, there is a conflicting relationship between the amount of on-chip memory used and the amount of parallel
threads executed. Ideally speaking, each multiprocessor may run up to 2048 parallel threads, so 15 × 2048 = 30720 threads
can be run in parallel. However, parallelism may be limited to fewer threads when shared memories or registers are over-
utilized. Finding the balance between these two aspects is particularly important in the present semi-implicit fractional-step
method. Note that the performance of the present method is bound by memory-bandwidth; it entails frequent loads/stores
of data without repetitive computation on the same data set. The use of on-chip memory may resolve this issue but it must
be used with care, so as not to harm parallelism.
In the following, several aspects of CUDA programming related to implementing the present method are introduced.
S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264 251

Fig. 3. Example of a coalesced access to a column-major matrix of the size 32 × 32. The arrow indicates the order of access of a thread. For example while
the first thread accesses element 1, the second thread accesses element 2 in parallel, the third thread accessing element 3 and so on, which satisfies the
condition that all threads in a warp access a contiguous chunk of memory.

Coalesced global memory access: Global memory is the starting point from which every kernel fetches data for computa-
tion. Although it is the most commonly used space in GPUs, it is characterized by high latency, which is the reason why
it should be used with care. Reads and writes to global memory must therefore be accessed in an efficient pattern called
coalesced access. Memory access is said to be coalesced when a group of 32 threads, called a warp, accesses a contiguous
memory address. For example memory access to a Fortran matrix of the size 32 × 32 would be coalesced if thread indices
were mapped onto the first index of the matrix, owing to the column-major order used in Fortran (Fig. 3). The importance
of coalesced global memory access has been stressed in several CFD papers [3,11,25] and its effect is discussed in textbooks
on CUDA [5,14,24]. In particular, note that the present code is bandwidth-bound, which makes this aspect essential.
Shared memory: Due to the slow access speed of global memory, it is often necessary to make use of the shared memory,
which is an on-chip memory having higher access speed. Shared memory is a scratch-pad memory, similar to cache in
CPUs but different in that it is manageable by a programmer. It is shared by all threads in a block, so it is often used for
cooperation among threads. Recall that the present implementation utilizes an ADI method, which requires access of data
in three different directions. In order to access them in a coalesced manner, matrices must be transposed for which shared
memory plays a significant role. Furthermore when computing the Courant–Friedrichs–Lewy (CFL) number or the sum of
flow rate for the convective outlet, the tree-reduction algorithm is used for which shared memory becomes particularly
useful for thread-cooperation. Details on implementation regarding matrix transpose and tree-reduction can be found in
Ruetsch et al. [24].
Register usage: Controlling the amount of registers is also important when a kernel is bound by register usage. In order
to minimize access to the global memory, each kernel should perform as many calculations as possible before sending
the result back to the global memory. Storing many intermediate results may sometimes lead to writing large kernels,
which may inevitably use a lot of registers. However, each multiprocessor has 65536 registers available which are evenly
distributed among threads. If for example a kernel were executed with each block containing 512 threads, each of which
used 32 registers, then each multiprocessor could hold 65536/(512 × 32) = 4 blocks at a time, but if a kernel used 48
registers, then the number of thread-blocks in each multiprocessor would be limited to 65536/(512 × 48) = 2 blocks.
Therefore, abusing registers restricts the number of concurrently executing threads. The number of registers can manually
be restricted, but in that case the excess amount of registers is spilled into a space called local memory, which resides in
the slower DRAM. Thus one needs to find the right balance between two conflicting factors – the number of concurrent
blocks versus the amount of spilled registers per thread – when designing a kernel that is bound by register usage.
CUDA streams: In CUDA programming, it is possible to execute multiple kernels and data transfers concurrently by using
CUDA streams. A CUDA stream can be thought of as a queue to which tasks are added and kept in order before execution.
Operations in a stream are executed in the order they were first added. The key advantage is that when multiple streams are
generated, asynchronous operations in different streams can be run in parallel. Using the feature, one can overlap CPU–GPU
data transfer with computations running on CPU and GPU. An example of the use of CUDA streams to achieve concurrency
is shown in Fig. 8 which will be discussed in later sections.

3.2. Structure of the code

A brief summary of the code structure is given to provide a general idea of the program flow. It is illustrated in the
flow-chart, Fig. 4. The upper and lower portion of the figure show computation running on the host (CPU) and the device
(GPU), respectively. Data transfers between the two are positioned in between. At the initialization step, variables related
to the geometry, mesh and the initial field are configured on the host. Such computations prior to time advancement are
conducted on the host, because not only do they occur once throughout the code, but most of them have small workloads
which cannot provide enough parallelism to keep GPU resources busy. These data are then copied to GPU, and the solver is
ready to start time advancement, which is marked by a box surrounding its components.
Before advancing the main computation on GPU, inflow data is read and copied to the GPU. This operation is done asyn-
chronously, meaning that after executing data transfer, control is immediately returned to the host so that other operations
can be done by the host in the meantime. Here, the CFL number and the sub-step size $t are computed while inflow data
is transferred from the host. Asynchronous data transfer is introduced again for copying flow variables u i of the previous
time-step. These variables are required for calculating quantities of turbulence statistics such as mean velocity or Reynolds
252 S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264

Fig. 4. Flow-chart that summarizes the code structure.

Fig. 5. Average computation time taken to process three major sections of a sub-step using a single-core Xeon E5-2660 v3 @2.6 GHz CPU, displayed in (a)
bars given in seconds and (b) a pie chart corresponding to 134 million grids showing their relative importance.

stress. While flow variables are being copied, the main Navier–Stokes solver starts to compute three sub-steps, each consist-
ing of (i) the right-hand side (RHS) of momentum equations, (ii) the ADI solver and (iii) the Poisson equation. Note also that
the present code is designed to perform major computations on GPU while turbulence statistics are calculated concurrently
on CPU to hide any latency arising from computing the statistics. Reasons as to why statistics are computed on CPU will be
discussed shortly.
To identify the major bottleneck of the code, computation time taken for the three major parts of the Navier–Stokes
solver (i)–(iii) is measured on a single-core CPU and shown in Fig. 5. It is clear that the ADI solver is taking up the majority
of the computation time. Ways to implement the three parts on GPU are presented next.

3.3. Implementing the RHS of momentum equations

It can be observed that computation of the RHS of Eq. (8) involves many arithmetic operations arising from finite dif-
ferences in space at time-steps m − 1 and m − 2. The computation can be expressed as a triply nested loop which exposes
ample parallelism. Such a loop can easily be coded as CUDA kernels. The simplest way to map the RHS into CUDA ker-
nels is to use CUDA Fortran compiler directives which resemble OpenMP programming. This method instructs the compiler
to automatically generate asynchronous kernels from the host code containing tightly nested loops. The advantage of this
method is that it saves considerable amount of work needed to write out trivial kernels by simply writing a clause in
front of a nested loop. For example, writing !$cuf kernel do(3) <<< ∗, ∗ >>> in front of a triply nested loop will
automatically generate a kernel with block/thread sizes computed at compile-time.
Automatic kernel generation methods are indeed efficient tools, yet their performance have been shown to fall behind
native implementations [19,23]. Instead of using compiler directives, one may transform the triply nested loop into a kernel
by mapping loop indices onto thread and block indices, so as to satisfy coalesced global memory access. However, this
alone does not produce a noticeable difference from automatically generated kernels. This is because the kernel uses a large
amount of registers which restricts the multiprocessor from holding sufficient amount of thread-blocks. In CUDA Fortran,
the maximum number of registers per thread can be controlled by using a compiler flag -Mcuda=maxregcount:N R . If
S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264 253

Fig. 6. Effect of register count per thread on occupancy. The figure is obtained from the NVIDIA CUDA GPU Occupancy Calculator.

more than N R registers are about to be used, excess amount is spilled into the local memory as explained before. Choosing
the value N R is a balance between two conflicting factors: high N R may restrict the number of concurrent threads (lower
occupancy), while low N R may result excessive spilling which forces the kernel to use too much local memory. By profiling
the program using nvprof, the optimal value of N R in Tesla K40c has been determined as N R = 64 registers per thread.
The two conflicting factors when choosing the upper bound of registers are shown in Fig. 6.
The RHS kernel can further be optimized by managing the 64 kB on-chip memory, SMEM and L1 cache, which share the
space together. A programmer can flexibly specify the size of each memory to either 48 kB SMEM with 16 kB L1, equally
32 kB, or 16 kB SMEM with 48 kB L1. Local memory stores are cached in L1, so increasing the size of L1 to 48 kB increases
the likelihood of cache hits. Thus before executing the RHS kernel, it is advised to use a cache configuration preferring L1.
The amount of local memory usage can also be reduced by substituting locally defined temporary variables of the kernel into
shared memory variables. Since on-chip memory is now configured to be L1-oriented, very few registers can be substituted
in this manner. Nevertheless reduction of one or two registers via shared memory can further reduce the amount of register
spills. Improvements in spill loads/stores can be identified by adding a compiler flag: -Mcuda=ptxinfo.

3.4. ADI solver using multi-level parallelism

The ADI method for solving Eq. (9) is the most important target of optimization for the present method. At each sub-step,
the ADI method requires totally six inversions of general TDMAs and three inversions of periodic TDMAs, all of which
consume significant amount of time in processing the Thomas algorithm. Application of this algorithm to GPUs is difficult,
because parallelism is exposed only at the coarse-grain matrix level.
As an alternative, a hybrid algorithm proposed by Zhang et al. [33] which combines cyclic reduction (CR) with parallel
cyclic reduction (PCR) is used. The advantage of this algorithm is that not only can we utilize the memory hierarchy of
a GPU by mapping equations to threads and systems to blocks, but the hybrid method also allows PCR complement the
lack of parallelism arising in CR [4,33]. Furthermore, there is no need to implement the algorithm, since it is provided in
the cuSPARSE library. The present study adopts a function named cusparseDgtsvStridedBatch which uses the hybrid
method to solve multiple systems of double-precision, real-valued TDMAs without pivoting. The complexity of the algorithm
is hidden at the low-level, exposing only a straightforward interface for users:

cusparseDgtsvStridedBatch (handle,m,a,b,c,d,batch,stride)

where handle is the cuSPARSE context; m is the size of the matrix; a,b,c,d are the four vectors corresponding to the
tridiagonal linear system; batch is the number of matrices to be inverted and stride is the length of separation between
each system.
At this point, two questions may arise regarding its usage:

1. How to invert periodic TDMAs and


2. How to increase parallelism with the given high-level interface.

The first question asks how the cuSPARSE function, originally intended for inversion of the usual TDMAs can be utilized to
invert TDMAs with slight perturbations, which arise from the periodic boundary condition. The second question asks how
one can assign maximum amount of work to make GPU resources as busy as possible. Strategies for each question are given
next.
254 S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264

Inversion of periodic TDMAs: To answer the first question, consider a periodic TDMA which has additional elements at the
upper right and lower left corners along with the three bands. These elements are stored in the first and the last index of
the lower (a) and upper (c) diagonals, respectively:
⎛ ⎞
b1 c1 a1
⎜ a2 b2 c2 ⎟
⎜ ⎟
⎜ .. .. .. ⎟
⎜ . . . ⎟
P =⎜



⎜ .. .. .. ⎟
⎜ . . . ⎟
⎝ a N Z −1 b N Z −1 cN Z −1

cN Z aN Z bN Z

Instead of using the expensive Gaussian elimination which requires O (n3 ) number of operations, Sherman–Morrison algo-
rithm is used to convert the problem into inverting two TDMAs [32]. In essence, the problem of solving a periodic tridiagonal
linear system P X = d can be converted into
(i) inverting two TDMAs of the form

A X 1 = d,
A X2 = f ,

(ii) calculating a scalar σ , and thereby X :


T
σ = − 1+g g XT 1X ,
2
X = X1 − σ X2,

where A is an invertible TDMA without additional corner elements, and f , g are n × 1 vectors of the form

f T = [a1 , 0, 0, . . . , 0, c N Z ],
g T = [1, 0, 0, . . . , 0, 1].

Now that there are two TDMAs for X 1 and X 2 , they can similarly be inverted using cusparseDgtsvStridedBatch.
In order to make this operation more efficient, a, b, c , d diagonals of the two linear systems are stored together in ã, b̃, c̃ , d̃ to
form a block matrix shown in Eq. (10). This method (Algorithm 2) not only allows the function cusparseDgtsvStrided-
Batch to be executed only once which saves memory transfer, but also solves a larger system providing more work for
GPU.
⎛ ⎞
β1 c1 ⎛ ⎞
⎜ a2 β2 c2 ⎟⎛ ⎞ d1
⎜ ⎟ ⎜ d2 ⎟
⎜ .. .. .. ⎟⎜ ⎟ ⎜ ⎟
⎜ . . . ⎟⎜ ⎟ ⎜ .. ⎟
⎜ ⎟ ⎜ X1 ⎟ ⎜ . ⎟
⎜ .. .. .. ⎟⎜ ⎟ ⎜ ⎟
⎜ . . . ⎟⎜ ⎟ ⎜ d N Z −1 ⎟
⎜ ⎟⎜ ⎜
⎟ ⎜ ⎟
⎜ aN Z βN Z 0 ⎟⎜ ⎟ ⎜ dN Z ⎟
⎜ ⎟⎜ ⎟=⎜ ⎟ (10)
⎜ 0 β1 c1 ⎟⎜ ⎟ ⎜ a1 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ a2 β2 c2 ⎟⎜ ⎟ ⎜ 0 ⎟
⎜ ⎟⎜ ⎟
⎜ .. .. .. ⎟ ⎜ X2 ⎟
⎟ ⎜ .. ⎟
⎜ . . . ⎟⎝ ⎠ ⎜ . ⎟

⎜ ⎟ ⎟
⎜ .. .. ⎟ ⎝ 0 ⎠
⎝ . . c N Z −1 ⎠
cN Z
aN Z βN Z
Multi-level parallelism approach: To answer the second question, consider Fig. 7 which shows the possible level of par-
allelism achievable in the ADI solver. 1-level parallelism corresponds to the Thomas algorithm which has a single level of
parallelism at the matrix-level. 2-level parallelism can be observed in the hybrid CR+PCR algorithm. In 2-level parallelism,
the first level is the equation-level at which equations of a matrix are eliminated in parallel using reduction operations,
while the second level is the matrix-level at which multiple TDMAs of one coordinate direction are inverted in parallel. In
Algorithm 1, 2-level parallelism is achieved via cusparseDgtsvStridedBatch to invert multiple matrices concurrently
using higher number of threads compared to the Thomas algorithm.
This algorithm can be improved by increasing parallelism to three levels. In 3-level parallelism, the first and second
levels are identical to those in 2-level parallelism. The third level is the velocity-level at which multiple TDMAs of each u i
are inverted together in parallel. For example when N Y number of tridiagonal matrices ( I − A 1 ) in Eq. (9) are to be inverted,
the 3-level parallelism inverts N Y × 3 number of matrices in parallel for all u 1 -, u 2 -, u 3 -momentum equations. As a result,
the method of forming block matrices as in Eq. (10) is used again to generate a larger linear system of the size N X × N Y × 3.
S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264 255

Fig. 7. Illustration of multiple levels of parallelism achievable in the ADI solver. (a) 1-level parallelism corresponds to the Thomas algorithm; (b) replacing
Thomas algorithm into the hybrid CR+PCR algorithm facilitates a 2-level parallelism; (c) and (d) following the methods described in section 3.4 can lead up
to 4-level parallelism.

Algorithm 1: ADI solver with 2-level parallelism.


input : u, rhsu
temp : a, b, c, d, d T
for j = 1 to N Y do
d = transpose(rhsu (:, j , :))
ConfigureDiagonals(a, b, c )
d = InvertPeriodicTDMAs(a, b, c , d ) ! refer to Algorithm 2.
rhsu (:, j , :) = transpose(d )
end
for k = 1 to N Z do
ConfigureDiagonals(a, b, c )
d = rhsu (:, :, k)
ApplyBC(d, k )
d = cusparseDgtsvStridedBatch(a, b, c , d ): size N X , batch N Y
ConfigureDiagonals(a, b, c )
d T = transpose(d )
ApplyBC(d T , k )
d T = cusparseDgtsvStridedBatch(a, b, c , d T ): size N Y , batch N X
u (:, :, k) = u (:, :, k) + d T
end
output: u

Parallelism can further be increased up to four levels. On top of the three levels above, an additional level can be used
to maximize the workload for GPUs and minimize dynamic allocation of temporary storage used in cusparseDgtsv-
StridedBatch. At the fourth level, chunks of N Y × 3 number of matrices are solved for multiple values of k (Fig. 7d),
where k indicates the index of the homogeneous x3 -direction ranging from 1 to N Z . Up to 3-level parallelism, chunks of
N Y × 3 number of matrices are configured for each k = 1, · · · , N Z . In 4-level parallelism however, N X × 3 × N Z number of
matrices are inverted in parallel leading to much higher performance.
However, 4-level parallelism has issues regarding memory capacity. As the level of parallelism increases, larger diagonals
are required, which leads to higher memory usage. In addition, the cusparseDgtsvStridedBatch function uses a
significant amount of temporary storage [20], which is written in bytes as

batch × (4 × m + 2048) × 8, (11)


256 S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264

Algorithm 2: Inversion of periodic TDMAs with 2-level parallelism.


input : a, b, c, d
temp : β , ã, b̃, c̃, d̃
for i = 1 to N X do
f (1, i ) = a(1, i ) ! upper-right corner element
f (N Z , i) = c(N Z , i) ! lower-left corner element
a(1, i ) = 0.
c ( N Z , i ) = 0.
for k = 2 to N Z do
f (k, i ) = 0.
end
end
Compute β = b − f
for i = 1 to N X do
for k = 1 to N Z do
ã(k, i ) = a(k, i ); ã(k + N Z , i ) = a(k, i )
b̃(k, i ) = β(k, i ); b̃(k + N Z , i ) = β(k, i )
c̃ (k, i ) = c (k, i ); c̃ (k + N Z , i ) = c (k, i )
d̃(k, i ) = d(k, i ); d̃(k + N Z , i ) = f (k, i )
end
end
d̃ = cusparseDgtsvStridedBatch(ã, b̃, c̃ , d̃ ): size N Z , batch 2N X
for i = 1 to N X do
d̃(1,i )+d̃( N Z ,i )
σ=
1+d̃(1+ N Z ,i )+d̃(2N Z ,i )
for k = 1 to N Z do
d(k, i ) = d̃(k, i ) − σ d̃(k + N Z , i )
end
end
output: d

where m indicates the matrix size and batch the number of matrices. Memory size is usually restricted to 12 GB in
Kepler GPUs and 16 GB in Pascal GPUs. Therefore the fourth level would not be able to exploit N Z amount of additional
parallelism, if the problem size were large. For example in a problem having 1024 × 256 × 128 number of grids, using
the entire N Z = 128 amount of parallelism in the fourth level would require about 13 GB of memory which exceeds the
capacity of Kepler GPUs.
Thus a parameter κ is introduced, which determines how many k’s out of N Z can be used for the fourth level within the
memory limit of a GPU. In the following, κ will be determined based on the knowledge about memory usage of the present
algorithm. The k-loop (in Algorithm 1) of the ADI solver requires a, b, c , d, d T diagonals, each having the size N X × N Y . Each
of these diagonals exists for each u 1 , u 2 , u 3 , so the total amount of space required for the diagonals in bytes is

( N X × N Y ) × 5 × 3 × κ × 8 = 120κ N X N Y . (12)
On the other hand, the j-loop (in Algorithm 1) requires four diagonals of the size N Z and a d diagonal of the size N X × N Z .
The Sherman–Morrison algorithm requires additional temporary space for diagonals ã , b̃, c̃ , d̃ each of the size N X × N Z × 2.
Each of these diagonals exists for each u 1 , u 2 and u 3 . Thus the required memory for periodic matrix inversion in bytes is

(4N Z + N X N Z + (2N X N Z × 4)) × 3 × 8 ≈ 216N X N Z . (13)


Let m′ be the number of matrices to be solved in the second level of parallelism when the diagonals are of the size m. For
example, if we were to invert matrices in the x2 -direction i.e., invert ( I − A 2 ), then m = N Y and m′ = N X . Then according to
equation (11), memory required for the cuSPARSE function in bytes is

(m′ × 3 × κ ) × (4 × m + 2048) × 8 = (96N X N Y + 49152m′ )κ (14)


Finally, let the amount of unused space left in global memory before starting the ADI solver be free, which can be
obtained at run-time using a CUDA API named cudaMemGetInfo. Then adding (12), (13), (14) gives the total amount of
space required for the ADI solver, and this value must not exceed free:

(120N X N Y + 96N X N Y + 49152m′ )κ + 216N X N Z ≤ free


Hence, the maximum value of κ permitted within the available space in global memory can be computed as
free − 216N X N Z
κ = f loor ( ) (15)
216N X N Y + 49152m′
Using this simple formula calculated beforehand, 4-level parallelism can be achieved regardless of the size of global memory
provided by the hardware. A pseudo-code for 4-level parallelism is written in Algorithm 3. Note that when compared to
Algorithm 1, the value of batch has increased from N Y to 3κ N Y and N X to 3κ N X , showing 3-level parallelism.
S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264 257

Algorithm 3: ADI solver with 4-level parallelism.


! Compute κ before starting the ADI solver.
status = cudaMemGetInfo(free,total)
κ X = (free−216N X N Z )/(216N X N Y + 49152N Y )
κY = (free−216N X N Z )/(216N X N Y + 49152N Y )
κ = min(κ X , κY )
! Next call the ADI solver.
input : u 1 , u 2 , u 3 , rhsu, rhsv, rhsw, κ
temp : a, b, c, d, d T each of the size 3κ N X N Y
.
.
.
kf in = N Z /κ
for k = 1 to kf in do
ConfigureDiagonals(a, b, c )
d(1 : κ N X N Y ) = rhsu (1 + κ N X N Y (k − 1) : κ N X N Y k)
d(1 + κ N X N Y : 2κ N X N Y ) = rhsv (1 + κ N X N Y (k − 1) : κ N X N Y k)
d(1 + 2κ N X N Y : 3κ N X N Y ) = rhsw (1 + κ N X N Y (k − 1) : κ N X N Y k)
ApplyBC(d, k ) for each u 1 , u 2 , u 3
d = cusparseDgtsvStridedBatch(a, b, c , d ): size N X , batch 3κ N Y
ConfigureDiagonals(a, b, c )
d T = transpose(d )
ApplyBC(d T , k ) for each u 1 , u 2 , u 3
d T = cusparseDgtsvStridedBatch(a, b, c , d T ): size N Y , batch 3κ N X
u i (:, :, 1 + κ (k − 1) : kκ ) = u i (:, :, 1 + κ (k − 1) : kκ ) + d T
end
output: u 1 , u 2 , u 3

3.5. Fourier-transform-based direct method

The Poisson equation (Eq. (4)) requires a half-range cosine transform in the x1 -direction and a Fourier transform in
the x3 -direction, both of which can be computed using FFT (fast Fourier transform). After applying Fourier transform in
the two directions, we are left with the second derivative operator in the x2 -direction. The second-order central-difference
discretization results in multiple TDMAs whose inversion is similarly the bottleneck for GPU acceleration.
For computing the FFT, functions from the cuFFT library are utilized. As shown in Algorithm 4, complex-to-complex FFT
is used for x1 -directional half-cosine transforms, while real-to-complex or complex-to-real FFTs are used for x3 -directional
Fourier transforms. Similar to the ADI solver, TDMAs along the x2 -direction are inverted using the cuSPARSE library. Note
however, that TDMAs produced from the Poisson equation differ from those of the ADI solver in two aspects.
Firstly, this linear system has real-valued diagonals on the left-hand side and a complex-valued right-hand side from
Fourier transforms. For this reason TDMAs must be inverted once for the real part of the right-hand side, and another for
the imaginary part. Therefore, the required data layout is a block matrix resembling that of Sherman–Morrison algorithm in
the ADI solver (Eq. (10)). Here instead, a block matrix is formed by storing the real part and the imaginary part together.
Secondly, inversion of TDMAs in the Poisson equation are unstable and therefore, requires pivoting. Unlike the momen-
2
1 δ
tum equations in which TDMAs are diagonally dominant with the major diagonals given by 1 − $t αm Re δ x2
, the Poisson
equation produces TDMAs whose major diagonals contain the sum of modified wave-numbers kl and km arising from Fourier
transforms:
2(1 − cos(π l/ N X )) 2(1 − cos(π m/ N Z ))
kl = , km = , (16)
$x21 $x23
where $x1 and $x3 are the size of grid spacing in the x1 - and x3 -directions, respectively. For certain combinations of l and
m, the sum of kl and km may approach zero. In such a case, TDMAs may lose diagonal dominance and become unstable.
For this reason, it is advised to invert TDMAs of the Poisson equation using a function named cusparseDgtsv, which is
based on a diagonal pivoting algorithm [4]. Due to the cost of pivoting, this function requires additional computation time
compared to cusparseDgtsvStridedBatch used in the ADI solver.

3.6. Collecting turbulence statistics

Collection of statistics is essential for analyses of turbulent flows. However as the number of grid gets bigger, it requires
additional computation time and a significant amount of memory. For example, collecting second-order statistics would
require 9 additional three-dimensional arrays for time-averaged quantities (u i and u ′i u ′j ), which would require 4.5 Gbytes of
additional memory in a domain having 134 million number of grids. Considering the limited amount of memory on a GPU,
storing statistics in global memory is certainly unrealistic.
To solve this issue, the present code utilizes OpenMP and CUDA streams to let the CPU take charge of computing statis-
tics. As detailed out in Algorithm 5, the solver starts with two or more OpenMP threads consisting of the master thread 0
258 S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264

Algorithm 4: Poisson equation: from forward FFT to TDMA inversion.


input : φ , ▽ of size N Y × N X × N Z
temp : DIV,CCAP,CTMP,
W R of size N Z × N X
CA of size 2N+ X × NZ ,
W C of size 12 N Z + 1 × N X
allocate(DIV): size N X+ × N Y × N, Z
allocate(CTMP): size 12 N Z + 1 × N X × N Y
DIV = transpose_yxz(▽)
for j = 1 to N Y do
for k = 1 to N Z do
for i = 1 to N X do
CA (i , k) = DIV (i , k)
CA (i + N X , k) = 0.
end
end
CA = cufftExecZ2Z(CA,CUFFT_FORWARD): size 2N X , batch N Z
(i −1)
W R = transpose( N2 ( real(CA)cosθ+aimag(CA) sinθ)): θ = π2N for i = 1 to 2N X
X X
W R (:, 1) = 0.5W R (:, 1)
W C = cufftExecD2Z( W R ): size N Z , batch N X
CTMP (:, :, j ) = N1 W C
Z
end
+1 ,
allocate(CCAP): size N Y × N X × 2
NZ + 1
CCAP = transpose_zyx(CTMP)
deallocate(DIV,CTMP)
for k = 1 to 12 N Z + 1 do
! configure D so that real and imaginary parts are aligned (similar to Algorithm 2)
ConfigureDiagonals(a, b, c , D )
D 1 , D 2 ← cusparseDgtsv(a, b, c , D ): size N X N Y
CCAP (:, :, k) =cmplx( D 1 , D 2 )
end
Inverse Fourier transform done likewise.
.
.
.
output: φ

and slave threads. The master thread takes charge of all GPU operations, and slave threads do not interfere with the master
thread. Two different non-default CUDA streams are created at the initialization step which is named here for convenience
as NSstream and D2Hstream. Using these two streams, flow variables u i and p computed from the previous time-step
are asynchronously copied to the CPU via the D2Hstream, while the main computation is processed on the NSstream.
After synchronizing data transfer using cudaStreamSynchronize, slave threads start to calculate statistics. This syn-
chronization is done by slave threads, so the GPU is not aware of what the slave threads are doing. Note in Algorithm 5 that
copying flow variables and calculating statistics are not done at the current time-step of the Navier–Stokes equations; in-
stead a variable startMemcpyAsync is made .true. which commences the statistics routine in the following time-step.
This was to ensure that the Navier–Stokes computation and the statistics routine are completely overlapped.

4. Numerical experiments

4.1. Environment for experiments

Numerical experiments are conducted to compare the GPU code of the semi-implicit fractional-step method with a highly
optimized single-core CPU counterpart. Both codes are written in Fortran 90 standards. The CPU code is run on a CentOS 6.5
Linux server with two deca-core Xeon E5-2660 v3 @2.6 GHz CPUs, and is compiled with an Intel Fortran Compiler version
16.0.3. The GPU code is run on a CentOS 6.8 workstation equipped with an octa-core Xeon E5-2630 v3 @2.4 GHz CPU along
with an NVIDIA Tesla K40c GPU, and is compiled with a PGI Fortran Compiler version 16.10. Both are compiled with an -O3
optimization, and all floating-point data have double-precision accuracy. Additional performance tests of the GPU solver are
conducted on a modern GPU server, IBM Power System S822LC for High Performance Computing. This server is equipped
with two octa-core Power8 CPUs and four Tesla P100 GPUs, but only a single GPU is utilized for this study. The GPU code
is run on Ubuntu 16.04 and is compiled with a PGI Fortran Compiler version 17.4.

4.2. Performance results: memory and speed

Performance of the present GPU solver is evaluated in terms of its speedup and its memory usage. Simulation results are
demonstrated in a DNS of the three-dimensional flat-plate boundary layer (Fig. 1) whose detailed configuration is addressed
in section 2.
S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264 259

Algorithm 5: Use of OpenMP and CUDA streams to collect statistics.


startMemcpyAsync = .false.
!$OMP PARALLEL PRIVATE(tid, n)
tid = OMP_GET_THREAD_NUM()
for n = 1 to ntimesteps do
!$OMP MASTER
.
.
.
if startMemcpyAsync then
Copy u i and p using cudaMemcpyAsync via D2Hstream
end
!$OMP END MASTER
!$OMP BARRIER
if startMemcpyAsync and tid ̸= 0 then
cudaStreamSynchronize(D2Hstream)
Calculate statistics.
end
!$OMP MASTER
for Runge–Kutta m = 1 to 3 do
.
.
.
end
!$OMP END MASTER
startMemcpyAsync = .false.
!$OMP BARRIER
!$OMP MASTER
if this step needs to write files for post-processing then
startMemcpyAsync = .false.
Copy u i and p using cudaMemcpyAsync via D2Hstream
else
startMemcpyAsync = .true.
end
!$OMP END MASTER
!$OMP BARRIER
if tid ̸= 0 then
cudaStreamSynchronize(D2Hstream)
Calculate statistics.
end
end

Table 3
Maximum grid size supported by each GPU and typical Reynolds numbers studied at this
scale. Note that the Reynolds number studied by [31] was restricted due to the computational
cost of their method, even though a higher number of grid points was used than in [26].
Maximum Re θ Required
grid size grid size
Tesla K40c 134M 950a 128M [26]
Tesla P100 190M 940b 210M [31]
a
Reynolds number achieved by using recycling method for inflow condition.
b
Reynolds number achieved by triggering turbulence from a laminar boundary layer.

Memory usage: The present solver uses 10 three-dimensional arrays in order to maximize the size of the problem within
the limit of GPU memory (3 variables for velocity in each direction, 1 variable for pressure, 3 variables for the RHS momen-
tum equation in each direction, and 3 variables for storing the RHS of the previous sub-step). In addition to these major
variables, simulation results show that minimum amount of memory required for constructing TDMAs along with other
minor variables take up space equivalent to about 1.2 three-dimensional variables. Hence the approximate amount of total
memory usage is found to be:
(10N X N Y N Z + 1.2N X N Y N Z ) × 8 bytes = 89.6N X N Y N Z bytes.
Using this expression, the maximum grid size supported by a GPU can be estimated. Table 3 shows that up to 134 million
and 190 million number of grid points can be computed in Tesla K40c and Tesla P100, respectively. This suggests that a
single GPU is capable of DNS of boundary-layer flow at around Re θ = 950 such as those considered by Simens et al. [26] or
Wu and Moin [31].
Speedup: Computation time taken for processing the three major parts within a sub-step is investigated in a test simu-
lation for three different grid sizes (Fig. 9). Compared to the results of Fig. 5a, both Tesla K40 and Tesla P100 have reduced
the computation time by an order of magnitude. Total elapsed time for computing a time-step in various mesh sizes is
summarized in Table 4. For 134 million cells, overall computation has been accelerated by factors of 20 and 48 in Tesla
K40c and Tesla P100, respectively. Note however that such speedups fall short of the usual speedup offered by GPUs mainly
because of the bandwidth-bound nature of the semi-implicit fractional-step method.
260 S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264

Table 4
Speedup results of the present code running on different machines. For each grid size, average computation time
for one time-step (three sub-steps) is provided for a single-core Xeon E5-2660 v3 CPU and two GPUs, Tesla K40c
and Tesla P100 running at boost clock frequencies. Both speedups are calculated against the single-core CPU
result.
Computation time (s) Speedup
Number of grid cells
E5-2660 v3 Tesla K40c Tesla P100 Tesla K40c Tesla P100
512 × 256 × 128 13.4 2.35 2.08 5.7 6.4
768 × 256 × 128 22.8 3.21 2.34 7.1 9.7
1024 × 256 × 128 35.5 3.64 2.47 9.8 14.4
1536 × 256 × 128 76.6 5.39 2.89 14.2 26.5
2048 × 256 × 128 120.8 6.74 3.25 17.9 37.2
2560 × 256 × 128 144.6 8.73 3.83 16.6 37.8
3072 × 256 × 128 177.9 10.3 4.27 17.2 41.7
3456 × 256 × 128 205.1 11.3 4.62 18.2 44.4
4096 × 256 × 128 259.5 13.0 5.39 20.0 48.1
4608 × 256 × 128 279.5 – 6.19 – 45.2
5120 × 256 × 128 315.6 – 6.79 – 46.5
5760 × 256 × 128 375.3 – 8.26 – 45.4

Fig. 8. Time-line obtained from NVIDIA Visual Profiler (NVVP). Markers are obtained using the NVIDIA Tools Extension Library (NVTX).

Although these numbers are significant improvements when each of them is compared with the CPU result, comparison
between the two GPUs requires further explanation. According to the throughput values in Table 2, Tesla P100 has more
than 3 times higher computational power than Tesla K40c due to an increase in the number of DP cores and core frequency.
Nevertheless, the present solver runs only 2.4 times faster on Tesla P100 than on Tesla K40c. Considering that memory
bandwidth has increased by 2.5, this result shows how the speedup of this solver is affected by memory bandwidth. It
also suggests that the present solver performs much better in modern NVIDIA GPUs which have adopted High Bandwidth
Memory 2 (HBM2) to boost memory bandwidth.
To elucidate the advantage of using a GPU in moderate Reynolds number flows, the speedup results of the present
solver can be compared with those of Wang et al. [29]. They present MPI parallelization methods of a similar semi-implicit
fractional-step method for incompressible flows, and graphically show that 24 cores of CPU are required for 20 times
speedup. Thus for incompressible Navier–Stokes equations at moderate Reynolds numbers, the performance of a Tesla K40c
GPU is comparable to that of a 24-core CPU node. Since Wang et al.’s scalability result is limited to 48 cores which deliver
33 times speedup, it is difficult to directly compare the Tesla P100 GPU with CPU nodes; however it can be inferred that its
performance may well be comparable to more than 60 cores of CPU.
Statistics: The time-line in Fig. 8 shows how various operations are overlapped with each other. The top figure shows
the detailed time-line of one sub-step. The bottom figure is a magnification of the part that shows the RHS computation
S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264 261

Fig. 9. Average computation time taken to process the three major parts of a sub-step using different GPUs, (a) Tesla K40c and (b) Tesla P100.

Fig. 10. Effect of grid size on speedup. Green ◦, Tesla K40c; blue ×, Tesla P100. Both GPUs have been run at boost clock frequencies. Note that the present
code performs best when the problem size is a power of 2. (For interpretation of the references to color in this figure legend, the reader is referred to the
web version of this article.)

overlapping with asynchronous data transfers from GPU to CPU. Here, two CPU threads have been used, one for calculating
statistics (here indicated as Thread 307664640) and another for advancing Navier–Stokes equations (Thread 1075850912).
The figure clearly shows that computation of statistics is completely hidden by Navier–Stokes equations on the GPU.

4.3. Effect of grid size on speedup

Typical scalability curves, in which speedup is plotted against the number of processors cannot be drawn for a single
GPU, since the user has no control over the number of cores used. In contrast, Fig. 10 shows speedup results plotted against
the number of grid points. Essentially, this curve illustrates how well GPU resources are being utilized as the problem size
is increased. It is generally known that GPUs tend to work better in larger problems, because more of the unused GPU
resources get utilized. Hence one can anticipate the speedup will at first rise as the grid size is increased. The curve will
then show at which problem size the GPU starts to saturate its resources. Ideally, a flat line should be drawn for grid sizes
that sufficiently populate the amount of resources being used.
Fig. 10 shows that Tesla K40c obtains sufficient workload for grid sizes larger than 67 million, while Tesla P100 does
so for grid sizes larger than 134 million. Note that after the grid size passes this point, there are certain grid sizes (as
can be observed from the “bending points”) whose speedup falls below the maximum value. This is because the majority of
the computation in this solver is spent on CR+PCR and FFT, both of which are reduction algorithms. These algorithms are
generally known to perform best when the problem size is a power of 2, so it is natural to have performance loss when the
grid size contains multiples of 3 or 5, while the best performance is achieved for the grid size 4096 × 256 × 128 = 227 . It
can therefore be concluded from Fig. 10 that once the grid size is large enough, speedup is not determined by the grid size,
but by the grid configuration in each coordinate direction.
262 S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264

Fig. 11. An example of sparse matrices to be solved when comparing the present ADI method with the conjugate gradient method on a 4 × 4 × 4 grid:
(a) three tridiagonal matrices and (b) one hepta-diagonal matrix. Blue markers indicate non-zero elements and squares with dotted lines represent different
levels of available parallelism. Periodic boundary condition is here omitted. (For interpretation of the references to color in this figure legend, the reader is
referred to the web version of this article.)

4.4. Present ADI method versus CUDA-based PCG method

Due to more general boundary conditions, systems of linear equations such as Eq. (7) are often solved instead using the
preconditioned conjugate gradient (PCG) method. From the viewpoint of operation counts, the ADI method is generally faster
than the PCG method, since the former factorizes a matrix into TDMAs while the latter processes a hepta-diagonal matrix.
However from the viewpoint of parallel computing, the PCG method is favored because the ADI method has difficulties in
parallelization of TDMA inversion and in multiple data transfers. Despite such difficulties, the present study has highlighted
that factorization can be better represented in CUDA by using multiple levels of parallelism (section 3.4). As a result, the
present ADI method can be represented by three additional levels of parallelism (Fig. 11a). Note however that the former
solves three TDMAs, while the latter solves one hepta-diagonal matrix (Fig. 11b). The objective of this experiment is to
compare the speed of solving three TDMAs with the speed of solving one hepta-diagonal matrix in CUDA.
For the experiment, a conjugate gradient (CG) method is implemented using built-in functions from cuBLAS and cuS-
PARSE libraries. Two cases are considered – one without a preconditioner and another with an incomplete-LU preconditioner
with 0 fill-in for reducing the number of iterations. Implementation follows the code provided in the white paper [21]. For
the ADI method, data of the left-hand side matrix is stored in three vectors whereas for the PCG method, it is stored in
Compressed Sparse Row (CSR) format. Computation time for transposing B and C matrices given by the ADI method in
Eq. (9) is included in the comparison. Solution error of the two CG methods is calculated by using the solution obtained
from the ADI method as the reference. As a result, residual tolerance is set to 10−12 so as to drop the solution error below
O (10−8 ).
Fig. 12 compares the time taken by the three methods for solving the left-hand side of momentum equations (Eq. (7)).
Although the use of a preconditioner has reduced the number of iterations needed, time required for the forward solve and
back substitution [21] has exceeded the time taken by additional iterations of the CG solver without a preconditioner. On
the other hand for grid sizes ranging from 4M to 50M cells, the CG method without a preconditioner is found to run faster
than the present ADI method. However this experiment suggests two major advantages of the present ADI method. First of
all, as can be seen from the slope of the two curves in Fig. 12, the present ADI method is expected to perform better for grid
sizes larger than 50M. Experiments on larger grids could not be conducted because of the heavy memory requirement of
the conjugate gradient method. This leads to the second advantage of the present ADI method which requires less memory
and allows larger simulations. Apart from the clear advantage that small TDMAs require less memory compared to a large
hepta-diagonal matrix, the present method is able to flexibly increase the size of the matrix by considering both the grid
size and the memory capacity.

5. Conclusions

An efficient numerical simulation solver using a semi-implicit fractional-step method for incompressible flows has been
developed for GPU acceleration, which shows a promising potential of GPUs for solving Navier–Stokes equations using non-
explicit numerical methods. By efficiently utilizing the memory model of CUDA and its built-in libraries, major difficulties
S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264 263

Fig. 12. Performance of the present ADI method and the conjugate gradient methods with respect to the grid size. Green ⋄, PCG; blue ×, ADI; red ◦, CG.
Computation time is measured on a Tesla P100 GPU. Due to heavy memory requirements of conjugate gradient methods, the curves with green ⋄ and red
◦ are plotted only up to 50M grid cells. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this
article.)

in the ADI method for solving momentum equations and the Fourier-transform-based direct solution method for solving
the Poisson equation have been overcome. Ways to leverage fine-grain parallelism in inversion of tridiagonal matrices are
proposed. OpenMP and CUDA streams are adopted to efficiently collect turbulence statistics on CPU while the main compu-
tation is carried out on GPU without interruption. Despite the difficulties of the semi-implicit scheme on parallel machines,
the present method efficiently uses the restricted memory space of a GPU and achieves a significant speedup of 20× and
48× in 134 million cells using a Tesla K40c and a Tesla P100, respectively in comparison with a single-core Xeon E5-2660
v3 @2.6 GHz CPU. This study suggests that DNS at around Re θ = 950 can efficiently be simulated using a single GPU with
speedups comparable to CPU nodes with 24 to 60 cores. In order to tackle simulations at larger Reynolds numbers, the
present method will be extended to multiple GPUs in a future study.

Acknowledgements

This research was supported by the Basic Science Research Program of the National Research Foundation of Korea (NRF)
funded by the Ministry of Science, ICT and future Planning (NRF-2015R1A2A1A15056086 and NRF-2014R1A2A1A11049599).

References

[1] G. Alfonsi, S.A. Ciliberti, M. Mancini, L. Primavera, GPGPU implementation of mixed spectral-finite difference computational code for the numerical
integration of the three-dimensional time-dependent incompressible Navier–Stokes equations, Comput. Fluids 102 (2014) 237–249.
[2] G. Borrell, J.A. Sillero, J. Jiménez, A code for direct numerical simulation of turbulent boundary layers at high Reynolds numbers in BG/P supercomput-
ers, Comput. Fluids 80 (2013) 37–43.
[3] A.R. Brodtkorb, M.L. Sætra, M. Altinakar, Efficient shallow water simulations on GPUs: implementation, visualization, verification, and validation, Com-
put. Fluids 55 (2012) 1–12.
[4] L.-W. Chang, J.A. Stratton, H.-S. Kim, W.-M.W. Hwu, A scalable, numerically stable, high-performance tridiagonal solver using GPUs, in: Proceedings of
the International Conference on High Performance Computing, Networking, Storage and Analysis, IEEE Computer Society Press, 2012, p. 27.
[5] J. Cheng, M. Grossman, T. McKercher, Professional Cuda C Programming, John Wiley & Sons, 2014.
[6] L. Deng, H. Bai, F. Wang, Q. Xu, CPU/GPU computing for an implicit multi-block compressible Navier–Stokes solver on heterogeneous platform, Inter-
national Journal of Modern Physics: Conference Series 42 (2016) 1660163, World Scientific.
[7] M. Frigo, A fast Fourier transform compiler, ACM SIGPLAN Notices 34 (5) (1999) 169–180, ACM.
[8] S. Hahn, J. Je, H. Choi, Direct numerical simulation of turbulent channel flow with permeable walls, J. Fluid Mech. 450 (2002) 259–285.
[9] R. Jacobs, P. Durbin, Simulations of bypass transition, J. Fluid Mech. 428 (2001) 185–212.
[10] T. Kempe, A. Aguilera, W. Nagel, J. Froelich, Performance of a projection method for incompressible flows on heterogeneous hardware, Comput. Fluids
121 (2015) 37–43.
[11] A. Khajeh-Saeed, J.B. Perot, Direct numerical simulation of turbulence using GPU accelerated supercomputers, J. Comput. Phys. 235 (2013) 241–257.
[12] J. Kim, D. Kim, H. Choi, An immersed-boundary finite-volume method for simulations of flow in complex geometries, J. Comput. Phys. 171 (1) (2001)
132–150.
[13] J. Kim, P. Moin, Application of a fractional-step method to incompressible Navier–Stokes equations, J. Comput. Phys. 59 (2) (1985) 308–323.
[14] D.B. Kirk, W.H. Wen-mei, Programming Massively Parallel Processors: A Hands-on Approach, Newnes, 2012.
[15] H. Le, P. Moin, An improvement of fractional step methods for the incompressible Navier–Stokes equations, J. Comput. Phys. 92 (2) (1991) 369–379.
[16] T.S. Lund, X. Wu, K.D. Squires, Generation of turbulent inflow data for spatially-developing boundary layer simulations, J. Comput. Phys. 140 (2) (1998)
233–258.
[17] M. Mirzadeh, A. Guittet, C. Burstedde, F. Gibou, Parallel level-set methods on adaptive tree-based grids, J. Comput. Phys. 322 (2016) 345–364.
[18] K. Niemeyer, C. Sung, Recent progress and challenges in exploiting graphics processors in computational fluid dynamics, J. Supercomput. 67 (2) (2014)
528–564.
[19] M. Norman, J. Larkin, A. Vose, K. Evans, A case study of CUDA Fortran and OpenACC for an atmospheric climate kernel, J. Comput. Sci. 9 (2015) 1–6.
264 S. Ha et al. / Journal of Computational Physics 352 (2018) 246–264

[20] NVIDIA Corporation, CUDA Toolkit Documentation: cuSPARSE, http://docs.nvidia.com/cuda/cusparse, 2007–2016.


[21] NVIDIA Corporation, CUDA Toolkit Documentation: Incomplete-LU and Cholesky Preconditioned Iterative Methods Using cuSPARSE and cuBLAS,
http://docs.nvidia.com/cuda/incomplete-lu-cholesky, 2007–2016.
[22] NVIDIA Corporation, NVIDIA Tesla V100 GPU Architecture, http://www.nvidia.com/object/volta-architecture-whitepaper.html, 2007–2016.
[23] A.J. Rueda, J.M. Noguera, A. Luque, A comparison of native GPU computing versus OpenACC for implementing flow-routing algorithms in hydrological
applications, Comput. Geosci. 87 (2016) 91–100.
[24] G. Ruetsch, M. Fatica, CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming, 2nd edition, Elsevier, 2013.
[25] F. Salvadore, M. Bernardini, M. Botti, GPU accelerated flow solver for direct numerical simulation of turbulent flows, J. Comput. Phys. 235 (2013)
129–142.
[26] M.P. Simens, J. Jiménez, S. Hoyas, Y. Mizuno, A high-resolution code for turbulent boundary layers, J. Comput. Phys. 228 (11) (2009) 4218–4231.
[27] S. Vanka, A. Shinn, K. Sahu, Computational fluid dynamics using graphics processing units: challenges and opportunities, in: ASME 2011 International
Mechanical Engineering Congress and Exposition, vol. 6, ASME, Nov. 2011, pp. 429–437.
[28] Q. Wang, K.D. Squires, Large eddy simulation of particle-laden turbulent channel flow, Phys. Fluids 8 (5) (1996) 1207–1223.
[29] Y. Wang, M. Baboulin, J. Dongarra, J. Falcou, Y. Fraigneau, O. Le Maitre, A parallel solver for incompressible fluid flows, Proc. Comput. Sci. 18 (2013)
439–448.
[30] Y. Wang, M. Baboulin, K. Rupp, O. Le Maître, Y. Fraigneau, Solving 3D incompressible Navier–Stokes equations on hybrid CPU/GPU systems, in: Pro-
ceedings of the High Performance Computing Symposium, Society for Computer Simulation International, 2014, p. 12.
[31] X. Wu, P. Moin, Direct numerical simulation of turbulence in a nominally zero-pressure-gradient flat-plate boundary layer, J. Fluid Mech. 630 (2009)
5–41.
[32] M. Yarrow, Solving periodic block tridiagonal systems using the Sherman–Morrison–Woodbury formula, in: AIAA Computational Fluid Dynamics Con-
ference, 9th, Buffalo, Washington, DC, 1989, pp. 188–196, Technical Papers (A89-41776 18-02).
[33] Y. Zhang, J. Cohen, J.D. Owens, Fast tridiagonal solvers on the GPU, in: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming, ACM Press, May 2010, pp. 127–136.

You might also like