1 s2.0 S0022407321001734 Main

Journal of Quantitative Spectroscopy & Radiative Transfer 269 (2021) 107680
Contents lists available at ScienceDirect
Journal of Quantitative Spectroscopy & Radiative Transfer

journal homepage: www.elsevier.com/locate/jqsrt
A fast GPU Monte Carlo implementation for radiative heat transfer in

graded-index media
Jiang Shao, Keyong Zhu, Yong Huang∗
School of Aeronautic Science and Engineering, Beihang University, Beijing 100191, China
a r t i c l e i n f o a b s t r a c t
Article history: Simulating radiative heat transfer in a graded-index (GRIN) medium is particularly challenging because of
Received 14 October 2020 curve ray propagation trajectories. As an effective method, the Monte Carlo method is easy to implement
Revised 27 March 2021
with high precision. However, the Monte Carlo method is time consuming, and the computing time in-
Accepted 29 March 2021
creased substantially when combined with the Runge-Kutta ray tracing technique to obtain the ray trajec-
Available online 27 April 2021
tories in the GRIN medium. Because the Monte Carlo method is ideally suited for parallel processing ar-
Keywords: chitectures and acceleration with graphics processing units (GPUs), we have developed a fast GPU Monte
Radiative heat transfer Carlo implementation for radiative heat transfer in GRIN media. The performance of the GPU implemen-
Graded-index (GRIN) medium tation has been improved by combining the ray tracing process with the binary search and optimizing
Monte Carlo method the code based on the architecture of GPUs. In particular, the utilization of the GPU hardware has been
Runge-Kutta ray tracing method maximized, and the warp inactivity has been substantially reduced. Two- and three-dimensional GRIN
Graphical processing unit (GPU)
medium models were evaluated to assess the accuracy and performance of the GPU implementations.
Compared with the equivalent central processing unit (CPU) implementations, the GPU implementations
provided in this paper show a great capability for producing physically accurate results with substantial
speedups. The speedup of the GPU implementation on a single GPU for the two-dimensional case reaches
43.13 × against the equivalent CPU implementation using a single CPU core and 5.65 × against the equiv-
alent CPU implementation using 6 CPU cores (12 threads). The speedup of the GPU implementation on
a single GPU for the three-dimensional case reaches 35.61 × against the equivalent CPU implementation
using a single CPU core and 2.07 × against the equivalent CPU implementation using 14 CPU cores (28
threads).
© 2021 Elsevier Ltd. All rights reserved.
1. Introduction Traditional analytical methods are usually unsuitable for a GRIN

medium because of the curve propagation trajectories of rays in-
The graded-index (GRIN) phenomenon is ubiquitous in nature. stead of straight lines. To analyze radiative heat transfer in a GRIN
For instance, the refractive index of many biological optical sys- medium, a variety of numerical methods have been developed,
tems is inhomogeneous. Spherical GRIN eye lenses are found in such as the meshless method [10], the finite element method
aquatic creatures, such as fish, octopus, squid, and jellyfish [1-3], [11], the finite volume method [12], and the lattice Boltzmann
while aspherical GRIN lenses are found in air dwellers, including method [13]. As another effective method for analyzing radiative
humans, lions, and cows [4-6]. Compared with the homogenous heat transfer in a GRIN medium, the Monte Carlo (MC) ray trac-
refractive index singlet lens, GRIN lenses have the advantages of ing method has been widely used. It is easy to implement with a
correcting optical aberrations, enhancing focusing power, and in- high flexibility. The computing time of this method increases mod-
creasing the field of view [7]. They are widely applied in many ar- erately with the complexity of the problem. Thus, cases with com-
eas, including camera lenses, optical fibers [8] and sensors [9]. In plex geometries or boundary conditions, which are usually chal-
addition, the GRIN phenomenon occurs in some high-temperature lenging for other methods, can easily be analyzed. Moreover, the
situations in the industrial field, such as burning and plasma heat- simulative moving process of the photons in the MC ray tracing
ing. This brings extra challenges to the analysis of radiative heat method is close to natural propagation, which brings a high preci-
transfer in these cases. sion to the result [14,15], which often acts as a benchmark solution
when the exact analytical solution cannot be obtained.
∗
A key point of using the MC ray tracing method to analyze
Corresponding author.
E-mail address: huangy@buaa.edu.cn (Y. Huang).
radiative heat transfer in a GRIN medium is to obtain the curve
https://doi.org/10.1016/j.jqsrt.2021.107680
0022-4073/© 2021 Elsevier Ltd. All rights reserved.
J. Shao, K. Zhu and Y. Huang Journal of Quantitative Spectroscopy & Radiative Transfer 269 (2021) 107680
propagation trajectories of rays. Generally, the propagation tra- chitecture [28], a substantial performance boost can be achieved.
jectory function of rays in a GRIN medium is complex implicit However, except for the multi-CPU system dedicated to profes-
function that is difficult to solve. Only for some cases with spe- sional computing, the core number of most desktop CPUs is less
cial refractive index distributions can the propagation trajectory than 20. For example, the newest Intel CPU with hyperthreading
function of rays be simplified and expressed as a simple explicit technology, Intel Core I9-10980XE, has 18 cores and can process up
function. For a two-dimensional GRIN medium with layered and to 36 threads concurrently. By contrast, a graphics processing unit
radial graded index distributions, Liu [16] deducted the propa- (GPU), which is designed for parallel computing, usually has thou-
gation trajectory function of rays and then solved radiative heat sands of cores and can process up to thousands of threads con-
transfer problems using the MC ray tracing method. For cases currently. When handling a highly parallel computing task, a GPU
in which the refractive index distribution function of a GRIN usually can provide an equivalent performance at a much lower
medium is already given, Anurag Sharma et al. [17,18] proposed price than a CPU. Thus, it has recently become increasingly popu-
a numerical method for tracing rays that consists of transforming lar among researchers in studies with massively parallel computa-
the ray equation into a convenient form and then solving the tions.
equation with the Runge-Kutta algorithm. By combining this Many studies have demonstrated the excellent efficiency of
method with the backward MC method, Shi et al. [19] investigated GPUs in implementing parallelizable tasks, such as MC simula-
the emission characteristics of a two-dimensional linear refractive tion and ray tracing. Efremenko et al. [29] developed a GPU im-
index medium. Huang et al. [20] obtained the ray-trajectory plementation of a radiative transfer model based on the discrete
numerical solution of a two-dimensional GRIN medium using the ordinate solution method and achieved a 50 × speedup for the
Runge-Kutta algorithm. Then, based on the numerical solution of two-stream radiative transfer model against the original single-
the ray trajectory, the radiative heat transfer of the medium was threaded CPU codes. Sweezy [30] designed a MC fluence estimator
simulated with the MC ray tracing method, and the accuracy of to exploit the computational power of a GPU to estimate global
the results was validated with the benchmark solutions calculated fluence and achieved a 23 × speedup for the track-length esti-
by Liu [16]. Qian et al. [21] further extended the Runge-Kutta ray mator using a single core CPU paired with a GPU. Silvestri et al.
tracing (RKRT) technique to a three-dimensional GRIN medium. [31] implemented a fast reciprocal MC algorithm on a GPU to ac-
By combining the MC method with the RKRT technique, the curately solve radiative heat transfer in turbulent flows of non-gray
radiative heat transfer in the three-dimensional GRIN medium participating media that can be coupled to fully resolved turbulent
was analyzed. Furthermore, a backward and forward MC method flows and achieved a speedup of up to 3 orders of magnitude com-
was proposed for simulating the vector radiative transfer in a pared to a classical CPU implementation. James Tickner [32] devel-
GRIN medium. On the basis of the numerical solution of the ray oped a GPU-based general purpose X-ray modeling MC simulation
trajectory obtained via the RKRT technique, the vector radiative code that computes the transport of high energy (>1 keV) pho-
transfer in a two-dimensional GRIN medium was simulated [22]. tons through arbitrary three-dimensional geometry models, sim-
Another key point of using the MC ray tracing method to ana- ulates their physical interactions and performs tallying and vari-
lyze radiative heat transfer in a GRIN medium is to satisfy the re- ance reduction. With the particle-per-block approach introduced, a
quirements for the hardware with high computing capability and speedup up to 35 × was achieved compared to an equivalent CPU-
long computing time. As a statistical simulation method, the MC based code. Ren et al. [33] presented a parallel implementation
ray tracing method consists of a statistic on repeated random sam- for MC simulation of light propagation in heterogeneous tissues. In
pling of large numbers of rays propagating in the medium to ob- addition, the feasibility and efficiency of parallel MC implementa-
tain numerical results. Therefore, a common problem is that the tion on a GPU were validated. Horiuchi et al. [34] implemented ray
accuracy of the MC ray tracing method highly depends on the tracing in a GRIN lens with a GPU and achieved a computing speed
number of samplings, which is the number of propagating rays approximately 19-fold (on average) higher than that of a CPU.
here. Researchers have proposed modified Monte Carlo methods In this paper, we developed a GPU implementation of the MC
with improved efficiency for specific problems of radiative trans- method combined with the RKRT technique for simulating radia-
fer. For the problem of radiation onto a small spot and/or onto tive heat transfer in a GRIN medium using the compute unified
a small direction cone, the backward Monte Carlo method [23] is device architecture (CUDA) developed by NVIDIA [35]. Two- and
more efficient if the source of radiation is large. For transient ra- three-dimensional GRIN medium models were evaluated to assess
diative transfer, Wang et al. [24,25] developed a modified Monte the accuracy and performance of the GPU implementation. The
Carlo method by introducing a time shift and superposition prin- physical model of the cases and the simulation method are pre-
ciple. For polarized radiative transfer, Huang et al. [26] developed sented in Section 2. The GPU implementation and the hardware
a backward and forward Monte Carlo method to study the thermal used in this paper are introduced in Section 3. Algorithm accelera-
emission considering polarization. tions, including improving the ray tracing process by combining it
Simulating radiative heat transfer with the MC ray tracing with the binary search and improving the performance of the GPU
method in a medium that has complicated geometries, complex implementation by optimizing the code based on the GPU archi-
boundary conditions or GRIN was once considered too expensive tecture, are proposed in Section 4. The accuracy and speedups of
because of the high requirements for the computing capability the GPU implementations compared to the equivalent CPU imple-
of the hardware and the unbearably long computing time. When mentations are presented in Section 5. Finally, the conclusions are
combined with the RKRT technique for obtaining the ray trajecto- provided in Section 6.
ries in a GRIN medium, the computing expenses will be further
multiplied because of the iterations of the Runge-Kutta algorithm 2. Physical models and simulation methods
for solving the curve ray trajectory function in the ray tracing pro-
cess. However, the repeated random sampling of the MC method, 2.1. Physical models
such as that of the ray tracing process in this paper, is mutually
independent and highly parallelizable. It greatly benefits from the To compare the accuracy and performance of the GPU im-
use of computing devices with parallel architectures. The multicore plementation and the equivalent CPU implementation of the MC
central processing unit (CPU) is one of the popular choices for par- method combined with the RKRT technique to analyze radiative
allel computing. On the basis of the single-instruction, multiple- heat transfer in a GRIN medium, the temperature fields of two-
data architecture [27] or the multiple-instruction, multiple-data ar- and three-dimensional GRIN media are simulated. The two- and
2
Thermal radiation in a GRIN medium passes along curved lines,

and the corresponding ray trajectory equation is a second-order or-
dinary differential equation. Therefore, the Runge-Kutta algorithm
is adopted here to obtain the ray trajectory numerical solution.
2.2. The Runge-Kutta algorithm
According to Fermat’s principle, the ray trajectory equation in a

GRIN medium is

d dr
n (r ) = ∇ n (r ) (3)
ds ds
where r is the variable-position vector on the ray, n is the refrac-
tive index, and ds is the geometric increment along the ray path
[36]. By defining a new variable t [20], which is

ds
t= , (4)
n (r )
Eq. (3) can be transformed into
2
d r 1
= n(r )∇ n(r ) = ∇ n(r )2 . (5)
dt 2 2
Introduce the position matrix R as
T
R= x y z (6)
and the ray vector T as
dr dr T
T= = n (r ) = n(r ) cosα cosβ cosγ (7)
dt ds
where α , β , and γ are the angles between the ray vector and the
x-axis, y-axis, and z-axis, respectively. Then, define the matrix D as
1 1 ∂ n2 ∂ n2 ∂ n2
T
D = n(r )∇ n(r ) = ∇ n(r )2 = ∂x ∂y ∂z . (8)
2 2
We can obtain
d2 r dT
= = D. (9)
dt 2 dt
Eq. (9) is a second-order ordinary differential equation. By
adopting the classical fourth-order Runge-Kutta formulas [37], the
iterative equations of the position matrix R and the ray vector T
Fig. 1. Physical model of (a) the two-dimensional GRIN medium and (b) the three-
dimensional GRIN medium. are derived as
⎧
k1 = tD(Rn )
⎨k = tDR +
⎪ t
T

three-dimensional GRIN media are assumed to be gray and at ra- 2
n 2 n
t t

⎪k3 = tD Rn + T + k
diative equilibrium with the optical thickness based on the side ⎩ 2 n 4 1
t
(10)
length of the square/cubic field τ = 1.0, the single scattering k4 = t D Rn + t Tn + 2 2
k
albedo ω and the scattering phase function = 1 + • . Rn+1 = Rn + t Tn + 6t (k1 + k2 + k3 )
The physical model of the two-dimensional GRIN medium is Tn+1 = Tn + 16 (k1 + 2k2 + 2k3 + k4 )
shown in Fig. 1(a). The cross-sectional area of the square field is
H × H and is divided evenly into 20 × 20 cells. The special radial where t is the step size, which represents the increment of t be-
refractive index distribution of the two-dimensional GRIN medium tween two iterations.
is
0 . 5 2.3. The MC ray tracing method
n(x, y ) = 5 1 − 0.4356 x2 + y2 /H 2 . (1)
The four boundaries of the medium are black with tempera- On the basis of the iterative equations of ray trajectory obtained
tures of 10 0 0 K for the bottom and 0 K for the others. above, the MC ray tracing method can be adopted to simulate the
The physical model of the three-dimensional GRIN medium is radiative heat transfer in a GRIN medium. The entire process can
shown in Fig. 1(b). The size of the cubic field is H × H × H and is be divided into three parts: photon initialization, ray tracing and
divided evenly into 31 × 31 × 31 cells. The side length H here is post-processing.
0.1 m. The special radial refractive-index distribution of the three- Each cell of the GRIN medium emits a large number of pho-
dimensional GRIN medium is tons. The random emitting position, random direction and proba-
bilistic propagation length of each photon are first calculated in the
2
n(x, y, z ) = . (2) photon initialization part. Then, the propagation of each photon is
1 + ( x2 + y2 + z 2 ) traced step by step using Eq. (10) in the part of ray tracing. For
The six boundaries of the medium are also black with temper- the domain with black boundaries, the ray tracing process of each
atures of 10 0 0 K for the bottom and 0 K for the others. photon stops only when the photon is absorbed by the medium or
3
Fig. 2. Flowchart of the MC ray tracing method.
Fig. 3. Flowchart of the GPU implementation.
4
sorbed by cell j to the total number of photons emitted by cell i.

Under the concept of the radiation transfer factor RD, the energy
conservation equation can be expressed as

Nv
NS
4ni 2 κ Vi σ Ti 4 = σ 4n j 2 κ V j T j 4 RD ji + σ nk 2 ε Sk Tk 4 RDki (11)
j=1 k=1
where T is the temperature of the cell, σ is the Stefan-Boltzmann

constant which is 5.67 × 10−8 W/(m2 • K4 ), κ is the absorption
coefficient, Vj is the volume of the cell, ε is the emissivity, Sk is
the area of the boundary, and Nv and Ns are the number of cells
and boundaries, respectively.
According to its interchangeability, the radiation transfer factor
RD has the following relationship:

εi Si RDij = ε j S j RDji
εi Si RDij = 4κ jV j RDji . (12)
4κiVi RDij = 4κ jV j RDji
Then, Eq. (11) can be transformed into

Nv
Ns
Ti 4 = T j 4 RDij + Tk 4 RDik . (13)
j=1 k=1
Note that Eq. (13) are a set of linear equations of the fourth
power of each cell’s temperature. Thus, the temperature of each
cell can be obtained by solving Eq. (13), which is in the part of
post-processing. The flowchart of the simulation is shown in Fig. 2.
3. GPU implementation
Because of the different design objectives, the architecture of

the GPU is quite different from that of the CPU. CPUs need strong
versatility to handle data of multiple types and tasks contain-
ing massive logical statements, branches, switches and interrupts.
In contrast, GPUs mostly handle mutually independent large-scale
data of highly uniform types and the tasks of simple arithmetic
operations. They are designed such that more transistors are de-
voted to data processing rather than data caching and flow control.
Therefore, the GPU is especially well suited for addressing prob-
lems that can be expressed as data-parallel computations.
An NVIDIA GPU architecture is built around a scalable array
of multithreaded streaming multiprocessors. Each multiprocessor
manages, schedules and executes threads, which are the small-
est parallel processing units of the GPU, in groups of 32 parallel
threads called warps. Individual threads composing a warp start
together at the same program address, but they are free to branch
and execute independently. When a CUDA program invokes a ker-
nel grid, the blocks of the grid are distributed to multiprocessors.
The threads of a block execute concurrently on a multiprocessor,
and multiple blocks can execute concurrently on a multiprocessor
as well. The multiprocessor partitions the blocks into warps, and
each warp gets scheduled by a warp scheduler for execution. This
unique architecture for executing hundreds of threads concurrently
on a multiprocessor is called a single-instruction, multiple-thread
(SIMT) architecture [38]. Thus, to port an application from a CPU
to a GPU, the computing task should first be partly or totally parti-
tioned into multiple independent parallel tasks to match the SIMT
Fig. 4. Number of photons absorbed by each cell for the cases that all the photons architecture. Then, these parallel tasks can be mapped to individ-
are emitted by the cell: (a) cell (0, 0), (b) cell (9, 9), and (c) cell (19, 19). ual threads for parallel execution on a GPU.
As mentioned in Section 2.3, the entire process of the simula-
the black boundaries. The indexes of the cells in which each pho- tion can be divided into three parts: photon initialization, ray trac-
ton is emitted and the indexes of the cells/boundaries in which ing and post-processing. For every single ray in the process, the
each photon is absorbed are recorded. Thus, the radiation transfer first two parts are totally independent. For these two parts, no data
factor RDi,j can be obtained, which is defined as the fraction of en- exchange or synchronization is involved between any two rays. For
ergy emitted by cell i and absorbed by cell j [20]. Here, it equals this reason, as shown in Fig. 3, the initialization and the following
the proportion of the number of photons emitted by cell i and ab- tracing process of every photon are mapped to multiple threads in
5
Fig. 5. Flowchart of the improved GPU implementation.
a GPU kernel grid. However, the post-processing part involves solv- racy of the method. The other aspect is that when the photon hits
ing a set of linear equations that must be executed serially. This boundaries or is absorbed/scattered by the medium during an iter-
part is performed by a CPU, which has a better serial computing ation, the position obtained by the Runge-Kutta algorithm is prob-
capability. ably not the exact position. A smaller step size will reduce the de-
The GPU implementations are executed on an NVIDIA GeForce viation between the exact and calculated positions.
GTX 1080Ti, which uses the NVIDIA Pascal architecture. It has a In this paper, we first use a constant step size for iterations.
compute capability of 6.1 and 28 multiprocessors in total. Each When the photon hits boundaries or is absorbed/scattered by the
multiprocessor has 4 warp schedulers, 128 CUDA cores (FP32 medium during an iteration, we return to the last iteration and
cores) for single-precision arithmetic operations, 4 FP64 cores for adopt the binary search to recalculate the step size that corre-
double-precision arithmetic operations and 32 special function sponds to the exact position. Compared with adopting a smaller
units (SFUs) for single-precision floating-point transcendental func- step size throughout the tracing process, the Runge-Kutta algo-
tions. The equivalent CPU implementations for comparison are ex- rithm combined with the binary search can achieve the same accu-
ecuted on an Intel Core I7-8750h running at 2.20 GHz, which has 6 racy but requires much less computing time due to the reduction
cores in total, and an Intel Xeon Gold 5120 with turbo boost tech- in the number of iterations. The relationship between the step size
nology enabled running at 3.20 GHz, which has 14 cores in total. and the accuracy of the method is shown in Section 5.3.
4. Algorithm accelerations 4.2. Maximize utilization
4.1. Adopt binary search Compared with a CPU, a GPU has a larger memory access de-
lay. Thus, a warp that is being executed always gets stalled by ac-
Using the Runge-Kutta algorithm to obtain the ray trajectory so- cessing data. To hide latencies and keep the hardware busy, the
lutions in a GRIN medium, the number of iterations for tracing multiprocessor of the GPU has a unique mechanism. The execution
rays is directly related to the step size of the Runge-Kutta algo- context, especially registers for each warp processed by a multi-
rithm. Choosing a smaller step size may improve the accuracy but processor, is maintained on-chip during the entire lifetime of the
increases the computing time. The effects of step size on accuracy warp. When a warp is getting stalled, the warp scheduler selects
manifest in two aspects. One is that when updating the propaga- another warp that has threads ready to execute its next instruction
tion length, the length of the curve ray trajectory is approximated and issues the instruction to those threads. As thread blocks ter-
as the distance between the positions of two adjacent iterations. A minate, new blocks can then be launched. The number of registers
smaller step size may help to improve the computational accuracy and amount of shared memory available on the multiprocessor are
of the total propagation length, thereby obtaining a more accurate limited. Thus, the number of blocks and warps that can reside and
ending position of the photon. However, it will increase the error be processed concurrently on a multiprocessor for a given kernel
caused by more iterations, which may finally decrease the accu- depends on the number of registers and amount of shared mem-
6
Fig. 6. Computing time and theoretical occupancy of the GPU using (a) a double- Fig. 7. Computing time and theoretical occupancy of the GPU using (a) a double-
precision floating-point type and (b) a single-precision floating-point type with dif- precision floating-point type and (b) a single-precision floating-point type with dif-
ferent numbers of registers per thread for the three-dimensional case: Nr = 4 × 105 , ferent numbers of registers per thread for the three-dimensional case: Nr = 4 × 105 ,
t = 0.01 and ω = 0. t = 0.01 and ω = 0.5.
ory used by the kernel per thread. For the GPU implementation in Table 1
this paper, no data exchange is involved between any two threads. Occupancy experiment details of the GPU computing application
using a single kernel.
Therefore, the only limiting factor is the number of registers occu-
pied by the kernel per thread. Variable Achieved Theoretical Device Limit
Fewer resident warps always result in poor instruction issue Occupancy Per SM
efficiency and performance degradation because there are not Active Blocks 8 32
enough eligible warps for multiprocessors to hide the latency be- Active Warps 7.30 8 64
Active Threads 256 2048
tween independent instructions. There is also a device limit on the
Occupancy 11.41% 12.50% 100.00%
number of resident warps and resident blocks per multiprocessor. Registers
The metric determining how effectively the GPU is kept busy is the Registers/Thread 176 255
occupancy [39]. It is defined as Registers/Block 5632 65536
Registers/SM 45056 65536
Nrw
Occupancy = , (14)
NDL
where Nrw is the number of resident warps per multiprocessor, and
NDL is the device limit of the number of resident warps per multi- Visual Studio Edition Analysis Tools [40] is adopted to monitor the
processor. The detailed performance information of the GPU appli- detailed performance information of the GPU computing applica-
cation can be seen with CUDA analysis tools. tion. The occupancy and other relevant information of the ker-
As shown in Fig. 3, the GPU kernel contains the initialization nel for simulating the temperature fields of the three-dimensional
and tracing process of a photon. Every thread of the kernel grid non-scattering GRIN medium are listed in Table 1. The detailed ex-
needs a large number of registers to launch, and thus, the occu- planations of the variables in Table 1 are in Ref. [40]. Note that
pancy of the device is quite low. In this paper, the NVIDIA Nsight the achieved occupancy in this case is only 11.41%, and there are
7
Fig. 8. Temperature fields of the three-dimensional GRIN medium for (a) ω = 0 and (b) ω = 0.5.
only 8 theoretical active warps per SM, which is far fewer than the occupied by the kernel per thread. Therefore, when the computing
maximum number per SM supported by the device, which is 64. process cannot be simplified further, partitioning the kernel into
several consecutive kernels and launching them in sequence is one
of the effective ways to reduce the number of registers per thread
4.2.1. Reduce the granularity of the kernel
allocated by the compiler, thereby increasing the occupancy of the
As mentioned above, the only limiting factor of the number of
device.
resident warps per multiprocessor here is the number of registers
8
Fig. 9. Temperature field profiles of the three-dimensional GRIN medium at y = 0.5H for (a) ω = 0 and (b) ω = 0.5.
The random number in CUDA on a GPU can be generated by bers. In this case, the photons initialized by these two threads are
the CUDA random number generation library cuRAND [41]. Simi- identical. They will have the same ray trajectories and ending po-
lar to the random numbers generated by a CPU, the random num- sitions, which leads to an incorrect result. Thus, each thread must
bers generated by cuRAND are also a pseudorandom sequence of have its own curandState variable and set it up individually using
numbers and the function curand (curandState ∗ state) for gener- the function curand_init() with different seeds.
ating the random numbers needs an input argument of type cu- For this reason, although the code of the initialization part is
randState. The argument of the type curandState acts as the ini- much shorter than the code of the ray tracing part, the initializa-
tial state of the pseudorandom sequence of numbers. This means tion part contains the calls to curand() for computing the random
that if two threads use the same initial state and the state is not emitting position, random direction and probabilistic propagation
modified between the calls to curand(), the calls to curand() in length of the photon. More registers are needed by the initializa-
the two threads will generate the same sequence of random num- tion part than the ray tracing part. According to the flowchart of
9
Table 2
Occupancy experiment details of the GPU computing application using two partitioned kernels: (a) the
kernel for photon initialization and (b) the kernel for tracing rays.
Variable Achieved(a) Achieved(b) Theoretical(a) Theoretical(b) Device Limit
Occupancy Per SM
Active Blocks 12 16 32
Active Warps 11.89 14.62 12 16 64
Active Threads 384 512 2048
Occupancy 18.58% 22.84% 18.75% 25.00% 100.00%
Registers
Registers/Thread 150 104 255
Registers/Block 4864 3328 65536
Registers/SM 58368 53248 65536
the MC ray tracing method in Fig. 2, the kernel for the photon ini- happens to the blocks within a kernel grid as well, is known as
tialization and ray tracing of a single ray can be partitioned into the “tail effect” and indicates the unbalance of the workload [40].
two consecutive kernels. One kernel is for photon initialization, Both the “branch divergence” and the “tail effect” will lead to a de-
and the other is for tracing the ray. The propagation process of crease in the achieved occupancy, and thus, the GPU performance
each ray is continuous, and the propagation trajectories of the rays deteriorates.
differ from each other. Thus, the kernel for tracing rays executed When implementing the MC ray tracing method to simulate the
by each thread is a series of continuous computing and cannot be temperature field of a GRIN medium on a GPU, the initialization
partitioned further. In contrast, the initialization processes of pho- and tracing process of a single ray is computed by a single thread
tons in each thread are identical, and the kernel for photon initial- of a GPU. The initialization process of each photon is identical.
ization can be further partitioned into several kernels, which may However, the initial position, initial direction, probabilistic propa-
require fewer registers to launch. However, continuing to reduce gation length and trajectory differ between photons. Thus, the trac-
the granularity of the kernel can result in poor performance. First, ing process of each photon and the corresponding computing time
every step in the initialization part needs the generation of ran- of each thread differ. As shown in Fig. 2, there are many data-
dom numbers. Every partitioned kernel needs to hold a local vari- dependent conditional branches in the kernel of tracing rays for
able of type curandState and set it up individually using the func- judging if the photon is beyond boundaries, absorbed or scattered.
tion curand_init(). In this case, the number of registers reduced by The “branch divergence” occurs within each warp, which makes
further partitioning is quite low and sometimes zero. Second, par- the computing time of threads differs further. Thus, the “tail ef-
titioning one kernel into several also adds the overhead of extra fect” occurs within each block and the kernel grid. The workload is
kernel invocations and global memory traffic. Threads belonging to greatly unbalanced because of these differences between the prop-
different blocks must share data through global memory. Thus, the agation of each photon. Partitioning the single kernel into two con-
total performance deteriorates. Third, the computing time of the secutive kernels, as mentioned in Section 4.2.1, can help to reduce
kernel for photon initialization, which is negligible in our simula- the differences between the computing time of each thread, but
tion, is much shorter than that of the kernel for tracing rays. there is still room for improvement.
For all these reasons, the kernel is only partitioned into two The total number of emitted rays of the MC simulation is enor-
consecutive kernels: the kernel for photon initialization and the mous, and the global memory of the GPU is usually insufficient for
kernel for tracing the ray. The occupancy, registers needed per tracing all the rays with a single kernel launch. Therefore, the rays
thread and other relevant information of the two partitioned ker- need to be sorted first and then traced in several kernel launches.
nels for the same case as Table 1 are listed in Table 2. Compared On the basis of the statistics on the number of photons absorbed
with the performance information of the GPU computing applica- by each cell, a high degree of similarity is found in the trajec-
tion using a single kernel listed in Table 1, the number of regis- tories of the photons emitted by the same cell. Fig. 4 shows the
ters needed per thread of the two partitioned kernels decreases number of photons absorbed by each cell in the two-dimensional
from 176 to 150 and 104, respectively. The number of theoretical GRIN medium for the cases that all the photons are emitted by (a)
active warps per SM of the two partitioned kernels increases from cell(0, 0), (b) cell(9, 9) and (c) cell(19, 19), respectively. Here, cell(i,
8 to 12 and 16, respectively. The achieved occupancy of the two j) represents the cell of x ∈ [i∗ lc , (i + 1)∗ lc ) and y ∈ [i∗ lc , (j + 1)∗ lc )
partitioned kernels increases from 11.41% to 18.58% and 22.84%, whose cell index is j∗ (H/lc ) + i + 1, where lc is the side length of
respectively. The corresponding computing times are shown in the square cells. It can be seen that the ending positions of the
Section 5.1. photons emitted by the same cell have significant similarity, which
also indicates a high similarity in the corresponding trajectories.
4.2.2. Balance the workload of each thread On the basis of the above results, the threads of a warp can
As shown in Tables 1 and 2, the achieved occupancy is lower have a lower probability of diverging to different execution paths
than the theoretical occupancy. This is because the theoretical if tracing the photons emitted by the same cell. To further reduce
number of active warps is not maintained for the full active time of the differences between threads, the rays are therefore divided ac-
the multiprocessor. Under the SIMT architecture of a GPU, a warp cording to their emitting cells, and each kernel launch traces the
executes one common instruction at a time. If threads of a warp rays emitted by the same cell at a time. The flowchart of the im-
diverge to different execution paths via a data-dependent condi- proved GPU implementation is shown in Fig. 5.
tional branch, the warp executes each branch path taken and dis-
ables the threads that are not on the executing path. This is called 4.2.3. Utilize launch bounds
“branch divergence” and occurs only within a warp [38]. Different Another way to further reduce the number of registers per
warps execute independently regardless of whether they are exe- thread allocated by the compiler and increase the occupancy of the
cuting the common or disjoint code paths. If the execution times GPU is by providing additional information to the compiler using
of warps within a block differ, there will be fewer warps still ex- the launch bounds kernel definition qualifier before the definition
ecuting while the others have already exited. This problem, which of a kernel function [38], which is
10
__launch_bounds__ (maxThreadsPerBlock, minBlocksPerMul- cos(), instead. Finally, the result returned by double transcenden-
tiprocessor). tal functions will be converted again back to float type as the
maxThreadsPerBlock specifies the maximum number of return value of the corresponding float transcendental functions.
threads per block, and minBlocksPerMultiprocessor specifies the The data type conversions needed above greatly increase the com-
desired minimum number of resident blocks per multiprocessor. puting time of single-precision floating-point operations on a CPU.
By specifying minBlocksPerMultiprocessor, the number of reg- Thus, for the program that frequently calls the transcendental func-
isters per thread and the occupancy can be controlled manually. tions, the computing time of the program is slightly shorter using
Note that the total number of registers per multiprocessor is lim- a double type rather than a float type.
ited. Continuing to increase the number of resident blocks per However, when programming with CUDA C on a GPU, the
multiprocessor when all the registers are allocated inevitably leads computing time of the program can be much shorter using a
to register spilling. In this case, local memory access occurs, which single-precision floating-point type rather than a double-precision
has a higher latency and a lower bandwidth than register access. floating-point type. One reason is that the local variables using a
In addition, the compiler also automatically reduces the number float type need less memory space than the local variables using a
of registers per thread by executing a higher number of instruc- double type. Thus, if a kernel uses a single-precision floating-point
tions instead. Therefore, the actual performance gain achieved by type instead of a double-precision floating-point type, the number
increasing occupancy using the launch bounds qualifier is unpre- of registers needed by each thread is much fewer, and the applica-
dictable. The best-performing balance of the occupancy and num- tion can have a higher occupancy when the launch bounds quali-
ber of registers per thread needs to be determined experimentally fier is not involved. The occupancy experiment details of the two
by specifying the launch bounds qualifier with different arguments partitioned kernels using a single-precision floating-point type are
to the kernel. listed in Table 3. Compared with Table 2, the number of registers
per thread of the kernel for photon initialization falls from 150
4.3. Intrinsic function to 135, and the number of registers per thread of the kernel for
tracing rays falls from 104 to 64. The number of theoretical active
The GPU supports all the functions of the C/C++ standard li- warps and the theoretical occupancy of the kernel for tracing rays
brary mathematical functions. In addition, there is another type increase from 16 and 25% to 32 and 50%, respectively. Although the
of function that can only be used in the device code executing occupancy of the kernel for photon initialization remains the same,
on a GPU. These functions, termed intrinsic functions, are less the number of registers used per thread decreases, and thus, the
accurate but faster versions of some of the C/C++ standard li- performance of the kernel can benefit more from the use of the
brary mathematical functions. In contrast to the C/C++ standard launch bounds qualifier.
library mathematical functions, the intrinsic functions are executed The other reason is related to the features and technical spec-
by the SFUs of the multiprocessors [42]. The number of intrinsic ifications of the GPU used. The single-precision floating-point and
functions for single-precision floating-point and double-precision the double-precision floating-point computing capabilities vary be-
floating-point numbers is different. Users have more choices of tween GPUs. For example, the GPU used in this paper, NVIDIA
single-precision floating-point intrinsic functions, including all the GeForce GTX 1080Ti, has 128 CUDA cores (FP32 cores) per mul-
necessary functions used in our code, while there are only 7 tiprocessor for single-precision arithmetic operations but only 4
double-precision floating-point intrinsic functions, which are insuf- FP64 cores per multiprocessor for double-precision arithmetic op-
ficient for the simulation in this paper. Therefore, only the single- erations, a 32-fold decrease. If there is a high requirement for com-
precision floating-point cases are accelerated using intrinsic func- puting accuracy, an NVIDIA Tesla series GPU that has a higher
tions. double-precision floating-point computing capability is more suit-
Proper use of intrinsic functions can reduce the computing time able. In addition, the use of intrinsic functions that are executed by
at the cost of losing precision. The precision loss of the corre- the SFUs of the multiprocessors can provide an additional perfor-
sponding functions can sometimes be fatal to the program. In the mance boost for single-precision floating-point GPU applications.
actual test, we found that if intrinsic functions are used in the Therefore, a single-precision floating-point type can be used in-
part of tracing rays, the program will occasionally get stuck in stead of a double-precision floating-point type to greatly reduce
an endless loop. This is because the calculation error of the bi- the computing time of GPU applications if the associated decrease
nary search using intrinsic functions for obtaining the exact end- in accuracy is tolerable.
ing positions is too large, and thus, the end conditions of some
data-dependent conditional branches sometimes cannot be satis- 5. Results and discussions
fied. Therefore, the intrinsic functions are used only in the itera-
tions of the Runge-Kutta algorithm, which are frequently executed 5.1. Speedups of the three-dimensional GRIN medium cases
throughout the simulation.
To validate the effectiveness of the algorithm accelerations, a
4.4. Single-precision vs. double-precision case of simulating the temperature field of a three-dimensional
GRIN medium is tested. Here, the number of rays emitted by each
When programming with C/C++, the computational costs of cell Nr is 4 × 105 , the step size of the Runge-Kutta algorithm t
floating-point arithmetic operations on a CPU using a single- is 0.01, and the single scattering albedo ω is 0. The speedup is
precision floating-point (float) type and a double-precision defined as the ratio of the computing times of CPU implemen-
floating-point (double) type are almost identical. The only major tations to GPU implementations. The computing time of the CPU
difference between float and double arithmetic operations on a implementation is slightly larger using a single-precision floating-
CPU is in the size of the memory space required by the variables, point type rather than a double-precision floating-point type be-
which is always sufficient nowadays for most desktop computers. cause there are many transcendental functions in the kernel, so all
However, when calling some single-precision floating-point tran- the speedups are calculated with the computing times of the CPU
scendental functions defined in the C++ library math.h, such as implementations using a double-precision floating-point type.
sinf() and cosf(), the arguments of float type will be converted to In this case, the computing time of the CPU (Intel Xeon Gold
double type first and then computed by the corresponding double- 5120) implementation for comparison using a double-precision
precision floating-point transcendental functions, such as sin() and floating-point type is 51348.82 s. The computing time of the GPU
11
Table 3
Occupancy experiment details of the GPU computing application using two partitioned kernels with
a single-precision floating-point type: (a) the kernel for photon initialization and (b) the kernel for
tracing rays.
Variable Achieved(a) Achieved(b) Theoretical(a) Theoretical(b) Device Limit
Occupancy Per SM
Active Blocks 6 16 32
Active Warps 11.89 28.28 12 32 64
Active Threads 384 1024 2048
Occupancy 18.58% 44.19% 18.75% 50.00% 100.00%
Registers
Registers/Thread 135 64 255
Registers/Block 8704 4096 65536
Registers/SM 52224 65536 65536
implementation that uses a single kernel with a double-precision intrinsic functions, the computing time of the GPU implementa-
floating-point type is 32905.27 s, and the speedup is only 1.56 × . tion with a single-precision floating-point type can be further de-
After partitioning the single kernel into two consecutive kernels, creased to 2623.19 s, and the corresponding speedup is 35.02 × .
the computing time decreases to 21394.58 s, and the speedup in- The temperature fields of the three-dimensional GRIN medium
creases to 2.40 × . The computing time of the GPU implementa- obtained by the GPU implementations for ω = 0 and ω = 0.5 are
tion, which uses a single kernel with a single-precision floating- shown in Fig. 8. The corresponding temperature field profiles at
point type, is 7134.63 s, and the speedup is 7.20 × . After parti- y = 0.5H are shown in Fig. 9.
tioning the single kernel into two consecutive kernels, the com-
puting time decreases to 4053.50 s, and the speedup increases to
5.2. Speedups of the two-dimensional GRIN medium cases
12.67 × .
The computing time can be further reduced by specifying the
The temperature field of a two-dimensional GRIN medium is
launch bounds qualifier to control the number of registers per
simulated to compare the performances of the CPU and GPU im-
thread allocated by the compiler and the occupancy of the GPU. As
plementations. To obtain the best performance of the GPU im-
shown in Fig. 6(a), the computing time of the GPU implementation
plementations, the same experiments as in the three-dimensional
with a double-precision floating-point type is minimized when the
cases are conducted for seeking the best-performing balance of
number of registers per thread is 48. The corresponding comput-
the occupancy and the number of registers per thread by spec-
ing time is 13523.2 s, and the speedup reaches 3.80 × . As shown
ifying the launch bounds qualifier with different arguments to
in Fig. 6(b), the computing time of the GPU implementation with a
the kernel. For ω = 0 and ω = 0.5, the computing times of the
single-precision floating-point type is minimized when the number
GPU implementations using double-precision and single-precision
of registers per thread is 40. The corresponding computing time is
floating-point types are minimized when the number of registers
3521.01 s, and the speedup reaches 14.58 × . The use of intrinsic
per thread is 40. The use of intrinsic functions has no effect on the
functions has no effect on the number of registers per thread allo-
number of registers per thread allocated by the compiler or the
cated by the compiler and the occupancy of the GPU. With the use
occupancy of the GPU. These findings are similar to the results of
of intrinsic functions, the computing time of the GPU implementa-
the launch bounds experiments for the three-dimensional medium
tion with a single-precision floating-point type can be further de-
cases in Section 5.1.
creased to 1667.29 s, and the corresponding speedup is 30.80 × .
The scaling of the GPU implementation with problem size is
The case of the three-dimensional GRIN medium in which
studied by simulating with different numbers of rays per cell. As
Nr = 4 × 105 , t = 0.001 and ω = 0 is also tested as a com-
shown in Fig. 10(a), the speedup of the GPU implementation re-
parison. The computing time is 99101.3 s for the CPU (Intel Xeon
mains almost unchanged as the number of rays per cell increases,
Gold 5120) implementation that uses a double-precision floating-
demonstrating that the speedup plateau of the GPU implementa-
point type and 22059.3 s for the improved GPU implementation
tion is reached, and the utilization of GPU hardware is maximized.
that uses a double-precision floating-point type, corresponding to
This is because the number of rays that can be traced concurrently
a speedup of 4.49 × . The computing time of the improved GPU
by the GPU is limited. As thread blocks terminate, new blocks can
implementation that uses a single-precision floating-point type is
then be launched. Therefore, the large number of rays per cell
5832.68 s, and the corresponding speedup is 16.99 × . With the
are actually traced block by block, which is similar to the execu-
use of intrinsic functions, the computing time of the GPU imple-
tion of the CPU. For t = 0.01 and ω = 0, the speedup reaches
mentation with a single-precision floating-point type can be fur-
40.71 × using a single-precision floating-point type, 43.13 × us-
ther decreased to 2782.64 s, and the corresponding speedup is
ing a single-precision floating-point type with intrinsic functions
35.61 × .
and 21.69 × using double-precision floating-point type compared
For the case of the three-dimensional GRIN medium in which
with an Intel Core I7-8750h running at 2.20 GHz. For another CPU,
Nr = 4 × 105 , t = 0.01 and ω = 0.5, the computing time of the
an Intel Xeon Gold 5120 running at 3.20 GHz, the speedup of
CPU (Intel Xeon Gold 5120) implementation that uses a double-
the GPU implementations reaches 23.17 × using a single-precision
precision floating-point type is 91876.1 s. As shown in Fig. 7(a),
floating-point type, 24.60 × using a single-precision floating-point
the computing time of the GPU implementation with a double-
type with intrinsic functions and 12.35 × using a double-precision
precision floating-point type is minimized when the number of
floating-point type.
registers per thread is 48. The corresponding computing time is
When decreasing the step size of the Runge-Kutta algorithm, as
17957.5 s, and the speedup reaches 5.12 × . As shown in Fig. 7(b),
shown in Fig. 10(b), the speedups decrease first and then increase.
the computing time of the GPU implementation with a single-
Compared with Intel Core I7-8750h running at 2.20 GHz, the
precision floating-point type is minimized when the number of
speedups reach a maximum at a step size of 0.01. For Nr = 4 × 105 ,
registers per thread is 40. The corresponding computing time is
t = 0.01 and ω = 0, the speedup reaches 37.61 × using a single-
5026.25 s, and the speedup reaches 18.28 × . With the use of
precision floating-point type, 39.93 × using a single-precision
12
Fig. 10. Speedups of the GPU implementations with (a) different numbers of rays per cell for the two-dimensional cases: t = 0.01 and ω = 0, (b) different step sizes of the
Runge-Kutta algorithm for the two-dimensional cases: Nr = 4 × 105 and ω = 0, and (c) different single scattering albedos for the two-dimensional cases: Nr = 4 × 105 and
t = 0.01. Solid lines: compared with Intel Xeon Gold 5120 @ 3.20 GHz, dashed lines: compared with Intel Core I7-8750h @ 2.20 GHz, black: double-precision floating-point
type, blue: single-precision floating-point type, red: single-precision floating-point type with the use of intrinsic functions.
floating-point type with intrinsic functions and 20.59 × using a fect” are more severe on the GPU, which leads to a larger gap
double-precision floating-point type. Compared with Intel Xeon between the achieved occupancy and the theoretical occupancy.
Gold 5120 running at 3.20 GHz, the speedups reach a maximum at Moreover, the kernel for tracing rays needs more registers to cal-
a step size of 0.0 0 025. For Nr = 4 × 105 , t = 0.0 0 025 and ω = 0, culate the scattering direction and the new probabilistic propaga-
the speedup reaches 23.08 × using a single-precision floating-point tion length. Thus, the number of blocks and warps that can re-
type, 24.37 × using a single-precision floating-point type with in- side and be processed concurrently on a multiprocessor decreases.
trinsic functions and 12.20 × using a double-precision floating- The theoretical occupancy is therefore reduced. Although we can
point type. manually increase the occupancy by using the launch bounds
When ω > 0, the tracing process contains extra data-dependent qualifier, it is at the cost of more local memory accesses and a
conditional branches for scattering. The larger the single scatter- higher number of instructions executed. Thus, the performance
ing albedo is, the more likely the photon is to be scattered by the of the GPU implementations degenerates as the single scattering
medium. Threads of the same warp have a higher probability of albedo increases, which matches the results shown in Fig. 10(c).
diverging to different execution paths via the data-dependent con- For Nr = 4 × 105 and t = 0.01, the speedups reach a mini-
ditional branches, and the difference in execution time of warps mum when the single scattering albedo is 0.8. Compared with Intel
within a block increases. The “branch divergence” and the “tail ef- Core I7-8750h running at 2.20 GHz, the corresponding speedup is
13
25.25 × using a single-precision floating-point type, 26.74 × using

a single-precision floating-point type with intrinsic functions and
14.49 × using a double-precision floating-point type. Compared
with Intel Xeon Gold 5120 running at 3.20 GHz, the correspond-
ing speedup is 14.29 × using a single-precision floating-point type,
15.13 × using a single-precision floating-point type with intrinsic
functions and 8.20 × using a double-precision floating-point type.
Clearly, even the lowest speedup provided by the GPU is consider-
able.
5.3. Errors of the two-dimensional GRIN medium cases
The correctness of the results is validated by comparison with

the benchmark numerical solutions for radiative heat transfer in a
two-dimensional GRIN medium provided in Ref. [16]. For the phys-
ical model of the two-dimensional GRIN medium used in this pa-
per, Ref. [16] provides the benchmark temperatures along the y-
axis at the position of x/H = 0.325. Thus, to indicate the overall
accuracy of the implementations, we define the integrated mean
relative error δ int of temperature as
H
|T − T |dy
δint = 0 H B × 100% (15)
0 |TB |dy
where T is the temperature of the medium along the y-axis at the

position of x/H = 0.325 obtained by the implementations, and TB
is the corresponding benchmark temperature.
Fig. 11 shows the integrated mean relative errors of the
CPU/GPU implementations with (a) different numbers of rays per
cell for the two-dimensional cases: t = 0.01 and ω = 0 and
(b) different step sizes of the Runge-Kutta algorithm for the two-
dimensional cases: Nr = 4 × 105 and ω = 0. As shown in Fig. 11(a),
the integrated mean relative errors of the CPU and GPU imple-
mentations remain almost unchanged as the number of rays per
cell increases, which indicates that the integrated mean relative er-
ror of the implementations is independent of the number of rays
per cell. Only for the single-precision floating-point GPU imple-
mentation with the use of intrinsic functions does the integrated
mean relative error slightly decrease at Nr = 4 × 105 . As shown in
Fig. 11(b), the integrated mean relative errors of the CPU and GPU
implementations slightly decrease as the step size of the Runge-
Kutta algorithm decreases. However, because of the large increase
in the computing time, the slight decrease in the integrated mean
relative error brought by decreasing the step size of the Runge-
Kutta algorithm is unprofitable. For these reason, the implementa-
Fig. 11. Integrated mean relative errors of the CPU/GPU implementations with (a)
tion with Nr = 4 × 105 and t = 0.01 is sufficient for obtaining an
different numbers of rays per cell for the two-dimensional cases: t = 0.01 and
accurate result while keeping a relatively short computing time. In ω = 0 and (b) different step sizes of the Runge-Kutta algorithm for the two-
addition, Fig. 11(a) and (b) show that the integrated mean relative dimensional cases: Nr = 4 × 105 and ω = 0.
errors of the GPU implementations and equivalent CPU implemen-
tations are identical. For the single-precision floating-point GPU
implementations, the use of intrinsic functions has almost no effect The integrated mean relative errors of the single-precision floating-
on the integrated mean relative errors. The integrated mean rela- point GPU implementation without and with the use of intrin-
tive errors of the single-precision floating-point GPU implementa- sic functions are 0.273 and 0.263, respectively. Compared with the
tions with and without the use of intrinsic functions are almost case of Nr = 4 × 105 , t = 0.01 and ω = 0, the integrated mean
identical. relative errors increase because of the longer execution paths and
For the two-dimensional case of Nr = 4 × 105 , t = 0.01 and the extra calculations for obtaining the scattering information. In
ω = 0, the integrated mean relative errors of the double-precision addition, the same conclusions as mentioned above can be ob-
floating-point implementations are 0.062 for Intel Xeon Gold 5120, tained.
0.059 for Intel Core I7-8750h and 0.075 for the GPU. The integrated The temperature fields of the two-dimensional GRIN medium
mean relative errors of the single-precision floating-point GPU im- obtained by the GPU implementations for ω = 0 and ω = 0.5 are
plementation without and with the use of intrinsic functions are shown in Fig. 12.
0.206 and 0.188, respectively.
For the two-dimensional case of Nr = 4 × 105 , t = 0.01 5.4. Speedups compared with multi-core CPUs
and ω = 0.5, the integrated mean relative errors of the double-
precision floating-point implementations are 0.235 for Intel Xeon Parallel computing can also be implemented on computers with
Gold 5120, 0.243 for Intel Core I7-8750h and 0.245 for the GPU. multi-core CPUs. Because of the different design objectives, CPUs
14
Fig. 12. Temperature fields of the two-dimensional GRIN medium for (a) ω = 0 and (b) ω = 0.5.
need strong versatility to handle data of multiple types and tasks system to address two virtual or logical cores for each processor
containing logical statements, branches, switches and interrupts. core that is physically present, most of the CPUs only have up to
Thus, they are composed of a few high-performance cores. Many dozens of threads that can execute concurrently. One of the CPUs
fewer threads can execute concurrently on a CPU than on a GPU. used in this paper, Intel Core I7-8750h running at 2.20 GHz, has
Even with hyper-threading technology, which allows the operating 6 cores and allows up to 12 threads to execute concurrently. The
15
Fig. 14. Speedups of the GPU implementations compared with the multi-thread
CPU implementations using Intel Xeon Gold 5120 @ 3.20 GHz for the three-
dimensional case: Nr = 4 × 105 , t = 0.01 and ω = 0.
of the double-precision floating-point GPU implementations com-

pared with the CPU implementations using multi-core are unsat-
isfactory: only 0.90 × and 2.92 × against the CPU implementa-
tions using the above two CPUs, respectively. This is because the
GPU used in this paper, NVIDIA GeForce GTX 1080Ti, has 128
CUDA cores (FP32 cores) per multiprocessor for single-precision
arithmetic operations but only 4 FP64 cores per multiprocessor
for double-precision arithmetic operations, a 32-fold decrease. If
double-precision arithmetic operations are necessary for the study,
a better choice would be to use a GPU with a higher double-
precision floating-point arithmetic capability, such as NVIDIA Tesla
series graphics.
For the three-dimensional case of Nr = 4 × 105 , t = 0.01 and
ω = 0, the speedups of the GPU implementations compared with
the CPU implementations using different numbers of threads are
shown in Fig. 14. Because of the larger unbalance of the work-
load between threads for the three-dimensional case, the corre-
sponding speedups are slightly lower than the speedups for the
two-dimensional case. While the use of intrinsic functions that are
Fig. 13. Speedups of the GPU implementations compared with the multi-thread
unique to the GPU turns the tide, the corresponding GPU imple-
CPU implementations using (a) Intel Xeon Gold 5120 @ 3.20 GHz and (b) Intel Core mentation, by contrast, achieves a larger speedup. Compared with
I7-8750h @ 2.20 GHz for the two-dimensional case: Nr = 4 × 105 , t = 0.01 and Intel Xeon Gold 5120 @ 3.20 GHz using all 14 cores (28 threads) for
ω = 0. computing, the speedups of the single-precision floating-point GPU
implementations without and with the use of intrinsic functions
are 0.98 × and 2.07 × , respectively. The corresponding speedup
other CPU used, Intel Xeon Gold 5120 with turbo boost technology of the double-precision floating-point GPU implementation is only
enabled running at 3.20 GHz, has 14 cores and allows up to 28 0.26 × , as explained above.
threads to execute concurrently.
For the two-dimensional case of Nr = 4 × 105 , t = 0.01 6. Conclusions
and ω = 0, the speedups of the GPU implementations compared
with the CPU implementations using different numbers of threads We have developed a fast GPU Monte Carlo implementation
are shown in Fig. 13(a) and (b). The single-precision floating-point for simulating radiative heat transfer in GRIN media with CUDA,
GPU implementations achieve a speedup of several fold even com- a general purpose parallel computing platform and programming
pared with the CPU implementations using all the CPU cores. Com- model for NVIDIA GPUs. The method is first modified by combining
pared with Intel Xeon Gold 5120 @ 3.20 GHz using all 14 cores it with the binary search algorithm to obtain the ending positions
(28 threads) for computing, the speedups of the single-precision of the rays instead of using a smaller step size of the Runge-Kutta
floating-point GPU implementations without and with the use of algorithm. Then, the efforts are focused on improving the perfor-
intrinsic functions are 1.65 × and 1.75 × , respectively. Compared mance of the GPU implementation by optimizing the code based
with Intel Core I7-8750h @ 2.20 GHz using all 6 cores (12 threads) on the GPU architecture. The utilization of the GPU hardware is
for computing, the speedups of the single-precision floating-point maximized by reducing the granularity of the kernel and speci-
GPU implementations without and with the use of intrinsic func- fying the launch bounds qualifier with the best-performing argu-
tions are 5.32 × and 5.65 × , respectively. However, the speedups ments to the kernel. Furthermore, the warp inactivity caused by
16
the “branch divergence” and the “tail effect” has been substan- [9] White IM, Fan X. On the performance quantification of resonant refractive in-
tially reduced by grouping the rays according to their emitting dex sensors. Opt Express 2008;16:1020–8.
[10] Liu LH. Meshless method for radiation heat transfer in graded index medium.
cell and tracing the rays emitted by the same cell at each kernel Int J Heat Mass Transf 2006;49:219–29.
launch. The speedups of the GPU implementations using a single- [11] Liu LH, Zhang L, Tan HP. Finite element method for radiation heat transfer
precision floating-point type and a double-precision floating-point in multi-dimensional graded index medium. J Quant Spectrosc Radiat Transf
2006;97:436–45.
type are compared. For the GPU used in this paper, NVIDIA GeForce [12] Liu LH. Finite volume method for radiation heat transfer in graded index
GTX 1080Ti, the speedups of the GPU implementations are much medium. J Thermophys Heat Transf 2006;20:59–66.
higher using a single-precision floating-point type than a double- [13] Zhang Y, Yi HL, Tan HP. The lattice Boltzmann method for one-dimensional
transient radiative transfer in graded index gray medium. J Quant Spectrosc
precision floating-point type, while the accuracy decreases slightly.
Radiat Transf 2014;137:1–12.
The speedups of the single-precision floating-point GPU implemen- [14] Howell JR, Siegel R, Mengüç MP. Thermal radiation heat transfer. 5th ed. Boca
tations are further increased by using intrinsic functions, which Raton: CRC Press; 2010.
[15] Modest MF. Radiative heat transfer. New York: Academic press; 2003.
can only be used in the device code, in the iterations of the Runge-
[16] Liu LH. Benchmark numerical solutions for radiative heat transfer in two-di-
Kutta algorithm. mensional medium with graded index distribution. J Quant Spectrosc Radiat
The results show that the GPU implementations proposed in Transf 2006;102:293–303.
this paper can produce physically accurate results while achiev- [17] Sharma A, Kumar DV, Ghatak AK. Tracing rays through graded-index media: a
new method. Appl Opt 1982;21:984–7.
ing substantial speedups against equivalent CPU implementations [18] Sharma A. Computing optical path length in gradient-index media: a fast and
using single-core/multi-core. For the two-dimensional case, the accurate method. Applied Optics 1985;24:4367–70.
speedup reaches 43.13 × against the single-core CPU implemen- [19] Shi GD, Huang Y, Zhu KY. Thermal emissions of a two-dimensional graded-in-
dex medium solved using a high-precision numerical ray-tracing technique. J
tation and 5.65 × against the CPU implementation using 6 CPU Quant Spectrosc Radiat Transf 2016;176:87–96.
cores (12 threads). For the three-dimensional case, the speedup [20] Huang Y, Shi GD, Zhu KY. Runge–Kutta ray tracing technique for solving radia-
reaches 35.61 × against the single-core CPU implementation and tive heat transfer in a two-dimensional graded-index medium. J Quant Spec-
trosc Radiat Transf 2016;176:24–33.
2.07 × against the CPU implementation using 14 CPU cores (28 [21] Qian LF, Shi GD, Huang Y. Runge-Kutta ray-tracing technique for radiative
threads). transfer in a three-dimensional graded-index medium. J Thermophys Heat
Because of the high flexibility of the MC method, the imple- Transf 2018;32:747–55.
[22] Qian LF, Shi GD, Huang Y, Xing YM. Backward and forward Monte Carlo
mentations in this paper can be easily modified to simulate ra-
method for vector radiative transfer in a two-dimensional graded index
diative heat transfer in a variety of cases with different geome- medium. J Quant Spectrosc Radiat Transf 2017;200:225–33.
tries, boundary conditions or media. The optimization methods in [23] Modest MF. Backward Monte Carlo simulations in radiative heat transfer. J
Heat Transf-Trans ASME 2003;125:57–62.
this paper for GPU implementations based on the architecture of
[24] Wang, et al. Transient radiative transfer in two dimensional graded index
NVIDIA GPUs can be used as references for other researchers when medium by Monte Carlo method combined with the time shift and superposi-
building GPU applications. tion principle. Numer. Heat Transfer, Part A. 2016;69(6):574–88.
[25] Wang, et al. Time-dependent polarized radiative transfer in an at-
mosphere-ocean system exposed to external illumination. Opt. Express.
Author statement 2019;27(16):A981–94.
[26] Yong H, Guo-Dong S, Ke-Yong Z. Backward and forward Monte Carlo method
Author contributions: This work was conceived by Yong Huang. in polarized radiative transfer. Astrophys J 2016;820:11.
[27] Mangiardi CM, Meyer R. A hybrid algorithm for parallel molecular dynamics
Jiang Shao performed the research and wrote the manuscript. Key- simulations. Comput Phys Commun 2017;219:196–208.
ong Zhu analyzed and discussed some results. [28] Volobuev YL, Truhlar DG. An MIMD strategy for quantum mechanical reactive
scattering calculations. Comput Phys Commun 20 0 0;128:465–76.
[29] Efremenko DS, Loyola DG, Doicu A, Spurr RJD. Multi-core-CPU and GPU-accel-
Declaration of Competing Interest erated radiative transfer models based on the discrete ordinate method. Com-
put Phys Commun 2014;185:3079–89.
The authors declare that they have no known competing finan- [30] Sweezy JE. A Monte Carlo volumetric-ray-casting estimator for global fluence
tallies on GPUs. J Comput Phys 2018;372:426–45.
cial interests or personal relationships that could have appeared to
[31] Silvestri S, Pecnik R. A fast GPU Monte Carlo radiative heat transfer implemen-
influence the work reported in this paper. tation for coupling with direct numerical simulation. J Comput Phys: X 2019;3.
[32] Tickner J. Monte Carlo simulation of X-ray and gamma-ray photon transport
on a graphics-processing unit. Comput Phys Commun 2010;181:1821–32.
Acknowledgments
[33] Ren N, Liang J, Qu X, Li J, Lu B, Tian J. GPU-based Monte Carlo simula-
tion for light propagation in complex heterogeneous tissues. Opt Express
This work was supported by the National Natural Science Foun- 2010;18:6811–23.
dation of China [Grant No. 51876004]. [34] Horiuchi S, Yoshida S, Yamamoto M. Fast GPU-based ray tracing in radial GRIN
lenses. Appl Opt 2014;53:4343–8.
[35] NVIDIA Corporation. NVIDIA® CUDA zone homepage, https://developer.nvidia.
References com/cuda-zone.
[36] Born M, Wolf E. Principles of optics: electromagnetic theory of propagation,
[1] Kroger RH, Gislen A. Compensation for longitudinal chromatic aberration in interference and diffraction of light. 7th ed. U.K.: Cambridge University Press;
the eye of the firefly squid, Watasenia scintillans. Vision Res 2004;44:2129–34. 1999.
[2] Nilsson DE, Gislen L, Coates MM, Skogh C, Garm A. Advanced optics in a jelly- [37] Burden R, Faires JD. Numerical analysis. 9th ed. Hampshire: Cengage Learning;
fish eye. Nature 2005;435:201–5. 2010.
[3] Jagger WS, Sands PJ. A wide-angle gradient index optical model of the crys- [38] NVIDIA Corporation. CUDA C++ programming guide, https://docs.nvidia.com/
talline lens and eye of the octopus. Vision Res 1999;39:2841–52. cuda/cuda- c- programming- guide/; 2019.
[4] Pierscionek BK, Augusteyn RC. Species variability In optical parameters of the [39] NVIDIA Corporation. CUDA C++ best practices guide, https://docs.nvidia.com/
eye lens. Clin Exp Optometry 1993;76:22–5. cuda/cuda- c- best- practices- guide/; 2019.
[5] Augusteyn RC, Stevens A. Macromolecular structure of the eye lens. Prog Polym [40] NVIDIA Corporation. NVIDIA Nsight visual studio Edition 2019.4 User
Sci 1998:375–413. Guide, https://docs.nvidia.com/nsight-visual-studio-edition/2019.4/Nsight_
[6] Ji S, Ponting M, Lepkowicz RS, Rosenberg A, Flynn R, Beadie G, Baer E. A bio-in- Visual_Studio_Edition_User _Guide.htm/; 2019.
spired polymeric gradient refractive index (GRIN) human eye lens. Opt Express [41] NVIDIA corporation. cuRAND, https://docs.nvidia.com/cuda/curand/; 2019.
2012;20:26746–54. [42] Oberman SF, Siu MY, Montuschi P, Schwarz E. A high-performance area-effi-
[7] Zuccarello G, Scribner D, Sands R, Buckley LJ. Mater Bio-inspired Opt cient multifunction interpolator. In: 17th IEEE symposium on computer arith-
2002;14:1261–4. metic, proceedings. Los Alamitos: Ieee Computer Soc; 2005. p. 272–9.
[8] Mao Y, Chang S, Sherif S, Flueraru C. Graded-index fiber lens proposed for ul-
trasmall probes used in biomedical imaging. Appl Opt 2007;46:5887–94.
17

1 s2.0 S0022407321001734 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0022407321001734 Main

Uploaded by

Copyright:

Available Formats

Journal of Quantitative Spectroscopy & Radiative Transfer 269 (2021) 107680

Contents lists available at ScienceDirect

Journal of Quantitative Spectroscopy & Radiative Transfer

A fast GPU Monte Carlo implementation for radiative heat transfer in

1. Introduction Traditional analytical methods are usually unsuitable for a GRIN

Thermal radiation in a GRIN medium passes along curved lines,

2.2. The Runge-Kutta algorithm

According to Fermat’s principle, the ray trajectory equation in a

Fig. 2. Flowchart of the MC ray tracing method.

Fig. 3. Flowchart of the GPU implementation.

sorbed by cell j to the total number of photons emitted by cell i.

where T is the temperature of the cell, σ is the Stefan-Boltzmann

Because of the different design objectives, the architecture of

Fig. 5. Flowchart of the improved GPU implementation.

4. Algorithm accelerations 4.2. Maximize utilization

Variable Achieved(a) Achieved(b) Theoretical(a) Theoretical(b) Device Limit

Variable Achieved(a) Achieved(b) Theoretical(a) Theoretical(b) Device Limit

25.25 × using a single-precision ﬂoating-point type, 26.74 × using

5.3. Errors of the two-dimensional GRIN medium cases

The correctness of the results is validated by comparison with

where T is the temperature of the medium along the y-axis at the

of the double-precision ﬂoating-point GPU implementations com-

You might also like