Professional Documents
Culture Documents
Scalable FBP Decomposition For Cone-Beam CT Reconstruction: November 2021
Scalable FBP Decomposition For Cone-Beam CT Reconstruction: November 2021
net/publication/356171619
CITATIONS READS
0 115
9 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Peng Chen on 17 November 2021.
CITATIONS READS
0 15
9 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Peng Chen on 14 November 2021.
3
4 for j ← 0 to 𝑁 𝑦 -1 do
5 for i ← 0 to 𝑁𝑥 -1 do
Figure 2: Cone-beam CT with a Flat Panel Detector (FPD). 6 𝑧 ← ⟨𝑀𝑎𝑡 [𝑠] [2], [𝑖, 𝑗, 𝑘, 1]⟩ ⊲ Eqn 8
7 𝑥 ← ⟨𝑀𝑎𝑡 [𝑠] [0], [𝑖, 𝑗, 𝑘, 1]⟩/𝑧 ⊲ Eqn 8
Table 1: The parameters of a CBCT system. 8 𝑦 ← ⟨𝑀𝑎𝑡 [𝑠] [1], [𝑖, 𝑗, 𝑘, 1]⟩/𝑧 ⊲ Eqn 8
Description Parameter Unit 9 𝐼 [𝑘] [ 𝑗] [𝑖]←𝐼 [𝑘] [ 𝑗] [𝑖] + 𝑧12 ·𝑆𝑢𝑏𝑃𝑖𝑥𝑒𝑙 (𝑃 [𝑠], 𝑥, 𝑦)
Rotation angle 𝜙 degree
Distance from source to object (Z-axis) 𝐷𝑠𝑜 mm
Distance from source to detector (FPD) 𝐷𝑠𝑑 mm 10 Function SubPixel(𝑄, 𝑥, 𝑦)
The number of 2D projections 𝑁𝑝 — 11 [𝑖𝑢 , 𝑖 𝑣 ]←[int(𝑥), int(𝑦)] ⊲ 𝑖𝑢 , 𝑖 𝑣 are integers
The width and height of a 2D projection 𝑁𝑢 , 𝑁 𝑣 pixel 12 [𝜀𝑢 , 𝜀 𝑣 ]←[𝑥 − 𝑖𝑢 , 𝑦 − 𝑖 𝑣 ] ⊲ 𝜀𝑢 , 𝜀 𝑣 ∈ [0, 1)
Pixel pitch at U- and V-axis △𝑢 , △𝑣 mm/pixel 13 𝑡 1 ←𝑄 [𝑖 𝑣 ] [𝑖𝑢 ]·(1 − 𝜀𝑢 ) + 𝑄 [𝑖 𝑣 ] [𝑖𝑢 + 1]·𝜀𝑢 ⊲ interp
A 3×4 projection matrix at angle 𝜙 (Sec. 4.1) 𝑀𝜙 —
The number of voxels in X-, Y-, Z-axis 𝑁 𝑥 , 𝑁 𝑦 , 𝑁𝑧 voxel
14 𝑡 2 ←𝑄 [𝑖 𝑣 + 1] [𝑖𝑢 ]·(1 − 𝜀𝑢 ) + 𝑄 [𝑖 𝑣 + 1] [𝑖𝑢 + 1]·𝜀𝑢 ⊲ interp
Voxel pitch at X-, Y-, and Z-axis △𝑥 , △𝑦 , △𝑧 mm/voxel 15 return 𝑡 1 ·(1 − 𝜀 𝑣 ) + 𝑡 2 ·𝜀 𝑣 ⊲ 2D interpolation value
Offset of FPD at U- and V-axis (Figure 7a) 𝜎𝑢 , 𝜎𝑣 pixel
Rotation center offset (Figure 7b) 𝜎𝑐𝑜𝑟 mm
Table 2: State-of-the-art image reconstruction solutions by FBP and Iterative Reconstruction (IR) algorithms.
Beam Decomposition Lower-bound Out-of-Core Multiple Communication
Implementation Algorithm
Shape Input Output Input Size Capability GPUs Nodes (MPI)
Trace [5] IR Parallel 2D 1D O(𝑁𝑝 ) ✓ ✗ ✓ O(𝑙𝑜𝑔(𝑁 ))
NU-PSV [63] IR Parallel 2D 1D O(𝑁𝑝 ) ✓ ✗ ✓ O(𝑙𝑜𝑔(𝑁 ))
MemXCT [27] IR Parallel 2D 2D O(𝑁𝑝 ) ✗ ✓ ✓ O(𝑁 )
Peta-scale XCT [28] IR Parallel 3D 3D O(𝑁𝑝 ) ✗ ✓ ✓ O(𝑁 )
Consensus Equilibrium [64] IR Parallel 2D 1D O(𝑁𝑝 ) ✓ ✗ ✓ O(𝑁𝑙𝑜𝑔(𝑁 ))
DMLEM [12] IR Cone 1D ✗ O(𝑁𝑢 ×N𝑣 ) ✗ ✓ ✓ O(𝑁𝑙𝑜𝑔(𝑁 ))
Palenstijn et al. [44] IR Cone 1D 1D O(𝑁𝑢 ×𝑁𝑝 ) ✗ ✓ ✓ O(𝑁𝑙𝑜𝑔(𝑁 ))
TIGRE [6] IR Cone 1D 1D O(𝑁𝑢 ×𝑁 𝑣 ) ✗ ✓ ✗ ✗
Lu et al. [38] FBP Cone 1D 1D O(𝑁𝑢 ×𝑁 𝑣 ) ✓ ✗ ✗ ✗
iFDK [9] FBP Cone 1D 1D O(𝑁𝑢 ×𝑁 𝑣 ) ✗ ✓ ✓ O(𝑁𝑙𝑜𝑔(𝑁 ))
This work FBP Cone 2D 1D O(𝑁𝑢 ) ✓ ✓ ✓ O(𝑙𝑜𝑔(𝑁 ))
Table 3: The system parameters in our framework. volume data (output) in a way that enables us to build a scalable
Description Parameter and deep pipeline on heterogeneous computing systems.
The batch size of slices 𝑁𝑏
3.1.1 Splitting the volume data. We split the volume data in a way
The number of slices generated by a rank (or group) 𝑁𝑠
The batch count of batch processing 𝑁𝑐
that we could perform independent and concurrent reconstruc-
The number of sub-volumes 𝑁𝑛 tion of different sub-volumes. This has been proposed before for
The number of groups (of MPI ranks) 𝑁𝑔 parallel-beam reconstruction (e.g. [28, 62]), but not for cone-beam,
The number of MPI ranks in a single group 𝑁𝑟 to the author’s knowledge. Furthermore, this method of splitting
The number of GPUs 𝑁𝑔𝑝𝑢𝑠 the volume enables us to construct volumes in an out-of-core fash-
The number of launched ranks 𝑁𝑟𝑎𝑛𝑘𝑠 ion. This is particularly useful in cases when the required memory
for storing high-resolution volume data is larger than the memory
capacity (i.e. GPU device memory). As Figure 3c shows, we split the
In comparison to cone-beam image reconstruction, parallel-beam volume into a sub-volumes vertically (or along Z-axis in Figure 2),
is simpler since the projection and volume data can be naturally split e.g., 𝑉0 , 𝑉1 , . . . , and 𝑉𝑁𝑛 −1 in Figure 3c. 𝑁𝑏 is defined as the batch
without special consideration to the irregular decomposition of com- size (or the number of 2D slices) in each sub-volume. Hence, the
putation and communication. TIGRE [6] implemented a collection total number of sub-volumes may be calculated as
of IR algorithms for CBCT. However, TIGRE is restricted to using
𝑁𝑧
multiple GPUs on a single node. Palenstijn et al. [44] presented an 𝑁𝑛 = (3)
𝑁𝑏
extension of the ASTRA Toolbox [60] with an optimization of the
SIRT algorithm that enables distributed computation with multiple 3.1.2 Splitting 2D projections. As Figure 3a shows, each 2D pro-
GPUs. The authors only considered decomposing the projections jection is decomposed into several sub-projections according to
of the cone-beam in the dimension of 𝑁 𝑣 . DMLEM [12] scales up the geometric position of the corresponding sub-volume. Due to
to tens of GPUs, yet is restricted to reconstructing extremely small the use of cone-beam geometry, the required sub-projection for
volumes, e.g. smaller than 2003 . iFDK [9] is an efficient framework the partial image reconstruction varies from the position of each
to scale the FBP algorithm on CBCT. However, iFDK only decom- sub-volume. That is since the magnification factor exists through
poses the projection at the dimension of 𝑁𝑝 , and the size of output a cone geometry. Unlike the parallel-beam CT system, decompos-
volume on each GPU is limited by the GPU memory capacity. Un- ing a cone-beam projection may introduce an overlapped area. In
like the prior works, our algorithm decomposes the projections other words, in comparison to the parallel-beam CT, the character-
on both 𝑁 𝑣 and 𝑁𝑝 dimensions, resulting in better scalability and istics of the magnification make the CBCT more challenging w.r.t.
also out-of-core capability. Furthermore, we can greatly reduce the decomposing the domain in distributed reconstruction.
inter-processes communication and remove all serialization and Figure 4 (YZ-view) shows the decomposition of a 2D projection.
redundancy from the end-to-end pipeline, i.e. all steps from loading To generate a sub-volume of 𝑉𝑖 , the required range of projection is
until storing can be overlapped.
𝑎𝑖 𝑏𝑖 = 𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝐴𝐵(𝑖·𝑁𝑏 , (𝑖 + 1)·𝑁𝑏 ) (4)
3 PROPOSED ALGORITHM Where the function ComputeAB, which computes the maximum
In this section, we discuss the algorithm we propose for scalable projection area, is shown in Algorithm 2.
FBP decomposition for cone-beam CT. In Algorithm 2, the projection operation (at angles of 135◦ and
315◦ ), according to Equation 8, is called four times to compute
3.1 Projection & Volume Decomposition four y coordinates. The range of 𝑎𝑖 𝑏𝑖 is determined by the mini-
Table 3 lists all the parameters we use in the following sections. mum and maximum values of y (min4 and max4 in the algorithm).
As Figure 3 shows, we introduce a novel algorithm to parallel FBP- Since the positions of y are at sub-pixel precision (expressed as a
based image reconstruction: decomposing projections (input) and single-precision variable), we adjust them to integer values since
Scalable Image Reconstruction Algorithm SC ’21, November 14–19, 2021, St. Louis, MO, USA
Sub-volume
𝑁𝑝
Sub-volume
𝑁𝑏
Sub-volume 𝑉𝑖 𝑉0
𝑉1
Sub-volume
…… 𝑉𝑖
𝑉𝑖+1
Required by 𝑉𝑖 Sub-volume
𝑁𝑣 Sub-volume
Required by 𝑉𝑖 +1 𝑉𝑖 +1
…… Sub-volume
Sub-volume
𝑉𝑁𝑛−1
𝑁𝑢
(a) 2D Projections decomposition. (b) Reduced sub-volumes. (c) Aggregate Volume.
Figure 3: Overview of the projection and volume decomposition. An example of four MPI ranks work as a group, 𝑁𝑟 = 4.
3D Volume
Load data Filtering Back-projection
𝛿𝑢 ഥ 𝑁𝑢
Load data Filtering Back-projection MPI-Reduce Store O
Load data Filtering Back-projection 𝛿𝑣 𝑁𝑣 𝛿𝑐𝑜𝑟
Load data Filtering Back-projection MPI-Reduce Store o ഥ
O
Load data Filtering Back-projection
Figure 6: Overview of how the proposed framework operates (a) FPD center offset. (b) Rotation center offset (top-view).
on a distributed system.
Figure 7: Geometric correction. The blue and red points are
the theoretical and calibrated positions, respectively.
This interaction between sub-volumes allows us to run an end-to-
end pipeline from loading projections to storing the volume (details where [𝑖, 𝑗, 𝑘] is a 3D point in the unit of voxel, [𝑥, 𝑦] is the pro-
in Section 4.4.3). It is important to note that the lack of projection jected 2D position at the FPD plane in the unit of pixel at sub-pixel
splitting in other cone-beam frameworks[9, 30, 38] forces them precision. We can obtain a projection matrix based on the angle
to serialize the sub-volumes: sub-volumes are computed one after 𝜙, and perform the mapping operation as shown in lines 6∼8 of
the other. In addition, the lack of projection splitting forces the Algorithm 1.
frameworks to repeatedly move the same projections from host to
device in an inefficient manner. 4.2 Filtering Computation Optimization
As Figure 6 shows, we take advantage of the heterogeneity of the
4 IMPLEMENTATION
GPU-accelerated system to pipeline and parallelize the FBP compu-
In this section, we discuss the implementation of the proposed algo- tation. Typically the GPU is used to perform the filtering computa-
rithm in a framework. Figure 6 gives an overview of the proposed tion in Equation 2 [7, 38]. We however do the filtering computation
algorithm when running on distributed systems. More specifically, on the CPU (using IPP/MKL libraries [58]) to have the end-to-end
we use the muti-core CPUs to perform the operations such as pipeline shown in Figure 6. Running the filtering computation on
Parallel File System (PFS) I/O, filtering computation, and MPI com- CPU contributes to building an efficient pipeline for the FBP algo-
munications. We use the GPUs to run the back-projection kernel. rithm as follows. (i) The GPU focuses on back-projection, which
is the computational bottleneck. Also, we can overlap the filtering
4.1 A General Projection Matrix computation and back-projection. Hence, we can hide the latency
It is important to assess the geometric parameters of a CT system of the filtering computation. (ii) The limited device memory of GPU
with a high degree of precision in order to reconstruct tomographic can be fully used for back-projection. More specifically, we can
images with good spatial resolution and low artifact content [70]. alleviate the pressure on device memory when generating high-
Many reconstruction works [38, 42, 66] assume CT systems have no -resolution volumes. (iii) The data movement remains simplified
geometric error or that the geometric error is corrected by physical since the filtering computation can be performed immediately after
adjustment. In this work, we correct the geometric offset dynami- loading data from storage.
cally when performing the projection operations. In Figure 7, we
show the geometric offsets that must be carefully calibrated before 4.3 Novel CUDA Back-projection Kernel
scanning [70]. In our system, the projection matrix is formulated Conventional approaches [38, 72] rely on 2D layered texture [11]
with consideration of the geometric offset, e.g. 𝜎𝑥 , 𝜎 𝑦 and 𝜎𝑐𝑜𝑟 . to improve the data locality and update the volume by batches
We propose a general projection matrix available to correct these of projections. We, however, split the projections horizontally, in
offsets for real-world datasets as listed in Table 4. addition to the batches, and update each voxel by all projections in
Most importantly, the proposed matrix is general and can be a single batch. We also take advantage of the 3D texture to improve
reused for most CBCT systems, e.g. microscope CT system with the data locality of projections.
rotation center offset (𝜎𝑐𝑜𝑟 ). The CBCT geometry can be described
as a pinhole model, similar to a digital camera [23]. All geometric 4.3.1 CUDA Implementation. In Listing 1, our kernel is imple-
parameters could be expressed in a well-aligned matrix of size 3×4 mented as kernelBackProjection. Each 2D position, namely (x, y)
(known as the projection matrix). These matrices are convenient in lines 12∼14, is computed according to Equation 8 and then, is
for performing the back-projection computation, i.e. for projecting used to fetch the intensity value of the projection at sub-pixel pre-
a voxel to the plane of FPD such as the matrix 𝑀𝜙 in Algorithm 1. cision using the device function devSubPixel. The device function
Our projection matrix is defined as devSubPixel is strictly consistent with the original interpolation
△𝑥 0 1−△𝑥 ·𝑁𝑥 function implementation in Algorithm 1. Each pixel is accessed
0
𝐷𝑠𝑑 0 𝑁𝑢2−1 + 𝜎𝑢 0 𝑐𝑜𝑠 (𝜙) −𝑠𝑖𝑛 (𝜙) 0 𝜎𝑐𝑜𝑟 2
via 3D texture, as the device function devPixel shows. We use the
△𝑦 ·𝑁 𝑦
0 0 −1 0
△𝑢
0 −△ 0
2
3D texture to improve the data reuse. To perform a bilinear inter-
𝑀𝜙 = 0 𝐷𝑠𝑑 𝑁 𝑣 −1
+ 𝜎𝑣 0 𝑠𝑖𝑛 (𝜙) 𝑐𝑜𝑠 (𝜙) 0 𝐷𝑠𝑜
𝑦
2 △𝑧 ·𝑁𝑧
△𝑣 0 0 △𝑧
0 0 0 2
𝐷𝑠𝑑 0 0 0 1 polation operation, we need to load four neighboring pixels from
0 0 0 1
Using this matrix, we can project the 3D position of any voxel the device memory to registers. However, those four pixels are not
relative to the FPD plane. We formulate the projection as physically contiguous in the device memory. It is worth mentioning
that the texture memory in CUDA provides a hardware-optimized
[𝑥, 𝑦] = 𝑃𝑟𝑜 𝑗𝑒𝑐𝑡𝑖𝑜𝑛(𝑀𝜙 , [𝑖, 𝑗, 𝑘]) (8) bilinear interpolation function at 8-bits precision [11]. However, to
Scalable Image Reconstruction Algorithm SC ’21, November 14–19, 2021, St. Louis, MO, USA
(a) Reconstructing 20483 volume of Tomo_00029 in Table 5. 𝑁𝑔𝑝𝑢𝑠 =1. (b) Reconstructing 40963 volume of Bumblebee. 𝑁𝑔𝑝𝑢𝑠 =128, 𝑁𝑔 =64, 𝑁𝑟 =8.
Figure 10: Empirical examples using real-world datasets: efficient overlapping of the pipelined stages (BP = Back-projection).
Algorithm 3: Moving data from host to device by cudaMem- Namely: loading projections from storage, filtering, back-projection,
cpy3D in the computing pipeline of Figure 9. reduction of the 3D sub-volumes, and finally saving the recon-
Input: Queue1 , Queue2 (as in Fig. 9), devMem is texture-optimized 3D structed results to storage. Five threads manage different stages
device memory with depth of H (tex in Listing 1). devMem(x,y) of the pipeline, concurrently. The main thread (also called MPI
means 3D memory ranges from x to y in depth dimension. thread) manages the MPI communication. Then four other threads
Output: sub-volumes of 𝑉0 , . . . , 𝑉𝑁𝑐 −1 are launched via the C++ standard library (namely std::thread) [33].
1 for i ← 0 to 𝑁𝑐 -1 do The inter-thread hand-over is managed through four queues (FIFO
2 s← 0 buffers) that assume all threads are running independently. The
3 if i == 0 then load thread is launched to load the partial projections according to
4 P(𝑎 0𝑏 0 ) ← 𝑄𝑢𝑒𝑢𝑒 1 ⊲ pop head of filtered projections the requirements of the sub-volumes (as in Figure 3). The filtering
5 d←b0 − 𝑎 0 + 1 thread is used to perform the filtering computation on the data
6 Memcpy3D 𝑃 (𝑎 0𝑏 0 ) to devMem(0, d-1) ⊲ H2D handed-over from the load thread. We then parallelize the filtering
7 else computation (using OpenMP). The Back-projection thread launches
8 P(𝑏𝑖−1𝑏𝑖 ) ← 𝑄𝑢𝑒𝑢𝑒 1 ⊲ get filtered projection the back-projection CUDA kernel on the GPU (proposed CUDA
9 d←b𝑖 − 𝑏𝑖−1 +1 kernel in Listing 1). The back-projection thread also manages the
10 if s%H + d < H then data movement in a way that allows the pipeline to seamlessly
11 Memcpy3D P(𝑏𝑖−1𝑏𝑖 ) 𝑡𝑜 𝑑𝑒𝑣𝑀𝑒𝑚 (𝑠%𝐻, 𝑠%𝐻 + 𝑑) support out-of-core image reconstruction if necessary (shown in
12 else Algorithm 3). In the algorithm, we use cudaMemcpy3D to move the
13 k← 𝐻 − 𝑠%𝐻 partial projections waiting for processing in 𝑄𝑢𝑒𝑢𝑒 1 to be processed
14 Memcpy3D P(𝑏𝑖−1 (𝑏𝑖−1 + 𝑘)) 𝑡𝑜 𝑑𝑒𝑣𝑀𝑒𝑚 (𝑠%𝐻, 𝐻 − 1) on the device using the proposed CUDA kernel in listing 1. Then
15 Memcpy3D P((𝑏𝑖−1 + 𝑘 + 1)𝑏𝑖 ) 𝑡𝑜 𝑑𝑒𝑣𝑀𝑒𝑚 (0, 𝑏𝑖 −𝑏𝑖−1 ) we push the generated sub-volume to 𝑄𝑢𝑒𝑢𝑒 2 . Finally, the store
16 s← 𝑠 + ℎ thread writes the final volume data to global storage (or PFS) using
17 𝑉𝑖 ← generate a sub-volume on GPU ⊲ use kernel in Listing 1 OpenMP threads. Figure 10 shows two empirical examples of the
18 𝑄𝑢𝑒𝑢𝑒 2 ←V𝑖 ⊲ push back a sub-volume end-to-end pipeline processing and demonstrates the effectiveness
of the overlapping.
Performance Projection. Each MPI rank loads a small part Table 4: Geometric correction for computing projection ma-
of projections from the local storage to memory as expressed in trix (𝑀𝜙 ) and projection pre-processing by Equation 1.
Equation 5 and Equation 7. tomo_ID
Coffee bean Bumblebee
( 00027 00028 00029 00030
𝜂·𝑆𝑖𝑧𝑒𝐴𝐵𝑖 /𝐵𝑊𝑙𝑜𝑎𝑑 , if 𝑖 = 0. 0 0 25 26 27 -10
𝑖
𝑇𝑙𝑜𝑎𝑑 = (13) 𝜎𝑢
𝜂·𝑆𝑖𝑧𝑒𝐵𝐵𝑖 /𝐵𝑊𝑙𝑜𝑎𝑑 , else 𝑖 ∈ [1, 𝑁𝑐 ). 𝜎𝑣 0 0 0.25 0.25 0.2 0.2
𝜎𝑐𝑜𝑟 -0.0021 1.03 0 0 0 0
Similar to the equation above, we can obtain 𝑇𝑓𝑖 𝑙𝑡 and 𝑇𝐻𝑖 2𝐷 by 𝜆𝑑𝑎𝑟𝑘 0 dark data is provided in dataset
216 blank data is provided in dataset
𝑇 𝐻 𝑓 𝑙𝑡 and 𝐵𝑊𝑝𝑐𝑖 , respectively. The back-projection execution time
𝜆𝑏𝑙𝑎𝑛𝑘
Therefore, the required time of moving a sub-volume from device to 6.1 Datasets, Environment, and Measurements
host, reducing the sub-volume by MPI_Reduce, and writing a sub- Computing Platform. ABCI supercomputer (14𝑡ℎ in the Top500
volume data to storage could be written as 𝑇𝐷2𝐻 𝑖 = 𝑆𝑖𝑧𝑒 𝑣𝑜𝑙 /𝐵𝑊𝑝𝑐𝑖 , list at November 2020) is used for performance evaluation. The sys-
𝑇𝑟𝑒𝑑𝑢𝑐𝑒 = 𝑆𝑖𝑧𝑒 𝑣𝑜𝑙 /𝑇 𝐻𝑟𝑒𝑑𝑢𝑐𝑒 , and 𝑇𝑠𝑡𝑜𝑟𝑒 = 𝑆𝑖𝑧𝑒 𝑣𝑜𝑙 /𝐵𝑊𝑠𝑡𝑜𝑟𝑒 , respec-
𝑖 𝑖 tem is equipped with 1,088 compute nodes (4,352 Nvidia Tesla V100
tively. We define following parameters that are aggregations: 𝑇𝐶𝑃𝑈 GPUs) and 35PB shared storage. The specification of each compute
is the runtime on CPU, 𝑇𝐺𝑃𝑈 is the runtime on GPU, 𝑇𝑙𝑜𝑎𝑑 is the node is as follows: two Intel Xeon Gold 6148 CPUs (2.40GHz, 20
time to load from local storage to memory, 𝑇𝑓 𝑙𝑡 is the time to do Cores), 384GB (DDR4 2666MHz) memory, 1.6TB NVMe SSD, four
the filtering computation on CPU, 𝑇𝑟𝑒𝑑𝑢𝑐𝑒 the time to do the sub- Nvidia Tesla V100 GPUs (16GB/GPU) with PCIe 3.0×16, and two
volume reductions, 𝑇𝑠𝑡𝑜𝑟𝑒 is the time to store the resulting volume InfiniBand EDR HCAs. Our framework is developed on CUDA-10.2
to the PFS. Note that the aggregate runtimes are comprised of run- Toolkit (CUDA Driver: 440.33.01) running on CentOS 7.4. The ver-
times for different batch counts (𝑁𝑐 ), e.g. 𝑇𝐶𝑃𝑈 = 𝑁
Í 𝑐 −1 𝑖 sion of Intel libraries for IPP and MPI is 2020.4.304. We use Nvidia
𝑗=0 𝑇𝐶𝑃𝑈 and
nvcc and Intel mpicc compilers for compiling the device and host
𝑇𝐶𝑃𝑈 ∝ 𝑇 𝑖
∼ 𝐶𝑃𝑈 .
codes, respectively.
The runtime on the CPU and GPU is thus
Datasets. Six real-world datasets are evaluated: (i) A Coffee
𝑖 𝑖
𝑇𝐶𝑃𝑈 = 𝑇𝑙𝑜𝑎𝑑 + 𝑇𝑓𝑖 𝑙𝑡 bean dataset. A roasted coffee bean sample was scanned by Zeiss
𝑖
(16) Xradia Versa 510 (3D X-ray Microscopy) at 80 kV, 87.5 𝜇A. 𝐷𝑠𝑑 =151.7,
𝑇𝐺𝑃𝑈 = 𝑇𝐻𝑖 2𝐷 + 𝑇𝑏𝑝
𝑖 𝑖
+ 𝑇𝐷2𝐻
𝐷𝑠𝑜 =16.0, the X-ray and optical magnification are 9.48 (=𝐷𝑠𝑑 /𝐷𝑠𝑜 )
Assuming a perfect overlap of the operations in the pipeline, the and 0.39 (by calibration), respectively. Offseting a detector of size
overall runtime is projected as 2000×2000 to the left and right side with overlapped region was
0
𝑇𝑟𝑢𝑛𝑡𝑖𝑚𝑒 =𝑇𝐶𝑃𝑈 0
+ 𝑇𝐺𝑃𝑈 0
+ 𝑇𝑟𝑒𝑑𝑢𝑐𝑒 0
+ 𝑇𝑠𝑡𝑜𝑟𝑒 + conducted at two full scan. The size of each stitched projection
𝑐 −1
𝑁Õ 𝑐 −1
𝑁Õ 𝑐 −1
𝑁Õ 𝑐 −1
𝑁Õ (17)
becomes 𝑁𝑢 =3728 and 𝑁 𝑣 =2000. The number of acquired projec-
max( 𝑖
𝑇𝐶𝑃𝑈 , 𝑖
𝑇𝐺𝑃𝑈 , 𝑖
𝑇𝑟𝑒𝑑𝑢𝑐𝑒 , 𝑖
𝑇𝑠𝑡𝑜𝑟𝑒 ) tions is 𝑁𝑝 =6401. The exposure time is 14 seconds and total scan
𝑖=1 𝑖=1 𝑖=1 𝑖=1
time is about 5 hours. (ii) A bumblebee dataset. A bumblebee
Insights from the Performance Model. We list observations was scanned on Nikon Metrology HMX ST 225 (a micro-CT scan-
from the performance model analysis. (i) Scaling: The operations ner) at 40Kv, 173𝜇A. 𝐷𝑠𝑑 =672.5, 𝐷𝑠𝑜 =39.8, the X-ray magnification
on the partial projections such as loading, filtering are light-weight is 16.9. The projection parameters are: 𝑁𝑢 =𝑁 𝑣 =2000, △𝑢 =△𝑣 =0.2,
in comparison with other operations on the volume that are often 𝑁𝑝 =3142. The number of acquired projections is 𝑁𝑝 =6401. The
the bottleneck. We observe that 𝑇𝐶𝑃𝑈 ∝ ∼1/𝑁𝑔 since fewer slices cor- total scan time is about 13 hours. (iii) Four Tomobank datasets
respond to less computation on the projections (as expressed in (tomo_00027, tomo_00028, tomo_00029, and tomo_00030). The scan-
Equation 8). The total number of generated slices by each group of ner and the related configurations are described in [13]. The dataset
MPI ranks becomes 𝑁𝑧 /𝑁𝑔 (as shown in Equation 10), where 𝑁𝑧 of 00027, 00028 and 00029 share the same geometric parameters:
is a constant value in this scenario. According to Equations 10, 12, 𝐷𝑠𝑑 =250, 𝐷𝑠𝑜 =100, 𝑁𝑢 =2004, 𝑁 𝑣 =1335, △𝑢 =△𝑣 =0.025, and 𝑁𝑝 =1800.
and 15, 𝑆𝑖𝑧𝑒 𝑣𝑜𝑙 ·𝑁𝑐 = 𝜂·𝑁𝑥 ·𝑁 𝑦 ·𝑁𝑧2 /𝑁𝑔 . It results in 𝑇𝐺𝑃𝑈
𝑖 , 𝑇𝑟𝑒𝑑𝑢𝑐𝑒
𝑖
The geometric parameters of tomo_00030 are: 𝐷𝑠𝑑 =350, 𝐷𝑠𝑜 =250,
and 𝑇𝑠𝑡𝑜𝑟𝑒 are inversely proportional to 𝑁𝑔 . Hence, we can derive
𝑖 𝑁𝑢 =668, 𝑁 𝑣 =445, △𝑢 =△𝑣 =0.075, and 𝑁𝑝 =720.
that 𝑇𝑟𝑢𝑛𝑡𝑖𝑚𝑒 ∝
∼1/𝑁𝑔 . Considering that 𝑁𝑟 is a fixed value for a given Importance of the Datasets. The datasets we use in this paper
problem, according to Equations 9 and 11, we can conclude that give valuable insight. We elaborate in specific on the coffee bean
∼1/𝑁𝑔𝑝𝑢𝑠 since 𝑁𝑔𝑝𝑢𝑠 = 𝑁𝑔 ·𝑁𝑟 . More specifically, the per-
𝑇𝑟𝑢𝑛𝑡𝑖𝑚𝑒 ∝ dataset (reconstruction shown in Figure 11a). (i) the shape and
formance theoretically scales linearly with the number of GPUs. aspect ratio of a coffee bean made it an appropriate candidate for
(ii) Peak performance: Using Equation 17, we can predict the using the wide-field macro mode of the CT scanner, allowing us to
potential peak performance and quantify the implementation effi- capture a high-resolution dataset. It has a variable and representa-
ciency. tive structure for a large group of problems and some interesting
SC ’21, November 14–19, 2021, St. Louis, MO, USA Chen, P. et al.
1 3
2 4
3
2
(a) Reconstruction of the coffee bean dataset: 3728 × 2000 × 6401 ⇒ 40963 . (b) Bumblebee visualization by 3D Slicer [18].
Figure 11: Image reconstruction of real-world datasets. The container for Bumblebee is hidden using a mask.
Performance [FLOP/s]
Single-precision Operations Roofline
a node with a single GPU (V100 or A100). The input sizes of 1013
Peak performance: 13.4e+12
tomo_00030 and tomo_00029 are 668×445×720 (816 MB) and
2004×1335×1800 (17.9 GB), respectively. The columns in the 1012
4.0TFLOP/s 4.0TFLOP/s 4.4TFLOP/s 4.5TFLOP/s 4.5TFLOP/s
table are not a breakdown of the total runtime 𝑇𝑟𝑢𝑛𝑡𝑖𝑚𝑒 ; the AI: 40.9 AI: 41.5 AI: 157.7 AI: 900.4 AI: 2954.7
operations are overlapped in a pipeline fashion as shown in 1011 14.9
Figure 10. 0.1 1 10 100 1000 10000
Arithmetic Intensity (AI) [FLOP/byte]
Input Output 𝑇𝑙𝑜𝑎𝑑 𝑇𝑓 𝑙𝑡 𝑇𝐻 2𝐷 𝑇𝑏𝑝 𝑇𝐷2𝐻 𝑇𝑠𝑡𝑜𝑟𝑒 𝑇𝑟𝑢𝑛𝑡𝑖𝑚𝑒 Perf. (GUPS)
tomo_ID (𝑣𝑜𝑥𝑒𝑙 3 ) (s) (s) (s) (s) (s) (s) (s) Ours RTK Figure 12: Roofline analysis on a V100 GPU (generated by
5123 (512MB) ∼0.5 ∼0.95 ∼0.3 0.87 0.11 0.47 1.4 111.6 110.8
Nvidia Nsight Compute [15]). Using tomo_00030 dataset, the
V100 GPU (16GB)
00030 10243 (4GB) ∼0.5 ∼0.95 ∼0.3 6.7 0.70 0.95 7.9 115.7 113.7
(816MB) 20483 (32GB) ∼0.5 ∼0.95 ∼0.3 53.1 6.4 2.60 60.2 117.2 ✗ blue, red, and black points indicate kernels that reconstruct
40963 (256GB) 423.3 50.2 21.8 475.0 120.1
the volumes of 5123 , 10243 , and 20483 , respectively. The ◦ and
∼0.5 ∼0.95 ∼0.3 ✗
5123 (512MB) 9.5 17.0 3.5 8.6 0.10 0.73 19.7 29.5 104.7
00029 10243 (4GB) 9.4 17.3 3.5 18.1 0.83 0.95 25.4 107.0 107.7
(17.9GB) 20483 (32GB) 9.5 17.0 3.8 124.2 6.4 7.61 137.7 125.1 ✗
△ denote RTK and our kernels, respectively.
40963 (256GB) 9.5 17.1 6.9 971.1 49.6 21.1 1028.8 129.2 ✗
5123 (512MB) ∼0.2 ∼0.7 ∼0.1 0.69 0.06 0.47 1.1 111.6 125.4
A100 GPU (40GB)
00030 10243 (4GB) ∼0.2 ∼0.7 ∼0.1 5.1 0.4 0.4 6.853 152.0 127.4
(816MB) 20483 (32GB) 40.1 3.2 4.0 52.4 155.3
40963 (256GB)
∼0.2
∼0.2
∼0.7
∼0.6
∼0.1
∼0.1 318.8 27.1 36.3 347.1 159.7
✗
✗
of 1e-5 is the threshold for the difference of the generated and
5123 (512MB) 6.4 9.0 2.8 2.8 0.06 0.5 10.1 87.8 122.0 standard volume data. (ii) The datasets described earlier are used
00029 10243 (4GB) 6.5 8.9 2.7 14.2 0.45 0.6 19.7 137.5 124.3
(17.9GB) 20483 (32GB) 6.3 8.8 3.2 98.2 3.2 3.9 114.9 158.2 ✗ in generating a wide variety of volume data as in Figure 8 and
40963 (256GB) 6.3 8.7 3.0 756.0 27.0 36.2 807.2 166.4 ✗
Figure 11. All volume data and slices are visualized for manual
visual inspection by the widely used 3D Slicer [18] viewer.
There are several reasons why we do not conduct a direct per-
features, e.g. walls, hollow pores, voids, laminar features. More formance comparison with other frameworks. (i) It would be in-
importantly, it could broadly represent the low-contrast usually adequate to compare with parallel-beam based algorithms such as
found in CFRP [55]. The pore structure is not dissimilar to that NU-PSV [63] and Peta-scale XCT [28] since they target the older
found in auxetic or metal foams [49], or for the bioengineering generation of CT devices. (ii) Cone-beam based systems such as
research, trabecular (cancellous) bone [25, 59]. RTK, Lu et al. [38], and iFDK [9] are incapable of processing the
Geometric correction & pre-processing of parameters. All CBCT datasets of the sizes and types we use. That is since all these
parameters we correct are listed in Table 4. The parameter 𝜎𝑐𝑜𝑟 is libraries do not consider the geometric correction as listed in Ta-
very small in value, yet due to the magnification effect, it is critical ble 4. Furthermore, RTK and TIGRE are restricted to a single GPU
to the image quality of X-ray microscopy. and compute node, respectively.
Measurement methods. All runs use single-precision on CPUs
and GPUs. The reported performance is averaged using a hundred 6.2 Out-of-core Back-projection Evaluation
repeated executions. The runtime of GPU kernel and host code are CUDA Kernel Performance. We report the CUDA back-projection
measured by cudaEvent and MPI_Wtime, respectively. We conduct kernel performance in the unit of GUPS 2 (Giga-updates Per Second).
both numerical and visual inspection assessments to assure the The specification of the A100 compute node used in this section
generated volume data is correct. (i) Regarding the numerical is different from V100 as follows: two Intel Xeon Platinum 8360Y
assessment, the digital phantom of Shepp-Logan phantom [53] CPUs, eight Nvidia Tesla A100 SMX4 GPUs, and NVMe SSD/Intel
is used to generate the projections by RTK tool and reconstruct
volumes using those projections. The root mean square error [1] 2 𝑃𝑒𝑟 𝑓 𝑁𝑥 ·𝑁 𝑦 ·𝑁𝑧 ·𝑁𝑝
= , where T is the runtime in the unit of second.
𝑇 ·109
Scalable Image Reconstruction Algorithm SC ’21, November 14–19, 2021, St. Louis, MO, USA
Runtime (s)
300 300
400
Runtime (s)
268.8 250
Runtime (s)
Runtime (s)
250 329.2 250 227.4 209.2
300 200
200 200
140.8 181.7 130.2 150 120.8
150 200 150
100 75.7 95.1 100 69.2 100 61.7
40.2 100 49.2 35.5 32.3
50 22.7 15.3 25.8 14.5 12.7 50 18.7 13.712.6 50 16.8 13.2 11.911.5
0 0 0 0
16
32
64
128
256
512
1024
16
32
64
128
256
512
1024
4
8
16
32
64
128
256
512
1024
8
16
32
64
128
256
512
1024
Number of GPUs Number of GPUs Number of GPUs Number of GPUs
(a) 3928×1998×6401⇒40963 , 𝑁𝑟 =16. (b) 1864×999×6401⇒40963 , 𝑁𝑟 =8. (c) 20002 ×3142⇒40963 , 𝑁𝑟 =8. (d) 2004×1335×1800⇒40963 , 𝑁𝑟 =4.
Figure 13: Strong scaling. Projected denotes the potential best runtime as predicted by our performance model. Coffee bean 2x
is a rebinning of the original dataset (i.e. double the pixel size to reduce the input size to 1/4)
40000
18 16
35000 Coffee bean
Coffee bean Bumblebee
Performance (GUPS)
16 14
30000 Coffee bean 2x
14 14.8 15.3 12 12.7 Bumblebee
13.9 25000
12 12.9 13.1 11.5 11.7 11.9
10 Tomo_00029
10 20000
Runtime (s)
Runtime (s)
8 9 9 9 9
8 9 9 9 9 9 15000
6
6 10000
4 4
Projected 5000
2 Mesured 2
0
0 0
4 8 16 32 64 128 256 512 1024
128
256
512
1024
64
128
256
512
1024
Number of GPUs
Number of GPUs Number of GPUs
6401·𝑁𝑔𝑝𝑢𝑠 3142·𝑁𝑔𝑝𝑢𝑠 Figure 15: Performance (GUPS) when generating 40963 vol-
(a) 3928×1998× 1024 ⇒40963 . (b) 2000×2000× 1024 ⇒40963 .
umes, data sets and parameter configuration are similar to
𝑁𝑔𝑝𝑢𝑠 𝑁
Figure 14: Weak scaling. (a) 𝑁𝑟 = 64 .
𝑔𝑝𝑢𝑠
(b) 𝑁𝑟 = 128 . Figure 13).
SSD P4610 1.6T×2. We take advantage of the widely used CUDA while the data movement between the device and host is overlapped
kernel in the RTK library for performance comparison. We achieve with the storing of the volume on the PFS (𝑇𝑠𝑡𝑜𝑟𝑒 ). Accordingly, all
performance that is competitive to the RTK library (as shown in the operations in the end-to-end pipeline are overlapped as shown
Table 5). Using a single Nvidia V100 GPU, the performance of the in Figure 10b.
back-projection kernel in RTK is 104.7∼113.7 GUPS while our ker- Out-of-core Back-projection. Based on the 2D projection de-
nel is 29.5∼129.2 GUPS. The results on the Nvidia A100 GPU show composition, our algorithm can tackle problem sizes beyond the
performance improvement that is proportional to the difference in device capacity (i.e. out-of-core). We use the two Tomobank datasets
peak performance between V100 and A100 (15.7 TFlops in V100 tomo_00029 and tomo_00030 in this evaluation. Figure 8 shows a
vs. 19.5 TFlops in A100 in single precision). We use the Roofline reconstructed slice of tomo_00030 at resolution of 512 × 512. We
model to analyze back-projection kernels on a single V100 GPU increase the output sizes gradually to go beyond device memory
using Nsight Compute [15] profiler as shown in Figure 12. We capacity. In Table 5, we list the pipelined computations and the per-
observe that the proposed CUDA kernel achieves a competitive formance of our system using a compute node with a single GPU
performance of ∼4.5 TFLOP/s, which is about 32.8% peak perfor- (V100/A100). Figure 10a presents a detailed example of generating
mance of the V100 GPU. This is very similar to the performance a 20483 volume (32GB) on a single V100 GPU.
of RTK, despite the extra redundant computation (e.g. the offset Our algorithm can generate volumes of any arbitrary sizes by
computations for K, Y, and Z in Listing 1) we use in our kernel to moving the projections once from host to device. The execution
enable the decomposition of input and volume. In summary, we time 𝑇𝐻 2𝐷 in Table 5 is nearly constant while the 𝑇𝐷2𝐻 is propor-
achieve performance that matches that of one of the most widely tional to the size of volume data. Note that the RTK library can not
used and optimized open libraries (RTK), despite increasing the generate volumes beyond 8GB and 20GB on a V100 and A100 GPU,
amount of computation. On top of that, our kernel provides the respectively. Our out-of-core performance is slightly better than
benefit of cone-beam decomposition that enables better scaling and the highly optimized algorithm in RTK. Furthermore, we can solve
also out-of-core capability. larger problems (i.e. 40963 volume).
It is important to note that the data movement between host and To sum up, we achieve out-of-core computing capability, without
device (𝑇𝐻 2𝐷 ) is overlapped with the filtering computation (𝑇𝑓 𝑙𝑡 ), sacrificing the back-projection performance on GPUs.
SC ’21, November 14–19, 2021, St. Louis, MO, USA Chen, P. et al.
6.3 Scalability & Performance thread, and the back-projection thread also waits for processed
This section reports the performance and scalability of our dis- data from the filter thread. On the other hand, when we scale input
tributed FPB framework. We configure the batch count parameter, size and not the volume, we eventually become bound by the PFS
i.e. number batches, at 𝑁𝑐 =8 in all runs. The batch sizes is calculated throughput.
as 𝑁𝑏 =𝑁𝑧 /(𝑁𝑔 ·𝑁𝑐 ) (according to Equations 10 and Equations 12). Discussion. In high-resolution CBCT, it is common to do 10s
In Figure 13, projected is the potential best runtime predicted by of repeated reconstructions after tuning the reconstruction param-
our performance model in Equation 17. According to Equation 15, eters to achieve a high-quality image result, e.g. Metal Artifact
the volumes sizes that are generated by each GPU and reduced in Reduction (MAR) processing [2]. Hence, the aggregate time saving
each MPI group may be expressed as 𝜂·𝑁𝑥 ·𝑁 𝑦 ·𝑁𝑧 ·𝑁𝑟 /𝑁𝑔𝑝𝑢𝑠 . Hence, when doing reconstruction at a large scale is substantial and con-
when 𝑁𝑔𝑝𝑢𝑠 =16, the sizes of the generated volumes in Figure 13a, tributes highly to productivity. Additionally, since the algorithm
b, c, and d are 256GB, 128GB, 128GB, and 64GB, respectively. This we propose enables out-of-core computing, one could also resort
demonstrates that our solution can generate volumes beyond the to using fewer resources (if there is budget constant for instance)
GPU memory capacity. and yet still be able to do high-resolution image reconstructions
Strong scaling. We elaborate on the strong scaling of our frame- for CBCT.
work in this paragraph. Figure 13 shows the strong scaling of several
datasets. Note that Figure 13b is a rebinning of the original coffee 6.4 Summary of Findings
bean dataset. All figures demonstrate that our implementation’s We summarize the findings of our experiments:
scaling matches that of the projected best runtime. According to • With up to 1024 V100 GPUs, we reconstruct CBCT volumes of
the performance model (Section 5), 𝑇𝑙𝑜𝑎𝑑 , 𝑇𝑓 𝑖𝑙𝑡𝑒𝑟 , 𝑇𝑏𝑝 , and 𝑇𝐷2𝐻 sizes 20483 and 40963 in 3s and 16s (including I/O), respectively.
decrease linearly with the number of GPUs. As shown in Figure 10b • By decomposing the projection in two dimensions, we improve
of the Bumblebee dataset using 128 GPUs, our implementation effi- performance and scalability (Figures 13, 14, 15). This is mainly be-
ciently overlaps the operations and approaches the potential peak cause the decomposition scheme achieves: a) insignificant redun-
performance. As expected with strong scaling, the performance dancy in operations, and b) effective pipelining of all operations
becomes flat as the number of GPUs increase (beyond 256 GPUs) as (end-to-end).
the observed overheads of I/O and communication start to dominate • Based on our decomposition scheme, we implement the first
the runtime. framework, to the authors knowledge, that support out-of-core
Weak scaling. This paragraph presents the weak scaling of our capability for cone-beam (8∼17 minutes for 40963 volumes on a
framework. In Figure 14, we show the weak scaling when gen- single V100 GPU).
erating 40963 volumes. We only present the weak scaling on the
coffee bean (Figure 14a) and bumblebee (Figure 14b) datasets due 7 CONCLUSION
to space limitations. In Figure 14a, the size of each projection is
In this work, we propose a novel algorithm to decompose the image
3928×1998, the size of the generated volume is 40963 . The evaluated
reconstruction problem for current generation cone-beam CT de-
pairs of (𝑁𝑝 , 𝑁𝑟 ) are (400, 1), (800, 2), . . . , (6401, 16). As Figure 14b
vices. The intensive computations can then be easily parallelized on
shows, to generate the volume data of size 40963 , the pairs of eval-
distributed systems, in a simple processing fashion similar to that
uated parameters (𝑁𝑝 , 𝑁𝑟 ) are (392, 1), (785, 2), . . . , (3142, 8). Differ-
of parallel-beam CT. We demonstrate the efficient performance and
ent MPI groups call the MPI_Reduce operation (𝑇𝑟𝑒𝑑𝑢𝑐𝑒 ) indepen-
scaling of our implementation using several real-world datasets on
dently, i.e., a segmented MPI_Reduce. Therefore 𝑇𝑟𝑒𝑑𝑢𝑐𝑒 increases
the ABCI supercomputer. Also, we take advantage of the hetero-
slightly with more GPUs (i.e. more ranks) while the other opera-
geneous architecture of the GPU-accelerated supercomputers by
tions within each rank are basically constant: 𝑇𝑙𝑜𝑎𝑑 , 𝑇𝑓 𝑖𝑙𝑡𝑒𝑟 , and
orchestrating the different computations on the CPUs and GPUs.
𝑇𝑏𝑝 . According to our performance model, the projected runtime is
The proposed CUDA kernel supports out-of-core image reconstruc-
0 0 0 Í𝑁𝑐 −1 𝑖 Í𝑁𝑐 −1 𝑖
𝑇𝑠𝑡𝑜𝑟𝑒 .
𝑇𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑇𝐶𝑃𝑈 + 𝑇𝐺𝑃𝑈 + 𝑇𝑟𝑒𝑑𝑢𝑐𝑒 + 𝑖=0 𝑇𝑠𝑡𝑜𝑟𝑒 ≈ 𝑖=0 tion. That is, we can use a single GPU to generate 40963 volumes
Since 𝐵𝑊𝑠𝑡𝑜𝑟𝑒 ≈ 28.5𝐺𝐵/𝑠 in the system we use in our experiments. that require a memory capacity of more than 256GB, far exceeding
The required time for storing a single 40963 volume is ∼9s, which the GPU’s memory capacity.
makes it the longest stage in the pipeline. Hence, the projected time
in Figure 14 becomes ∼9s since the performance model assumes a ACKNOWLEDGMENT
perfect overlap. This work was supported by JSPS KAKENHI under Grant Number
Performance. This paragraph analyzes the performance of our JP21K17750. This paper is based on results obtained from a project,
framework. Figure 15 shows the performance for generating the JPNP20006, commissioned by the New Energy and Industrial Tech-
volumes of size 40963 from different datasets. We can observe two nology Development Organization (NEDO). This research was par-
orders of magnitude speed up as we go from a single GPU to hun- tially supported by EPSRC grant EP/R002495/1 and EURAMET
dreds of GPUs. In Figure 13 and Figure 14, we show the potential grant 17IND08. This work was partially supported by JST-CREST
best runtime as predicted by our performance model using Equa- under Grant Number JPMJCR19F5; JST, PRESTO Grant Number
tion 17. The empirical results demonstrate that we can achieve 78% JPMJPR20MA, Japan. We would like to thank Endo Lab at Tokyo
of the peak performance on average. As Figure 10 shows, moving Institute of Technology for providing computing resources. The
and collecting data within a single MPI rank introduces most of author wishes to acknowledge useful discussions with Prof. Qinyou
the overhead, e.g. the filter thread waits for data from the load Hu at SMU and Dr. Jintao Meng at CAS.
Scalable Image Reconstruction Algorithm SC ’21, November 14–19, 2021, St. Louis, MO, USA
REFERENCES [22] Randolf Hanke, Theobald Fuchs, and Norman Uhlmann. 2008. X-ray based
[1] J Scott Armstrong and Fred Collopy. 1992. Error measures for generalizing about methods for non-destructive testing and material characterization. Nuclear
forecasting methods: Empirical comparisons. International journal of forecasting Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers,
8, 1 (1992), 69–80. Detectors and Associated Equipment 591, 1 (2008), 14–18.
[2] Sonja Vulcu Tomas Dobrocky Werner J. Z’Graggenm Franca Wagner Ar- [23] Richard Hartley and Andrew Zisserman. 2003. Multiple view geometry in computer
sany Hakim, Manuela Pastore-Wapp. 2019. Efficiency of Iterative Metal Ar- vision. Cambridge university press.
tifact Reduction Algorithm (iMAR) Applied to Brain Volume Perfusion CT in the [24] Sepideh Hatamikia, Ander Biguri, Gernot Kronreif, Michael Figl, Tom Russ,
Follow-up of Patients after Coiling or Clipping of Ruptured Brain Aneurysms. Joachim Kettenbach, Martin Buschmann, and Wolfgang Birkfellner. 2021. To-
Nature Scientific Reports 9, 19423 (2019), 201–213. ward on-the-fly trajectory optimization for C-arm CBCT under strong kinematic
[3] Thilo Balke, S. Majee, G. Buzzard, Scott Poveromo, P. Howard, M. Groeber, John constraints. PLOS ONE 16, 2 (02 2021), 1–17. https://doi.org/10.1371/journal.
McClure, and C. Bouman. 2018. Separable Models for cone-beam MBIR Recon- pone.0245508
struction. electronic imaging 2018 (2018). [25] Rong-Ting He, Ming-Gene Tu, Heng-Li Huang, Ming-Tzu Tsai, Jay Wu, and
[4] GM Besson. 2016. Seventh-generation CT. In Medical Imaging 2016: Physics Jui-Ting Hsu. 2019. Improving the prediction of the trabecular bone microarchi-
of Medical Imaging, Vol. 9783. International Society for Optics and Photonics, tectural parameters using dental cone-beam computed tomography. BMC Medical
978350. Imaging 19, 1 (2019), 10:1–10:9. https://doi.org/10.1186/s12880-019-0313-9
[5] Tekin Bicer, Doğa Gürsoy, Vincent De Andrade, Rajkumar Kettimuthu, William [26] I Henry and Ming Chen. 2012. An FPGA Architecture for Real-Time 3-D Tomo-
Scullin, Francesco De Carlo, and Ian T. Foster. 2017. Trace: a high-throughput graphic Reconstruction. Ph.D. Dissertation. University of California, Los Angeles.
tomographic reconstruction engine for large-scale datasets. Advanced Structural [27] Mert Hidayetoğlu, Tekin Biçer, Simon Garcia De Gonzalo, Bin Ren, Doğa Gür-
and Chemical Imaging 3, 1 (jan 2017). https://doi.org/10.1186/s40679-017-0040-7 soy, Rajkumar Kettimuthu, Ian T Foster, and Wen-mei W Hwu. 2019. Memxct:
[6] Ander Biguri, Reuben Lindroos, Robert Bryll, Hossein Towsyfyan, Hans Deyhle, Memory-centric x-ray ct reconstruction with massive parallelization. In Proceed-
Ibrahim El khalil Harrane, Richard Boardman, Mark Mavrogordato, Manjit ings of the International Conference for High Performance Computing, Networking,
Dosanjh, Steven Hancock, and Thomas Blumensath. 2020. Arbitrarily large Storage and Analysis. 1–56.
tomography with iterative algorithms on multiple GPUs using the TIGRE tool- [28] Mert Hidayetoğlu, Tekin Bicer, Simon Garcia de Gonzalo, Bin Ren, Vincent De An-
box. J. Parallel and Distrib. Comput. 146 (2020), 52 – 63. https://doi.org/10.1016/ drade, Doga Gursoy, Raj Kettimuthu, Ian T. Foster, and Wen-mei W. Hwu. 2020.
j.jpdc.2020.07.004 Petascale XCT: 3D Image Reconstruction with Hierarchical Communications
[7] Javier Garcia Blas, Monica Abella, Florin Isaila, Jesus Carretero, and Manuel Desco. on Multi-GPU Nodes. In Proceedings of the International Conference for High
2014. Surfing the optimization space of a multiple-GPU parallel implementation Performance Computing, Networking, Storage and Analysis (SC ’20). IEEE Press,
of a X-ray tomography reconstruction algorithm. Journal of Systems and Software Article 37, 13 pages.
95 (2014), 166–175. [29] Johannes Hofmann, Jan Treibig, Georg Hager, and Gerhard Wellein. 2014. Per-
[8] Brian Cabral, Nancy Cam, and Jim Foran. 1994. Accelerated volume rendering formance engineering for a medical imaging application on the Intel Xeon Phi
and tomographic reconstruction using texture mapping hardware. In Proceedings accelerator. In ARCS 2014; 2014 Workshop Proceedings on Architecture of Computing
of the 1994 symposium on Volume visualization. 91–98. Systems. VDE, 1–8.
[9] Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, and Satoshi [30] F. Ino, Y. Okitsu, T. Kishi, S. Ohnishi, and K. Hagihara. 2010. Out-of-core cone
Matsuoka. 2019. iFDK: A Scalable Framework for Instant High-Resolution beam reconstruction using multiple GPUs. In 2010 IEEE International Symposium
Image Reconstruction. In Proceedings of the International Conference for High on Biomedical Imaging: From Nano to Macro. 792–795. https://doi.org/10.1109/
Performance Computing, Networking, Storage and Analysis (SC ’19). Associ- ISBI.2010.5490055
ation for Computing Machinery, New York, NY, USA, Article 84, 24 pages. [31] Intel. 2021. Intel MPI Benchmarks User Guide. https://software.intel.com/
https://doi.org/10.1145/3295500.3356163 content/www/us/en/develop/documentation/imb-user-guide/top.html [Online;
[10] Srdjan Coric, Miriam Leeser, Eric Miller, and Marc Trepanier. 2002. Parallel- accessed 27-May-2021].
beam backprojection: an FPGA implementation optimized for medical imaging. [32] DA Jaffray and JH Siewerdsen. 2000. Cone-beam computed tomography with
In Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field- a flat-panel imager: initial performance characterization. Medical physics 27, 6
programmable gate arrays. ACM, 217–226. (2000), 1311–1323.
[11] NVIDIA CUDA. 2021. CUDA Toolkit Documentation. NVIDIA Developer Zone. [33] Nicolai M Josuttis. 2012. The C++ standard library: a tutorial and reference.
http://docs.nvidia.com/cuda/index.html (2021). (2012).
[12] Jingyu Cui, Guillem Pratx, Bowen Meng, and Craig S Levin. 2013. Distributed [34] Avinash C.. Kak and Malcolm Slaney. 1988. Principles of computerized tomographic
MLEM: An iterative tomographic image reconstruction algorithm for distributed imaging. IEEE press New York.
memory architectures. IEEE transactions on medical imaging 32, 5 (2013), 957–967. [35] Vladimir Kasik, Martin Cerny, Marek Penhaker, Václav Snášel, Vilem Novak, and
[13] Francesco De Carlo, Doğa Gürsoy, Daniel J Ching, K Joost Batenburg, Wolfgang Radka Pustkova. 2012. Advanced CT and MR image processing with FPGA. In
Ludwig, Lucia Mancini, Federica Marone, Rajmund Mokso, Daniël M Pelt, Jan International Conference on Intelligent Data Engineering and Automated Learning.
Sijbers, et al. 2018. TomoBank: a tomographic data repository for computational Springer, 787–793.
x-ray science. Measurement Science and Technology 29, 3 (2018), 034004. [36] Jean Pierre Kruth, Markus Bartscher, Simone Carmignato, Robert Schmitt,
[14] W De Vos, Jan Casselman, and GRJ Swennen. 2009. Cone-beam computerized Leonardo De Chiffre, and Albert Weckenmann. 2011. Computed tomography for
tomography (CBCT) imaging of the oral and maxillofacial region: a systematic dimensional metrology. CIRP annals 60, 2 (2011), 821–842.
review of the literature. International journal of oral and maxillofacial surgery 38, [37] Wenxuan Liang, Hui Zhang, and Guangshu Hu. 2010. Optimized implementation
6 (2009), 609–625. of the FDK algorithm on one digital signal processor. Tsinghua Science and
[15] Nvidia Developer Tools Document. 2021. Nvidia Nsight Compute. https://docs. Technology 15, 1 (2010), 108–113.
nvidia.com/nsight-compute/NsightComputeCli/index.html [Online; accessed [38] Yuechao Lu, Fumihiko Ino, and Kenichi Hagihara. 2016. Cache-aware GPU
27-May-2021]. optimization for out-of-core cone beam CT reconstruction of high-resolution
[16] Daniel Castaño Díez, Hannes Mueller, and Achilleas S Frangakis. 2007. Imple- volumes. IEICE TRANSACTIONS on Information and Systems 99, 12 (2016), 3060–
mentation and performance evaluation of reconstruction algorithms on graphics 3071.
processors. Journal of Structural Biology 157, 1 (2007), 288–295. [39] John B Ludlow and Marija Ivanovic. 2008. Comparative dosimetry of dental
[17] Anders Eklund, Paul Dufort, Daniel Forsberg, and Stephen M. LaConte. 2013. CBCT devices and 64-slice CT for oral and maxillofacial radiology. Oral Surgery,
Medical image processing on the GPU – Past, present and future. Medical Image Oral Medicine, Oral Pathology, Oral Radiology, and Endodontology 106, 1 (2008),
Analysis 17, 8 (2013), 1073 – 1094. https://doi.org/10.1016/j.media.2013.05.008 106–114.
[18] Andriy Fedorov, Reinhard Beichel, Jayashree Kalpathy-Cramer, Julien Finet, Jean- [40] Dmitri Matenine, Geoffroi Côté, Julia Mascolo-Fortin, Yves Goussard, and
Christophe Fillion-Robin, Sonia Pujol, Christian Bauer, Dominique Jennings, Philippe Després. 2018. System matrix computation vs storage on GPU: A
Fiona Fennessy, Milan Sonka, et al. 2012. 3D Slicer as an image computing comparative study in cone beam CT. Medical physics 45, 2 (2018), 579–588.
platform for the Quantitative Imaging Network. Magnetic resonance imaging 30, [41] Klaus Mueller, F Xu, and N Neophytou. 2007. Why do GPUs work so well for
9 (2012), 1323–1341. acceleration of CT? SPIE Electronic Imaging07 (2007). http://cvc.cs.stonybrook.
[19] LA Feldkamp, LC Davis, and JW Kress. 1984. Practical cone-beam algorithm. edu/Publications/2007/MXN07a
JOSA A 1, 6 (1984), 612–619. [42] Nassir Navab, A Bani-Hashemi, Mariappan S Nadar, Karl Wiesent, Peter Durlak,
[20] Yushan Gao, Ander Biguri, and Thomas Blumensath. 2019. Block stochastic Thomas Brunner, Karl Barth, and Rainer Graumann. 1998. 3D reconstruction from
gradient descent for large-scale tomographic reconstruction in a parallel network. projection matrices in a C-arm based 3D-angiography system. In International
arXiv preprint arXiv:1903.11874 (2019). Conference on Medical Image Computing and Computer-Assisted Intervention.
[21] Jens Gregor and Thomas Benson. 2008. Computational analysis and improvement Springer, 119–129.
of SIRT. IEEE transactions on medical imaging 27, 7 (2008), 918–924. [43] Brian Nett. 2020. Animated CT Generations for Radiologic Technologists. https:
//howradiologyworks.com/ctgenerations/
SC ’21, November 14–19, 2021, St. Louis, MO, USA Chen, P. et al.
[44] Willem Jan Palenstijn, Jeroen Bédorf, and K Joost Batenburg. 2015. A distributed [68] Fang Xu and Klaus Mueller. 2005. Accelerating popular tomographic recon-
SIRT implementation for the ASTRA toolbox. Proc. Fully Three-Dimensional struction algorithms on commodity PC graphics hardware. IEEE Transactions on
Image Reconstruct. Radiol. Nucl. Med (2015), 166–169. nuclear science 52, 3 (2005), 654–663.
[45] Xiaochuan Pan, Emil Y Sidky, and Michael Vannier. 2009. Why do commercial [69] Xinwei Xue, Arvi Cheryauka, and David Tubbs. 2006. Acceleration of fluoro-CT
CT scanners still employ traditional, filtered back-projection for image recon- reconstruction for a mobile C-Arm on GPU and FPGA hardware: a simulation
struction? Inverse problems 25, 12 (2009), 123009. study. In Medical Imaging 2006: Physics of Medical Imaging, Vol. 6142. International
[46] Ruben Pauwels, Jilke Beinsberger, Bruno Collaert, Chrysoula Theodorakou, Jes- Society for Optics and Photonics, 61424L.
sica Rogers, Anne Walker, Lesley Cockmartin, Hilde Bosmans, Reinhilde Jacobs, [70] Kai Yang, Alexander LC Kwan, DeWitt F Miller, and John M Boone. 2006. A
Ria Bogaerts, et al. 2012. Effective dose range for dental cone beam computed geometric calibration method for cone beam CT systems. Medical physics 33,
tomography scanners. European journal of radiology 81, 2 (2012), 267–271. 6Part1 (2006), 1695–1706.
[47] N Rezvani, D Aruliah, K Jackson, D Moseley, and J Siewerdsen. 2007. SU-FF-I-16: [71] X-ray Tomography Solutions ZEISS. 2021. High Resolution 3D X-ray Microscopy
OSCaR: An open-source cone-beam CT reconstruction tool for imaging research. and Computed Tomography. https://www.zeiss.com/microscopy/int/products/x-
Medical Physics 34, 6Part2 (2007), 2341–2341. ray-microscopy.html. [Online; accessed 27-May-2021].
[48] John C Russ. 1990. Image processing. In Computer-assisted microscopy. Springer, [72] Timo Zinsser and Benjamin Keck. 2013. Systematic performance optimization of
33–69. cone-beam back-projection on the Kepler architecture. Proceedings of the 12th
[49] Mohammad Saadatfar, Francisco García-Moreno, S. Hutzler, A.P. Sheppard, Mark Fully Three-Dimensional Image Reconstruction in Radiology and Nuclear Medicine
Knackstedt, John Banhart, and Denis Weaire. 2009. Imaging of metallic foams (2013), 225–228.
using X-ray micro-CT. Colloids and Surfaces A: Physicochemical and Engineering [73] Yu Zou and Xiaochuan Pan. 2004. Exact image reconstruction on PI-lines from
Aspects 344 (07 2009), 107–112. https://doi.org/10.1016/j.colsurfa.2009.01.008 minimum data in helical cone-beam CT. Physics in Medicine & Biology 49, 6
[50] Amit Sabne, Xiao Wang, Sherman J Kisner, Charles A Bouman, Anand Raghu- (2004), 941.
nathan, and Samuel P Midkiff. 2017. Model-based iterative CT image reconstruc-
tion on GPUs. ACM SIGPLAN Notices 52, 8 (2017), 207–220.
[51] Paul Sack and William Gropp. 2010. A scalable mpi_comm_split algorithm for
exascale computing. In European MPI Users’ Group Meeting. Springer, 1–10.
[52] Holger Scherl, Markus Kowarschik, Hannes G Hofmann, Benjamin Keck, and
Joachim Hornegger. 2012. Evaluation of state-of-the-art hardware architectures
for fast cone-beam CT reconstruction. Parallel computing 38, 3 (2012), 111–124.
[53] Lawrence A Shepp and Benjamin F Logan. 1974. The Fourier reconstruction of a
head section. IEEE Transactions on nuclear science 21, 3 (1974), 21–43.
[54] Lawrence A Shepp and Yehuda Vardi. 1982. Maximum likelihood reconstruction
for emission tomography. IEEE transactions on medical imaging 1, 2 (1982),
113–122.
[55] Rainer Stoessel, Denis Kiefel, Reinhold Oster, Björn Diewel, and L Llopart Prieto.
2011. 𝜇 -computed tomography for 3d porosity evaluation in Carbon Fibre Rein-
forced Plastics (CFRP). In International Symposium on Digital Industrial Radiology
and Computed Tomography.
[56] Frederick C Strong. 1952. Theoretical basis of Bouguer-Beer law of radiation
absorption. Analytical Chemistry 24, 2 (1952), 338–342.
[57] Nikhil Subramanian. 2009. A C-to-FPGA solution for accelerating tomographic
reconstruction. Ph.D. Dissertation. University of Washington.
[58] Stewart Taylor. 2007. Optimizing applications for multi-core processors: using the
intel integrated performance primitives. Intel.
[59] Ming-Tzu Tsai, Rong-Ting He, Heng-Li Huang, Ming-Gene Tu, and Jui-Ting
Hsu. 2020. Effect of Scanning Resolution on the Prediction of Trabecular Bone
Microarchitectures Using Dental Cone Beam Computed Tomography. Diagnostics
10, 6 (2020). https://doi.org/10.3390/diagnostics10060368
[60] Wim Van Aarle, Willem Jan Palenstijn, Jeroen Cant, Eline Janssens, Folkert
Bleichrodt, Andrei Dabravolski, Jan De Beenhouwer, K Joost Batenburg, and
Jan Sijbers. 2016. Fast and flexible X-ray tomography using the ASTRA toolbox.
Optics express 24, 22 (2016), 25129–25147.
[61] Richard Wilson Vuduc. 2003. Automatic performance tuning of sparse matrix
kernels. Vol. 1. Citeseer.
[62] Xiao Wang, Amit Sabne, Sherman J. Kisner, Anand Raghunathan, Charles A.
Bouman, and Samuel P. Midkiff. 2016. High Performance Model-Based Image
Reconstruction. 21st ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming (PPoPP’16) (2016), 2:1–2:12. https://github.com/HPImaging/
sv-mbirct
[63] Xiao Wang, Amit Sabne, Putt Sakdhnagool, Sherman J Kisner, Charles A Bouman,
and Samuel P Midkiff. 2017. Massively parallel 3D image reconstruction. In
Proceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis. 1–12.
[64] Xiao Wang, Venkatesh Sridhar, Zahra Ronaghi, Rollin Thomas, Jack Deslippe, Dil-
worth Parkinson, Gregery T Buzzard, Samuel P Midkiff, Charles A Bouman, and
Simon K Warfield. 2019. Consensus equilibrium framework for super-resolution
and extreme-scale CT reconstruction. In Proceedings of the International Confer-
ence for High Performance Computing, Networking, Storage and Analysis. 1–23.
[65] Jason M Warnett, Valeriy Titarenko, Ercihan Kiraci, Alex Attridge, William RB
Lionheart, Philip J Withers, and Mark A Williams. 2016. Towards in-process
x-ray CT for dimensional metrology. Measurement Science and Technology 27, 3
(2016), 035401.
[66] Karl Wiesent, Karl Barth, Nassir Navab, Peter Durlak, Thomas Brunner, Oliver
Schuetz, and Wolfgang Seissler. 2000. Enhanced 3-D-reconstruction algorithm
for C-arm systems suitable for interventional procedures. IEEE transactions on
medical imaging 19, 5 (2000), 391–403.
[67] Michael A Wu. 1991. ASIC applications in computed tomography systems. In
ASIC Conference and Exhibit, 1991. Proceedings., Fourth Annual IEEE International.
IEEE, P1–3.