Scalable FBP Decomposition For Cone-Beam CT Reconstruction: November 2021

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/356171619
Scalable FBP Decomposition for Cone-beam CT Reconstruction
Conference Paper · November 2021

DOI: 10.1145/3458817.3476139
CITATIONS READS
0 115
9 authors, including:
Peng Chen Xiao Wang

National Institute of Advanced Industrial Science and Technology Oak Ridge National Laboratory
22 PUBLICATIONS 32 CITATIONS 19 PUBLICATIONS 131 CITATIONS
SEE PROFILE SEE PROFILE
Takahiro Hirofuchi Hirotaka Ogawa

National Institute of Advanced Industrial Science and Technology National Institute of Advanced Industrial Science and Technology
Some of the authors of this publication are also working on these related projects:
General method for motion correction in X-ray CT View project
Gridvon View project
All content following this page was uploaded by Peng Chen on 17 November 2021.
The user has requested enhancement of the downloaded file.

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/356171619
Scalable FBP Decomposition for Cone-beam CT Reconstruction
Conference Paper · November 2021

DOI: 10.1145/3458817.3476139
CITATIONS READS
0 15
9 authors, including:
Peng Chen Xiao Wang

National Institute of Advanced Industrial Science and Technology Oak Ridge National Laboratory
Takahiro Hirofuchi Hirotaka Ogawa

National Institute of Advanced Industrial Science and Technology National Institute of Advanced Industrial Science and Technology
Some of the authors of this publication are also working on these related projects:
OpenJIT View project
Super-Voxel View project
All content following this page was uploaded by Peng Chen on 14 November 2021.
The user has requested enhancement of the downloaded file.

Scalable FBP Decomposition for Cone-Beam CT Reconstruction
Peng Chen Mohamed Wahib Xiao Wang
National Institute of Advanced National Institute of Advanced Oak Ridge National Laboratory, US
Industrial Science and Technology Industrial Science and Technology Boston Children’s Hospital, US
RIKEN CCS, Hyogo, Japan RIKEN CCS, Hyogo, Japan xiao.wang2@childrens.harvard.edu
chin.hou@aist.go.jp mohamed.attia@aist.go.jp
Takahiro Hirofuchi Hirotaka Ogawa Ander Biguri

National Institute of Advanced National Institute of Advanced Institute of Nuclear Medicine,
Industrial Science and Technology Industrial Science and Technology University College London, UK
t.hirofuchi@aist.go.jp h-ogawa@aist.go.jp a.biguri@ucl.ac.uk
Richard Boardman Thomas Blumensath Satoshi Matsuoka

𝜇-VIS X-Ray Imaging Centre, 𝜇-VIS X-Ray Imaging Centre, Tokyo Institute of Technology, Japan
University of Southampton, UK University of Southampton, UK RIKEN CCS, Hyogo, Japan
rpb@soton.ac.uk thomas.blumensath@soton.ac.uk matsu@acm.org
ABSTRACT ACM Reference Format:

Filtered Back-Projection (FBP) is a fundamental compute intense Peng Chen, Mohamed Wahib, Xiao Wang, Takahiro Hirofuchi, Hirotaka
Ogawa, Ander Biguri, Richard Boardman, Thomas Blumensath, and Satoshi
algorithm used in tomographic image reconstruction. Cone-Beam
Matsuoka. 2021. Scalable FBP Decomposition for Cone-Beam CT Recon-
Computed Tomography (CBCT) devices use a cone-shaped X-ray struction. In The International Conference for High Performance Computing,
beam, in comparison to the parallel beam used in older CT gen- Networking, Storage and Analysis (SC ’21), November 14–19, 2021, St. Louis,
erations. Distributed image reconstruction of cone-beam datasets MO, USA. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/
typically relies on dividing batches of images into different nodes. 3458817.3476139
This simple input decomposition, however, introduces limits on
input/output sizes and scalability.
We propose a novel decomposition scheme and reconstruction
algorithm for distributed FPB. This scheme enables arbitrarily large 1 INTRODUCTION
input/output sizes, eliminates the redundancy arising in the end- Computed Tomography (CT) imaging has a central role in clinical
to-end pipeline and improves the scalability by replacing two com- and scientific applications. There exists two classes of image re-
munication collectives with only one segmented reduction. Finally, construction algorithms: one is the Filtered Back-projection (FBP)
we implement the proposed decomposition scheme in a framework approaches such as FDK [19] and Katsevich [73] methods. The other
that is useful for all current-generation CT devices (7𝑡ℎ gen). In is iterative image reconstruction (IR) algorithms, such as MBIR [62],
our experiments using up to 1024 GPUs, our framework can con- SIRT [21], and MLEM [12]. However, FBP is commonly regarded as
struct 40963 volumes, for real-world datasets, in under 16 seconds the standard image reconstruction for most of the production CT
(including I/O). systems [45]. FBP employs filtering computation (i.e. convolution)
and back-projection to generate 3D tomographic images (named
CCS CONCEPTS volume) from thousands of 2D projections, such as X-ray and elec-
• Computing methodologies → Massively parallel algorithms; tron microscopy images. In recent years, significant effort has been
Image processing. underway to design image reconstruction algorithms for large scale
HPC systems [5, 9, 12, 20, 28, 44, 63, 64], mainly for the purpose
of clinical diagnosis, non-destructive testing [22], and dimensional
KEYWORDS
metrology [36, 65].
HPC, Image Reconstruction, FBP, GPU There is a strong motivation to optimize the FBP algorithms for
Cone-Beam CT (CBCT) given that the current generation produc-
tion CT devices use cone-beam (as Figure 1 illustrates). CBCT de-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed vices are abundantly used in clinical and industrial applications [14,
for profit or commercial advantage and that copies bear this notice and the full citation 24, 24, 39, 46]. Additionally, CBCT is the main technology used
on the first page. Copyrights for components of this work owned by others than ACM in microscopic X-ray CT devices [71] due to the 3D cone-shaped
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a geometry caused by device optics. Unlike the existing algorithms
fee. Request permissions from permissions@acm.org. that use parallel-beam CT as shown in Figure 1a (e.g. Tekin et al. [5],
SC ’21, November 14–19, 2021, St. Louis, MO, USA Xiao et al. [63, 64] and Mert et al. [27, 28]), this work targets the cur-
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8442-1/21/11. . . $15.00 rent generation of production CT devices by parallelizing and scale
https://doi.org/10.1145/3458817.3476139 the computation of CBCT on GPU-accelerated supercomputers.
SC ’21, November 14–19, 2021, St. Louis, MO, USA Chen, P. et al.
of the cone-shaped projections. This solves the problem of

scaling FBP for CBCT datasets by reducing the communica-
tion to a segmented reduction.
• We implemented the proposed algorithm in the form of a
back-projection CUDA kernel, which can perform compu-
(a) Parallel beam (b) Fan beam (c) Cone beam
tations in a streaming fashion. This not only improves scal-
ability but also allows us to use a single GPU to solve a
Figure 1: Different geometries for X-ray sources and detec- high-resolution problem in an out-of-core fashion (generat-
tors. Cone-beam is the geometry used in the latest (7𝑡ℎ gen- ing an 40963 volume).
eration) of CT [4, 43]. • We evaluate our framework on a set of real-world problems
on up to 1000s of GPUs. We can generate 20483 and 40963
volume data on 1, 024 GPUs in ∼3 seconds and ∼16 seconds
There are several challenges to optimize the FBP algorithms (including I/O), respectively.
for distributed CBCT reconstruction in seconds or minutes. Back-
The rest of this paper is organized as follows: In Section 2, we
projection is the computational bottleneck in the FBP algorithm.
elaborate on the background. Section 3 presents the proposed algo-
The complexity of filtering computation is O (𝑁 2𝑙𝑜𝑔(𝑁 )) and back-
rithms. Section 4 shows our implementation on a distributed system.
projection is O (𝑁 4 ) [38, 52]. This is a very high computational cost
In Section 5, we build a performance model. Section 6 displays the
when considering the increasingly large input sizes, e.g. the coffee
evaluated result. Finally, Section 7 concludes.
bean dataset we evaluate in this paper includes more than 177GB
of projections. Scaling the back-projection for CBCT is limited by
the manner in which the problem is decomposed. More specifically,
2 BACKGROUND AND RELATED WORK
decomposing projections (images) of cone-beamed geometry would This section introduces image reconstruction using the FBP algo-
require an algorithm that handles the complexity of constructing rithm on Cone-Beam CT and elaborates on the related work.
sub-volumes from cone-shaped slices of images. This is the reason
that distributed CBCT frameworks (summarized in Table 2) only 2.1 Projection (Image) Acquisition
decompose the images into batches, and do not split the images. As Figure 2 shows, the measurement setup of a 3D volumetric CT
This leads to scalability limitations and forces constraints on the includes an X-ray source, an object to be scanned, and an X-ray
input image sizes due to device capacity (as will be discussed later). imaging sensor (or detector). The object is placed at the center of
We propose a novel CBCT distributed image reconstruction al- scanner system and rotates in a range of 360◦ . The 3D volume of the
gorithm that could operate on input images that are sliced in a scanned object can be reconstructed from hundreds or thousands
cone-shaped geometry. This algorithm, implemented as part of a of projection images. Figure 1 shows the typical X-ray and detector
framework, is capable of operating on sliced images to improve geometries. The photon count 𝜆 can be observed by the X-ray sensor.
scalability. That is, we can decompose the input and output at two According to Beer’s Law [56], the projection is measured as
and one dimensions, respectively. This is in comparison to the tra- 𝜆 − 𝜆𝑑𝑎𝑟𝑘
ditional CBCT algorithms that decompose for one dimension of the 𝑃 = −𝑙𝑜𝑔( ) (1)
𝜆𝑏𝑙𝑎𝑛𝑘 − 𝜆𝑑𝑎𝑟𝑘
input (i.e. batches of images). The decomposition scheme results in
finer granularity of image reconstruction problems and better local- where 𝜆𝑑𝑎𝑟𝑘 and 𝜆𝑏𝑙𝑎𝑛𝑘 are the background offset and normaliza-
ity since each of the sub-volume computing tasks can be processed tion scan, respectively. The output P is commonly used for generat-
independently. The data communication overhead can be reduced ing the 3D volume by most of the image reconstruction algorithms.
greatly from the typical distributed cone-beam implementations at
which only the batch dimension is split. In addition, decomposing 2.2 FBP-based Image Reconstruction
the problem in the manner we propose eliminates the need for se- We briefly introduce the CBCT system and the 3D image recon-
quential communication and storing of results; the communication struction by the FBP algorithm as presented [19, 32].
and storage of sub-volumes can be overlapped with computation. It
is important to note that in addition to the algorithm’s scalability, it 2.2.1 Geometric Parameters. Table 1 lists all the geometric param-
can also be a solution for high-resolution out-of-core CBCT image eters for a CBCT system (illustrated in Figure 2). The X-ray source
reconstruction problems. is a micro-focus X-ray tube. The X-ray sensor of the Flat Panel De-
We can also perform each image reconstruction task on accelera- tector (FPD) is a type of digital radiography imaging sensor similar
tors in a streaming fashion that overlaps operations in an elaborate to digital photography. The distances of the source to the rotation
pipeline. We can generate high-resolution volumes (i.e. 20483 and axis (namely Z-axis) and FPD are 𝐷𝑠𝑜 and 𝐷𝑠𝑑 , respectively. The
40963 volumes) by moving the projections from host to device only sizes of FPD in the unit of pixels are 𝑁𝑢 and 𝑁 𝑣 , respectively. Note
once, in comparison to the typical CBCT decomposition scheme that the U-axis and V-axis of FPD are parallel to X-axis and Z-axis,
that redundantly moves the data several times (as in the algorithm respectively. The size of the 3D volume in the unit of voxels is 𝑁𝑥 ,
by Lu et al. [38]). 𝑁 𝑦 , and 𝑁𝑧 , respectively. The pixels in volume data (or 3D images)
The contributions in this paper are as follows: are often called voxels.
• We propose a novel image reconstruction algorithm that re- 2.2.2 Magnification Factor. Unlike parallel-beam CT, CBCT sys-
constructs the volume from a cone-geometry decomposition tems can magnify the scanned object based on the cone-shaped
Scalable Image Reconstruction Algorithm SC ’21, November 14–19, 2021, St. Louis, MO, USA
Z-axis Algorithm 1: 3D Back-projection algorithm. SubPixel is a bi-

linear interpolation function [48].
𝑁𝑦 Input: P is projection of size 𝑁𝑝 ×𝑁 𝑣 ×𝑁𝑢 , Mat is projection
O matrix of size 𝑁𝑝 ×3×4 (details in Sec. 4.1).
𝑁𝑣 𝑁𝑧 Cone-beam Output: I is output volume data of size 𝑁𝑧 ×𝑁 𝑦 ×𝑁𝑥 .
X-ray source 1 𝐼 ←0
2 for s ← 0 to 𝑁𝑝 -1 do
for k ← 0 to 𝑁𝑧 -1 do
V-axis
3
4 for j ← 0 to 𝑁 𝑦 -1 do
5 for i ← 0 to 𝑁𝑥 -1 do
Figure 2: Cone-beam CT with a Flat Panel Detector (FPD). 6 𝑧 ← ⟨𝑀𝑎𝑡 [𝑠] [2], [𝑖, 𝑗, 𝑘, 1]⟩ ⊲ Eqn 8
7 𝑥 ← ⟨𝑀𝑎𝑡 [𝑠] [0], [𝑖, 𝑗, 𝑘, 1]⟩/𝑧 ⊲ Eqn 8
Table 1: The parameters of a CBCT system. 8 𝑦 ← ⟨𝑀𝑎𝑡 [𝑠] [1], [𝑖, 𝑗, 𝑘, 1]⟩/𝑧 ⊲ Eqn 8
Description Parameter Unit 9 𝐼 [𝑘] [ 𝑗] [𝑖]←𝐼 [𝑘] [ 𝑗] [𝑖] + 𝑧12 ·𝑆𝑢𝑏𝑃𝑖𝑥𝑒𝑙 (𝑃 [𝑠], 𝑥, 𝑦)
Rotation angle 𝜙 degree
Distance from source to object (Z-axis) 𝐷𝑠𝑜 mm
Distance from source to detector (FPD) 𝐷𝑠𝑑 mm 10 Function SubPixel(𝑄, 𝑥, 𝑦)
The number of 2D projections 𝑁𝑝 — 11 [𝑖𝑢 , 𝑖 𝑣 ]←[int(𝑥), int(𝑦)] ⊲ 𝑖𝑢 , 𝑖 𝑣 are integers
The width and height of a 2D projection 𝑁𝑢 , 𝑁 𝑣 pixel 12 [𝜀𝑢 , 𝜀 𝑣 ]←[𝑥 − 𝑖𝑢 , 𝑦 − 𝑖 𝑣 ] ⊲ 𝜀𝑢 , 𝜀 𝑣 ∈ [0, 1)
Pixel pitch at U- and V-axis △𝑢 , △𝑣 mm/pixel 13 𝑡 1 ←𝑄 [𝑖 𝑣 ] [𝑖𝑢 ]·(1 − 𝜀𝑢 ) + 𝑄 [𝑖 𝑣 ] [𝑖𝑢 + 1]·𝜀𝑢 ⊲ interp
A 3×4 projection matrix at angle 𝜙 (Sec. 4.1) 𝑀𝜙 —
The number of voxels in X-, Y-, Z-axis 𝑁 𝑥 , 𝑁 𝑦 , 𝑁𝑧 voxel
14 𝑡 2 ←𝑄 [𝑖 𝑣 + 1] [𝑖𝑢 ]·(1 − 𝜀𝑢 ) + 𝑄 [𝑖 𝑣 + 1] [𝑖𝑢 + 1]·𝜀𝑢 ⊲ interp
Voxel pitch at X-, Y-, and Z-axis △𝑥 , △𝑦 , △𝑧 mm/voxel 15 return 𝑡 1 ·(1 − 𝜀 𝑣 ) + 𝑡 2 ·𝜀 𝑣 ⊲ 2D interpolation value
Offset of FPD at U- and V-axis (Figure 7a) 𝜎𝑢 , 𝜎𝑣 pixel
Rotation center offset (Figure 7b) 𝜎𝑐𝑜𝑟 mm
each pixel value as line 9 shows. The SubPixel in Algorithm 1 demon-

geometry. Taking the coffee bean dataset we use as an instance strates the bilinear interpolation function [48]. The computational
(Section 6.1), the magnification factor (𝐷𝑠𝑑 /𝐷𝑠𝑜 ) is as big as 9.48. complexity of back-projection may be O (𝑁 4 ) [52].
However, the magnification enablement comes at the cost of compli-
cating the input decomposition in distributed computation. To solve 2.3 Related Work
this challenge, the proposed algorithm employs a novel projection- In the past decades, a wide variety of accelerators with different
based methodology that allows us to split the volumes and projec- computing architectures have been used to optimize the image
tions. reconstruction algorithms, e.g. ASIC [67], Xeon Phi processor [29],
FPGA [10, 26, 35, 57, 69], DSP [37], and GPUs [8, 16, 17, 28, 41, 47,
2.2.3 Filtering computation. The filtering computation includes a 50, 68, 72].
point-wise multiplication and a 1D linear convolution Table 2 lists the recent work and shows the trend of optimizing
𝑃𝜙 (𝑢, 𝑣) = { q
𝐷𝑠𝑑
·𝑃𝜙 (𝑢, 𝑣)} ∗ 𝑓𝑟𝑎𝑚𝑝 the image reconstruction algorithms using multi-GPU and large-
(2) scale systems. There are many parallel-beam based image recon-
𝐷 (𝑢, 𝑣) 2 + 𝐷𝑠𝑑 2
struction algorithms. Trace [5], is a distributed framework for IR
where 𝐷 (𝑢, 𝑣) 2 = {△𝑢 (𝑢 − 𝑁𝑢 /2)}2 + {△𝑣 (𝑣 − 𝑁 𝑣 /2)}2 , 𝑃𝜙 is a two- algorithms (i.e. MLEM [54], SIRT [21]). Trace enables fine-grained
dimensional projection at angle 𝜙 and 𝑓𝑟𝑎𝑚𝑝 is an one-dimensional parallelization techniques, e.g. the shared and distributed memory
Ramp filter as illustrated in [19, 34]. Due to the use of FFT primitive parallelization. Using 32 compute nodes (384 cores), Trace achieved
to perform the convolution computation, the complexity of filtering 158× speedup in comparison to a single-core configuration. Xiao
computation becomes O (𝑁 2𝑙𝑜𝑔(𝑁 )) [52]. et al. [62] proposed a Non-Uniform Parallel Super-Voxel (NU-PSV)
algorithm to optimize the MBIR algorithm resulting in better data
2.2.4 Back-projection. Algorithm 1 shows the back-projection im-
locality, massive parallelism, and fast convergence. On a 69,632-
plementation in RTK 1 library. P is the set of filtered images of size
core distributed system, the NU-PSV algorithm is on average 1,665×
𝑁𝑝 ×𝑁 𝑣 ×𝑁𝑢 , Mat is the corresponding projection matrix of size
faster than the baseline implementations. In [64], Xiao et al. took
𝑁𝑝 × 4 × 3 for all projections (more details in Section 4.1). The Mat
advantage of Consensus Equilibrium algorithm to alleviate the
can be derived as 𝑀𝑎𝑡 [𝑠] = 𝑀𝜙 , where s∈[0, 𝑁𝑝 ) and 𝜙 = 360·𝑠/𝑁𝑝
memory footprint pressure when parallelizing the MBIR algorithm
(a full scan). The projection operation (as formulated in Equation 8)
achieving 3.5 PFLOP/s using 58,7520 cores. MemXCT was imple-
is presented in Algorithm 1 lines 6∼8. Clearly, each computation
mented to accelerate the IR algorithm by Mert et al. [27] on 4,096
requires three inner product operations 𝑀𝜙 · [𝑖, 𝑗, 𝑘, 1]𝑇 . To update
KNL nodes (256K cores) and achieved real-time image reconstruc-
the volume data, the pixel values at sub-pixel precision are fetched
tion for a 2D image of size 11K×11K. Mert et al. [28] extend their
according to the projecting coordinates, namely the variables x and
algorithm to generate a 3D image using the entire system of Summit
y. The value of z is also used as a geometric weight for adjusting
(24,576 GPUs) reaching the performance of 65 PFLOPS (excluding
1 Reconstruction Toolkit (RTK): https://www.openrtk.org the I/O).
Table 2: State-of-the-art image reconstruction solutions by FBP and Iterative Reconstruction (IR) algorithms.
Beam Decomposition Lower-bound Out-of-Core Multiple Communication
Implementation Algorithm
Shape Input Output Input Size Capability GPUs Nodes (MPI)
Trace [5] IR Parallel 2D 1D O(𝑁𝑝 ) ✓ ✗ ✓ O(𝑙𝑜𝑔(𝑁 ))
NU-PSV [63] IR Parallel 2D 1D O(𝑁𝑝 ) ✓ ✗ ✓ O(𝑙𝑜𝑔(𝑁 ))
MemXCT [27] IR Parallel 2D 2D O(𝑁𝑝 ) ✗ ✓ ✓ O(𝑁 )
Peta-scale XCT [28] IR Parallel 3D 3D O(𝑁𝑝 ) ✗ ✓ ✓ O(𝑁 )
Consensus Equilibrium [64] IR Parallel 2D 1D O(𝑁𝑝 ) ✓ ✗ ✓ O(𝑁𝑙𝑜𝑔(𝑁 ))
DMLEM [12] IR Cone 1D ✗ O(𝑁𝑢 ×N𝑣 ) ✗ ✓ ✓ O(𝑁𝑙𝑜𝑔(𝑁 ))
Palenstijn et al. [44] IR Cone 1D 1D O(𝑁𝑢 ×𝑁𝑝 ) ✗ ✓ ✓ O(𝑁𝑙𝑜𝑔(𝑁 ))
TIGRE [6] IR Cone 1D 1D O(𝑁𝑢 ×𝑁 𝑣 ) ✗ ✓ ✗ ✗
Lu et al. [38] FBP Cone 1D 1D O(𝑁𝑢 ×𝑁 𝑣 ) ✓ ✗ ✗ ✗
iFDK [9] FBP Cone 1D 1D O(𝑁𝑢 ×𝑁 𝑣 ) ✗ ✓ ✓ O(𝑁𝑙𝑜𝑔(𝑁 ))
This work FBP Cone 2D 1D O(𝑁𝑢 ) ✓ ✓ ✓ O(𝑙𝑜𝑔(𝑁 ))
Table 3: The system parameters in our framework. volume data (output) in a way that enables us to build a scalable
Description Parameter and deep pipeline on heterogeneous computing systems.
The batch size of slices 𝑁𝑏
3.1.1 Splitting the volume data. We split the volume data in a way
The number of slices generated by a rank (or group) 𝑁𝑠
The batch count of batch processing 𝑁𝑐
that we could perform independent and concurrent reconstruc-
The number of sub-volumes 𝑁𝑛 tion of different sub-volumes. This has been proposed before for
The number of groups (of MPI ranks) 𝑁𝑔 parallel-beam reconstruction (e.g. [28, 62]), but not for cone-beam,
The number of MPI ranks in a single group 𝑁𝑟 to the author’s knowledge. Furthermore, this method of splitting
The number of GPUs 𝑁𝑔𝑝𝑢𝑠 the volume enables us to construct volumes in an out-of-core fash-
The number of launched ranks 𝑁𝑟𝑎𝑛𝑘𝑠 ion. This is particularly useful in cases when the required memory
for storing high-resolution volume data is larger than the memory
capacity (i.e. GPU device memory). As Figure 3c shows, we split the
In comparison to cone-beam image reconstruction, parallel-beam volume into a sub-volumes vertically (or along Z-axis in Figure 2),
is simpler since the projection and volume data can be naturally split e.g., 𝑉0 , 𝑉1 , . . . , and 𝑉𝑁𝑛 −1 in Figure 3c. 𝑁𝑏 is defined as the batch
without special consideration to the irregular decomposition of com- size (or the number of 2D slices) in each sub-volume. Hence, the
putation and communication. TIGRE [6] implemented a collection total number of sub-volumes may be calculated as
of IR algorithms for CBCT. However, TIGRE is restricted to using
𝑁𝑧
multiple GPUs on a single node. Palenstijn et al. [44] presented an 𝑁𝑛 = (3)
𝑁𝑏
extension of the ASTRA Toolbox [60] with an optimization of the
SIRT algorithm that enables distributed computation with multiple 3.1.2 Splitting 2D projections. As Figure 3a shows, each 2D pro-
GPUs. The authors only considered decomposing the projections jection is decomposed into several sub-projections according to
of the cone-beam in the dimension of 𝑁 𝑣 . DMLEM [12] scales up the geometric position of the corresponding sub-volume. Due to
to tens of GPUs, yet is restricted to reconstructing extremely small the use of cone-beam geometry, the required sub-projection for
volumes, e.g. smaller than 2003 . iFDK [9] is an efficient framework the partial image reconstruction varies from the position of each
to scale the FBP algorithm on CBCT. However, iFDK only decom- sub-volume. That is since the magnification factor exists through
poses the projection at the dimension of 𝑁𝑝 , and the size of output a cone geometry. Unlike the parallel-beam CT system, decompos-
volume on each GPU is limited by the GPU memory capacity. Un- ing a cone-beam projection may introduce an overlapped area. In
like the prior works, our algorithm decomposes the projections other words, in comparison to the parallel-beam CT, the character-
on both 𝑁 𝑣 and 𝑁𝑝 dimensions, resulting in better scalability and istics of the magnification make the CBCT more challenging w.r.t.
also out-of-core capability. Furthermore, we can greatly reduce the decomposing the domain in distributed reconstruction.
inter-processes communication and remove all serialization and Figure 4 (YZ-view) shows the decomposition of a 2D projection.
redundancy from the end-to-end pipeline, i.e. all steps from loading To generate a sub-volume of 𝑉𝑖 , the required range of projection is
until storing can be overlapped.
𝑎𝑖 𝑏𝑖 = 𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝐴𝐵(𝑖·𝑁𝑏 , (𝑖 + 1)·𝑁𝑏 ) (4)
3 PROPOSED ALGORITHM Where the function ComputeAB, which computes the maximum
In this section, we discuss the algorithm we propose for scalable projection area, is shown in Algorithm 2.
FBP decomposition for cone-beam CT. In Algorithm 2, the projection operation (at angles of 135◦ and
315◦ ), according to Equation 8, is called four times to compute
3.1 Projection & Volume Decomposition four y coordinates. The range of 𝑎𝑖 𝑏𝑖 is determined by the mini-
Table 3 lists all the parameters we use in the following sections. mum and maximum values of y (min4 and max4 in the algorithm).
As Figure 3 shows, we introduce a novel algorithm to parallel FBP- Since the positions of y are at sub-pixel precision (expressed as a
based image reconstruction: decomposing projections (input) and single-precision variable), we adjust them to integer values since
Input (projections): 𝑁𝑣 × 𝑁𝑝 × 𝑁𝑢 Overlapped area: Output (volume): 𝑁𝑧 × 𝑁𝑦 × 𝑁𝑥
Sub-volume
𝑁𝑝
Sub-volume
𝑁𝑏
Sub-volume 𝑉𝑖 𝑉0
𝑉1
Sub-volume
…… 𝑉𝑖
𝑉𝑖+1
Required by 𝑉𝑖 Sub-volume
𝑁𝑣 Sub-volume
Required by 𝑉𝑖 +1 𝑉𝑖 +1
…… Sub-volume
Sub-volume
𝑉𝑁𝑛−1
𝑁𝑢
(a) 2D Projections decomposition. (b) Reduced sub-volumes. (c) Aggregate Volume.
Figure 3: Overview of the projection and volume decomposition. An example of four MPI ranks work as a group, 𝑁𝑟 = 4.
Z-axis X-axis X-axis X

𝑎0 𝑏0 𝑎0 X
Z 𝑁𝑢 𝑁𝑢 Y
Z
𝑎1
Y-axis Y-axis
Y
𝑎1 𝑏1 𝑏0 𝑉0
𝑉1
𝑏1
𝑉2 (a) 135° rotation. (b) 315° rotation.
Figure 5: The nearest and furthest voxel positions to the X-
FPD 𝑎1 𝑏0 :Overlapped area ray source (from a top-view). The red points are rotated to
Figure 4: Schematic YZ-view of overlapped projections. the position of blue points, which are on the Y-axis.
Algorithm 2: Compute maximum projection area. 𝑀135◦ and

this facilitates the projection decomposition processing. Figure 5 ex-
𝑀315◦ are projection matrices at angles of 135◦ and 315◦ , re-
plains the reasons why we choose the projection matrices at angles
spectively. min4 and max4 are the maximum and minimum
of 135◦ and 315◦ . As the top-view shows, a voxel of position (0, 0)
values of the four input elements, respectively.
rotates from the position of the red point to the blue point. Hence,
the nearest (Figure 5a) and furthest (Figure 5b) positions of voxels Function ComputeAB(𝑏𝑒𝑔𝑖𝑛_𝑖𝑑𝑥, 𝑒𝑛𝑑_𝑖𝑑𝑥)
on a horizontal slice (w.r.t. the X-ray source) can be identified and [-, y0 ] = 𝑃𝑟𝑜 𝑗𝑒𝑐𝑡𝑖𝑜𝑛(𝑀135◦ , [0, 0, 𝑏𝑒𝑔𝑖𝑛_𝑖𝑑𝑥]) ⊲ use Eqn 8
used in Algorithm 2. [-, y1 ] = 𝑃𝑟𝑜 𝑗𝑒𝑐𝑡𝑖𝑜𝑛(𝑀315◦ , [0, 0, 𝑏𝑒𝑔𝑖𝑛_𝑖𝑑𝑥]) ⊲ use Eqn 8
[-, y2 ] = 𝑃𝑟𝑜 𝑗𝑒𝑐𝑡𝑖𝑜𝑛(𝑀135◦ , [0, 0, 𝑒𝑛𝑑_𝑖𝑑𝑥 − 1]) ⊲ use Eqn 8
3.1.3 Splitting 3D projections. This section explains how we decom- [-, y3 ] = 𝑃𝑟𝑜 𝑗𝑒𝑐𝑡𝑖𝑜𝑛(𝑀315◦ , [0, 0, 𝑒𝑛𝑑_𝑖𝑑𝑥 − 1]) ⊲ use Eqn 8
pose the projection in the 𝑁𝑝 dimension. Unlike the decomposition a = (int)floor(min4(y0, 𝑦1, 𝑦3, 𝑦4 )) ⊲ floor operation
along the 𝑁 𝑣 , the dimension of 𝑁𝑝 can be split equally and has b = (int)ceil(max4(y0, 𝑦1, 𝑦3, 𝑦4 )) ⊲ ceil operation
no overlap region. We decompose the projections in the 𝑁 𝑣 and return 𝑎𝑏 ⊲ integer values
𝑁𝑝 dimensions. This decomposition scheme improves both perfor-
mance and scalability since the small-sized problems (or partial
projections) can be solved efficiently on all distributed systems. of the next sub-volume. Taking the sub-volumes of 𝑉𝑖 and 𝑉𝑖+1 for
As Figure 3a shows, for the projections of size 𝑁 𝑣 ×𝑁𝑝 ×𝑁𝑢 , the an instance, we can update the sub-projections required by 𝑉𝑖+1
dimension of 𝑁𝑝 is split into four parts (𝑁𝑟 =4). Using the corre- without regard to the overlapped area that is required by both 𝑉𝑖
sponding range of 𝑎𝑖 𝑏𝑖 (computed as in Equation 4), four partial and 𝑉𝑖+1 . More specifically, as shown in Figure 4, the projections for
sub-volumes can be reconstructed in parallel. The actual processing the overlapped area of 𝑎 1𝑏 0 can be reused and thus, only the partial
at the level of the four sub-volume will be discussed later. projections of 𝑏 0𝑏 1 are required for the update. The differential part
In Figure 3c, for a sub-volume of size 𝑁𝑥 ×𝑁 𝑦 ×𝑁𝑏 , the size (or required for updating may be expressed as
total elements) of the corresponding partial projection is
𝑏𝑖 𝑏𝑖+1 = 𝑎𝑖+1𝑏𝑖+1 − 𝑎𝑖 𝑏𝑖 ∩𝑎𝑖+1𝑏𝑖+1 (6)
𝑁𝑢 ·𝑁𝑝 ·(𝑏𝑖 − 𝑎𝑖 )
𝑆𝑖𝑧𝑒𝐴𝐵𝑖 = (5)
𝑁𝑟 Where 𝑖 ∈ [0, 𝑁𝑛 − 1), n is the number of sub-volumes as in Equa-
tion 3. Hence, the total number of elements of the corresponding
where, 𝑁𝑝 is split into 𝑁𝑟 partitions. Splitting into two dimensions
partial projections is
can help in alleviating the memory pressure on the heterogeneous
accelerators with limited device memory, e.g. GPUs. To generate 𝑁𝑢 ·𝑁𝑝 ·(𝑏𝑖+1 − 𝑏𝑖 )
𝑆𝑖𝑧𝑒𝐵𝐵𝑖 = (7)
the sub-volumes in the pipeline shown in Figure 3, we only update 𝑁𝑟
a small part of the required sub-projection for the reconstruction
Load data Filtering Back-projection MPI-Reduce Store 𝑁𝑢 Rotation center
3D Volume
Load data Filtering Back-projection
𝛿𝑢 ഥ 𝑁𝑢
Load data Filtering Back-projection MPI-Reduce Store O
Load data Filtering Back-projection 𝛿𝑣 𝑁𝑣 𝛿𝑐𝑜𝑟
Load data Filtering Back-projection MPI-Reduce Store o ഥ
O
Load data Filtering Back-projection
Figure 6: Overview of how the proposed framework operates (a) FPD center offset. (b) Rotation center offset (top-view).
on a distributed system.
Figure 7: Geometric correction. The blue and red points are
the theoretical and calibrated positions, respectively.
This interaction between sub-volumes allows us to run an end-to-
end pipeline from loading projections to storing the volume (details where [𝑖, 𝑗, 𝑘] is a 3D point in the unit of voxel, [𝑥, 𝑦] is the pro-
in Section 4.4.3). It is important to note that the lack of projection jected 2D position at the FPD plane in the unit of pixel at sub-pixel
splitting in other cone-beam frameworks[9, 30, 38] forces them precision. We can obtain a projection matrix based on the angle
to serialize the sub-volumes: sub-volumes are computed one after 𝜙, and perform the mapping operation as shown in lines 6∼8 of
the other. In addition, the lack of projection splitting forces the Algorithm 1.
frameworks to repeatedly move the same projections from host to
device in an inefficient manner. 4.2 Filtering Computation Optimization
As Figure 6 shows, we take advantage of the heterogeneity of the
4 IMPLEMENTATION
GPU-accelerated system to pipeline and parallelize the FBP compu-
In this section, we discuss the implementation of the proposed algo- tation. Typically the GPU is used to perform the filtering computa-
rithm in a framework. Figure 6 gives an overview of the proposed tion in Equation 2 [7, 38]. We however do the filtering computation
algorithm when running on distributed systems. More specifically, on the CPU (using IPP/MKL libraries [58]) to have the end-to-end
we use the muti-core CPUs to perform the operations such as pipeline shown in Figure 6. Running the filtering computation on
Parallel File System (PFS) I/O, filtering computation, and MPI com- CPU contributes to building an efficient pipeline for the FBP algo-
munications. We use the GPUs to run the back-projection kernel. rithm as follows. (i) The GPU focuses on back-projection, which
is the computational bottleneck. Also, we can overlap the filtering
4.1 A General Projection Matrix computation and back-projection. Hence, we can hide the latency
It is important to assess the geometric parameters of a CT system of the filtering computation. (ii) The limited device memory of GPU
with a high degree of precision in order to reconstruct tomographic can be fully used for back-projection. More specifically, we can
images with good spatial resolution and low artifact content [70]. alleviate the pressure on device memory when generating high-
Many reconstruction works [38, 42, 66] assume CT systems have no -resolution volumes. (iii) The data movement remains simplified
geometric error or that the geometric error is corrected by physical since the filtering computation can be performed immediately after
adjustment. In this work, we correct the geometric offset dynami- loading data from storage.
cally when performing the projection operations. In Figure 7, we
show the geometric offsets that must be carefully calibrated before 4.3 Novel CUDA Back-projection Kernel
scanning [70]. In our system, the projection matrix is formulated Conventional approaches [38, 72] rely on 2D layered texture [11]
with consideration of the geometric offset, e.g. 𝜎𝑥 , 𝜎 𝑦 and 𝜎𝑐𝑜𝑟 . to improve the data locality and update the volume by batches
We propose a general projection matrix available to correct these of projections. We, however, split the projections horizontally, in
offsets for real-world datasets as listed in Table 4. addition to the batches, and update each voxel by all projections in
Most importantly, the proposed matrix is general and can be a single batch. We also take advantage of the 3D texture to improve
reused for most CBCT systems, e.g. microscope CT system with the data locality of projections.
rotation center offset (𝜎𝑐𝑜𝑟 ). The CBCT geometry can be described
as a pinhole model, similar to a digital camera [23]. All geometric 4.3.1 CUDA Implementation. In Listing 1, our kernel is imple-
parameters could be expressed in a well-aligned matrix of size 3×4 mented as kernelBackProjection. Each 2D position, namely (x, y)
(known as the projection matrix). These matrices are convenient in lines 12∼14, is computed according to Equation 8 and then, is
for performing the back-projection computation, i.e. for projecting used to fetch the intensity value of the projection at sub-pixel pre-
a voxel to the plane of FPD such as the matrix 𝑀𝜙 in Algorithm 1. cision using the device function devSubPixel. The device function
Our projection matrix is defined as devSubPixel is strictly consistent with the original interpolation
 △𝑥 0 1−△𝑥 ·𝑁𝑥 function implementation in Algorithm 1. Each pixel is accessed
0
 𝐷𝑠𝑑 0 𝑁𝑢2−1 + 𝜎𝑢 0  𝑐𝑜𝑠 (𝜙) −𝑠𝑖𝑛 (𝜙) 0 𝜎𝑐𝑜𝑟  2
via 3D texture, as the device function devPixel shows. We use the

△𝑦 ·𝑁 𝑦
 0 0 −1 0 
 △𝑢  
 0 −△ 0 
2
3D texture to improve the data reuse. To perform a bilinear inter-
𝑀𝜙 =  0 𝐷𝑠𝑑 𝑁 𝑣 −1
+ 𝜎𝑣 0  𝑠𝑖𝑛 (𝜙) 𝑐𝑜𝑠 (𝜙) 0 𝐷𝑠𝑜 
  𝑦 
2 △𝑧 ·𝑁𝑧
△𝑣  0 0 △𝑧
 
 0 0 0 2

𝐷𝑠𝑑   0 0 0 1  polation operation, we need to load four neighboring pixels from


   0 0 0 1 
 
Using this matrix, we can project the 3D position of any voxel the device memory to registers. However, those four pixels are not
relative to the FPD plane. We formulate the projection as physically contiguous in the device memory. It is worth mentioning
that the texture memory in CUDA provides a hardware-optimized
[𝑥, 𝑦] = 𝑃𝑟𝑜 𝑗𝑒𝑐𝑡𝑖𝑜𝑛(𝑀𝜙 , [𝑖, 𝑗, 𝑘]) (8) bilinear interpolation function at 8-bits precision [11]. However, to
Listing 1: Proposed CUDA Back-projection kernel enabling

streaming and out-of-core computations.
1 t e x t u r e < f l o a t , cudaTextureType3D > tex ; / / 3D t e x t u r e memory MPI_Reduce
2 _ _ g l o b a l void kernelBackProjection ( f l o a t ∗ volume , i n t
o f f s e t _ v o l u m e _ z , i n t 3 vol_dim , const f l o a t 4 ∗ p r o j _ m a t ,
i n t 3 proj_dim , i n t o f f s e t _ p r o j _ y ) {
3 i n t i = b l o c k I d x . x ∗ blockDim . x+ t h r e a d I d x . x ;
4 i n t j = b l o c k I d x . y ∗ blockDim . y+ t h r e a d I d x . y ;
5 i n t k = b l o c k I d x . z ∗ blockDim . z + t h r e a d I d x . z ;
6 i f ( i >= v o l _ d i m . x | | j >= v o l _ d i m . y | | k>= v o l _ d i m . z )
7 return ;
8 f l o a t sum = 0 ;
9 i n t K = k + offset_volume_z / / o f f s e t k Figure 8: MPI_Reduce on a slice (512 × 512) of tomo_00030.
10 f l o a t 4 i j k = make_float4 ( i , j , K , 1.0 f ) ;
11 f o r ( i n t s = 0 ; s < p r o j _ d i m . y ; s ++ , p r o j _ m a t += 3 ) {
12 f l o a t z = d o t ( _ _ l d g (& p r o j _ m a t [ 2 ] ) , i j k ) ; / / Eqn 8
13 f l o a t x = d o t ( _ _ l d g (& p r o j _ m a t [ 0 ] ) , i j k ) / z ; / / Eqn 8
14 f l o a t y = d o t ( _ _ l d g (& p r o j _ m a t [ 1 ] ) , i j k ) / z ; / / Eqn 8 thus, we only need to copy the projections in the range 𝑏 1𝑏 2 from
15 f l o a t Y = y - offset_proj_y ; / / o f f s e t y host to device memory for computing the second sub-volume. To
16 sum += 1 . f / ( z ∗ z ) ∗ d e v S u b P i x e l ( x , Y , s , p r o j _ d i m ) ;
17 } reuse projections and stream the back-projection computation, the
18 / / update a v o x e l o f t h e volume data offset operations for projections and volume are essential (variables
19 volume [ v o l _ d i m . x ∗ v o l _ d i m . y ∗ k+ v o l _ d i m . x ∗ j + i ] = sum ;
20 } K, Y, and Z are emphasized in gray color). The optimization of data
21 / / d e v i c e f u n c t i o n a s S u b P i x e l i n Alg. 1 movement is explained later in Section 4.4.3.
22 _ _ d e v i c e f l o a t devSubPixel ( f l o a t x , f l o a t y , f l o a t z , i n t 3
& proj_dim ) {
23 f l o a t 2 f = make_float2 ( f l o o r ( x ) , f l o o r ( y ) ) ; 4.4 Distributed FBP Framework
24 f l o a t 2 d = m a k e _ f l o a t 2 ( x , y ) − f ; / / sub − p i x e l p o s i t i o n
25 f l o a t 2 _d = 1 . f − d ; / / sub − p i x e l p o s i t i o n This section describes a novel distributed framework for FBP-based
26 / / fetch four neighbour p i x e l values
27 f l o a t v0 = devPixel ( f . x , z , f . y , proj_dim . z ) ; image reconstruction that is based on the algorithm we propose
28 f l o a t v1 = devPixel ( f . x + 1 , z , f . y , proj_dim . z ) ; in this paper. We build an end-to-end pipeline to overlap different
29 f l o a t v2 = devPixel ( f . x , z , f . y + 1 , p r o j _ d i m . z ) ;
30 f l o a t v3 = devPixel ( f . x + 1 , z , f . y + 1 , p r o j _ d i m . z ) ; operations of the volume reconstruction. We also take advantage of
31 return ( v0 ∗ _d . x+ v1 ∗ d . x ) ∗ _d . y + ( v2 ∗ _d . x+ v3 ∗ d . x ) ∗ d . y ;
32 } the heterogeneous computing resources to orchestrate a balanced
33 _ _ d e v i c e f l o a t devPixel ( i n t x , i n t y , i n t z , i n t dimZ ) { workload on both CPUs and GPUs.
34 i n t Z = z % dimZ ; / / o f f s e t z
35 return tex3D ( tex , x , y , Z ) ; / / r e a d 3D t e x t u r e 4.4.1 Grouping MPI Ranks. We equally divide the all MPI ranks
36 }
into several groups to perform the image reconstruction. The total
number of ranks is
𝑁𝑟𝑎𝑛𝑘𝑠 = 𝑁𝑟 ·𝑁𝑔 (9)
maintain the required high resolution of generated volumes, we use
the single-precision (32-bits) device function (namely devSubPixel). Where 𝑁𝑔 is the number of groups and 𝑁𝑟 the number of MPI ranks
The forward-projection and back-projection are optimized as 𝐴𝑥 in a group (both defined in Table 3). We use MPI_Comm_split [51]
and 𝐴𝑇 𝑦 (two SpMV operations [61] computations), where A is a to group the MPI ranks: the ranks in the same group share the same
sparse matrix, called system matrix [40, 62]. The huge size, and color 𝑖𝑑/𝑁𝑔 , where 𝑖𝑑 is the world MPI rank number. Therefore,
sparsity, of the A-matrix (O (𝑁 5 ) [3]) that arises from the CBCT each group may be assigned a new MPI communicator that can
technology makes it inefficient to use Tensor Cores. be used for an efficient segmented collective communication, i.e.
Also, we use the device memory to store the projection matrices the MPI_Reduce in our implementation. The total number of sub-
(declared as proj_mat) and access them via the cache-optimized volume slices generated by each MPI group is
intrinsic __ldg, rather than using constant memory. That is since 𝑁𝑧
𝑁𝑠 = (10)
those matrices are used by several projections and thus, we can 𝑁𝑔
avoid the constant memory capacity restriction [11]. Note that the We assign one rank to each GPU. Hence, the number of GPUs
proj_mat is used as read-only. Since all projections are used to equals the number of the MPI ranks in our framework. Therefore,
update the voxels in a single batch, we can accumulate the values
via a register (declared as sum) and update the volume once. This 𝑁𝑔𝑝𝑢𝑠 = 𝑁𝑟𝑎𝑛𝑘𝑠 (11)
helps in reducing the memory traffic when accessing the device This simplifies the pipeline and improves the portability of the code
memory. between systems with different numbers of GPUs per node. Based
4.3.2 Data Reuse in Device Memory. Our CUDA kernel enables on Equation 3, the required number of batches for generating all
streamed processing of the back-projection computation. This is volumes in each rank (or group) becomes
beneficial for device memory reuse. This also facilitates the pipeline 𝑁𝑠 𝑁𝑛
𝑁𝑐 = = (12)
scheme. As Figure 4 shows, we load projections in the range 𝑎 1𝑏 1 to 𝑁𝑏 𝑁𝑔
generate the first sub-volume, then we move the volume data from 𝑁𝑐 can be used to control the device memory budget to use since
the device to host memory for further processing. The projections it is inversely proportional to the value of 𝑁𝑏 . In other words, we
in the range 𝑎 2𝑏 2 are required for generating the next volume data. can process fewer slices when using larger 𝑁𝑐 . To simplify these
However, the overlapped area in the range 𝑎 2𝑏 1 can be reused and configurations, we fix 𝑁𝑐 =8 in the following evaluation.
Storage Queue0 Queue1 Queue2 Queue3 Storage

Projections Load Filter Back-projection MPI Store 3D Volume
Thread Thread Thread Thread Thread
Figure 9: A end-to-end view of the pipeline in a single MPI rank. Queues correspond to the stages of the pipeline.
Load Thread 9.5s 137.7s Load Thread 0.5s 35.5s

Filter Thread 17s Filter Thread 1.0s
BP Thread H2D D2H …… 134.4 s D2H BP Thread H2D D2H …… 16.3 s D2H
MPI Thread MPI Thread 23.3 s
Store Thread 7.6s Time (s) Store Thread 5.1s Time (s)
(a) Reconstructing 20483 volume of Tomo_00029 in Table 5. 𝑁𝑔𝑝𝑢𝑠 =1. (b) Reconstructing 40963 volume of Bumblebee. 𝑁𝑔𝑝𝑢𝑠 =128, 𝑁𝑔 =64, 𝑁𝑟 =8.
Figure 10: Empirical examples using real-world datasets: efficient overlapping of the pipelined stages (BP = Back-projection).
Algorithm 3: Moving data from host to device by cudaMem- Namely: loading projections from storage, filtering, back-projection,
cpy3D in the computing pipeline of Figure 9. reduction of the 3D sub-volumes, and finally saving the recon-
Input: Queue1 , Queue2 (as in Fig. 9), devMem is texture-optimized 3D structed results to storage. Five threads manage different stages
device memory with depth of H (tex in Listing 1). devMem(x,y) of the pipeline, concurrently. The main thread (also called MPI
means 3D memory ranges from x to y in depth dimension. thread) manages the MPI communication. Then four other threads
Output: sub-volumes of 𝑉0 , . . . , 𝑉𝑁𝑐 −1 are launched via the C++ standard library (namely std::thread) [33].
1 for i ← 0 to 𝑁𝑐 -1 do The inter-thread hand-over is managed through four queues (FIFO
2 s← 0 buffers) that assume all threads are running independently. The
3 if i == 0 then load thread is launched to load the partial projections according to
4 P(𝑎 0𝑏 0 ) ← 𝑄𝑢𝑒𝑢𝑒 1 ⊲ pop head of filtered projections the requirements of the sub-volumes (as in Figure 3). The filtering
5 d←b0 − 𝑎 0 + 1 thread is used to perform the filtering computation on the data
6 Memcpy3D 𝑃 (𝑎 0𝑏 0 ) to devMem(0, d-1) ⊲ H2D handed-over from the load thread. We then parallelize the filtering
7 else computation (using OpenMP). The Back-projection thread launches
8 P(𝑏𝑖−1𝑏𝑖 ) ← 𝑄𝑢𝑒𝑢𝑒 1 ⊲ get filtered projection the back-projection CUDA kernel on the GPU (proposed CUDA
9 d←b𝑖 − 𝑏𝑖−1 +1 kernel in Listing 1). The back-projection thread also manages the
10 if s%H + d < H then data movement in a way that allows the pipeline to seamlessly
11 Memcpy3D P(𝑏𝑖−1𝑏𝑖 ) 𝑡𝑜 𝑑𝑒𝑣𝑀𝑒𝑚 (𝑠%𝐻, 𝑠%𝐻 + 𝑑) support out-of-core image reconstruction if necessary (shown in
12 else Algorithm 3). In the algorithm, we use cudaMemcpy3D to move the
13 k← 𝐻 − 𝑠%𝐻 partial projections waiting for processing in 𝑄𝑢𝑒𝑢𝑒 1 to be processed
14 Memcpy3D P(𝑏𝑖−1 (𝑏𝑖−1 + 𝑘)) 𝑡𝑜 𝑑𝑒𝑣𝑀𝑒𝑚 (𝑠%𝐻, 𝐻 − 1) on the device using the proposed CUDA kernel in listing 1. Then
15 Memcpy3D P((𝑏𝑖−1 + 𝑘 + 1)𝑏𝑖 ) 𝑡𝑜 𝑑𝑒𝑣𝑀𝑒𝑚 (0, 𝑏𝑖 −𝑏𝑖−1 ) we push the generated sub-volume to 𝑄𝑢𝑒𝑢𝑒 2 . Finally, the store
16 s← 𝑠 + ℎ thread writes the final volume data to global storage (or PFS) using
17 𝑉𝑖 ← generate a sub-volume on GPU ⊲ use kernel in Listing 1 OpenMP threads. Figure 10 shows two empirical examples of the
18 𝑄𝑢𝑒𝑢𝑒 2 ←V𝑖 ⊲ push back a sub-volume end-to-end pipeline processing and demonstrates the effectiveness
of the overlapping.
4.4.2 Communication: Efficient Segmented Reduction. This section

discusses the communication in our proposed framework in com- 5 PERFORMANCE MODEL
parison to the related works listed in Table 2. As Figure 3 shows,
To better quantify the efficiency of our framework, we model the
four MPI ranks (namely 𝑁𝑟 =4) generating the sub-volumes in par-
computing performance to predict the overall runtime.
allel, then the four sub-volumes are reduced (using MPI_Reduce)
Micro-benchmark measurements. The parameters used in
as Figure 8 shows. Note that the MPI_Reduce call is segmented, i.e.,
our model, such as the PFS throughput, are measured by micro-
different subsets of nodes do reduction together (the MPI_Reduce
benchmarks. 𝐵𝑊𝑙𝑜𝑎𝑑 and 𝐵𝑊𝑠𝑡𝑜𝑟𝑒 are the peak throughput of load-
is not collective for all nodes). We conduct the reduction on the
ing data from local storage and writing data to the PFS, respectively.
CPU and avoid using the CUDA-aware reduce primitive since the
𝑇 𝐻 𝑓 𝑙𝑡 is the throughput, on a given CPU, of Intel’s primitives library
latency of the reduction can be hidden in our pipeline, and to make
(IPP) we use for the filtering function. 𝑇 𝐻𝑏𝑝 is the back-projection
our framework portable to systems not supporting CUDA-aware
throughput on a GPU. 𝑇 𝐻𝑟𝑒𝑑𝑢𝑐𝑒 is the throughput of MPI_Reduce
MPI. To improve the performance of the segmented MPI_Reduce,
that we measure using Intel MPI-benchmarks [31]. 𝐵𝑊𝑝𝑐𝑖 is the
we do a hierarchical reduction at which MPI ranks on the same
throughput of moving data by a single PCIe (x16) interconnect
nodes first do a round of reduction to a leader MPI rank.
and is measured by Nvidia’s SDK toolkit. 𝜂 represents a constant
4.4.3 End-to-end Pipeline. In Figure 9, we show the pipeline we value of sizeof(float) since we use single-precision by default in our
use to perform the image reconstruction in an end-to-end fashion. framework.
Performance Projection. Each MPI rank loads a small part Table 4: Geometric correction for computing projection ma-
of projections from the local storage to memory as expressed in trix (𝑀𝜙 ) and projection pre-processing by Equation 1.
Equation 5 and Equation 7. tomo_ID
Coffee bean Bumblebee
( 00027 00028 00029 00030
𝜂·𝑆𝑖𝑧𝑒𝐴𝐵𝑖 /𝐵𝑊𝑙𝑜𝑎𝑑 , if 𝑖 = 0. 0 0 25 26 27 -10
𝑖
𝑇𝑙𝑜𝑎𝑑 = (13) 𝜎𝑢
𝜂·𝑆𝑖𝑧𝑒𝐵𝐵𝑖 /𝐵𝑊𝑙𝑜𝑎𝑑 , else 𝑖 ∈ [1, 𝑁𝑐 ). 𝜎𝑣 0 0 0.25 0.25 0.2 0.2
𝜎𝑐𝑜𝑟 -0.0021 1.03 0 0 0 0
Similar to the equation above, we can obtain 𝑇𝑓𝑖 𝑙𝑡 and 𝑇𝐻𝑖 2𝐷 by 𝜆𝑑𝑎𝑟𝑘 0 dark data is provided in dataset
216 blank data is provided in dataset
𝑇 𝐻 𝑓 𝑙𝑡 and 𝐵𝑊𝑝𝑐𝑖 , respectively. The back-projection execution time
𝜆𝑏𝑙𝑎𝑛𝑘
mainly depends on the size of the partial projections

𝑁𝑥 ·𝑁 𝑦 ·𝑁𝑏 ·𝑁𝑝 6 EVALUATION
𝑖
𝑇𝑏𝑝 = (14)
𝑁𝑟 ·𝑇 𝐻𝑏𝑝
In this section, we show the experimental environment, discuss the
The required memory for a single sub-volume is evaluated performance of the proposed algorithm, and analyze the
scaling of our implementation.
𝑆𝑖𝑧𝑒 𝑣𝑜𝑙 = 𝜂·𝑁𝑥 ·𝑁 𝑦 ·𝑁𝑏 (15)
Therefore, the required time of moving a sub-volume from device to 6.1 Datasets, Environment, and Measurements
host, reducing the sub-volume by MPI_Reduce, and writing a sub- Computing Platform. ABCI supercomputer (14𝑡ℎ in the Top500
volume data to storage could be written as 𝑇𝐷2𝐻 𝑖 = 𝑆𝑖𝑧𝑒 𝑣𝑜𝑙 /𝐵𝑊𝑝𝑐𝑖 , list at November 2020) is used for performance evaluation. The sys-
𝑇𝑟𝑒𝑑𝑢𝑐𝑒 = 𝑆𝑖𝑧𝑒 𝑣𝑜𝑙 /𝑇 𝐻𝑟𝑒𝑑𝑢𝑐𝑒 , and 𝑇𝑠𝑡𝑜𝑟𝑒 = 𝑆𝑖𝑧𝑒 𝑣𝑜𝑙 /𝐵𝑊𝑠𝑡𝑜𝑟𝑒 , respec-
𝑖 𝑖 tem is equipped with 1,088 compute nodes (4,352 Nvidia Tesla V100
tively. We define following parameters that are aggregations: 𝑇𝐶𝑃𝑈 GPUs) and 35PB shared storage. The specification of each compute
is the runtime on CPU, 𝑇𝐺𝑃𝑈 is the runtime on GPU, 𝑇𝑙𝑜𝑎𝑑 is the node is as follows: two Intel Xeon Gold 6148 CPUs (2.40GHz, 20
time to load from local storage to memory, 𝑇𝑓 𝑙𝑡 is the time to do Cores), 384GB (DDR4 2666MHz) memory, 1.6TB NVMe SSD, four
the filtering computation on CPU, 𝑇𝑟𝑒𝑑𝑢𝑐𝑒 the time to do the sub- Nvidia Tesla V100 GPUs (16GB/GPU) with PCIe 3.0×16, and two
volume reductions, 𝑇𝑠𝑡𝑜𝑟𝑒 is the time to store the resulting volume InfiniBand EDR HCAs. Our framework is developed on CUDA-10.2
to the PFS. Note that the aggregate runtimes are comprised of run- Toolkit (CUDA Driver: 440.33.01) running on CentOS 7.4. The ver-
times for different batch counts (𝑁𝑐 ), e.g. 𝑇𝐶𝑃𝑈 = 𝑁
Í 𝑐 −1 𝑖 sion of Intel libraries for IPP and MPI is 2020.4.304. We use Nvidia
𝑗=0 𝑇𝐶𝑃𝑈 and
nvcc and Intel mpicc compilers for compiling the device and host
𝑇𝐶𝑃𝑈 ∝ 𝑇 𝑖
∼ 𝐶𝑃𝑈 .
codes, respectively.
The runtime on the CPU and GPU is thus
Datasets. Six real-world datasets are evaluated: (i) A Coffee
𝑖 𝑖
𝑇𝐶𝑃𝑈 = 𝑇𝑙𝑜𝑎𝑑 + 𝑇𝑓𝑖 𝑙𝑡 bean dataset. A roasted coffee bean sample was scanned by Zeiss
𝑖
(16) Xradia Versa 510 (3D X-ray Microscopy) at 80 kV, 87.5 𝜇A. 𝐷𝑠𝑑 =151.7,
𝑇𝐺𝑃𝑈 = 𝑇𝐻𝑖 2𝐷 + 𝑇𝑏𝑝
𝑖 𝑖
+ 𝑇𝐷2𝐻
𝐷𝑠𝑜 =16.0, the X-ray and optical magnification are 9.48 (=𝐷𝑠𝑑 /𝐷𝑠𝑜 )
Assuming a perfect overlap of the operations in the pipeline, the and 0.39 (by calibration), respectively. Offseting a detector of size
overall runtime is projected as 2000×2000 to the left and right side with overlapped region was
0
𝑇𝑟𝑢𝑛𝑡𝑖𝑚𝑒 =𝑇𝐶𝑃𝑈 0
+ 𝑇𝐺𝑃𝑈 0
+ 𝑇𝑟𝑒𝑑𝑢𝑐𝑒 0
+ 𝑇𝑠𝑡𝑜𝑟𝑒 + conducted at two full scan. The size of each stitched projection
𝑐 −1
𝑁Õ 𝑐 −1
𝑁Õ 𝑐 −1
𝑁Õ 𝑐 −1
𝑁Õ (17)
becomes 𝑁𝑢 =3728 and 𝑁 𝑣 =2000. The number of acquired projec-
max( 𝑖
𝑇𝐶𝑃𝑈 , 𝑖
𝑇𝐺𝑃𝑈 , 𝑖
𝑇𝑟𝑒𝑑𝑢𝑐𝑒 , 𝑖
𝑇𝑠𝑡𝑜𝑟𝑒 ) tions is 𝑁𝑝 =6401. The exposure time is 14 seconds and total scan
𝑖=1 𝑖=1 𝑖=1 𝑖=1
time is about 5 hours. (ii) A bumblebee dataset. A bumblebee
Insights from the Performance Model. We list observations was scanned on Nikon Metrology HMX ST 225 (a micro-CT scan-
from the performance model analysis. (i) Scaling: The operations ner) at 40Kv, 173𝜇A. 𝐷𝑠𝑑 =672.5, 𝐷𝑠𝑜 =39.8, the X-ray magnification
on the partial projections such as loading, filtering are light-weight is 16.9. The projection parameters are: 𝑁𝑢 =𝑁 𝑣 =2000, △𝑢 =△𝑣 =0.2,
in comparison with other operations on the volume that are often 𝑁𝑝 =3142. The number of acquired projections is 𝑁𝑝 =6401. The
the bottleneck. We observe that 𝑇𝐶𝑃𝑈 ∝ ∼1/𝑁𝑔 since fewer slices cor- total scan time is about 13 hours. (iii) Four Tomobank datasets
respond to less computation on the projections (as expressed in (tomo_00027, tomo_00028, tomo_00029, and tomo_00030). The scan-
Equation 8). The total number of generated slices by each group of ner and the related configurations are described in [13]. The dataset
MPI ranks becomes 𝑁𝑧 /𝑁𝑔 (as shown in Equation 10), where 𝑁𝑧 of 00027, 00028 and 00029 share the same geometric parameters:
is a constant value in this scenario. According to Equations 10, 12, 𝐷𝑠𝑑 =250, 𝐷𝑠𝑜 =100, 𝑁𝑢 =2004, 𝑁 𝑣 =1335, △𝑢 =△𝑣 =0.025, and 𝑁𝑝 =1800.
and 15, 𝑆𝑖𝑧𝑒 𝑣𝑜𝑙 ·𝑁𝑐 = 𝜂·𝑁𝑥 ·𝑁 𝑦 ·𝑁𝑧2 /𝑁𝑔 . It results in 𝑇𝐺𝑃𝑈
𝑖 , 𝑇𝑟𝑒𝑑𝑢𝑐𝑒
𝑖
The geometric parameters of tomo_00030 are: 𝐷𝑠𝑑 =350, 𝐷𝑠𝑜 =250,
and 𝑇𝑠𝑡𝑜𝑟𝑒 are inversely proportional to 𝑁𝑔 . Hence, we can derive
𝑖 𝑁𝑢 =668, 𝑁 𝑣 =445, △𝑢 =△𝑣 =0.075, and 𝑁𝑝 =720.
that 𝑇𝑟𝑢𝑛𝑡𝑖𝑚𝑒 ∝
∼1/𝑁𝑔 . Considering that 𝑁𝑟 is a fixed value for a given Importance of the Datasets. The datasets we use in this paper
problem, according to Equations 9 and 11, we can conclude that give valuable insight. We elaborate in specific on the coffee bean
∼1/𝑁𝑔𝑝𝑢𝑠 since 𝑁𝑔𝑝𝑢𝑠 = 𝑁𝑔 ·𝑁𝑟 . More specifically, the per-
𝑇𝑟𝑢𝑛𝑡𝑖𝑚𝑒 ∝ dataset (reconstruction shown in Figure 11a). (i) the shape and
formance theoretically scales linearly with the number of GPUs. aspect ratio of a coffee bean made it an appropriate candidate for
(ii) Peak performance: Using Equation 17, we can predict the using the wide-field macro mode of the CT scanner, allowing us to
potential peak performance and quantify the implementation effi- capture a high-resolution dataset. It has a variable and representa-
ciency. tive structure for a large group of problems and some interesting
1 3
2 4
3
2
(a) Reconstruction of the coffee bean dataset: 3728 × 2000 × 6401 ⇒ 40963 . (b) Bumblebee visualization by 3D Slicer [18].
Figure 11: Image reconstruction of real-world datasets. The container for Bumblebee is hidden using a mask.
Table 5: Out-of-core evaluation using tomobank dataset on
Performance [FLOP/s]
Single-precision Operations Roofline
a node with a single GPU (V100 or A100). The input sizes of 1013
Peak performance: 13.4e+12
tomo_00030 and tomo_00029 are 668×445×720 (816 MB) and
2004×1335×1800 (17.9 GB), respectively. The columns in the 1012
4.0TFLOP/s 4.0TFLOP/s 4.4TFLOP/s 4.5TFLOP/s 4.5TFLOP/s
table are not a breakdown of the total runtime 𝑇𝑟𝑢𝑛𝑡𝑖𝑚𝑒 ; the AI: 40.9 AI: 41.5 AI: 157.7 AI: 900.4 AI: 2954.7
operations are overlapped in a pipeline fashion as shown in 1011 14.9
Figure 10. 0.1 1 10 100 1000 10000
Arithmetic Intensity (AI) [FLOP/byte]
Input Output 𝑇𝑙𝑜𝑎𝑑 𝑇𝑓 𝑙𝑡 𝑇𝐻 2𝐷 𝑇𝑏𝑝 𝑇𝐷2𝐻 𝑇𝑠𝑡𝑜𝑟𝑒 𝑇𝑟𝑢𝑛𝑡𝑖𝑚𝑒 Perf. (GUPS)
tomo_ID (𝑣𝑜𝑥𝑒𝑙 3 ) (s) (s) (s) (s) (s) (s) (s) Ours RTK Figure 12: Roofline analysis on a V100 GPU (generated by
5123 (512MB) ∼0.5 ∼0.95 ∼0.3 0.87 0.11 0.47 1.4 111.6 110.8
Nvidia Nsight Compute [15]). Using tomo_00030 dataset, the
V100 GPU (16GB)
00030 10243 (4GB) ∼0.5 ∼0.95 ∼0.3 6.7 0.70 0.95 7.9 115.7 113.7
(816MB) 20483 (32GB) ∼0.5 ∼0.95 ∼0.3 53.1 6.4 2.60 60.2 117.2 ✗ blue, red, and black points indicate kernels that reconstruct
40963 (256GB) 423.3 50.2 21.8 475.0 120.1
the volumes of 5123 , 10243 , and 20483 , respectively. The ◦ and
∼0.5 ∼0.95 ∼0.3 ✗
5123 (512MB) 9.5 17.0 3.5 8.6 0.10 0.73 19.7 29.5 104.7
00029 10243 (4GB) 9.4 17.3 3.5 18.1 0.83 0.95 25.4 107.0 107.7
(17.9GB) 20483 (32GB) 9.5 17.0 3.8 124.2 6.4 7.61 137.7 125.1 ✗
△ denote RTK and our kernels, respectively.
40963 (256GB) 9.5 17.1 6.9 971.1 49.6 21.1 1028.8 129.2 ✗
5123 (512MB) ∼0.2 ∼0.7 ∼0.1 0.69 0.06 0.47 1.1 111.6 125.4
A100 GPU (40GB)
00030 10243 (4GB) ∼0.2 ∼0.7 ∼0.1 5.1 0.4 0.4 6.853 152.0 127.4
(816MB) 20483 (32GB) 40.1 3.2 4.0 52.4 155.3
40963 (256GB)
∼0.2
∼0.2
∼0.7
∼0.6
∼0.1
∼0.1 318.8 27.1 36.3 347.1 159.7
✗
✗
of 1e-5 is the threshold for the difference of the generated and
5123 (512MB) 6.4 9.0 2.8 2.8 0.06 0.5 10.1 87.8 122.0 standard volume data. (ii) The datasets described earlier are used
00029 10243 (4GB) 6.5 8.9 2.7 14.2 0.45 0.6 19.7 137.5 124.3
(17.9GB) 20483 (32GB) 6.3 8.8 3.2 98.2 3.2 3.9 114.9 158.2 ✗ in generating a wide variety of volume data as in Figure 8 and
40963 (256GB) 6.3 8.7 3.0 756.0 27.0 36.2 807.2 166.4 ✗
Figure 11. All volume data and slices are visualized for manual
visual inspection by the widely used 3D Slicer [18] viewer.
There are several reasons why we do not conduct a direct per-
features, e.g. walls, hollow pores, voids, laminar features. More formance comparison with other frameworks. (i) It would be in-
importantly, it could broadly represent the low-contrast usually adequate to compare with parallel-beam based algorithms such as
found in CFRP [55]. The pore structure is not dissimilar to that NU-PSV [63] and Peta-scale XCT [28] since they target the older
found in auxetic or metal foams [49], or for the bioengineering generation of CT devices. (ii) Cone-beam based systems such as
research, trabecular (cancellous) bone [25, 59]. RTK, Lu et al. [38], and iFDK [9] are incapable of processing the
Geometric correction & pre-processing of parameters. All CBCT datasets of the sizes and types we use. That is since all these
parameters we correct are listed in Table 4. The parameter 𝜎𝑐𝑜𝑟 is libraries do not consider the geometric correction as listed in Ta-
very small in value, yet due to the magnification effect, it is critical ble 4. Furthermore, RTK and TIGRE are restricted to a single GPU
to the image quality of X-ray microscopy. and compute node, respectively.
Measurement methods. All runs use single-precision on CPUs
and GPUs. The reported performance is averaged using a hundred 6.2 Out-of-core Back-projection Evaluation
repeated executions. The runtime of GPU kernel and host code are CUDA Kernel Performance. We report the CUDA back-projection
measured by cudaEvent and MPI_Wtime, respectively. We conduct kernel performance in the unit of GUPS 2 (Giga-updates Per Second).
both numerical and visual inspection assessments to assure the The specification of the A100 compute node used in this section
generated volume data is correct. (i) Regarding the numerical is different from V100 as follows: two Intel Xeon Platinum 8360Y
assessment, the digital phantom of Shepp-Logan phantom [53] CPUs, eight Nvidia Tesla A100 SMX4 GPUs, and NVMe SSD/Intel
is used to generate the projections by RTK tool and reconstruct
volumes using those projections. The root mean square error [1] 2 𝑃𝑒𝑟 𝑓 𝑁𝑥 ·𝑁 𝑦 ·𝑁𝑧 ·𝑁𝑝
= , where T is the runtime in the unit of second.
𝑇 ·109
500 700 500 450

489.5 Coffee bean 631.7 Bumbleebee Tomo_00029
450 Coffee bean 2x 450 430.0 400 384.6
600
400 400 350
Projected
350 500 350
Measured 300
Runtime (s)
300 300
400
Runtime (s)
268.8 250
Runtime (s)
Runtime (s)
250 329.2 250 227.4 209.2
300 200
200 200
140.8 181.7 130.2 150 120.8
150 200 150
100 75.7 95.1 100 69.2 100 61.7
40.2 100 49.2 35.5 32.3
50 22.7 15.3 25.8 14.5 12.7 50 18.7 13.712.6 50 16.8 13.2 11.911.5
0 0 0 0
16
32
64
128
256
512
1024
16
32
64
128
256
512
1024
4
8
16
32
64
128
256
512
1024
8
16
32
64
128
256
512
1024
Number of GPUs Number of GPUs Number of GPUs Number of GPUs
(a) 3928×1998×6401⇒40963 , 𝑁𝑟 =16. (b) 1864×999×6401⇒40963 , 𝑁𝑟 =8. (c) 20002 ×3142⇒40963 , 𝑁𝑟 =8. (d) 2004×1335×1800⇒40963 , 𝑁𝑟 =4.
Figure 13: Strong scaling. Projected denotes the potential best runtime as predicted by our performance model. Coffee bean 2x
is a rebinning of the original dataset (i.e. double the pixel size to reduce the input size to 1/4)
40000
18 16
35000 Coffee bean
Coffee bean Bumblebee
Performance (GUPS)
16 14
30000 Coffee bean 2x
14 14.8 15.3 12 12.7 Bumblebee
13.9 25000
12 12.9 13.1 11.5 11.7 11.9
10 Tomo_00029
10 20000
Runtime (s)
Runtime (s)
8 9 9 9 9
8 9 9 9 9 9 15000
6
6 10000
4 4
Projected 5000
2 Mesured 2
0
0 0
4 8 16 32 64 128 256 512 1024
128
256
512
1024
64
128
256
512
1024
Number of GPUs
Number of GPUs Number of GPUs
6401·𝑁𝑔𝑝𝑢𝑠 3142·𝑁𝑔𝑝𝑢𝑠 Figure 15: Performance (GUPS) when generating 40963 vol-
(a) 3928×1998× 1024 ⇒40963 . (b) 2000×2000× 1024 ⇒40963 .
umes, data sets and parameter configuration are similar to
𝑁𝑔𝑝𝑢𝑠 𝑁
Figure 14: Weak scaling. (a) 𝑁𝑟 = 64 .
𝑔𝑝𝑢𝑠
(b) 𝑁𝑟 = 128 . Figure 13).
SSD P4610 1.6T×2. We take advantage of the widely used CUDA while the data movement between the device and host is overlapped
kernel in the RTK library for performance comparison. We achieve with the storing of the volume on the PFS (𝑇𝑠𝑡𝑜𝑟𝑒 ). Accordingly, all
performance that is competitive to the RTK library (as shown in the operations in the end-to-end pipeline are overlapped as shown
Table 5). Using a single Nvidia V100 GPU, the performance of the in Figure 10b.
back-projection kernel in RTK is 104.7∼113.7 GUPS while our ker- Out-of-core Back-projection. Based on the 2D projection de-
nel is 29.5∼129.2 GUPS. The results on the Nvidia A100 GPU show composition, our algorithm can tackle problem sizes beyond the
performance improvement that is proportional to the difference in device capacity (i.e. out-of-core). We use the two Tomobank datasets
peak performance between V100 and A100 (15.7 TFlops in V100 tomo_00029 and tomo_00030 in this evaluation. Figure 8 shows a
vs. 19.5 TFlops in A100 in single precision). We use the Roofline reconstructed slice of tomo_00030 at resolution of 512 × 512. We
model to analyze back-projection kernels on a single V100 GPU increase the output sizes gradually to go beyond device memory
using Nsight Compute [15] profiler as shown in Figure 12. We capacity. In Table 5, we list the pipelined computations and the per-
observe that the proposed CUDA kernel achieves a competitive formance of our system using a compute node with a single GPU
performance of ∼4.5 TFLOP/s, which is about 32.8% peak perfor- (V100/A100). Figure 10a presents a detailed example of generating
mance of the V100 GPU. This is very similar to the performance a 20483 volume (32GB) on a single V100 GPU.
of RTK, despite the extra redundant computation (e.g. the offset Our algorithm can generate volumes of any arbitrary sizes by
computations for K, Y, and Z in Listing 1) we use in our kernel to moving the projections once from host to device. The execution
enable the decomposition of input and volume. In summary, we time 𝑇𝐻 2𝐷 in Table 5 is nearly constant while the 𝑇𝐷2𝐻 is propor-
achieve performance that matches that of one of the most widely tional to the size of volume data. Note that the RTK library can not
used and optimized open libraries (RTK), despite increasing the generate volumes beyond 8GB and 20GB on a V100 and A100 GPU,
amount of computation. On top of that, our kernel provides the respectively. Our out-of-core performance is slightly better than
benefit of cone-beam decomposition that enables better scaling and the highly optimized algorithm in RTK. Furthermore, we can solve
also out-of-core capability. larger problems (i.e. 40963 volume).
It is important to note that the data movement between host and To sum up, we achieve out-of-core computing capability, without
device (𝑇𝐻 2𝐷 ) is overlapped with the filtering computation (𝑇𝑓 𝑙𝑡 ), sacrificing the back-projection performance on GPUs.
6.3 Scalability & Performance thread, and the back-projection thread also waits for processed
This section reports the performance and scalability of our dis- data from the filter thread. On the other hand, when we scale input
tributed FPB framework. We configure the batch count parameter, size and not the volume, we eventually become bound by the PFS
i.e. number batches, at 𝑁𝑐 =8 in all runs. The batch sizes is calculated throughput.
as 𝑁𝑏 =𝑁𝑧 /(𝑁𝑔 ·𝑁𝑐 ) (according to Equations 10 and Equations 12). Discussion. In high-resolution CBCT, it is common to do 10s
In Figure 13, projected is the potential best runtime predicted by of repeated reconstructions after tuning the reconstruction param-
our performance model in Equation 17. According to Equation 15, eters to achieve a high-quality image result, e.g. Metal Artifact
the volumes sizes that are generated by each GPU and reduced in Reduction (MAR) processing [2]. Hence, the aggregate time saving
each MPI group may be expressed as 𝜂·𝑁𝑥 ·𝑁 𝑦 ·𝑁𝑧 ·𝑁𝑟 /𝑁𝑔𝑝𝑢𝑠 . Hence, when doing reconstruction at a large scale is substantial and con-
when 𝑁𝑔𝑝𝑢𝑠 =16, the sizes of the generated volumes in Figure 13a, tributes highly to productivity. Additionally, since the algorithm
b, c, and d are 256GB, 128GB, 128GB, and 64GB, respectively. This we propose enables out-of-core computing, one could also resort
demonstrates that our solution can generate volumes beyond the to using fewer resources (if there is budget constant for instance)
GPU memory capacity. and yet still be able to do high-resolution image reconstructions
Strong scaling. We elaborate on the strong scaling of our frame- for CBCT.
work in this paragraph. Figure 13 shows the strong scaling of several
datasets. Note that Figure 13b is a rebinning of the original coffee 6.4 Summary of Findings
bean dataset. All figures demonstrate that our implementation’s We summarize the findings of our experiments:
scaling matches that of the projected best runtime. According to • With up to 1024 V100 GPUs, we reconstruct CBCT volumes of
the performance model (Section 5), 𝑇𝑙𝑜𝑎𝑑 , 𝑇𝑓 𝑖𝑙𝑡𝑒𝑟 , 𝑇𝑏𝑝 , and 𝑇𝐷2𝐻 sizes 20483 and 40963 in 3s and 16s (including I/O), respectively.
decrease linearly with the number of GPUs. As shown in Figure 10b • By decomposing the projection in two dimensions, we improve
of the Bumblebee dataset using 128 GPUs, our implementation effi- performance and scalability (Figures 13, 14, 15). This is mainly be-
ciently overlaps the operations and approaches the potential peak cause the decomposition scheme achieves: a) insignificant redun-
performance. As expected with strong scaling, the performance dancy in operations, and b) effective pipelining of all operations
becomes flat as the number of GPUs increase (beyond 256 GPUs) as (end-to-end).
the observed overheads of I/O and communication start to dominate • Based on our decomposition scheme, we implement the first
the runtime. framework, to the authors knowledge, that support out-of-core
Weak scaling. This paragraph presents the weak scaling of our capability for cone-beam (8∼17 minutes for 40963 volumes on a
framework. In Figure 14, we show the weak scaling when gen- single V100 GPU).
erating 40963 volumes. We only present the weak scaling on the
coffee bean (Figure 14a) and bumblebee (Figure 14b) datasets due 7 CONCLUSION
to space limitations. In Figure 14a, the size of each projection is
In this work, we propose a novel algorithm to decompose the image
3928×1998, the size of the generated volume is 40963 . The evaluated
reconstruction problem for current generation cone-beam CT de-
pairs of (𝑁𝑝 , 𝑁𝑟 ) are (400, 1), (800, 2), . . . , (6401, 16). As Figure 14b
vices. The intensive computations can then be easily parallelized on
shows, to generate the volume data of size 40963 , the pairs of eval-
distributed systems, in a simple processing fashion similar to that
uated parameters (𝑁𝑝 , 𝑁𝑟 ) are (392, 1), (785, 2), . . . , (3142, 8). Differ-
of parallel-beam CT. We demonstrate the efficient performance and
ent MPI groups call the MPI_Reduce operation (𝑇𝑟𝑒𝑑𝑢𝑐𝑒 ) indepen-
scaling of our implementation using several real-world datasets on
dently, i.e., a segmented MPI_Reduce. Therefore 𝑇𝑟𝑒𝑑𝑢𝑐𝑒 increases
the ABCI supercomputer. Also, we take advantage of the hetero-
slightly with more GPUs (i.e. more ranks) while the other opera-
geneous architecture of the GPU-accelerated supercomputers by
tions within each rank are basically constant: 𝑇𝑙𝑜𝑎𝑑 , 𝑇𝑓 𝑖𝑙𝑡𝑒𝑟 , and
orchestrating the different computations on the CPUs and GPUs.
𝑇𝑏𝑝 . According to our performance model, the projected runtime is
The proposed CUDA kernel supports out-of-core image reconstruc-
0 0 0 Í𝑁𝑐 −1 𝑖 Í𝑁𝑐 −1 𝑖
𝑇𝑠𝑡𝑜𝑟𝑒 .
𝑇𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑇𝐶𝑃𝑈 + 𝑇𝐺𝑃𝑈 + 𝑇𝑟𝑒𝑑𝑢𝑐𝑒 + 𝑖=0 𝑇𝑠𝑡𝑜𝑟𝑒 ≈ 𝑖=0 tion. That is, we can use a single GPU to generate 40963 volumes
Since 𝐵𝑊𝑠𝑡𝑜𝑟𝑒 ≈ 28.5𝐺𝐵/𝑠 in the system we use in our experiments. that require a memory capacity of more than 256GB, far exceeding
The required time for storing a single 40963 volume is ∼9s, which the GPU’s memory capacity.
makes it the longest stage in the pipeline. Hence, the projected time
in Figure 14 becomes ∼9s since the performance model assumes a ACKNOWLEDGMENT
perfect overlap. This work was supported by JSPS KAKENHI under Grant Number
Performance. This paragraph analyzes the performance of our JP21K17750. This paper is based on results obtained from a project,
framework. Figure 15 shows the performance for generating the JPNP20006, commissioned by the New Energy and Industrial Tech-
volumes of size 40963 from different datasets. We can observe two nology Development Organization (NEDO). This research was par-
orders of magnitude speed up as we go from a single GPU to hun- tially supported by EPSRC grant EP/R002495/1 and EURAMET
dreds of GPUs. In Figure 13 and Figure 14, we show the potential grant 17IND08. This work was partially supported by JST-CREST
best runtime as predicted by our performance model using Equa- under Grant Number JPMJCR19F5; JST, PRESTO Grant Number
tion 17. The empirical results demonstrate that we can achieve 78% JPMJPR20MA, Japan. We would like to thank Endo Lab at Tokyo
of the peak performance on average. As Figure 10 shows, moving Institute of Technology for providing computing resources. The
and collecting data within a single MPI rank introduces most of author wishes to acknowledge useful discussions with Prof. Qinyou
the overhead, e.g. the filter thread waits for data from the load Hu at SMU and Dr. Jintao Meng at CAS.
REFERENCES [22] Randolf Hanke, Theobald Fuchs, and Norman Uhlmann. 2008. X-ray based
[1] J Scott Armstrong and Fred Collopy. 1992. Error measures for generalizing about methods for non-destructive testing and material characterization. Nuclear
forecasting methods: Empirical comparisons. International journal of forecasting Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers,
8, 1 (1992), 69–80. Detectors and Associated Equipment 591, 1 (2008), 14–18.
[2] Sonja Vulcu Tomas Dobrocky Werner J. Z’Graggenm Franca Wagner Ar- [23] Richard Hartley and Andrew Zisserman. 2003. Multiple view geometry in computer
sany Hakim, Manuela Pastore-Wapp. 2019. Efficiency of Iterative Metal Ar- vision. Cambridge university press.
tifact Reduction Algorithm (iMAR) Applied to Brain Volume Perfusion CT in the [24] Sepideh Hatamikia, Ander Biguri, Gernot Kronreif, Michael Figl, Tom Russ,
Follow-up of Patients after Coiling or Clipping of Ruptured Brain Aneurysms. Joachim Kettenbach, Martin Buschmann, and Wolfgang Birkfellner. 2021. To-
Nature Scientific Reports 9, 19423 (2019), 201–213. ward on-the-fly trajectory optimization for C-arm CBCT under strong kinematic
[3] Thilo Balke, S. Majee, G. Buzzard, Scott Poveromo, P. Howard, M. Groeber, John constraints. PLOS ONE 16, 2 (02 2021), 1–17. https://doi.org/10.1371/journal.
McClure, and C. Bouman. 2018. Separable Models for cone-beam MBIR Recon- pone.0245508
struction. electronic imaging 2018 (2018). [25] Rong-Ting He, Ming-Gene Tu, Heng-Li Huang, Ming-Tzu Tsai, Jay Wu, and
[4] GM Besson. 2016. Seventh-generation CT. In Medical Imaging 2016: Physics Jui-Ting Hsu. 2019. Improving the prediction of the trabecular bone microarchi-
of Medical Imaging, Vol. 9783. International Society for Optics and Photonics, tectural parameters using dental cone-beam computed tomography. BMC Medical
978350. Imaging 19, 1 (2019), 10:1–10:9. https://doi.org/10.1186/s12880-019-0313-9
[5] Tekin Bicer, Doğa Gürsoy, Vincent De Andrade, Rajkumar Kettimuthu, William [26] I Henry and Ming Chen. 2012. An FPGA Architecture for Real-Time 3-D Tomo-
Scullin, Francesco De Carlo, and Ian T. Foster. 2017. Trace: a high-throughput graphic Reconstruction. Ph.D. Dissertation. University of California, Los Angeles.
tomographic reconstruction engine for large-scale datasets. Advanced Structural [27] Mert Hidayetoğlu, Tekin Biçer, Simon Garcia De Gonzalo, Bin Ren, Doğa Gür-
and Chemical Imaging 3, 1 (jan 2017). https://doi.org/10.1186/s40679-017-0040-7 soy, Rajkumar Kettimuthu, Ian T Foster, and Wen-mei W Hwu. 2019. Memxct:
[6] Ander Biguri, Reuben Lindroos, Robert Bryll, Hossein Towsyfyan, Hans Deyhle, Memory-centric x-ray ct reconstruction with massive parallelization. In Proceed-
Ibrahim El khalil Harrane, Richard Boardman, Mark Mavrogordato, Manjit ings of the International Conference for High Performance Computing, Networking,
Dosanjh, Steven Hancock, and Thomas Blumensath. 2020. Arbitrarily large Storage and Analysis. 1–56.
tomography with iterative algorithms on multiple GPUs using the TIGRE tool- [28] Mert Hidayetoğlu, Tekin Bicer, Simon Garcia de Gonzalo, Bin Ren, Vincent De An-
box. J. Parallel and Distrib. Comput. 146 (2020), 52 – 63. https://doi.org/10.1016/ drade, Doga Gursoy, Raj Kettimuthu, Ian T. Foster, and Wen-mei W. Hwu. 2020.
j.jpdc.2020.07.004 Petascale XCT: 3D Image Reconstruction with Hierarchical Communications
[7] Javier Garcia Blas, Monica Abella, Florin Isaila, Jesus Carretero, and Manuel Desco. on Multi-GPU Nodes. In Proceedings of the International Conference for High
2014. Surfing the optimization space of a multiple-GPU parallel implementation Performance Computing, Networking, Storage and Analysis (SC ’20). IEEE Press,
of a X-ray tomography reconstruction algorithm. Journal of Systems and Software Article 37, 13 pages.
95 (2014), 166–175. [29] Johannes Hofmann, Jan Treibig, Georg Hager, and Gerhard Wellein. 2014. Per-
[8] Brian Cabral, Nancy Cam, and Jim Foran. 1994. Accelerated volume rendering formance engineering for a medical imaging application on the Intel Xeon Phi
and tomographic reconstruction using texture mapping hardware. In Proceedings accelerator. In ARCS 2014; 2014 Workshop Proceedings on Architecture of Computing
of the 1994 symposium on Volume visualization. 91–98. Systems. VDE, 1–8.
[9] Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, and Satoshi [30] F. Ino, Y. Okitsu, T. Kishi, S. Ohnishi, and K. Hagihara. 2010. Out-of-core cone
Matsuoka. 2019. iFDK: A Scalable Framework for Instant High-Resolution beam reconstruction using multiple GPUs. In 2010 IEEE International Symposium
Image Reconstruction. In Proceedings of the International Conference for High on Biomedical Imaging: From Nano to Macro. 792–795. https://doi.org/10.1109/
Performance Computing, Networking, Storage and Analysis (SC ’19). Associ- ISBI.2010.5490055
ation for Computing Machinery, New York, NY, USA, Article 84, 24 pages. [31] Intel. 2021. Intel MPI Benchmarks User Guide. https://software.intel.com/
https://doi.org/10.1145/3295500.3356163 content/www/us/en/develop/documentation/imb-user-guide/top.html [Online;
[10] Srdjan Coric, Miriam Leeser, Eric Miller, and Marc Trepanier. 2002. Parallel- accessed 27-May-2021].
beam backprojection: an FPGA implementation optimized for medical imaging. [32] DA Jaffray and JH Siewerdsen. 2000. Cone-beam computed tomography with
In Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field- a flat-panel imager: initial performance characterization. Medical physics 27, 6
programmable gate arrays. ACM, 217–226. (2000), 1311–1323.
[11] NVIDIA CUDA. 2021. CUDA Toolkit Documentation. NVIDIA Developer Zone. [33] Nicolai M Josuttis. 2012. The C++ standard library: a tutorial and reference.
http://docs.nvidia.com/cuda/index.html (2021). (2012).
[12] Jingyu Cui, Guillem Pratx, Bowen Meng, and Craig S Levin. 2013. Distributed [34] Avinash C.. Kak and Malcolm Slaney. 1988. Principles of computerized tomographic
MLEM: An iterative tomographic image reconstruction algorithm for distributed imaging. IEEE press New York.
memory architectures. IEEE transactions on medical imaging 32, 5 (2013), 957–967. [35] Vladimir Kasik, Martin Cerny, Marek Penhaker, Václav Snášel, Vilem Novak, and
[13] Francesco De Carlo, Doğa Gürsoy, Daniel J Ching, K Joost Batenburg, Wolfgang Radka Pustkova. 2012. Advanced CT and MR image processing with FPGA. In
Ludwig, Lucia Mancini, Federica Marone, Rajmund Mokso, Daniël M Pelt, Jan International Conference on Intelligent Data Engineering and Automated Learning.
Sijbers, et al. 2018. TomoBank: a tomographic data repository for computational Springer, 787–793.
x-ray science. Measurement Science and Technology 29, 3 (2018), 034004. [36] Jean Pierre Kruth, Markus Bartscher, Simone Carmignato, Robert Schmitt,
[14] W De Vos, Jan Casselman, and GRJ Swennen. 2009. Cone-beam computerized Leonardo De Chiffre, and Albert Weckenmann. 2011. Computed tomography for
tomography (CBCT) imaging of the oral and maxillofacial region: a systematic dimensional metrology. CIRP annals 60, 2 (2011), 821–842.
review of the literature. International journal of oral and maxillofacial surgery 38, [37] Wenxuan Liang, Hui Zhang, and Guangshu Hu. 2010. Optimized implementation
6 (2009), 609–625. of the FDK algorithm on one digital signal processor. Tsinghua Science and
[15] Nvidia Developer Tools Document. 2021. Nvidia Nsight Compute. https://docs. Technology 15, 1 (2010), 108–113.
nvidia.com/nsight-compute/NsightComputeCli/index.html [Online; accessed [38] Yuechao Lu, Fumihiko Ino, and Kenichi Hagihara. 2016. Cache-aware GPU
27-May-2021]. optimization for out-of-core cone beam CT reconstruction of high-resolution
[16] Daniel Castaño Díez, Hannes Mueller, and Achilleas S Frangakis. 2007. Imple- volumes. IEICE TRANSACTIONS on Information and Systems 99, 12 (2016), 3060–
mentation and performance evaluation of reconstruction algorithms on graphics 3071.
processors. Journal of Structural Biology 157, 1 (2007), 288–295. [39] John B Ludlow and Marija Ivanovic. 2008. Comparative dosimetry of dental
[17] Anders Eklund, Paul Dufort, Daniel Forsberg, and Stephen M. LaConte. 2013. CBCT devices and 64-slice CT for oral and maxillofacial radiology. Oral Surgery,
Medical image processing on the GPU – Past, present and future. Medical Image Oral Medicine, Oral Pathology, Oral Radiology, and Endodontology 106, 1 (2008),
Analysis 17, 8 (2013), 1073 – 1094. https://doi.org/10.1016/j.media.2013.05.008 106–114.
[18] Andriy Fedorov, Reinhard Beichel, Jayashree Kalpathy-Cramer, Julien Finet, Jean- [40] Dmitri Matenine, Geoffroi Côté, Julia Mascolo-Fortin, Yves Goussard, and
Christophe Fillion-Robin, Sonia Pujol, Christian Bauer, Dominique Jennings, Philippe Després. 2018. System matrix computation vs storage on GPU: A
Fiona Fennessy, Milan Sonka, et al. 2012. 3D Slicer as an image computing comparative study in cone beam CT. Medical physics 45, 2 (2018), 579–588.
platform for the Quantitative Imaging Network. Magnetic resonance imaging 30, [41] Klaus Mueller, F Xu, and N Neophytou. 2007. Why do GPUs work so well for
9 (2012), 1323–1341. acceleration of CT? SPIE Electronic Imaging07 (2007). http://cvc.cs.stonybrook.
[19] LA Feldkamp, LC Davis, and JW Kress. 1984. Practical cone-beam algorithm. edu/Publications/2007/MXN07a
JOSA A 1, 6 (1984), 612–619. [42] Nassir Navab, A Bani-Hashemi, Mariappan S Nadar, Karl Wiesent, Peter Durlak,
[20] Yushan Gao, Ander Biguri, and Thomas Blumensath. 2019. Block stochastic Thomas Brunner, Karl Barth, and Rainer Graumann. 1998. 3D reconstruction from
gradient descent for large-scale tomographic reconstruction in a parallel network. projection matrices in a C-arm based 3D-angiography system. In International
arXiv preprint arXiv:1903.11874 (2019). Conference on Medical Image Computing and Computer-Assisted Intervention.
[21] Jens Gregor and Thomas Benson. 2008. Computational analysis and improvement Springer, 119–129.
of SIRT. IEEE transactions on medical imaging 27, 7 (2008), 918–924. [43] Brian Nett. 2020. Animated CT Generations for Radiologic Technologists. https:
//howradiologyworks.com/ctgenerations/
[44] Willem Jan Palenstijn, Jeroen Bédorf, and K Joost Batenburg. 2015. A distributed [68] Fang Xu and Klaus Mueller. 2005. Accelerating popular tomographic recon-
SIRT implementation for the ASTRA toolbox. Proc. Fully Three-Dimensional struction algorithms on commodity PC graphics hardware. IEEE Transactions on
Image Reconstruct. Radiol. Nucl. Med (2015), 166–169. nuclear science 52, 3 (2005), 654–663.
[45] Xiaochuan Pan, Emil Y Sidky, and Michael Vannier. 2009. Why do commercial [69] Xinwei Xue, Arvi Cheryauka, and David Tubbs. 2006. Acceleration of fluoro-CT
CT scanners still employ traditional, filtered back-projection for image recon- reconstruction for a mobile C-Arm on GPU and FPGA hardware: a simulation
struction? Inverse problems 25, 12 (2009), 123009. study. In Medical Imaging 2006: Physics of Medical Imaging, Vol. 6142. International
[46] Ruben Pauwels, Jilke Beinsberger, Bruno Collaert, Chrysoula Theodorakou, Jes- Society for Optics and Photonics, 61424L.
sica Rogers, Anne Walker, Lesley Cockmartin, Hilde Bosmans, Reinhilde Jacobs, [70] Kai Yang, Alexander LC Kwan, DeWitt F Miller, and John M Boone. 2006. A
Ria Bogaerts, et al. 2012. Effective dose range for dental cone beam computed geometric calibration method for cone beam CT systems. Medical physics 33,
tomography scanners. European journal of radiology 81, 2 (2012), 267–271. 6Part1 (2006), 1695–1706.
[47] N Rezvani, D Aruliah, K Jackson, D Moseley, and J Siewerdsen. 2007. SU-FF-I-16: [71] X-ray Tomography Solutions ZEISS. 2021. High Resolution 3D X-ray Microscopy
OSCaR: An open-source cone-beam CT reconstruction tool for imaging research. and Computed Tomography. https://www.zeiss.com/microscopy/int/products/x-
Medical Physics 34, 6Part2 (2007), 2341–2341. ray-microscopy.html. [Online; accessed 27-May-2021].
[48] John C Russ. 1990. Image processing. In Computer-assisted microscopy. Springer, [72] Timo Zinsser and Benjamin Keck. 2013. Systematic performance optimization of
33–69. cone-beam back-projection on the Kepler architecture. Proceedings of the 12th
[49] Mohammad Saadatfar, Francisco García-Moreno, S. Hutzler, A.P. Sheppard, Mark Fully Three-Dimensional Image Reconstruction in Radiology and Nuclear Medicine
Knackstedt, John Banhart, and Denis Weaire. 2009. Imaging of metallic foams (2013), 225–228.
using X-ray micro-CT. Colloids and Surfaces A: Physicochemical and Engineering [73] Yu Zou and Xiaochuan Pan. 2004. Exact image reconstruction on PI-lines from
Aspects 344 (07 2009), 107–112. https://doi.org/10.1016/j.colsurfa.2009.01.008 minimum data in helical cone-beam CT. Physics in Medicine & Biology 49, 6
[50] Amit Sabne, Xiao Wang, Sherman J Kisner, Charles A Bouman, Anand Raghu- (2004), 941.
nathan, and Samuel P Midkiff. 2017. Model-based iterative CT image reconstruc-
tion on GPUs. ACM SIGPLAN Notices 52, 8 (2017), 207–220.
[51] Paul Sack and William Gropp. 2010. A scalable mpi_comm_split algorithm for
exascale computing. In European MPI Users’ Group Meeting. Springer, 1–10.
[52] Holger Scherl, Markus Kowarschik, Hannes G Hofmann, Benjamin Keck, and
Joachim Hornegger. 2012. Evaluation of state-of-the-art hardware architectures
for fast cone-beam CT reconstruction. Parallel computing 38, 3 (2012), 111–124.
[53] Lawrence A Shepp and Benjamin F Logan. 1974. The Fourier reconstruction of a
head section. IEEE Transactions on nuclear science 21, 3 (1974), 21–43.
[54] Lawrence A Shepp and Yehuda Vardi. 1982. Maximum likelihood reconstruction
for emission tomography. IEEE transactions on medical imaging 1, 2 (1982),
113–122.
[55] Rainer Stoessel, Denis Kiefel, Reinhold Oster, Björn Diewel, and L Llopart Prieto.
2011. 𝜇 -computed tomography for 3d porosity evaluation in Carbon Fibre Rein-
forced Plastics (CFRP). In International Symposium on Digital Industrial Radiology
and Computed Tomography.
[56] Frederick C Strong. 1952. Theoretical basis of Bouguer-Beer law of radiation
absorption. Analytical Chemistry 24, 2 (1952), 338–342.
[57] Nikhil Subramanian. 2009. A C-to-FPGA solution for accelerating tomographic
reconstruction. Ph.D. Dissertation. University of Washington.
[58] Stewart Taylor. 2007. Optimizing applications for multi-core processors: using the
intel integrated performance primitives. Intel.
[59] Ming-Tzu Tsai, Rong-Ting He, Heng-Li Huang, Ming-Gene Tu, and Jui-Ting
Hsu. 2020. Effect of Scanning Resolution on the Prediction of Trabecular Bone
Microarchitectures Using Dental Cone Beam Computed Tomography. Diagnostics
10, 6 (2020). https://doi.org/10.3390/diagnostics10060368
[60] Wim Van Aarle, Willem Jan Palenstijn, Jeroen Cant, Eline Janssens, Folkert
Bleichrodt, Andrei Dabravolski, Jan De Beenhouwer, K Joost Batenburg, and
Jan Sijbers. 2016. Fast and flexible X-ray tomography using the ASTRA toolbox.
Optics express 24, 22 (2016), 25129–25147.
[61] Richard Wilson Vuduc. 2003. Automatic performance tuning of sparse matrix
kernels. Vol. 1. Citeseer.
[62] Xiao Wang, Amit Sabne, Sherman J. Kisner, Anand Raghunathan, Charles A.
Bouman, and Samuel P. Midkiff. 2016. High Performance Model-Based Image
Reconstruction. 21st ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming (PPoPP’16) (2016), 2:1–2:12. https://github.com/HPImaging/
sv-mbirct
[63] Xiao Wang, Amit Sabne, Putt Sakdhnagool, Sherman J Kisner, Charles A Bouman,
and Samuel P Midkiff. 2017. Massively parallel 3D image reconstruction. In
Proceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis. 1–12.
[64] Xiao Wang, Venkatesh Sridhar, Zahra Ronaghi, Rollin Thomas, Jack Deslippe, Dil-
worth Parkinson, Gregery T Buzzard, Samuel P Midkiff, Charles A Bouman, and
Simon K Warfield. 2019. Consensus equilibrium framework for super-resolution
and extreme-scale CT reconstruction. In Proceedings of the International Confer-
ence for High Performance Computing, Networking, Storage and Analysis. 1–23.
[65] Jason M Warnett, Valeriy Titarenko, Ercihan Kiraci, Alex Attridge, William RB
Lionheart, Philip J Withers, and Mark A Williams. 2016. Towards in-process
x-ray CT for dimensional metrology. Measurement Science and Technology 27, 3
(2016), 035401.
[66] Karl Wiesent, Karl Barth, Nassir Navab, Peter Durlak, Thomas Brunner, Oliver
Schuetz, and Wolfgang Seissler. 2000. Enhanced 3-D-reconstruction algorithm
for C-arm systems suitable for interventional procedures. IEEE transactions on
medical imaging 19, 5 (2000), 391–403.
[67] Michael A Wu. 1991. ASIC applications in computed tomography systems. In
ASIC Conference and Exhibit, 1991. Proceedings., Fourth Annual IEEE International.
IEEE, P1–3.
View publication stats

Scalable FBP Decomposition For Cone-Beam CT Reconstruction: November 2021

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scalable FBP Decomposition For Cone-Beam CT Reconstruction: November 2021

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Scalable FBP Decomposition for Cone-beam CT Reconstruction

Conference Paper · November 2021

Peng Chen Xiao Wang

SEE PROFILE SEE PROFILE

Takahiro Hirofuchi Hirotaka Ogawa

SEE PROFILE SEE PROFILE

General method for motion correction in X-ray CT View project

Gridvon View project

The user has requested enhancement of the downloaded file.

Scalable FBP Decomposition for Cone-beam CT Reconstruction

Conference Paper · November 2021

Peng Chen Xiao Wang

SEE PROFILE SEE PROFILE

Takahiro Hirofuchi Hirotaka Ogawa

SEE PROFILE SEE PROFILE

OpenJIT View project

Super-Voxel View project

The user has requested enhancement of the downloaded file.

Takahiro Hirofuchi Hirotaka Ogawa Ander Biguri

Richard Boardman Thomas Blumensath Satoshi Matsuoka

ABSTRACT ACM Reference Format:

of the cone-shaped projections. This solves the problem of

Z-axis Algorithm 1: 3D Back-projection algorithm. SubPixel is a bi-

each pixel value as line 9 shows. The SubPixel in Algorithm 1 demon-

Input (projections): 𝑁𝑣 × 𝑁𝑝 × 𝑁𝑢 Overlapped area: Output (volume): 𝑁𝑧 × 𝑁𝑦 × 𝑁𝑥

Z-axis X-axis X-axis X

Algorithm 2: Compute maximum projection area. 𝑀135◦ and

Load data Filtering Back-projection MPI-Reduce Store 𝑁𝑢 Rotation center

Listing 1: Proposed CUDA Back-projection kernel enabling

Storage Queue0 Queue1 Queue2 Queue3 Storage

Load Thread 9.5s 137.7s Load Thread 0.5s 35.5s

4.4.2 Communication: Efficient Segmented Reduction. This section

mainly depends on the size of the partial projections

Table 5: Out-of-core evaluation using tomobank dataset on

500 700 500 450

View publication stats

You might also like