Professional Documents
Culture Documents
Parallel Rasterisation Zhou 2016
Parallel Rasterisation Zhou 2016
Abstract
This research develops a parallel scheme to adopt multiple graphics processing units (GPUs) to accelerate
large-scale polygon rasterization. Three new parallel strategies are proposed. First, a decomposition strat-
egy considering the calculation complexity of polygons and limited GPU memory is developed to achieve
balanced workloads among multiple GPUs. Second, a parallel CPU/GPU scheduling strategy is proposed
to conceal the data read/write times. The CPU is engaged with data reads/writes while the GPU rasterizes
the polygons in parallel. This strategy can save considerable time spent in reading and writing, further
improving the parallel efficiency. Third, a strategy for utilizing the GPUs internal memory and cache is
proposed to reduce the time required to access the data. The parallel boundary algebra filling (BAF) algo-
rithm is implemented using the programming models of compute unified device architecture (CUDA), mes-
sage passing interface (MPI), and open multi-processing (OpenMP). Experimental results confirm that the
implemented parallel algorithm delivers apparent acceleration when a massive dataset is addressed (50.32
GB with approximately 1.3 3 108 polygons), reducing conversion time from 25.43 to 0.69 h, and obtain-
ing a speedup ratio of 36.91. The proposed parallel strategies outperform the conventional method and
can be effectively extended to a CPU-based environment.
1 Introduction
Address for correspondence: Zhenjie Chen, Department of Geographic Information Science, Nanjing University, 163 Xianlin Avenue,
Nanjing, Jiangsu Province, China 210023. E-mail: chenzj@nju.edu.cn
Acknowledgements: This work was supported by the National Natural Science Foundation of China (Grant no. 41571378), and the
National High Technology Research and Development Program of China (Grant no. 2011AA120301). Sincere thanks are given to Dr.
Sun Chao for technical assistance.
2012). Developing GPU-based parallel technology has become a means for rapidly converting
massive vector data to raster data.
In recent years, considerable effort has been dedicated to the development of parallel tech-
niques for polygon rasterization; nevertheless, there is still much room for improvement of
massively rasterization efficiency. Conventional CPU-based parallel rasterization techniques
have succeeded in a moderate parallel speedup (Healey et al. 1998; Wang et al. 2013). How-
ever, the acceleration performance was limited when addressing massive polygon datasets. To
develop a GPU-based accelerating technique, Zhang (2011) parallelized the scan-line rasteriza-
tion algorithm on a single GPU, where all of the coordinates were stored in the GPUs shared
memory. Although a considerable speedup ratio (20.5) was achieved, three urgent issues
remain to be solved.
1. The GPUs internal memory is constrained, limiting the volume of data that can be proc-
essed (Hou et al. 2011; Zhang and Owens 2011). In the existing processing of polygon
calculations, a dataset with small data volume, rather than a large-scale dataset, greater
in size than the GPUs memory, was addressed (Simion et al. 2012; Zhao and Zhou
2013). When managing a massive dataset that exceeds the GPU memory, polygons must
be decomposed into subsets. Moreover, vector polygons have inherently complex struc-
tures; they vary significantly in size and calculation complexity and involve voluminous
data (Meng et al. 2007; Bakkum and Skadron 2010; Ye et al. 2011; Luo et al. 2012).
The equal treatment of different polygons can lead to unbalanced workloads. A rational
decomposition strategy for polygons that can achieve effective load balancing is urgently
needed.
2. In the existing strategy, a single GPU was used instead of multiple GPUs, and thus the
achieved performance acceleration was limited. To fully utilize the computational resour-
ces of CPUs and GPUs, a rational strategy for task scheduling is required.
3. Although the shared memory in a GPU is faster to access, it is typically small and cannot
store all the polygon coordinates. Zhang (2011) utilized shared memory to store the
polygonal nodes and could not process polygons with more than 1,024 nodes. A strategy
for more rationally utilizing the GPU memory that can improve data-reading efficiency
and process large-sized polygons is required. To address these issues, new GPU-based
parallel strategies for large-scale polygon rasterization are necessary.
The objective of this research is to develop a parallel scheme to accelerate the large-scale
polygon rasterization process under multiple GPUs based on the compute unified device archi-
tecture (CUDA). Three novel parallel strategies are proposed for this purpose. First, a decom-
position strategy is developed to ensure balanced workloads for multiple GPUs. In this
strategy, a measure model is first designed to estimate polygon complexity; then, polygons can
be decomposed according to ascending calculated complexity and limited GPU memory. Sec-
ond, a parallel CPU/GPU scheduling strategy is proposed to conceal data read/write time,
where the CPU is engaged with data reads/writes while the GPU rasterizes polygons in parallel.
Third, a strategy for utilizing the GPUs internal memory and cache is proposed to improve
data-reading efficiency and address polygons with excessive numbers of nodes. The proposed
parallel scheme is implemented and performed in a cluster with two GPUs. The accuracy loss
of the rasterized result is evaluated and the parallel performance is tested in terms of execution
time, speedup ratio, and load balancing. Performance of the proposed and conventional strat-
egies is compared when addressing different datasets. Finally, the extension of the proposed
strategy to CPU-based parallel implementations is discussed.
2 Background
2.1 GPU Architecture and CUDA Programming Model
GPUs have evolved into highly parallel multi-core systems that permit large datasets to be manip-
ulated in an efficient manner, especially for massively parallel applications. Sample GPU architec-
ture is illustrated in Figure 1a. In the GPU computing architecture supported by NVIDIA, the
multiprocessors, or blocks of processing units, are structured in a grid. The block threads are
grouped into warps (32 threads/warp). Each warp performs the same computations, however, on
different data; this is called single-instruction multiple data (SIMD) mode (Nickolls et al. 2008).
The memory hierarchy for the GPU includes registers in addition to local, global, shared, texture,
and constant memories. The multiprocessor registers are distributed evenly across the threads
that are currently running. Data can also be stored in local memory that is private to each thread;
however, this has the same high latency as global memory. One part of the memory is used as
memory to be shared by all processor blocks. Accessing shared memory is slower than using
registers, but faster than global memory. GPU global memory is large (several GB) and can be
accessed by all threads, blocks, and grids; access, however, is slow. Texture memory is typically
used for graphics and constant memory accelerates uniform access. With the latest Kepler GPUs,
level-1 (L1) and level-2 (L2) caches are optimized for memory access.
CUDA is a popular framework that allows general-purpose programming of GPUs
(Mielikainen et al. 2013). A major advantage of CUDA, compared to other GPU programming
models, is that it uses a C language; hence, the C function code originally written for a CPU can
frequently be ported to a CUDA kernel with minimal modification. Furthermore, NIVIDA pro-
vides developers with C libraries that expose all the device functionalities required to integrate
CUDA into a C program (Sui et al. 2012; Tang et al. 2015). For these reasons, CUDA is chosen
in this research to accelerate the polygon rasterization process. In the CUDA programming envi-
ronment, there is a clear separation between the host code (for the CPU) and the kernel code (for
the GPU) (see Figure 1b). The host code contains all the processing allocated to the CPU; it can
manipulate data transfers between the CPU and GPU memories and launch kernel code on the
GPU. The kernel code is executed in parallel on the GPU in SIMD mode (NVIDIA Corp 2013).
Currently, two types of GPU parallel environments are commonly employed when performing
GPU-based parallel algorithms. They are one GPU on one CPU node and multiple GPUs on sepa-
rate parallel CPU nodes where one node has one GPU. In this research, we focus on the design and
implementation of parallel strategies under multiple GPUs that reside on separate CPU nodes.
downward, add the attribute value to the raster pixels on the left side; if the direction is parallel,
skip this boundary; and (3) repeat step (2) until all the boundaries are processed.
The general process of parallel rasterization includes five procedures: (1) data decomposi-
tion; (2) polygon data reading; (3) data transfers; (4) polygon filling computation; and (5) ras-
terization results writing. Before applying the sequential algorithm in a GPU environment, we
must first consider those parts that can be implemented on the GPU and those that can be allo-
cated to the CPU. Among the five procedures, polygon-filling computation is the most time-
consuming procedure; it can be written as a CUDA kernel function executed on the GPU. In
parallel execution, each CUDA thread is responsible for rasterizing a different polygon. The
other procedures are executed on the CPU in this implementation.
where number of nodesi right is the number of nodes located in the right half of the MBR
and number of nodesi is the total number of nodes. The cell size affects the rasterization effi-
ciency by changing the number of raster pixels. In our evaluation, three groups of experiments
were conducted to independently test the influence of each factor on the rasterization time. For
each group of experiments, the value of cell size was set to 10, 30, 50, and 70 m, and then:
1. To test number of nodes, its value was changed from 20 to 380, with MBR
area 5 238,974.38 m2 and shape 5 0.5.
Figure 2 Influence of different factors on rasterization time (cell size was set as 10, 30, 50, and
70 m). Experimental results on the; (a) number of nodes; (b) MBR area; and (c) shape
2. To test MBR area, its value was changed from 238,974.38 to 1,314,359.09 m2, with
number of nodes 5 20 and shape 5 0.5.
3. To test shape, its value was changed from 0.1 to 0.9, with number of nodes 5 20 and
MBR area 5 238,974.38 m2.
Figure 2 indicates that the changes in the different factors influenced the rasterization time
to varying degrees: (1) The number of nodes and MBR area are dominated factors that can
affect rasterization efficiency and the shape is a subordinate factor; and (2) The cell size has an
obvious influence on number of nodes and MBR area. Specifically, when cell size 30 m, the
increasing ratio of MBR area is larger than that of number of nodes; when cell size > 30 m, the
increasing ratio of number of nodes is larger than that of MBR area. Accordingly, different
weight values should be assigned to number of nodes and MBR area when performing different
cell sizes. The measure model can be developed using the following steps: (1) For the i-th poly-
gon, the values of number of nodes, MBR area, and shape are first calculated; and (2) The nor-
malized values of number of nodes, number of nodesnorm i , can be calculated as:
log10 number of nodesi 2 min flog10 number of nodesi g
i51n
number of nodesnorm
i 5
max flog10 number of nodesi g2 min flog10 number of nodesi g
i51n i51n
(2)
where n denotes the total number of polygons, log10 number of nodesi is the logarithmic
value of number of nodes,maxi51n flog10 number of nodesi g is the maximum value of the
logarithmic value, and mini51n flog10 number of nodesi g is the minimum value. The nor-
malized values of MBR area, MBR areanormi , can be calculated as
log10 MBR areai 2 min flog10 MBR areai g
i51n
MBR areanorm
i 5 (3)
max flog10 MBR areai g2 min flog10 MBR areai g
i51n i51n
where log10 MBR areai is the logarithmic value of MBR area, maxi51n flog10 MBR areai g
is the maximum value of the logarithmic value, and mini51n flog10 MBR areai g is the mini-
mum value. An finally, (3) the complexity, complexityi, can be calculated as:
8
>
> 0:43number of nodesnormi 10:53MBR areanorm i 10:13shapei ;
>
>
>
< when cell size 30 m
complexityi 5 (4)
>
> 0:53number of nodesnorm 10:43MBR areanorm 10:13shapei ;
>
> i i
>
:
when cell size >30 m
where number of nodesnormi denotes the normalized value of number of nodes and MBR areanorm
i
denotes the normalized value of MBR area. When cell size 30 m, the weights for number of
nodesnorm
i and MBR areanormi are 0.4 and 0.5, respectively; when cell size >30 m, the weights
are 0.5 and 0.4, respectively. The weight for shapei is always 0.1. Thus, the polygon complex-
ity ranges from zero to one; a larger value means a more complex calculation.
Because each GPU has limited internal memory, massive polygons cannot be processed
simultaneously and must be decomposed into several subsets. The procedure for data decompo-
sition includes the following three steps (Figure 3):
1. Forming a distribution queue. For the i-th polygon, its calculation complexity is first
computed using Equation (4). Then, its memory usage is calculated according to the number of
nodes, which can be expressed as:
where PointX and PointY are the arrays of X and Y coordinates, respectively, and Attribute-
Value is the attribute value. Then, all the polygons are sorted in ascending order according to
the complexity and a polygon distribution queue is formed.
2. Decomposing polygons into subsets and chunks. Two polygons are taken from the
queue to a subset each time: one from the head and the other from the end. The polygons are
assigned in this order until completion. Each GPU is assigned one of the subsets. However, if
the total memory of a subset exceeds the memory limitation of the GPU, the subset must be
further subdivided into smaller chunks in each GPU. When the GPUs limited memory is
MUlimit, the number of polygons in a subdivided chunk should conform to the following:
X
Nma x
where MUgpuresult is the memory used to store the rasterization results produced by the GPU
and Nmax is the maximum number of polygons. Within each GPU, when completing the calcu-
lations of memory usage, remove a polygon from the head of the corresponding subset each
time and then add the memory. If the total memory does not exceed Mlimit-Mgpuresult, continue
to remove a polygon until Equation (6) is not satisfied. The selected Nmax polygons are assigned
in a chunk; a GPU must address its own chunks sequentially.
3. Distributing polygons to blocks and threads. In a chunk of polygons, the grid holds all the
polygons sorted in ascending order. Each time, the first and last polygons in the queue are
assigned to a block until all the polygons are assigned, completing the block-level decomposition.
Inside each block, polygons are assigned to threads in a circular distribution: one polygon is
assigned to each thread; whenever a thread completes, another polygon is assigned to that thread
until all the polygons in this block are processed, completing the thread-level decomposition.
Using this approach, the original dataset can be decomposed evenly for multiple GPUs.
Each GPU can hold several chunks of polygons and address these chunks sequentially. More-
over, datasets with different sizes can be rationally decomposed and addressed. A small dataset
is decomposed into different subsets for multiple GPUs; a large dataset is further decomposed
into chunks within each GPU.
Significant time in the streaming dedicated to reading polygons and writing results can be saved.
Hence, data reads/writes by the CPU and GPU parallel processing can be executed concurrently.
Figure 5 (a) Strategy for GPU memory storage; and (b) strategy for segmenting large-sized polygons
limited size of the L2 cache. The formula to calculate the number of nodes, Nodes, for new pol-
ygons after the segmentation process is:
8
> Mcache
> when Nt <Ntpm 3Np
< sizeof 23Nt ;
>
Nodes 5 (7)
>
> Mcache
>
: ; when Nt Ntpm 3Np
sizeof 23Ntpm 3Np
where Mcache is the memory size of the L2 cache, Nt is the total number of threads, Ntpm is the
maximum number of threads per multiprocessor, and Np is the number of multiprocessors. All
smaller segmented polygons have Nodes nodes, except for the last polygon, which could have
fewer nodes. When addressing a convex polygon, it can be segmented immediately; when proc-
essing a concave polygon, it should be converted into several convex polygons and then seg-
mented using the above strategy (Rogers 1984). To ensure the correct rasterization results,
every two spatially adjacent smaller polygons must have common boundaries (Figure 5b). In
this manner, the coordinates of the newly segmented polygons can be stored entirely in the L2
cache, resulting in a faster coordinates-access rate.
Table 1 Implementation pseudo-code of the GPU-based parallel scheme. poSrcVector and poD-
stImg are the input vector data files and output result, respectively; p denotes the specified num-
ber of CPU parallel nodes; CellSize is the specified resultant cell size, with pszAttribute as the
polygon attribute values. GPUNum is the number of GPUs. For each GPU, BlockNum is the speci-
fied number of blocks and ThreadNum is the number of threads
scheme is implemented using a standard C11 programming environment. Under the GPU-
based environment, CUDA is used as the parallel programming framework. In the CPU-based
environment, the message-passing interface (MPI) and open multi-processing (OpenMP) pro-
gramming models are used. MPI is the specification of a standard library for message passing
and is used to access different parallel CPU nodes and distribute tasks. OpenMP is an industry
standard application programming interface (API) for shared memory programming and is
employed for parallelizing calculations within each node. The open-source geospatial data
abstraction library (GDAL) is employed to read the vector data and write the raster data. The
pseudo-code of the general parallel implementation is described in Table 1; the pseudo-code of
the detailed CUDA kernel function is described in Table 2. The main procedures are presented
below.
Step 1: Initialization step. The master parallel node (with rank 0) analyzes all the input
parameters provided by the user. Such parameters include the number of GPUs, blocks per
GPU, threads per block (TPB), and resultant cell size. The master node creates a resultant raster
dataset according to the specified cell size.
Step 2: The master parallel node calculates the values of the number of nodes, the
MBR area, the shape, and the memory usage for each polygon. The master node calculates
the polygon complexity for each polygon, sorts all polygons according to their complexity,
and then performs data decomposition according to the number of GPUs. It forms a distri-
bution queue and then forms different subsets. During this procedure, other parallel nodes
Table 2 Pseudo-code of the BAF kernel function. PointX and PointY are the coordinates of the
polygons; AttributeValue represents the array of attribute values of the polygons. hDstDS is used
to store the GPU rasterization result. BlockPolygon denotes the serial number of the polygons in
each block. BlockNum is the specified number of blocks and ThreadNum is the number of
threads
wait. Upon completion, the master parallel node sends the decomposition results to other
parallel nodes.
Step 3: All parallel nodes receive the decomposition results and read their
corresponding polygons. Each parallel node forms different chunks of polygons according to
the limited memory of the GPU. The polygon chunks are sent to the GPU to process in
sequence.
Step 4: For each chunk, the large-sized polygons are segmented first. Each GPU distributes
polygons to different blocks and threads. Threads invoke the BAF algorithm to rasterize the
polygons in parallel. During multi-threading parallel execution, concurrent reading and writing
are usually the problem. CUDA has provided a solution for concurrent reading. In particular,
block threads in a GPU are grouped into warps (32 threads/warp). Half of a warp (i.e. 16
threads) in a block can concurrently access the global memory. Different threads can access the
global memory alternately. This approach can alleviate the problem of concurrent reading.
Furthermore, the calculation of a polygon is minimized into its MBR, reducing the number of
raster pixels affected by concurrent access, and the atomic operation in CUDA is added. Using
this approach, the correctness of the rasterized result can be guaranteed. Upon completion,
each GPU returns the rasterization result to the CPU.
Step 5: Parallel nodes on the CPU receive the GPU result and write into the resultant raster
dataset. When all chunks are processed, the master parallel node updates the resultant raster
dataset and exits the parallel execution.
When performing a vector dataset with n polygons, the time complexity of the sequential
polygon rasterization algorithm is O(n3). For a parallel algorithm, the time is mainly composed
of data decomposition, I/O, polygon filling computation, and data transfer times. The data
decomposition also includes the calculations of polygon complexity and memory usage, decom-
position into subsets, decomposition into chunks, and decomposition for blocks and threads.
The corresponding time complexities are O(n2), O(n), O(n), and O(n), respectively. Therefore,
the time complexity for data decomposition is O(n2). The time complexities for I/O, polygon
filling computation, and data transfers are O(n), O(n3), and O(n), respectively. Consequently,
the time complexity for a GPU-based parallel algorithm is O(n3).
4 Experiments
4.1 Experiment Design
The experimental GPU cluster contained two HP Z620 workstations. Each workstation
included a NVIDIA Corporation Tesla K20c GPU. The GPU had 2,496 CUDA cores, 5 GB of
global memory, 48 KB of shared memory, and 1.25 MB of L2 cache memory. According to the
technical specifications, it contained 13 multiprocessors and up to 2,048 threads could be
assigned to each multiprocessor. Thus, a maximum of 2,048 3 13 5 26,624 threads could run
concurrently in parallel on the physical GPU. The two workstations were interconnected via a
dual-port gigabit Ethernet network. The CPU-based parallel implementations were performed
on an IBM parallel cluster that contained eight computing nodes, each with the following hard-
ware configuration: two Intel(R) Xeon(R) CPUs (E5-2620 clocked at 2.00 GHz, six-core
model), 16 GB of memory, and a 2 TB hard drive. The software implementation included
CUDA 5.5, OpenMPI 1.4.1, OpenMP 2.0, and GDAL 1.9.2.
For the experiments, four datasets stored in a PostgreSQL/PostGIS database were
employed. A basic description of the datasets used is listed in Table 3. The geographical projec-
tions of the datasets were all Albers equal-area conic projections. Dataset 1 had a data volume
of 5.03 GB and approximately 1.3 3 107 polygons. The total area was approximately
104,199 m2. In this dataset, the mean number of nodes was 49.65, the mean MBR area was
56,516.02 m2, the mean shape was 0.47 and the mean complexity was 0.41. This dataset was
used to verify the accuracy of the rasterized result. Dataset 2 was formed by duplicating dataset
1 first by a ratio of ten and then moved spatially to maintain each subset disjointed. Using this
approach, dataset 2 had a data volume of 50.32 GB and was primarily employed to evaluate
the performance of the proposed parallel implementation. Datasets 3 and 4 were used to con-
duct comparisons on the parallel performance of the proposed implementation and a conven-
tional implementation. These datasets were actual Chinese land use data derived from the
national land survey program of China introduced in 2007. These datasets could be promoted
to reasonably recognize, manage, and utilize Chinese land resources. For each dataset, there
were more than 20 land types, e.g., forest, shrub, woods, dense grass, moderate grass, sparse
grass, streams and rivers, lakes, reservoirs and ponds, beach and shore, urban built-up, and
rural settlements (Liu et al. 2009).
In our experiments, the efficiency of the parallel algorithm was evaluated according to the
execution time, speedup ratio, and load balancing index. The execution time is the time
between invoking the algorithm and the completion of the last computing unit. The speedup
ratio is that of the time used by the sequential CPU algorithm to the time used by the GPU algo-
rithm (Preis et al. 2009). The load balancing index is the ratio of time spent on the slowest com-
puting unit to the fastest. To verify the proposed parallel algorithm, four sets of experiments
were conducted: (1) evaluating the accuracy loss of the rasterized result quantitatively; (2) cal-
culating and analyzing the execution time, speedup ratio, and load balancing; (3) comparing
the performance of the proposed and conventional implementations when addressing different
types of datasets; and (4) testing the extension of the proposed method to CPU-based parallel
implementations.
Figure 6 Parallel polygon rasterization result: (a) Overview of original dataset; and (b) its raster-
ization result
where i51, 2, . . ., n, is the index of the land use type, Ai0 is the area of each land type in vector
format (a reference for this land use type), and Ai is the area of each land type in raster format
after rasterization (Liao and Bai 2010). Positive (negative) accuracy loss indicates that the ras-
ter area after rasterization is larger (smaller) than the vector area before rasterization for a cer-
tain land type. The comparisons are listed in Table 4. The results indicate that the rasterization
result can achieve little accuracy loss. The total accuracy loss was 0.549174% and the largest
loss was 0.914937% for mountain dry land. This demonstrates that the proposed rasterization
result was highly accurate and is appropriate for further spatial analysis.
Number of raster
Land use type Vector area (m2) pixels Raster area (m2) Accuracy loss (%)
fully exploited, attaining the highest level of efficiency. Increasing the number of threads
further is likely to incur competition for resources and cause access conflicts that could lead
to a rapid reduction in parallel efficiency. This means that additional increases in computa-
tional resources will not enhance efficiency once performance has reached its peak.
3. When the total number of threads remained unchanged, the parallel efficiency varied
with different TPB and block configurations. In the following, (TPB number, block num-
ber) represents the configuration of the TPB and blocks, respectively, allocated. When
the number of total threads was 13,312, the speedup ratio for configurations (128, 104),
(256, 52), (512, 26), and (1024, 13) was 16.86, 18.57, 14.57, and 14.06, respectively.
When the number of total threads was 19,968, the speedup for configurations (128,
156), (256, 78), and (512, 39) was 29.24, 31.14, and 20.34, respectively. The speedup
attained its peak when the TPB number was 256. This ratio is marginally higher than
when the TPB number was 128. This is mainly because of the increasing warps in each
Figure 7 Experimental results for: (a) execution time; and (b) speedup ratio
Data Execution
TPB 3 block Decomposition I/O Computing transfers time Speedup
block, which lead to faster thread switches and improved performance. The speedup
value was significantly higher than when the TPB number was 512 or 1,024. This is pri-
marily because of the competition for the limited memory and cache caused by the
excessive threads in each block, leading to a considerable decrease in efficiency.
Parallel processing time consists mainly of data decomposition, I/O, polygon filling computa-
tion, and data transfers times (Table 5). Despite an increase in the block number, the time required
for decomposing data, I/O, and data transfers does not vary significantly with the same data
because the time required for these tasks are all closely related to the data source. The I/O time was
less than with the sequential algorithm, demonstrating that the proposed CPU/GPU scheduling
strategy can effectively conceal the data read/write time. Computing time decreased rapidly from
24.57 to 0.22 h, further indicating the effectiveness of the proposed parallel strategies.
effectiveness of the adopted parallel strategies. Experimental statistics were collected for the
execution time of each block for each GPU. The ratio of the longest to the shortest processing
time was calculated for each GPU. The average of these ratios was then used to assess the load
balancing. The formula is:
Ngpu max fTj g
1 X jNb
Load balancing5 (9)
Ngpu i51 min fTj g
jNb
where Ngpu is the number of GPUs, Nb is the block number of each GPU, Tj is the processing
time for block j on GPU i, and max jNb fTj g and min jNb fTj g are the longest and shortest proc-
essing time of the blocks, respectively. As Load balancing approaches one, the load is increas-
ingly balanced. In this experiment, the TPB number is set at 256. Load balancing was
measured with different block numbers (Figure 8). The results demonstrate that the parallel
algorithm delivers a desirable load balance, with its maximum Load balancing value no higher
than 1.30. When the block number increased to 104, the Load balancing value decreased grad-
ually and the load became more balanced, with its optimal value at 1.16. As the block number
increases, delays and resource competition occurs among the threads, leading to a gradual
unbalancing of the load and a slower increase to the Load balancing value.
Comparative method Conventional The presented Conventional The presented Conventional The presented
implementation implementation implementation implementation implementation implementation
Sequential time 142.17 s 25.43 h 584.72 s
Parallel time 5.97 s 5.56 s Cannot work 0.69 h Cannot work 17.95 s
Speedup ratio 23.81 25.68 36.91 32.58
Large-scale Polygon Rasterization on CUDA-enabled GPUs
to be processed cannot exceed the GPU memory and the number of polygon nodes in the data-
set cannot be greater than 1,024. Therefore, the datasets employed in this experiment were
classified into three types: simple, large, and complicated. The simple dataset represents the
case where the data volume is less than the GPUs limited memory and the included polygons
have less than 1,024 nodes. The large dataset represents the case where the volume is larger
than the GPU memory, and the complicated dataset represents the case where the polygons
vary considerably in complexity. In this experiment, dataset 3, excluding polygons with more
than 1,024 nodes, was used as the simple dataset. There were only 105 polygons with more
than 1,024 nodes in dataset 3. Datasets 2 and 4 were used as the large and complicated data-
sets, respectively. The execution time and speedup ratios of the conventional and the proposed
implementations for different datasets were calculated (Table 6).
For dataset 3, both the proposed and conventional implementations performed well. The
CPU sequential execution time was 142.17 s. The conventional implementation obtained a
minimum time of 5.97 s and the best speedup ratio was 23.81. The proposed implementation
demonstrated superior performance with a time of 5.56 s and speedup of 25.68. In the conven-
tional strategy, the dataset was placed completely into the GPU memory and all polygon nodes
were stored in the shared memory. This strategy can accelerate the access rate of the polygon
nodes and improve parallel efficiency. For the proposed implementation, the decomposition
strategy achieved improved workloads between GPU threads, leading to marginally improved
efficiency. The result demonstrates that the proposed strategy can achieve superior performance
compared with the conventional strategy. In the proposed implementation, although the data
decomposition procedure required a time cost, its influence on the acceleration of the GPU
computation was significant.
For dataset 2, a large dataset compared with the GPU memory, the conventional imple-
mentation could not work. The reason is that the conventional strategy ignores the data volume
that is greater than the GPU memory and fails to present a suitable approach to decompose a
large dataset into subsets. The proposed implementation performed well: the execution time
was reduced from 25.43 to 0.69 h and the best speedup achieved was 36.91.
Dataset 4 was a complicated dataset where some of the polygons included considerable
complexity. In this dataset, the mean values of the number of nodes, MBR area, shape, and
complexity were 643.15, 11,928,497.24 m2, 0.47, and 0.56, which are larger than those of the
other datasets. In particular, there were 740 polygons having more than 1,024 nodes; the maxi-
mum number of nodes was 1,175,064. The conventional implementation utilizes the shared
memory in the GPU to store all the coordinates and failed to address dataset 4. In general,
although the shared memory is faster to access compared with the global memory, it is small.
Therefore, the conventional method could not manage those polygons with excessive numbers
of nodes. In the proposed implementation, the polygon coordinates are stored in the global
memory and the attribute values are stored in the shared memory. This approach ensures suffi-
cient memory to store the polygons with excessive numbers of nodes. Polygons are decomposed
according to their calculation complexities to ensure balanced workloads. Moreover, polygons
with excessive numbers of nodes can be segmented into smaller polygons. Using these
approaches, complicated datasets can be effectively addressed. For a sequential time of
584.72 s, the proposed implementation obtained an optimized execution time of 17.95 s and a
best speedup of 32.58.
In summary, compared with the conventional strategy, the proposed strategies can achieve
slightly better parallel performance for simple datasets and perform much better for large and
complicated datasets.
two versions. One implemented the proposed strategy; the other utilized the conventional strat-
egy where polygons were distributed sequentially and each process was allocated the same
number of polygons. The execution time and speedup ratios were calculated (see Figure 9). For
the parallel MPI implementation, the computing units represent processes; for parallel MPI/
OpenMP implementation, they represent the combination of processes and threads.
In Figure 9, the sequential time is 25.43 h. For the MPI implementation, the optimized time
was 1.51 and 1.25 h for the conventional and proposed strategies, respectively. The speedup
ratios were 16.89 and 20.29, respectively. For the MPI/OpenMP implementation, the minimum
times were 1.47 and 1.20 h for the conventional and proposed strategies, respectively. The
speedup ratios were 17.24 and 21.12, respectively. The results suggest that for each parallel
implementation, the version that used the proposed strategy demonstrated superior performance.
The conventional strategy ensures load balancing in terms of polygon numbers. Conversely, the
proposed strategy achieves balanced workloads in terms of calculation complexity. Experimental
results demonstrate that the proposed strategy can obtain better load balancing, requires less exe-
cution time, and achieves a considerably higher speedup ratio. The MPI/OpenMP implementa-
tion required less time than the MPI implementation. The reason is that hybrid parallelization
can fully adopt lightweight threads, accelerating computationally intensive filling calculations.
5 Discussion
This section discusses the broader extension and limitations of the proposed approaches.
It is practical to apply the proposed strategies to other geospatial applications that share
similar algorithmic characteristics. The generic characteristics of polygon rasterization can be
described as follows. High-level independence exists between polygons and little inter-
communication is necessary during parallel computation. There are many polygon operations
in geo-computation that share similar characteristics, e.g. polygon area calculations, coordinate
transactions, and data format conversion. When developing new GPU-based CUDA codes for
these applications, the proposed parallel strategies including those of data decomposition,
CPU/GPU scheduling, and memory and cache utilization can be broadened to achieve different
parallel implementations with little modification.
Nevertheless, there are limitations when applying the proposed approaches to other appli-
cations such as overlay and intersection calculations, wherein strong relationships exist
between different polygons. For these applications, the calculation complexity is related to the
spatial proximity of the polygons, rather than the complexity of the polygons themselves (i.e.
type, area, shape, and structural features). When parallelizing these applications, spatially adja-
cent polygons must be first determined and then, the intersection calculation of these polygons
can be conducted. Therefore, the strategies proposed in this study are not appropriate for this
kind of polygon-based applications and new strategies need to be studied.
6 Conclusions
This research proposed three novel strategies to support the parallel scheme such that multiple
GPUs could be fully utilized to accelerate massive-scale polygon rasterization processes. In partic-
ular, a data decomposition strategy was designed in accordance with the calculation complexity
of the polygons and GPU internal memory to ensure load balancing (1); a parallel CPU/GPU
scheduling strategy was suggested to conceal data read/write times to improve performance (2);
and a utilization strategy for the GPU internal memory and cache was proposed to hasten data
access (3). The parallel BAF algorithm was implemented using the CUDA programming model
and executed under a GPU cluster of two workstations with two GPUs. Results confirm that:
1. The proposed GPU-based parallel polygon rasterization implementation can significantly
accelerate the enormous time-consuming conversion process. For a 50.32 GB dataset
with approximately 1.3 3 108 polygons, the processing time was reduced from 25.43 to
0.69 h and achieved a desirable speedup (36.91) and effective load balancing. This indi-
cates improved performance compared with the sequential implementation.
2. Compared with the conventional GPU-based parallel polygon rasterization algorithm,
the proposed parallel algorithm can outperform the other with different dataset types
including simple, large-scale, and complicated.
3. The proposed data decomposition strategy can be extended efficiently to CPU-based par-
allel environments. The CPU-based parallel implementations that use the proposed
decomposition strategy can also achieve superior performance compared to the conven-
tional strategy.
References
Bakkum P and Skadron K 2010 Accelerating SQL database operations on a GPU with CUDA. In Proceedings of
the Third Workshop on General-Purpose Computation on Graphics Processing Units, Pittsburgh, Pennsyl-
vania: 94103
Brinkhoff T, Kriegel H P, Schneider R, and Braun A 1995 Measuring the complexity of polygonal objects. In
ACM Proceedings of the Third ACM International Workshop on Advances in Geographical Information
Systems, Baltimore, Maryland: 10917
Chang K T 2010 Introduction to Geographic Information Systems. New York, McGraw-Hill
Christian S 2013 Efficient local search on the GPU: Investigations on the vehicle routing problem. Journal of Par-
allel Distribute Computing 73: 1431
Feito F, Torres J C, and Urena A 1995 Orientation, simplicity, and inclusion test for planar polygons. Computers
and Graphics 19: 595600
Gharachorloo N, Gupta S, Sproull R F, and Sutherland I E 1989 A characterization of ten rasterization techni-
ques. In Proceedings of the Sixteenth Annual ACM Conference on Computer Graphics and Interactive Tech-
niques, Boston, Massachusetts: 35568
Goodchild M F 2011 Scale in GIS: An overview. Geomorphology 130: 59
Guo M Q, Guan Q F, Xie Z, Wu L, Luo X G, and Huang Y 2015 A spatially adaptive decomposition approach
for parallel vector data visualization of polylines and polygons. International Journal of Geographical Infor-
mation Science 29: 14191440
Haines E 1994 Point in polygon strategies. Graphics Gems 4: 246
Hawick K A, Coddington P D, and James H A 2003 Distributed frameworks and parallel algorithms for process-
ing large-scale geographic data. Parallel Computing 29: 1297333
Healey R G, Dowers S, and Minetar M 1998 Parallel Processing Algorithms for GIS. London, Taylor and Francis
Hormann K and Agathos A 2001 The point in polygon problem for arbitrary polygons. Computational Geometry
20: 13144
Hou Q M, Sun X, Zhou K, Lauterbach C, and Manocha D 2011 Memory-scalable GPU spatial hierarchy con-
struction. IEEE Transactions on Visualization and Computer Graphics 17: 46674
Jiang L, Tang G A, Liu X J, Song X D, Yang J Y, and Liu K 2013 Parallel contributing area calculation with gran-
ularity control on massive grid terrain datasets. Computers and Geosciences 60: 7080
Kirk D B and Wen-mei W H 2012 Programming Massively Parallel Processors: A Hands-on Approach. Amster-
dam, the Netherlands, Elsevier
Li J, Jiang Y F, Yang C W, Huang Q Y, and Rice M 2013 Visualizing 3D/4D environmental data using many-
core graphics processing units (GPUs) and multi-core central processing units (CPUs). Computers and Geo-
sciences 59: 7889
Liao S B and Bai Y 2010 A new grid-cell-based method for error evaluation of vector-to-raster conversion. Com-
putational Geoscience 14: 53949