Gpgpu Final Report

University of Cape Town Department of Computer Science Department of Mathematics and Applied Mathematics G ENERAL P URPOSE C OMPUTING ON G RAPHICS
P ROCESSING U NITS M AM 4001W GPGPU F INAL P ROJECT: S YNTHESIZED T ERRAINS Lecturer: Associate Professor James E. Gain
PATRICK A DAMS A DMPAT 002 June 10, 2011
Contents
1 Synthesized Terrains 1.1 1.2 2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel Solution Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 2 2 3 3 4 4 4 5 5 5 6 6 6 7
Results 2.1 Non-variation of Parameters and Results . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 2.1.2 2.1.3 2.2 Nave (NV) Implementation . . . . . . . . . . . . . . . . . . . . . . . . . Query Constant Memory (QC) Implementation . . . . . . . . . . . . . . . Query Constant Memory/Terrain Texture (QCTTex) Memory Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Variation of Parameters and Results . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 2.2.2 2.2.3 Varying Total Terrains . . . . . . . . . . . . . . . . . . . . . . . . . . . . Varying Terrain Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Varying Patch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion
A Validation A.1 Results on a Different GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Further Variation of Parameters (TSESSEBE CLUSTER) . . . . . . . . . . . . . .
1
1.1
Synthesized Terrains
Problem Statement
Synthesized terrains are integral in the elds of computer games and visual effects. One way of performing this synthesis of terrains is by indirectly combining same-size patches of real-life landscapes provided by different Digital Elevation Models (DEMs). This data most often is provided in the form of a grid consisting solely of height-values and in order to make the appearance of the synthesized landscape most realistic, the so-called L2 -norm between a cerain query patch Qi j and some reference patch Ri j in a terrain (most typically a set of terrains) T . For a smooth appearance, the L2 -norm should be minimized, given such a Qi j for some Ri j in the set T . For reference, the L2 -norm E is calculated as follows:
m1 n1
E=
i=0 j=0
Qi j Ri j ,
where all parameters are dened as before. Our purpose then is to nd an optimal parallel solution to this problem and then briey study its performance under changes of various simulation parameters.
1.2
Parallel Solution Strategy
A parallel solution proposed by us, and based on one outlined in [1], involves treating any given terrain as a single computational block with many threads operating on its various subsections. The E-values are then simultaneously calculated with respect to the query patch and the result stored in an array on device memory. When this process is complete, the resultant array is returned from device memory and then searched for the smallest E-value with respect to Q and R. This value of E, along with the co-ordinates of the patch R and the number of the terrain T is the answer sought. Graphically, if we were to join these two terrain patches together, the transition would appear the most smooth.
Results
In this section, we present the results of our simulations, beginning rstly with a nave implementa tion of the problem and then gradually optimizing this implementation and examining the resulting speedup at every iteration. For benchmarking purposes, we simulate the problem only once using a similar nave serial (CPU) implementation and compare all results obtained from our parallel (GPU) implementation against this in order to determine the effectiveness of our parallel solution. 2
For the readers convenience, we dene the following code parameters: NUM TERRAIN, TEX X, TEX Y, QUERY X, QUERY Y, DIMBLOCK X, DIMBLOCK Y, DIMTHREAD X and DIMTHREAD Y. The rst parameter represents the number of terrains to be used in a simulation, the second and third the dimensions of the terrain(s) under consideration, the fourth and fth the dimensions of the querypatch, the sixth and seventh the dimensions of the computational block and the last two the dimensions of the computational grid. For all simulations in this section, we dene the following standard set of parameter values: NUM TERRAIN = 20, TEX X = TEX Y = 1024, QUERY X = QUERY Y = 64, DIMBLOCK X = DIMBLOCK Y = 16 and DIMTHREAD X = DIMTHREAD Y = 960; the reader may assert that these value remain unchanged throughout all simulations for this section. The reader is to note that unless stated otherwise, all simulations for the following subsection were performed on the GPU cluster at the Centre for Higher Performance Computing (CHPC). Note also that during the time the simulations were performed, it appeared to us that the cluster was slightly overloaded, leading us to believe that long access times and lag hampered the performance of our solution. For further validation of our solution, please see Appendix A for a very brief discussion.
2.1
2.1.1
Non-variation of Parameters and Results

Nave (NV) Implementation
This implementation is based on the course notes by Gain [1]. In this implementation the query patch, terrain and result are all stored on the host (CPU) and device (GPU) as one-dimensional arrays of the appropriate sizes: QUERY X*QUERY Y*sizeof(float), TEX X*TEX Y*sizeof(float) and (TEX X - QUERY X)*(TEX Y - QUERY Y)*sizeof(float) respectively. This one-dimensional storage is an example of at indexing and is done only to avoid the potential complexity associated with using a multi-dimensional data structure. The CPU arrays are rst populated with the necessary data and are then transferred to the GPU for computation; the kernel accepting as its input arguments, pointers to the three structures. Parallel calculation of E is done on the GPU side and is then stored in the result array. This result array is then returned to the CPU side where it is searched for the minimum value of E as described above. Note that this implementation features no optimization whatsoever and shall later also be used as a second benchmark when the other implementations are considered. Despite this, we can already see a notable speedup, compared to the same nave CPU implementation and conclude the rst iteration of our parallel algorithm a success. Table 1: Computation time for the NV implementation and speedup (GPU CLUSTER). C PU T IME (ms) G PU T IME (ms) S PEEDUP 714531.187500 290762.375000 2.457440
2.1.2
Query Constant Memory (QC) Implementation
Re-examining the problem statement, we note that the query remains constant throughout all the terrains processed. This would warrant us to consider placing the query data into constant memory on the GPU. This is only allocated once and then left until all of the queries have been completed. The modus operandi completely as before, with now only the constant memory being declared at the global scope in our code, the query data being copied to it and then it being referenced in the kernel. Multiple runs on the cluster suggest some speedup compared to the previous implementation, with a denite speedup overall when compared to the CPU implementation. A typical result is presented in the table below. Table 2: Computation time for the QC implementation and speedup (GPU CLUSTER). C PU T IME (ms) G PU T IME (ms) S PEEDUP 714531.187500 289247.783203 2.470308
2.1.3
Query Constant Memory/Terrain Texture (QCTTex) Memory Implementation
Another optimization that we considered and implemented was the usage of texture memory. From the previous iterations, we noted that the terrain data itself is also remains unmodied while the value of E is being computed but is too large to t into constant memory successfully. Texture memory was then considered as a possible solution. Terrain data was therefore bound to a onedimensional texture for the computations on the GPU. Comparing the elapsed time to that of the CPU and the previous two implementations, it is clear that our problem beneted well from the use of texture memory. Table 3: Computation time for the QCTTex implementation and speedup (GPU CLUSTER). C PU T IME (ms) G PU T IME (ms) S PEEDUP 714531.187500 143670.54687 4.973400
2.2
Variation of Parameters and Results
In this section, we varied all of the code parameters dened above in order to test the parameterdependence of our solution. As the algorithms remained the same throughout, we do not re-discuss their implementation again. For convenience, we dene the following standard parameter set for the subsequent simulations and the reader may assert that these values remain the same throughout, unless stated otherwise: NUM TERRAIN = 20, TEX X = TEX Y = 1024, QUERY X = QUERY Y = 64, DIMBLOCK X = DIMBLOCK Y = 16 and DIMTHREAD X = DIMTHREAD Y = 960. We choose only to investigate the effects of varying the number of terrains, terrain size and patch size. For other variation results, please consult Appendix A. In all simulations, for brevity, we 4
make use of only the most optimized version of our solution. Note that simulations for this section were carried out on the CHPCs Tsessebe cluster, using the vizualisation node viz01.
2.2.1
Varying Total Terrains
We can see from the results presented in Table 4 that our solution appears to perform better as the number of terrains is decreased. This should come as intutively obvious due to the fact that a smaller number of terrains implies less operations to be performed by the GPU, as well as an overall larger number of threads available to do work and thus a smaller running time. Table 4: Computation time for varying number of terrains (TSESSEBE CLUSTER). 1 10 15 NUM TERRAIN G PU T IME (ms) 12095.289062 120854.070312 180778.125000
2.2.2
Varying Terrain Size
Here we varied the values of TEX X and TEX Y and investigated our solutions performance with the changes. The results are shown in Table 5. Note that the terrain is a square in all cases. We note that though we had obtained a computation time for a size of 512, there appeared to be a problem with the computation of E, having obtained a result of 0. We suspect that this may be an issue with the terrain sizes relationship with the other parameters. For a size of 256, we obtained a segmentation fault, again suspecting the same reason mentioned before. Despite this, however, we noted that as the terrain size increased, so did the computation time of our solution. Table 5: Computation time for varying number of terrains (TSESSEBE CLUSTER). 256 512 4096 TEX X G PU T IME (ms) SIGSEGV 17702.134766 652443.562500
2.2.3
Varying Patch Size
The results obtained suggest to us suggest that as the query patch size is increased, the computation time itself also increases. While holding the other parameters as before, it seems to be intuitively obvious that this should happen due to the fact that there would be increasingly less threads now operating on a larger query patch. Increasing the number of threads should combat this slowdown. One could also increase the number of threads while holding the patch size constant, which should result in an overall speedup due to there being less work for each thread to do. Note that in all cases, the query-patch is a square.
Table 6: Computation time for varying patch size (TSESSEBE CLUSTER). 16 32 96 QUERY X G PU T IME (ms) 12005.465820 35817.128906 242187.390625
Conclusion
Using the code provided by Gain [1], we implemented a parallel solution to the terrain synthesis problem. We noted that an optimal speedup was obtained by the combined use of texture and constant memory for the terrain and query data respectively. The implementation of streams was considered as well, but not implemented due to us choosing to optimise the GPU solution over that of the CPU. Furthermore, the use of streams was not seen to be integral to this problem as we were only comparing the running time of the GPU solution to the CPU solution for various optimisations of the former using streams would then have no real effect on the speed of the GPU solution. Though not implemented here, we also noted that we could obtain a much faster solution by restraining the determination of the minimum value of E for any one terrain to the GPU and then selecting the global minimum from on the CPU. Dependence of our QCTTex solution on the aforementioned simulation parameters was also examined and comments were made on the results. What stood out to us was the value of the global minimum obtained (viz. 0) when some parameters were set to certain values, as well as the segmentation fault obtained when the terrain size was set to 256. We can conclude that for this problem, a GPU solution is indeed a denite way forward in terms of computation time and can be further optimised in order to attain the best possible performance for this particular problem.
References
[1] J. Gain. Gpgpu lecture notes, 2011.
A
A.1
Validation
Results on a Different GPU
Due to the slow access times obtained on the CHPC clusters resources, we also simulated our solution our personal system, sporting an NVIDIA GTS 450 GPU in order to conrm and compare the speedups obtained in Section 2. Note that all of the gures below are unrounded and the 6
Table 7: Computation time for the NV implementation and speedup. C PU T IME (ms) G PU T IME (ms) S PEEDUP 339025.343750 39991.476563 8.477440 Table 8: Computation time for the QC implementation and speedup. C PU T IME (ms) G PU T IME (ms) S PEEDUP 339025.343750 36252.617188 9.351748 simulation parameter set in Section 2.1 was used. As is clear from the result shown in Table 9 for the QCTTex implementation, our solution has indeed beneted greatly from the usage of texture memory, whilst the NV and QC implementations provide only a mediocre, but still very signicant, speedup. Other results can be seen in tables 7 and 8. Similar results and speedups can be expected if the simulation parameters were varied
A.2
Further Variation of Parameters (TSESSEBE CLUSTER)
In this section, we briey examine the effects of changes in size of the block- and grid dimensions on our QCTTex implementation. These simulations were once again performed on the Tsessebe cluster at the CHPC. For brevity, we only consider the speedup, compared to the results obtained for the same implementation when parameters were kept constant. The parameter set used here is identical to the one used in Section 2.2. Note that in all cases, the block and grid dimensions were kept square. For all of these simulations, we obtained a value of 0 for our calculated minimum. We suspect that this may have something to do with us providing either incompatible dimensions with respect to the card, or again the possible sensitive depencence of the block and grid dimensions on the other simulation parameters.
Table 9: Computation time for the QCTTex implementation and speedup. C PU T IME (ms) G PU T IME (ms) S PEEDUP 339025.343750 13476.184570 25.157368 7
Table 10: Computation time for varying patch size (TSESSEBE CLUSTER). DIMBLOCK X 32 64 96 G PU T IME (ms) 940.611023 945.150024 941.534973 DIMGRID X 512 1024 2048 G PU T IME (ms) 939.822021 942.500977 961.567993

Gpgpu Final Report

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gpgpu Final Report

Uploaded by

Copyright:

Available Formats

University of Cape Town Department of Computer Science Department of Mathematics and Applied Mathematics G ENERAL P URPOSE C OMPUTING ON G RAPHICS

Parallel Solution Strategy

Non-variation of Parameters and Results

Query Constant Memory (QC) Implementation

Query Constant Memory/Terrain Texture (QCTTex) Memory Implementation

Variation of Parameters and Results

Varying Total Terrains

Varying Terrain Size

Varying Patch Size

Further Variation of Parameters (TSESSEBE CLUSTER)

You might also like