Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

High Performance Computing

Lab 2 Report
Contents

1. Introduction.......................................................................................................................................2
2. Method................................................................................................................................................2
3. Results and Discussion......................................................................................................................3
Appendix....................................................................................................................................................5
mp2-part1.cu..............................................................................................................................................5
mp2-part2.cu..............................................................................................................................................7
mp2-part3.cu..............................................................................................................................................9

1
1.Introduction
In this lab, we explored the use of atomic operations in the particle binning
problem where a binning location would need to be filled simultaneously by
different particles, presenting a race condition for that location. We also explored
the use of and methods for allocating shared memory in the k-nearest neighbours
(knn) problem, where we needed to perform read-modify-write operations
repeatedly on the same memory location and we also needed to carry out random
access of memory locations repeatedly, which would waste throughput if carried
out in global memory. The third part of this lab also explored the algorithmic
concept of splitting a large work into smaller manageable batches.

2.Method
In part 1, the templated kernel initialize is used to fill the arrays d_bins and
d_bin_counters with temporary values. Alternatively, the values can be copied
from the temporary values of their equivalent host arrays, using cudaMemcpy. The
function bin_index has the __device__ prefix appended to it, to be used in the
binning kernel on the GPU. The atomicAdd operation is used to update the
d_bin_counters array to avoid race condition from threads updating
simultaneously.
In part 2, the search for the k-nearest neighbours of each particle is carried out by
each thread in the grid. We use two arrays in shared memory of each block,
s_neigh_dist and s_neigh_ids to update the particle distance and particle id lists of
the k-nearest neighbours of each particle. These arrays are of length
num_neighbours*blockSize, since each thread in the block will update
num_neighbours elements of both arrays as each thread performs the knn search.
The size of the shared memory arrays can be determined by setting the dimensions
num_neighbours and blockSize as constants of a templated kernel or by dynamic
allocation, by specifying the size of threadblock shared memory when the knn
search kernel is called. When the search is complete, each thread loops over
num_neighbours elements of the d_knn array, starting with an offset of threadIdx
on the d_knn array.
In part 3, each thread in the grid corresponds to a particle bin, and the binning
kernel loops over the entire grid length multiple times, until the total number of
bins has been accessed. (This assumes that the total number of bins may be larger

2
than the kernel grid size). The neighbouring bins knn search is started by making
the linear bin index to be a function of the grid index of each thread in each
threadblock. The linear bin index is used to compute the x,y and z bin indices for
that bin, which are used to determine the neighbouring bins of a particular bin.
Hence, each thread is assigned to a bin, and correspondingly, to perform the knn
search between the particles in that bin and particles in its neighbouring bins.
Similar to part 2, we also use shared memory arrays for updating the particle
distance and particle list, except that, in part 3, the shared memory arrays are
looped over multiple times by a single thread in order to obtain the k-nearest
neighbours of the particles in the bin that the thread represents.

3.Results and Discussion

Figure 1: Results of particle binning in mp2-part1


The result from part 1 above shows a significant speedup was achieved by
performing the particle binning operation on the GPU. The result shown above
used cudaMemcpy to initialize the GPU arrays d_bins and d_bin_counters. The
same result was obtained when the algorithm was run with templated kernels used
to initialize these arrays.

3
Figure 2: Results of k-nearest neighbours algorithm in part 2
Figure 2 above shows the results of the k-nearest neighbours algorithm in part 2.
Again, it shows significant speedup of the GPU version over the CPU version, but
using local memory to store the particle distance and particle id update arrays was
slightly faster than using shared memory arrays. The same timing result for shared
memory was obtained for both statically and dynamically declared shared memory
arrays.

Figure 3: Results of k-nearest neighbours algorithm in part 3


Figure 3 above shows the results of the knn algorithm using the neighbouring bins
approach. The results show a speedup of the GPU version compared to the CPU
version

You might also like