Lab 4 Functors

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Lab 4 Report

Contents

1. Introduction.......................................................................................................................................2
2. Method................................................................................................................................................2
3. Results and Discussion......................................................................................................................3
Appendix....................................................................................................................................................6
mp4-part1.cu...........................................................................................................................................6
mp4-part2.cu...........................................................................................................................................9

1
1.Introduction
In this lab, we explored the features of the Thrust library that allow us to create
containers on both the CPU and GPU and also write kernels composed of
algorithms for parallel primitives already prepared for the Thrust library. The
Thrust library provides features similar to that of the Standard Template
Library(STL) of C++ that allow us to write GPU programs at a higher level of
abstraction which allows us to rapidly develop programs without caring for
intricate deails as threadblock dimensions, and so on.
In this lab, we created kernels using Thrust implementations of parallel primitives
like scan, sort and compact.

2.Method
The codes for all the lab exercises were edited and run using the CUDA 11.2
Runtime API in Microsoft Visual Studio 2019 in Release mode. The GPU used for
running the experiments is NVIDIA GeForce GTX 1660 Ti on a laptop with an
Intel Corei7 2.6GHz processor.
In mp4-part1, we re-implement the binning program for 3-dimensional particles
that we had earlier implemented in mp2, but this time, using sorting operations as
opposed to atomic operations in counting the number of particles per bin.
The bin indices of the points on the device vector are calculated using the functor
to_bin_index whose internal variables which describe the width, height and depth
of the 3D space are first initialized, before the functor is applied to the points on
the device using Thrust’s transform algorithm. This process is followed by sorting
the points using the bin indices as the key for the sorting operation. Next, a
compaction operation is carried out on the bin indices which produces a compacted
bin index and number of particles per bin as the output of the operation. The
constant_iterator algorithm in Thrust allows us to create a virtual array of ones to
be used for the counting operation. It has the advantage of “simulating” an array
without having to create an actual array that would use a significant amount of
memory storage. The final result of the compaction is scattered into an array of bin
indices counts using a Thrust implementation of the scatter algorithm.
The program for mp4-part1 is given in the appendix of the report.

2
In mp4-part 2, we implement the Black-Scholes algorithm from mp3 and the
compaction operation required for the subsequent rounds of the algorithm using
calls to Thrust algorithms. A Structure of Arrays approach is taken in storing the
input and output variables of the Black-Scholes algorithm. The first round of the
Black-Scholes algorithm is implemented by using the functor
black_scholes_functor in the Thrust transform algorithm. The compaction
operation that precedes the subsequent rounds is carried out by using the
thrust::remove_if() algorithm which compacts the input vector d_stock based on
what elements of the stencil d_first_round_result which satisfies the fail condition
given by the functor option_fails_threshold.
The program for mp4-part2 is given in the appendix of the report

3.Results and Discussion

The results of implementing the binning operation on the CPU and GPU using
Thrust implementations on the GPU are shown in figure 1 below.

Figure 1: Implementation of binning operation on GPU using thrust::transform


algorithm

3
Figure 2: Implementation of binning operation on CPU using thrust::transform
algorithm
In figure 1, we observe a speedup in the binning operation implemented on the
GPU compared to the CPU implementation. However, figure 2 shows the same
binning operation implemented on the CPU by replacing thrust::device_vector
with thrust::host_vector containers. The CPU implementation of the sorting
algorithm (timing given by CPU2) appears to be faster than the same Thrust
implementation of the sorting algorithm on the GPU. This may have something to
do with differences in the way sorting and compacting are done in the CPU and the
GPU.
Figure 3 below shows the implementation of the Black-Scholes algorithm on the
GPU and the CPU. We observe a speedup in the implementation of the first round
on the GPU compared to the CPU implementation. However, after compaction, the
CPU implementation of the subsequent rounds appear to be faster than the GPU
implementation. This might be possibly due to the effect of reducing the size of the
input vectors, since the GPU tends to outperform the CPU for larger number of
particles. The array of structures arrangement of the input and output of the Black-
Scholes implementation may also be a factor in the speed changes, due to the
slower memory accesses incurred by implementing an Array of Structures data
storage format, as opposed to a more efficient Structure of Arrays format.

4
Figure 2: Implementation of Black-Scholes algorithm using Thrust algorithms

You might also like