Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Chapter 1

Introduction

In this project, we have attempted to develop parallel formulations for four different
operations on a GPU using the CUDA API and compare the speedups obtained with single
threaded CPU implementations. The four operations are:

1. Vector Multiplication
2. Vector Dot Product
3. Two-Dimensional Convolution
4. Trapezoidal Numerical Integration

1.1 Background:

1.1.1 An introduction to modern GPUs:

Graphics processing units have evolved to coprocessors of a size larger than typical CPUs.
While CPUs use large portions of the chip area for caches, GPUs use most of the area for
arithmetic logic units (ALUs). The main concept that both NVIDIA and AMD GPUs use to
exploit the computational power of these ALUs is executing a single instruction stream on
multiple independent data streams (SIMD) [23]. This concept is known from CPUs with
vector registers and instructions operating on these registers. For example, a 128-bit vector
register can hold four single-precision floating-point values; an addition instruction
operating on two such registers performs four independent additions in parallel. Instead of
using vector registers, GPUs use hardware threads that all execute the same instruction
stream on different sets of data. NVIDIA calls this approach to SIMD computing single
instruction stream, multiple threads (SIMT). The number of threads required to keep the
ALUs busy is much larger than the number of elements inside vector registers on CPUs. GPU
performance therefore relies on a high degree of data-level parallelism in the application.

To alleviate these requirements on data-level parallelism, GPUs can also exploit task-level
parallelism by running different independent tasks of a computation in parallel. This is
possible on all modern GPUs through the use of conditional statements. Some recent GPUs

1
support the exploitation of task-level parallelism also through concurrent execution of
independent GPU programs. Each of the independent tasks again needs to involve a
relatively high degree of data-level parallelism to make full use of the computational power
of the GPU, but exploitation of task-level parallelism gives the programmer more flexibility
and extends the set of applications that can make use of GPUs to accelerate computations.

The remainder of this section gives an overview of the hardware architectures of modern
GPUs, introduces the relevant programming languages, and discusses typical performance
bottlenecks and GPU benchmarking issues. The project focuses on NVIDIA GPUs because
most of the implementations of subsequent sections target this GPU.

1.1.2 NVIDIA GPUs:

In 2006 NVIDIA introduced the Compute Unified Device Architecture. Today all of NVIDIAs
GPUs are CUDA GPUs. CUDA is not computer architecture in the sense of a definition of an
instruction set and a set of architectural registers; binaries compiled for one CUDA GPU do
not necessarily run on all CUDA GPUs. More specifically, NVIDIA defines different CUDA
compute capabilities to describe the features supported by CUDA hardware. The first CUDA
GPUs had compute capability 1.0.

A CUDA GPU consists of multiple so-called streaming multiprocessors (SMs). The threads
executing a GPU program, a so-called kernel, are grouped in blocks. Threads belonging to
one block all run on the same multi- processor but one multiprocessor can run multiple
blocks concurrently. Blocks are further divided into groups of 32 threads called warps; the
threads belonging to one warp are executed in lock step, i.e., they are synchronized. As a
consequence, if threads inside one warp diverge via a conditional branch instruction,
execution of the different branches is serialized. On GPUs with compute capability 1.x all
streaming multiprocessors must execute the same kernel. Compute capability 2.x supports
concurrent execution of different kernels on different streaming multiprocessors.

Each streaming multiprocessor contains several so-called CUDA cores, 8 per SM in compute
capability 1.x, 32 per SM in compute capability 2.0 and 48 per SM in compute capability 2.1.
One could think that for example a reasonable number of threads per SM is 8 for compute-

2
capability-1.x GPUs or 48 for compute- capability-2.1 GPUs. In fact it needs many more
threads to fully utilize the ALUs; the reason is that concurrent execution of many threads on
one SM is used to hide arithmetic latencies and up to some extent also memory- access
latencies. For compute capability 1.x NVIDIA recommends to run at least 192 or 256
threads per SM. To fully utilize the power of compute-capability-2.x GPUs even more
threads need to run concurrently on one SM. For applications that involve a very high
degree of data-level parallelism it might now sound like a good idea to just run as many
concurrent threads as possible. The problem is that the register banks are shared among
threads; the more threads are executed the fewer registers are available per thread. Finding
the optimal number of threads running concurrently on one streaming multiprocessor is a
crucial step to achieve good performance.

Aside from registers, each thread also has access to various memory domains. Each
streaming multiprocessor has several KB of fast-shared memory accessible by all threads
on this multiprocessor. This memory is intended to exchange data between the threads of a
thread block, latencies are as low as for register access but throughput depends on access
patterns. The shared memory is organized in 16 banks. If two threads within the same half-
warp (16 threads) load from or store to different addresses on the same memory bank in
the same instruction, these requests are serialized. Such requests to different addresses on
the same memory bank are called bank conflicts. Graphics cards also contain several
hundred MB up to a few GB of device memory. Each thread has a part of this device
memory dedicated as so-called local memory. Another part of the device memory is global
memory accessible by all threads. Access to device memory has a much higher latency than
access to shared memory or registers. Additionally, each thread has cached read-only
access to constant memory and texture and surface memory. Loads from constant cache are
efficient if all threads belonging to a half-warp load from the same address; if two threads
within the same half-warp load from different addresses in the same instruction,
throughput decreases by a factor equal to the number of different load addresses. Another
decision (aside from the number of threads per SM) that can have huge impact on
performance is what data is kept in which memory domain.

3
Communication between CPU and GPU is done by transferring data between the host
memory and the GPU device memory or by mapping page-locked host memory into the
GPUs address space. Asynchronous data transfers between page-locked host memory and
device memory can overlap with computations on the CPU. For some CUDA devices since
compute capability 1.1 they can also overlap with computations on the GPU. Since CUDA 4.0
NVIDIA simplifies data exchange between host memory and device memory of Fermi GPUs
by supporting a unified virtual address space. The unified virtual address space is
particularly interesting in conjunction with peer-to-peer memory access between multiple
GPUs. This technique makes it possible to access the memory of one GPU directly from
another GPU without data transfers through host memory.

4
1.1.3 GPU-Accelerated Computing:

GPU-accelerated computing is the use of a graphics-processing unit (GPU) together with a


CPU to accelerate deep learning, analytics, and engineering applications. Pioneered in 2007
by NVIDIA, GPU accelerators now power energy-efficient data centers in government labs,
universities, enterprises, and small-and-medium businesses around the world. They play a
huge role in accelerating applications in platforms ranging from artificial intelligence to
cars, drones, and robots.

GPU-accelerated computing offloads compute-intensive portions of the application to the


GPU, while the remainder of the code still runs on the CPU. From a user's perspective,
applications simply run much faster.

1.1.4 GPU vs CPU Performance

A simple way to understand the difference between a GPU and a CPU is to compare how
they process tasks. A CPU consists of a few cores optimized for sequential serial processing
while a GPU has a massively parallel architecture consisting of thousands of smaller, more
efficient cores designed for handling multiple tasks simultaneously.

5
GPUs have thousands of cores to process parallel workloads efficiently:

1.2 Project Methodology:

We performed two different implementations for all the four operations: One using Global
Memory and the other using Shared Memory. Further, we tried to compare the speedup of
the single threaded solution computed by the CPU with the run-time of the parallel GPU
implementation. Additionally, we incorporated a tolerance check with the program, to
ensure that the area calculated using Cuda yielded the same value as the area calculated
using serial implementation.

6
Chapter 2
Parallel Formulation of Vector Multiplication

2.1 Environment Information:

All code was run remotely on the eeitnu.nirmauni.ac.in host. The code was compiled using
the following command:

nvcc -o vec_mat_mult vec_mat_mult.cu vec_mat_mult_gold.cpp -O3 -gencode


arch=compute_60,code=sm_60

After compilation the program was run as follows:

./vec_mat_mult

2.2 Code Analysis:

2.2.1 Vector Multiplication using Global Memory:

A) CPU Side:

Matrix Ad=allocate_matrix_on_gpu(A);
copy_matrix_to_device(Ad,A);
Matrix Xd=allocate_matrix_on_gpu(X);
copy_matrix_to_device(Xd,X);
Matrix Yd=allocate_matrix_on_gpu(Y);

Allocate_matrix_on_gpu function calls cudamalloc to dynamically allocate memory in


the gpu. The copy_matrix_to_device function calls memcopy to copy array from CPU to
GPU into the allocated memory.

struct timeval start, stop;


dim3 threads(32,1,1);
dim3 grid(MATRIX_SIZE/threads.x,1);
gettimeofday(&start, NULL);

Timeval was used to time the program. The dim3 threads() functions chooses the
number of threads the program creates. In this case we create 32 threads in x direction
and no extra threads in y and z coordinates. The dim3 grid() call, determines the
number of blocks in the execution grid.
7
vec_mat_kernel_naive<<<grid,threads>>>(Ad.elements,Xd.elements,Yd.ele
ments);
cudaThreadSynchronize();
gettimeofday(&stop, NULL);
tp1=(float)(stop.tv_sec-start.tv_sec+(stop.tv_usec-
start.tv_usec)/(float)1000000);}
printf("Parallel time %f\n",tp1 );

The vec_mat_kernel calls the function vec_mat_kernel_naive which runs on the kernel.
The remaining line is used to pass arguments to the kernel from the CPU.
CudaThreadSynchronize() is used to synchronize the threads within the kernel and
makes sure the threads join before control is passed to CPU again. Tp1 is a variable
used to time the kernel run.

}
copy_matrix_from_device(Y,Yd);
cudaFree(Ad.elements);
cudaFree(Xd.elements);
cudaFree(Yd.elements);
}

copy_matrix_from_device() is used to echo result from kernel to CPU. cudaFree()


function is used to free the dynamically allocated memory in the kernel.

B) Kernel Side:

int tx=threadIdx.x+blockIdx.x*blockDim.x,k;
float Y_temp=0;
for(k=0;k<MATRIX_SIZE;k++)

{
float element1=Ad[tx*MATRIX_SIZE+k];
float element2=Xd[k];
Y_temp+=element1*element2;
}

Yd[tx]=Y_temp;

Integer tx is used to store id of a particular thread. Y_temp is a temporary storage for


partial summation which each thread performs. The total number of threads are equal
8
to the number of rows or columns in the matrix (split into blocks of 32). Each thread
performs multiplication of one row with the column in the X matrix and stores the
answer in Y_temp. Once it has computed the calculation for all elements in its row, it
stores the summation in array Yd[tx] which is echoed back to the CPU.

2.2.2 Vector Multiplication using Shared Memory:

A) CPU Side:

The CPU side code for shared memory vector multiplication was almost identical to
the CPU side code for global memory implementation. The few different lines of code
are written below and explained.

dim3 dimBlock(TILE_WIDTH,TILE_WIDTH,1);
dim3 dimGrid( MATRIX_SIZE/dimBlock.x,1);
vec_mat_kernel_optimized <<< dimGrid, dimBlock >>>
(Ad.elements,Xd.elements,Yd.elements);

Like the CPU side for Vector Multiplication using global memory, dim3 is used to
define the execution thread. dimBlock was used to define the TILE_SIZE. A constant
TILE_WIDTH was defined in the header file and used. Based on trial and error
methods, the optimum tile width was found to be 8. dimGrid was used to define the
number of blocks needed which in this case is the Matrix size/dimension of an
individual block in x direction.

B) Kernel Side:

int tx = threadIdx.x;
int ty = threadIdx.y;
int by = blockIdx.y;
int bxdim = blockDim.x;
int bydim = blockDim.y;
int Row = by*bydim;

Tx and ty are used to store the x and y points of the threadIdx in a block.by is used to
store the y location of block in the grid. bxDim stores the dimension of block in x

9
direction and byDim stores the dimension of block in y direction.Row stores the
number of the row on which to extract values.

__shared__ float shared_X[TILE_WIDTH];


__shared__ float shared_A[TILE_WIDTH][TILE_WIDTH];

__shared__ is used to define the variable is to be stored in shared memory. In this case 2
arrays called shared_X and shared_A of the specified sizes are to be stored in shared
memory. The tile Size used for X is single dimension and the tile size for A is double
dimension which can be seen in the above initialization.

// Temporary storage for partial summation


float Pvalue = 0;
// Variable used to choose tile
int m;

Float Pvalue is a temporary variable used to store partial summation for each thread.
m is an integer which will be used to demarcate the starting of a specific tile.

for (int i = 0; i < MATRIX_SIZE/TILE_WIDTH; i++)


{
//Use me to choose which block the thread will load
m = i*TILE_WIDTH;
int el = tx +m;

//Load to Shared memory


shared_X[tx] = Xd[el];
shared_A[ty][tx] = Ad[Row*MATRIX_SIZE + el];

The 2 lines above are used to load memory into the shared memory blocks initialized
previously.

//sync threads acts as barrier sync. Waits for all threads to finish
__syncthreads();

syncthreads() waits for all threads to finish and synchronize before proceeding.

10
//First thread of block multiples elements and stores in Pvalue
if ( tx == 0 )
{
for ( int j = 0; j< bxdim; j++ )
Pvalue += shared_A[tx][j] * shared_X[j];
}__
syncthreads();
}

//First threads loads to Yd the value of Pvalue


if (tx == 0 )
Yd[Row] = Pvalue;

The first thread of each block is used to perform the vector multiplication of elements
in shared memory. After this once the threads have synchronized, the first thread is
used again to load the partial summation into an array Yd[] which is later echoed to
the CPU.

11
Chapter 3
Parallel Formulation of Vector Dot Product

3.1 Environment Information:


All code was run remotely on the eeitnu.nirmauni.ac.in host. The code was compiled
using the following command:

/usr/local/cuda/bin/nvcc -o vector_dot_product vector_dot_product.cu


vector_dot_product_gold.cpp -O3 -gencode arch=compute_60,code=sm_60

After compilation the program was run as follows:

./vector_dot_product <num elements>

3.2 Code Analysis:

3.2.1 Vector Multiplication using Global Memory:

A) CPU Side:

float *A_d = NULL, *B_d = NULL, *P_d = NULL; size_t sf = sizeof(float);

Initialize Matrices to be stored in GPU for computation. The second line defines a
variable which stores the size of float.

cudaMalloc( (void **)&A_d, num_elements * sf); cudaMalloc( (void **)&B_d,


num_elements * sf); cudaMalloc( (void **)&P_d, GRID * sf );

Dynamically allocate space on the GPU to store the matrices A and B and also P.

cudaMemcpy( A_d, A_on_host, num_elements * sf,


cudaMemcpyHostToDevice); cudaMemcpy( B_d, B_on_host, num_elements *
sf, cudaMemcpyHostToDevice); cudaMemset( P_d, 0.00f, GRID* sf );

Copy elements of A and B to GPU. Since P_d is the result vector we just initialize all
spaces to 0 for now. This was done using the cudaMemset function.

dim3 thread_block(BLOCK, 1, 1); dim3 grid(GRID,1);

cudaBindTexture(NULL, A_t, A_d, num_elements* sf);


cudaBindTexture(NULL, B_t, B_d, num_elements* sf);
12
Allocate Block and Execution Grid using the dim3 function. cudaBindTexture binds A
and B to texture memory (which will later be used on the GPU to quickly access Matrix
A and B).

struct timeval start, stop; gettimeofday(&start, NULL);


vector_dot_product_kernel<<< grid, thread_block >>>(P_d, A_d, B_d,
num_elements); cudaThreadSynchronize();
gettimeofday(&stop, NULL);
tp2=(float)(stop.tv_sec - start.tv_sec + (stop.tv_usec -
start.tv_usec)/(float)1000000); printf("Kernel Execution Time= %fs. \n",
tp2);
speedup1=tp1/tp2; printf("Speedup %f\n",speedup1 );

Simple timing function used before. Used to get time spent on kernel. Tp1 was used to
store time of compute gold function. The ratio of tp1 and tp2 gives the speedup. The
synchronize function was used before stopping the time to ensure all threads have
finished before control was passed to the CPU.

float result = 0.0;


cudaMemcpy( &result, P_d, sf, cudaMemcpyDeviceToHost );
cudaUnbindTexture(A_t); cudaUnbindTexture(B_t); cudaFree(P_d);
cudaFree(A_d); cudaFree(B_d);
return result;

The final answer from the kernel was passed to the CPU and stored in the variable
result through the cudaMemcpy() function.

cudaUnbindTexture() function was used to free the GPU texture memory and cudaFree
was used to free the dynamically allocated memory previously.

B) Kernel Side:

float partial_sum = 0.00; __shared__ float sum[BLOCK]; int tx = threadIdx.x;

Initialize a float variable where each thread stores partial sum. The partial sum of each
thread in thread block is summed and stored in shared memory variable called sum.
Tx simply gives the thread id of each thread in the x direction.

13
for(int j = (bd * bx+ tx); j < num_elements; j = j + (bd * gx))
{
partial_sum += (tex1Dfetch(A_t, (j))* tex1Dfetch(B_t, (j)));
}
sum[tx] = partial_sum; __syncthreads();

Iterating thread in block to next and adding respective element from A and B matrix
stored in texture memory. Sum[] array is used to store the sums of each thread in
block to shared memory. _syncthreads() was used to synchronize all threads once they
finished their tasks.

int h;
if(BLOCK % 2 ==0)

{
h = BLOCK/2;
}
else
{
h = ((BLOCK+1)/2);
}

while ( h != 0 )

{
if ( tx < h)
{
sum[tx] += sum[tx+ h];
}

__syncthreads(); h = h / 2;
}

Depending if block size was odd or even h was halved accordingly. After this h is
halved following the reduction steps in which threads add 2 variables to make one
variable and hence the number of threads required halves. Since the threads are half,
we only take threads up to h and keep summing the variables until only one variable
remains which is the final answer for the summation of all elements of a thread block
which were put in the threads shared memory sum.

14
if(tx == 0)
{
atomicAdd(P, sum[0]);
}

The first thread is used to atomically add the sum[0] (since sum now only contains one
element). Add this sum which is the reduction of one block into global variable P which
is then passed to CPU once all thread blocks have atomically added their sums to P.

15
Chapter 4
Parallel Formulation of Two-Dimensional Convolution

4.1 Environment Information:


All code was run remotely on the eeitnu.nirmauni.ac.in host. The code was compiled using
the following command:

/usr/local/cuda/bin/nvcc -o 2Dconvolution 2Dconvolution.cu


2Dconvolution_gold.cpp -O3 -gencode arch=compute_60,code=sm_60

After compilation the program was run as follows:

./2Dconvolution

4.2 Code Analysis:

4.2.1 Vector Multiplication using Global Memory:

A) CPU Side:

Matrix Md = AllocateDeviceMatrix(M);
CopyToDeviceMatrix(Md, M);
Matrix Nd = AllocateDeviceMatrix(N);
CopyToDeviceMatrix(Nd, N);
struct timeval start, stop;

Allocate the memory for matrix M and N that is given as input to the function and it is
passed to the device.Here matrix M is the matrix that needs to be convolved and
matrix N is the kernel matrix.

Matrix Pd = AllocateDeviceMatrix(P);
CopyToDeviceMatrix(Pd, P);

Allocate memory for the matrix in which the convolution result will be stored and copy
that matrix to the device.

dim3 thread_block(THREAD_BLOCK_SIZE,THREAD_BLOCK_SIZE,1); int


num_elements=N.height;
16
int num_thread_blocks_x = num_elements/thread_block.x; int
num_thread_blocks_y = num_elements/thread_block.y; dim3
grid(num_thread_blocks_x,num_thread_blocks_y, 1);

Specify a 2 dimensional grid and a 2 dimensional thread block according to the matrix
size of the given matrix.The thread block size is kept as 32 and the kernel matrix size is
fixed to be 5X5.

gettimeofday(&start, NULL);
ConvolutionKernel<<<grid,thread_block>>>(Md,Nd,Pd);
cudaThreadSynchronize();
gettimeofday(&stop, NULL);
tp=(stop.tv_sec - start.tv_sec +(stop.tv_usec - start.tv_usec)/(float)1000000);

Call for kernel is made.Threads are synchronized to get the exact time it takes to run
the code on GPU.

CopyFromDeviceMatrix(P, Pd);

The output matrix is copied back into the device

FreeDeviceMatrix(&Md);
FreeDeviceMatrix(&Nd);
FreeDeviceMatrix(&Pd);

Free the device memory that was allocated for input and output matrices.

B) Kernel Side:

int hN=N.height,wN=N.width;

Stores the height and width of matrix.

int i=blockIdx.y*blockDim.y+threadIdx.y; int


j=blockIdx.x*blockDim.x+threadIdx.x;
Select the row and column according to the dimensions of the block and the 2
dimensional threads.

17
The rest of the code logic is similar to compute gold.The mbegin is used to calculate the
beginning index of the row in main matrix to perform the convolution process that is
multiplying each element of kernel with the corresponding element of
matrix.Similarly, mend is used to calculate last element of the row.Similarly nbegin and
nend are used to represent the index for column. Then convolution is performed for
5X5 kernel portions for matrix.

double sum = 0;
unsigned int mbegin = (i < 2)? 2 - i : 0; unsigned int
mend = (i > (hN - 3))?hN - i + 2 : 5; unsigned int nbegin
= (j < 2)? 2 - j : 0; unsigned int nend = (j > (wN - 3))?
(wN-j) + 2 : 5;
// overlay M over N centered at element (i,j). For each
// overlapping element, multiply the two and accumulate
for(unsigned int m = mbegin; m < mend; ++m) {
for(unsigned int n = nbegin; n < nend; n++)
{ sum +=M.elements[m * 5 + n] *
N.elements[wN*(i + m - 2) + (j+n - 2)];
}
}
// store the result P.elements[i*wN + j] =
(float)sum;

18
Chapter 5
Parallel Formulation of Trapezoidal Numerical
Integration
5.1 Environment Information:
All code was run remotely on the eeitnu.nirmauni.ac.in host. The code was compiled using
the following command:

5.2 Code Analysis:

5.2.1 Vector Multiplication using Global Memory:

A) CPU Side:

int Block_num = (GRID/BLOCK); size_t fs =


sizeof(float);
float *Result_fromGPU; double sum, interim;

First line initializes the number of blocks needed with the given Block and Grid size.
The latter are both predefined constants. The next line simply saves the size of a float
variable to make the code easier to write. The last line declares 2 variables sum and
interim

float *partial_result = (float *)malloc(Block_num * fs);


cudaMalloc((void**)&Result_fromGPU, Block_num *
fs); dim3 thread_block(BLOCK, 1, 1);
dim3 grid(Block_num, 1);

Partial result is a dynamically allocated array used to store the summation results
from the kernel side for each thread block. This is a workaround to get a precise yet
fast answer without making multiple kernel calls.

The dim3 functions are used to set-up the blocks and the execution grid.

struct timeval start, stop; gettimeofday(&start, NULL);


trap_kernel <<< grid, thread_block >>> (a, b, n, h,)

19
Result_fromGPU); cudaThreadSynchronize();
gettimeofday(&stop, NULL);

These lines of code just make the kernel code and have a timer wrapped around to
count the time spent in kernel mode.

cudaMemcpy(partial_result, Result_fromGPU,
Block_num *fs, cudaMemcpyDeviceToHost);

Used to bring back the result of summation of each block as an array to to host once
kernel has finished computing the summation of each block.

sum = ((b) + 1)/sqrt(pow(b, 2) + (b + 1)) + ((a) +


1)/sqrt(pow(a, 2) + (a + 1))/2; int i = 0;
while(i < Block_num)
{
sum = partial_result[i] + sum; i++;
}
cudaFree(Result_fromGPU); free(partial_result);
return (h*(sum));

The first line is used to find the sum defined in the pseudo code and also in the regular
trapezoid function. We will add our partial results to this to get the final answer. This
was done using a while loop which simply loops through each element in the
partial_result array and adds to sum. Finally we multiply this total by the height to give
final answer which is returned.

B) Kernel Side:

int tx = threadIdx.x; int bd = blockDim.x; int bi


= blockIdx.x; int gd = gridDim.x;
int i; int j =1; float x,y;
double partial_sum = 0.00;
__shared__ float sum[BLOCK];

tx, bd, bi and gd are used to save the various dimensions of a thread-block or the grid. I
and j are iterating variables. A double precision variable partial sum was initialized,

20
This will be used to partial sum of each thread.

To develop an optimized implementation, the summation of partial results of each


thread was stored in shared memory.

for (i = (bi* bd) + (tx + 1); i < n; i = i+ bd * gd)


{
x = (i * h); y = a + x;
partial_sum = partial_sum + (y +
1)/sqrt(pow(y, 2) + y + 1);
}
sum[tx] = partial_sum;

This is the core of the program. Since the number of threads are less than the total
number of elements, in our case, the total number of threads which equals the grid size
are 16,384 and the default case has 100000000 slices in the Trapezoid, therefore after
processing one loop iteration, each thread must then move to the next unprocessed
location, which is given by the total number of threads or bd (block dimension) * gd
(number of block in grid).

After this each thread performs the operation highlighted in the Pseudocode below
(provided in the description of the problem).

int half;
if(BLOCK % 2 ==0)
{
half = BLOCK/2;
}
else
{
half = ((BLOCK+1)/2);
}
while (half > 0)
{
if (tx < half)
{
sum[tx] = sum[tx] + sum[tx+ j];
}
half = half/2; __syncthreads();
}
21
Once all threads in a block have stored their respective partial sums in shared
memory. The above code block is used to reduce the shared memory blocks to result in
the summation of all elements in the block. Since reduction involved halving the block
after each step, we use an int half which is sliced in half after each iteration. Since the
number of threads are also halved we only take the lower half of threads and use them
to reduce the elements.

if (tx)
Result_fromGPU[bi] = sum[0];

Finally, one thread in each block is used store the sum after reduction of the its
respective block in an array which is passed to the CPU.

22
Chapter 6
Results and Calculations
6.1 Vector Multiplication:
Table 1: Speedup using Global Memory

Matrix size Speedup

512X512 11.12

1024X1024 17.33

2048X2048 29.6

Table 2: Speedup using Shared Memory

Matrix size Speedup

512X512 19.72

1024X1024 31.81

2048X2048 59.7

Calculations:

Number of floating point operations per byte is (8800 X 10^9) FLOPS/320 X 10^9
Bytes/s=27.5 floating point operations per byte. We consider floating point operation
as 4 bytes then there will be 110 floating point operations per load per operation.

When there are total 512X512 row column combinations, for each there are 2
operations so the number of floating point operations are 0.00000524 GFLOPS.

When there are total 1024X1024 row column combinations, for each there are 2
operations so the number of floating point operations are 0.00209GFLOPS.

When there are total 2048X2048 row column combinations, for each there are 2
operations so the number of floating point operations are 0.008388GFLOPS.

23
Performance = (Number of floating point operations) X (number of rows) X (number
of columns)/(Time for execution)

Table 3: Kernel performance for each implementation

Global Memory Shared Memory


Matrix size Performance(in Performance(in
GFLOPS) GFLOPS)

512X512 6.55 11.64

1024X1024 13.66 25.18

2048X2048 28.24 50.22

6.2 Vector Dot Product:


Table 4: Speedup using Texture Memory

Num elements Speedup

100000 2.67
1000000 16.7
10000000 36.7

Table 5: Speedup for 10000000 elements with varying thread block size

Block size Speedup

16 26.5
32 39.74
64 33.23
512 36.21
1024 35.99

24
6.3 Two-Dimensional Convolution:

Table 6: Speedup using Global Memory

Matrix size Speedup Block size

512X512 55.72 32
1024X1024 74.72 32
2048X2048 85.5 32
512X512 54.2 16
1024X1024 72.1 16
2048X2048 84.3 16
512X512 51.2 8
1024X1024 63.35 8
2048X2048 76.2 8

Calculations:

Performance= (Number of floating point operations) X (number of rows) X (number of


columns) / (Time for execution). The total number of floating point operations per
computation of one element would be 25 for multiplications and 24 for additions. So there
are 49 calculations.

Table 7: Kernel performance for each implementation

Global Memory Performance


Matrix size
(in GFLOPS)

512X512 36.8

1024X1024 62.88

2048X2048 76.65

25
Table 8: CPU performance for each implementation

Matrix size CPU performance

512X512 0.67

1024X1024 0.83

2048X2048 0.96

Timing overhead is calculated by measuring the time before GPU call and after GPU call.

Table 9: Timing overhead for different matrix sizes

Matrix size Timing overhead (in seconds)

512X512 1.016

1024X1024 1.026

2048X2048 1.077

As the size of matrix increases timing overhead also increases.

6.4 Trapezoidal Numerical Integration:

Table 10: Execution Time for Varying NUM_TRAPEZOIDS and thread block size 64.

NUM_TRAPEZOIDS Speedup

100000 3.35

1000000 457.69

10000000 1557.35

26
100000000 1610.41

1000000000 1679.51

Table 11: Execution Time for Various Thread Block sizes

Block Size No. of Blocks Execution Time (s)

32 512 0.002115

64 256 0.002100

128 128 0.002116

256 64 0.002361

512 32 0.002363

1024 16 0.002392

27
Chapter 7

Conclusions

Parallel implementation of vector multiplication resulted in appreciable speedup. It was


noted that the shared-memory implementation was almost twice as fast as the global-
memory parallel implementation of vector multiplication. This can be attributed to the fact
that accessing shared memory (Read-Write takes ~5 cycles) is much quicker than accessing
global memory (Read-Write takes ~500 cycles). Also it was noticed that speedup was almost
directly proportional to the matrix size. This is because with larger data set more parallelism
can be extracted by having higher number of threads. It was noted that higher the number of
elements the better the speedup using the GPU. This was simply because a higher number of
elements allows for a better extraction of parallelism.

It was also noted that the number of blocks does not greatly affect the speedup in case of
vector dot product. A block size of 32 gave consistent high speed-ups when compared to
block size of 64,16 and 512. A noticeable difference was when the block size of 16. As can be
seen from table 5, this led to the least amount of speedup. However it was also noticed that
having block size of 2048 resulted in test failed. This is because a grid size of 1000 and block
size of 2048 exceeds the maximum capacity of an SM leading to failure of test.

Parallel implementation of 2D convolution resulted in appreciable speedup. It was noted that


higher the number of elements the better the speedup using the GPU. This was simply
because a higher number of elements allows for a better extraction of parallelism.

Like Vector Multiplication, it was also noted that the number of blocks does not greatly affect
the speedup. A block size of 32 gave consistent high speed-ups when compared to block size
of 8,16 .The timing overhead for the function compute on device increased with the increase
in matrix size.

Parallel implementation of Numerical Integration resulted in appreciable amounts of


speedup. Shared memory was used to optimize the Trapezoid kernel code. Shared memory

28
provides a much faster computation in comparison to global memory, which we already saw
in the 2D convolution.

Table 10 shows the speedups achieved for the various number of trapezoid slices. It was
concluded that GPU speeds are best seen when the number of elements being worked with is
large. Table 11 shows the execution time with various block sizes. It can be seen that smaller
block sizes (32-128) have quicker execution times compared to larger block sizes (256-
1024).

29

You might also like