Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 13

GPGPU Parallel Merge Sort Algorithm

Jim Kukunas and James Devine


May 4, 2009

Abstract
The increasingly high data throughput and computational power of todays Graphics
Processing Units (GPUs), has led to the the development of General Purpose co
mputa- tion on the GPU (GPGPU). nVidia has created a unified GPGPU architectur
e, known as CUDA, which can be utilized through language extensions to the
C programming language. This work seeks to explore CUDA through the impleme
ntation of a parallel merge sort algorithm.

1
GPGPU
ally
There
cing

Introduction
allows for the parallel execution of code on the GPU. The GPU is essenti
used as a co-processor to handle the execution of highly parallel code.
are very large per- formance improvements that can be realized by outsour
highly parallel computationally intense tasks to the GPU.

Typically, Central Processing Units (CPUs) achieve computational parallelism thr


ough the use of an out-of-order execution core, which typically contains on
ly a few Arithmetic Logic Units (ALUs). ALUs are co-processors dedicated to int
eger and floating point compu- tations. For example, three out of the six port
s within the execution core of the Intel Core
2 Duo are ALUs. This means that, assuming a full pipeline, three arithmetic co
mputations can execute at once. Graphical Processing Units (GPUs), unlike CPU
s, are specialized for intense floating point calculations, and thus contain
many stream processors. Stream pro- cessors are ALUs which specialize in a
limited form of parallelism, similar to SIMD vertex instructions, in which the
re exists a stream of data and a kernel function which is applied to each elem
ent in the stream. These streams are ideal for intensive, highly specialized, p
arallel floating point operations. The nVidia GTX 260, for example, contains 25
0 on-board stream
1

Figure 1: CUDA Architecture


processors. Compared with a CPU, the GPU has a much greater computatio
n through- put, and thus it is ideal to outsource heavy embarrassingly parall

el computations onto the GPU. This outsourcing onto the graphics card is
known as General Purpose computation on Graphics Processing Units (GPGPU).
Historically, GPGPU architectures were highly specific to certain graphics card
s and required the developer to write all GPGPU code in assembly language, assem
bled specifically for the specific card in use, however with the increase in the
number of stream processors, the increase in the amount of on-board memory,
and increased pipeline speeds, the demand for GPGPU tools has increased and bo
th nVidia and ATI have released unified GPGPU architectures for their la
test cards.
nVidias architecture, CUDA, is a set of language extensions
to the C programming language. The CUDA architecture can be seen in figure
1.

Extended CUDA C

To allow for simple integration with existing applications, nVidia implemented t


heir GPGPU architecture through extensions to the C programming language. The
nVidia CUDA com- piler supports full ISO C++ for code running on the host, b
ut for functions executed on the device, only supports CUDA C.

2
Function Qualifiers
The CUDA extensions for C contain three types of function type qualifiers,
device, global and
host
. The
device qualifier declares a
function executed completely on the GPU, and hence this function is the m
ost restricted. These functions can not be derefer- enced, can not be recu
rsive, can not contain static variable declarations and can not take
a variable number of arguments. These functions can also only be called fro
m the device. Functions declared with the
global qualifier defines a
kernel function, the function which will be executed on each element in the
stream of data.
global functions are executed on the GPU, but are o
nly callable from the host. They also must return void. The third type of qua
lifier,
host
, declares a function which is executed on the host. Th
ese functions are only callable from the host.Once a
host function h
as been declared, it can be called using the following syntax:
F U N C <<< Dg, Db, N s >>> (P arameters);
Here, Dg is a three-element tuple specifying the dimensions and size of t
he grid to be launched. Db is also a three-element tuple specifying the dim
ension and size of each block. Finally, Ns is of the type size_t, and repre
sents the number of bytes in shared memory that is dynamically allocated p
er block. The number of threads spawned per block can be
represented by[1]:

threadsperblock = Db.x

Db.y

Db.z

Variable Qualifiers
The CUDA extensions for C contain three types of variable qualifiers,
devic
e
, constant
, and
shared . Variables declared with the
d
evice qualifier reside within the memory of the device. These variables are in the global memory space and have th
e lifetime of the application. The
constant qualifier declares a variable
similar to
device
, but the value of which is constant. The third qua
lifier
shared
, declares a variable which is accessible from all threa
ds[1].

3
Learning About CUDA and GPGPU
The starting point for the project was to learn about GPGPU and CUDA. The
Internet severed as our best resource. We started with the wikipedia entry
for general information and then followed the links to more authoritative web
sites. These links can be found in the References section.

Install the CUDA SDK and Driver


The CUDA SDK and drivers are available for free from nVidias CUDA ZONE website[
2]. After downloading the software we first installed the CUDA drivers.
The drivers replace the video drivers that were installed to allow the CUDA
extensions to run. Once the CUDA drivers were installed the SDK installer w
as run. This installed the files required to write code that uses the CUDA
extensions. Along with the SDK a package of sample CUDA programs were i
nstalled.
Integrate the CUDA SDK into Visual Studio 2008
After the CUDA driver is installed, it is necessary to set the environ
mental variables in Visual Studio to allow proper integration. Inside of Vi
sual Studio, you must set the bin, include, lib and source paths to that o
f those installed by the driver and to that of those installed by the SDK.
Once that is completed, you must create a custom build rule to allow for the
CUDA application to be built by the NVCC compiler.
Porting an Iterative Merge Sort to the CUDA Architecture
Since CUDA does not support recursion, we had to implement an iterative version
of merge sort. After a few various attempts, we settled on an implementation w
e found at [4]. Using existing CUDA examples and the CUDA Programming Guide[1],

we created a host function which was passed two input arrays, the first bei
ng a list of the numbers to be sorted and the second was the destination arra
y. Both of these arrays were first created and initialized in host memory.
Then, using cudaMalloc and cudaMemCpy, they were shuttled onto the

4
GPUs dedicated memory.
Inside the host function, the merge function is called.
ction is a
device

The merge fun

function which performs the actual merge sort algorithm on the GP

U.

Challenges

The two main challenges involved in this project were setting up Visual St
udio 2008 and writing the algorithm. After several hours of work we were
able to get Visual Studio to build applications using the CUDA API. It
was quite frustrating at the time, but after the correct files were add
ed to the Windows PATH variable programs began to compile without any p
roblems. Once we were able to use the API we were faced with the challenge of l
earning CUDA and then coding the merge sort algorithm. One of the more
difficult tasks in writing the algorithm was deciding on how to convert the
recursive merge sort into an iterative one. After finding some sample C code
for an iterative merge sort online we were able to replicate it using the CUD
A API.
4

Future Work

If more time were alloted for this project it would have been interesting to lo
ok at the speed up gained by performing merge sort on the GPU. There is
much further work that can be done with GPGPU. The increasing speed and de
creasing price of GPUs makes them a promising platform for future research.
Undoubtedly, better tools and easier extensions to implement GPGPU are not fa
r down the road.
5

Conclusion

The project was both fun and educational. We got a change to learn how to writ
e parallel code that runs on the GPU.

5
6
Appendix Iterative Merge Sort [4]
// T h i s i s an i t e r a t i v e
r t .

v e r s i o n

// I t m e r g e s p a i r s o f t w o
s o n e a f t e r a n o t h e r .
// A f t e r s c a n n i n g
a b o v e j o b ,
// i t g o e s t o
o n t r o l s t h e

t h e

a r r a y

s t a g e .

// o f
i m e

l i s t s t o b e m e r g e d .
t h e main l o o p

// i s

e x e c u t e d .

d o u b l e s

#i n c l u d e <math . h>
i n t i , n , t , k ;
i n t a [ 1 0 0 0 0 0 ] , b [ 1 0 0 0 0 0 ] ;
i n t m e r g e ( l , r , u )
i n t l , r , u ;
{ i n t i , j , k ;
j=r ;

k= l ;

while ( i <r && j <u )


i f

( a [ i ]<=a [ j ] )

e l s e

{ b [ k ]= a [ i ] ;

{ b [ k ]= a [ j ] ;

k++;

j ++;}

}
while ( i <r )

b [ k ]= a [ i ] ;

i ++;

k++;

j ++;

k++;

}
while ( j <u )
b [ k ]= a [ j ] ;

}
f o r ( k= l ;

k<u ;

k++) {

t o

i ++;}

s o

l i s t
do

V a r i a b l e

#i n c l u d e < s t d i o . h>

i = l ;

m e rg e

c o n s e c u t i v e

w h o l e

t h e n e x t
s i z e

o f

t h e
k

e a c h

a [ k ]= b [ k ] ;
}
}
6
s o r t ( )
{ i n t k , u ;
k =1;
while ( k<n )

i =1;
while ( i +k<=n )
u= i +k
2 ;
i f ( u>n )

u=n +1; m e r g e ( i , i +k , u ) ; i = i +k

2 ;

}
k=k

2 ;

}
}
main ( )
{ p r i n t f ( "input size \n" ) ;
s c a n f ( "%d",&n ) ;
/

f o r ( i =1; i <=n ; i ++) s c a n f (%d ,& a [ i ] ) ;

f o r ( i =1; i <=n ; i ++) a [ i ]= random ( ) % 1 0 0 0 ;


t=c l o c k ( ) ;
s o r t ( ) ;
f o r ( i =1; i <=10; i ++) p r i n t f ( "%d " , a [ i ] ) ;
p r i n t f ( "\n" ) ;
p r i n t f ( "time= %d millisec\n" , ( c l o c k () t ) / 1 0 0 0 ) ;
}

GPGPU Merge Sort Kernel

/
m e r g e k e r n e l . c u

m e rg e s o r t
t h r e a d s

e l

on

I m p l e m e n t a t i o n

t h e GPU

t h r o u g h

t h e

r u n

i n

p a r a l l

n V i d i a CUDA Framework

7
Jim Kukunas

and

James D e v i n e

#i f n d e f

MERGE KERNEL CU

#d e f i n e

MERGE KERNEL CU

#d e f i n e NUM

d e v i c e

6 4

i n l i n e

void Merge ( i n t
r , i n t u )

v a l u e s ,

i n t

r e s u l t s ,

i n t l ,

i n t

{
i n t i , j , k ;
i = l ;

j=r ;

k= l ;

while ( i <r && j <u )

i f ( v a l u e s [ i ]<= v a l u e s [ j ] )
e s [ i ] ; i ++;}
e l s e
k++;
}

{ r e s u l t s [ k ]= v a l u

{ r e s u l t s [ k ]= v a l u e s [ j ] ;

j ++;}

while ( i <r )

r e s u l t s [ k ]= v a l u e s [ i ] ;

i ++; k++;

while ( j <u )

r e s u l t s [ k ]= v a l u e s [ j ] ;

j ++;

k++;

}
f o r ( k= l ;

k<u ;

k++) {

v a l u e s [ k ]= r e s u l t s [ k ] ;
}
}

g l o b a l
s t a t i c
i n t
r e s u l t s )

void M e r g e S o r t ( i n t

{
extern

s h a r e d

i n t s h a r e d [ ] ;

co nst unsigned i n t t i d = t h r e a d I d x . x ;
i n t k , u , i ;

// Copy

i n p u t

s h a r e d [ t i d ]

t o

= v a l u e s [ t i d ] ;

s y n c t h r e a d s ( ) ;

k = 1 ;
while ( k < NUM)

s h a r e d mem .

v a l u e s ,

{
i = 1 ;
while ( i +k <=NUM)
{
u = i +k

2 ;

i f ( u > NUM)
{
u = NUM+1;
}
Merge ( s h a r e d ,
i = i +k

r e s u l t s ,

i ,

i +k ,

u ) ;

2 ;

}
k = k

2 ;

s y n c t h r e a d s ( ) ;
}

v a l u e s [ t i d ]

= s h a r e d [ t i d ] ;

#e n d i f
GPGPU Merge Sort Driver Application
/
m e rg e . c u

T e s t s t h e m e rg e s o r t
p a r a l l e l t h r e a d s

I m p l e m e n t a t i o n

r u n

i n

on

t h e GPU

t h r o u g h

t h e

n V i d i a CUDA Framework

Jim Kukunas

and

James D e v i n e

#i n c l u d e < s t d i o . h>
#i n c l u d e < s t d l i b . h>
#i n c l u d e < c u t i l i n l i n e . h>
#i n c l u d e

"merge_kernel.cu"

i n t main ( i n t a r g c ,

char

a r g v )

{
i f ( c u t C h e c k C m d L i n e F l a g ( a r g c ,
g v , "device" ) )
c u t i l D e v i c e I n i t ( a r g c ,

( co nst char

a r g v ) ;

e l s e

c u d a S e t D e v i c e (
) ;

c u t G e t M a x G f l o p s D e v i c e I d ( )

i n t v a l u e s [NUM] ;

i n i t i a l i z e

f o r ( i n t i = 0 ;

a random d a t a s e t
i < NUM;

{
v a l u e s [ i ]

= r a n d ( ) ;

10

i n t

d v a l u e s ,

i ++)

) a r

r e s u l t s ;
c u t i l S a f e C a l l ( c u d a M a l l o c ( ( void
z e o f ( i n t )
NUM) ) ;
c u t i l S a f e C a l l
e o f ( i n t )
NUM,
d a M a l l o c ( ( void
c u t i l S a f e C a l l
e o f ( i n t ) NUM,
s i z e o f ( i n t )
// c h e c k f o r any

( cudaMemcpy ( d v a l u
cudaMemcpyHostToDevice
)& r e s u l t s , s
( cudaMemcpy ( r e s u l
cudaMemcpyHostToDevice )
NUM 2>>>( d v a l u e s
e r r o r s

e
c
i
t
M
,

s
u
z
s
e

,
t
e
,
r
r

)& d v a l u e s ,

i
o
v
g
e

v
l
f
a
e
s

a
S
(
l
S
u

l
a
i
u
o
l

u
f
n
e
r
t

e
e
t
s
t
s

s , s i z
C a l l ( c u
)
NUM) ) ;
, s i z
<<<1, NUM,
) ;

c u t i l C h e c k M s g ( "Kernel execution failed" ) ;

c u t i l S a f e C a l l ( c u d a F r e e ( d v a l u e s ) ) ;
c u t i l S a f e C a l l ( cudaMemcpy ( v a l u e s ,
e o f ( i n t ) NUM, cudaMemcpyDeviceToHost ) )

r e s u l t s ,

c u t i l S a f e C a l l ( c u d a F r e e ( r e s u l t s ) ) ;

bool p a s s e d = t ru e ;
f o r ( i n t i = 1 ;

i < NUM;

i ++)

{
i f

( v a l u e s [ i 1] > v a l u e s [ i ] )

{
p a s s e d = f a l s e ;
}
}
p r i n t f (

"Test %s\n" ,

p a s s e d

c u d a T h r e a d E x i t ( ) ;
c u t i l E x i t ( a r g c ,
}

a r g v ) ;

? "PASSED" :

s i

"FAILED" ) ;

s i z

11
References
[1] NVIDIA
Corporation.
Nvidia
cuda
programming guide.
NVIDIA,
2008. http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/
docs/ NVIDIA_CUDA_Programming_Guide_2.1.pdf
[2] NVIDIA Corporation. Cuda zone. NVIDIA, 2009. http://www.nvidia.com/obje
ct/
cuda_home.html
[3] Mark Harris. General purpose computation on graphics hardware. GPGPU.org,
2009.
http://gpgpu.org
[4] Tadao Takaok.
Iterative merge
ac.nz/tad. takaoka/cosc229/imerge.c

sort.

http://www.cosc.canterbury.

12

You might also like