The Performance of A Selection of Sorting Algorithms On A General Purpose Parallel Computer

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

CONCURRENCY: PRACTICE AND EXPERIENCE, VOL.

10(4), 315–332 (10 APRIL 1998)

The performance of a selection of


sorting algorithms on a general
purpose parallel computer
R. D. DOWSING∗AND W. S. MARTINS†
School of Information Systems, University of East Anglia, Norwich NR4 7TJ, UK
(e-mail: rdd@sys.uea.ac.uk)

SUMMARY
In the past few years, there has been considerable interest in general purpose computational
models of parallel computation to enable independent development of hardware and software.
The BSP and related models represent an important step in this direction, providing a simple
view of a parallel machine and permitting the design and analysis of algorithms whose perfor-
mance can be predicted for real machines. In this paper we analyse the performance of three
sorting algorithms on a BSP-type architecture and show qualitative agreement between exper-
imental results from a simulator and theoretical performance equations. 1998 John Wiley &
Sons, Ltd.

Concurrency: Pract. Exper., Vol. 10(4), 315–332 (1998)

1. INTRODUCTION
Parallel machines are in widespread use but most applications use architecture dependent
software which is thus not portable and quickly becomes obsolete with a new generation
of parallel machines. Some researchers[1–4] argue that a parallel model of computation is
required which separates hardware and software development so that portable software can
be developed for a range of different parallel architectures.
Parallel models of computation can broadly be classified into two categories: PRAM
(shared memory) based models and network (nonshared memory) models. The former
type has been extensively used for analysing the complexity of parallel algorithms. In this
idealised model processors operate completely synchronously and have access to a single
shared memory whose cells can be accessed in unit time. Although these assumptions ease
the design and analysis of parallel algorithms they usually result in algorithms which are
unsuitable for direct implementation on current parallel machines. Network models are
based on real parallel machines but, by including realistic costs for operations, they make
analysis more difficult. In particular, by including the topology of the architecture in the
model, algorithms map with different degrees of difficulty on to machines with different
topologies. The bulk-synchronous parallel (BSP) model developed by Valiant[4] attempts
to bridge the gap between these two types of model.

∗ Correspondence to: R. D. Dowsing, School of Information Systems, University of East Anglia, Norwich
NR4 7TJ, UK. (e-mail: rdd@sys.uea.ac.uk)
† Present Address: Informatics Department, Universidade Federale de Goias - IMF, Caixa Postal 131,
Campus II, CEP:74001-970, Goiania - GO - Brazil.
Contract grant sponsor: CNPq, Brazil; Contract grant number: 200721/91-7.

CCC 1040–3108/98/040315–18$17.50 Received 12 May 1996


1998 John Wiley & Sons, Ltd. Revised 15 January 1997
316 R. D. DOWSING AND W. S. MARTINS

The BSP model – and related models – define a general purpose computational model in
which the programmer is presented with a two-level memory hierarchy. Each processor has
its own local memory and access to a common shared memory. Logically, shared memory
is a uniform address space, although physically it may be implemented across a range of
distributed memories. The model abstracts the topology of the machine by characterising the
communication network via two parameters (L and g)[4], which are related to the latency
and bandwidth, respectively. Global accesses are costed using these two parameters and the
lower the values the lower the communication costs. This is the case in the idealised PRAM
model where global operations are assumed to take the same time as local operations. In
existing parallel machines, however, these values are considerably higher and dependent
on the access patterns to global memory. Valiant[5] has shown that by introducing some
random behaviour in the routing algorithm (two-phase randomised routing) it is possible
for real machines to maintain good bandwidth and latency. This enables real machines to
reflect a two-level memory hierarchy, thus hiding physical topology from the programmer.

2. BSP AND RELATED MODELS


The BSP and related models aim to provide a simple view of a parallel machine such that the
programmer is able to design and analyse algorithms whose performance is predictable in
real machines. However, the way a computation is viewed, as well as the way performance
analysis of algorithms is accomplished, may differ from one model to another.
A computation in the BSP model consists of a sequence of supersteps separated by barrier
synchronisation. In a superstep local operations can be carried out and messages can be sent
and received to implement global operations Read and Write. These communication events,
however, are not guaranteed to terminate before the end of a superstep. Therefore remote
data requested in a superstep can only be used in the next superstep. The performance of
a BSP algorithm is determined by adding the costs of its supersteps. Individual supersteps
are costed by analysing their total computation and communication demands. The global
parameters L and g, together with the number of processors p and the problem size n,
are used in this analysis. The parameter L represents the synchronisation periodicity of
the machine, whereas g is the value such that an h-relation can be performed in g.h time;
an h-relation is defined as a routing problem where each processor sends and receives h
messages. The cost of a superstep is then given by max{L, c, ghs , ghr }, where c is the
maximum time spent in local operations, hs is the number of messages sent and hr is the
number of messages received in that superstep. Alternative cost definitions can be used,
depending on the assumptions made about the implementation of the supersteps.
In contrast with the bulk synchronous nature of algorithms for the BSP model discussed
above, the execution of algorithms on the LogP model can be viewed as a number of
separately executing processes which are asynchronous with respect to each other. Com-
munication and synchronisation among the processes is performed via message passing. A
message sent to a processor can be used as soon as it arrives, instead of having to wait for a
barrier operation as in the BSP model. The LogP model defines four main parameters: L, o,
g and P . P represents the number of processors. The parameter o is defined as the overhead
associated with the transmission or reception of a message. The parameters L and g, al-
though using the same labels as in the BSP model, have different meanings. The parameter
L sets an upper bound on the latency incurred in sending a small message, whereas g is
defined as the minimum time period (or gap) between consecutive message transmissions

Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 317

or receptions, the reciprocal of g being the bandwidth per processor. The parameter g is
similar to that in the BSP model, in that it provides a measure of the efficiency of message
delivery. However, since there is no implicit synchronisation in the LogP model, the notion
of supersteps performing h-relations does not apply to this model. The model assumes that
the network has a finite capacity, i.e. each processor can have no more than L/g outstanding
messages in the network at any one time. Processors attempting to exceed this limit are
stalled until the message transfer can be initiated. This is in contrast to the BSP model,
where any balanced communication event can be done in gh time. The performance of
LogP algorithms can be quantified by summing all computation and communication costs
of the algorithm. Communications are costed in terms of primitive message events. For
example, the cost for reading a remote location is 2L + 4o. Two message transmissions are
required, one requesting and another sending the data. In each transmission each processor
involved in the operation spends o time units interacting with the network and the message
takes L time units to get to its destination. The cost of a writing operation is the same,
although in this case the response is the acknowledge required for sequential consistency.
This analysis assumes the data fits into a single transmission block. When dealing with a
block of n such basic data, the cost becomes 2L + 4o + (n − 1)g, assuming o < g. This
is because, after the first transmission, subsequent transmissions have to wait g time units.
The LogP model encourages the use of balanced communication events so as to avoid a
processor being flooded with incoming messages – a situation where all but L/g of the
sending processors would stall.
The WPRAM model views the BSP and LogP models as architectural models. The
following description is restricted to this level of abstraction.∗ The WPRAM model attempts
to extend the BSP model to a more flexible form. One important difference is that barrier
synchronisation is supported at the message-passing level using point-to-point routing.
Message passing can also be used to implement any synchronisation operation. This makes
the WPRAM model closer to the LogP model. However, while the BSP and LogP models
are applicable to a broad range of machine classes, the WPRAM model has been designed
for a class of scalable distributed memory machines. This means that network latency should
increase at a logarithmic rate with respect to the number of processors, i.e. D = O(log(p)),
and that each processor should be able to send messages into the network at a constant
frequency, i.e. g = O(1). The global parameters D and g are similar to L and g defined in
the LogP model. However, instead of an upper bound given by a constant, the parameter D
represents a mean delay which increases logarithmically with the number of processors. It
is modelled by the mean network delay resulting from continuous random traffic and it thus
includes the effects of switch contention and contention for shared data at the destination
processor. The parameter g is the same as that of the LogP model, though no limit is
imposed on the total number of outstanding messages a processor can have in the network.
Because the network is capable of handling a constant maximum frequency of accesses per
processor, the analysis of WPRAM algorithms is facilitated since the programmer does not
have to be concerned with a network limit capacity, as is the case with LogP algorithms.
Besides the global parameters L and g, the WPRAM model defines a number of other
machine parameters. These parameters have been incorporated in a simulator[6] so that the
execution time of WPRAM programs can be obtained. This allows one to determine the

∗ In a higher level, the WPRAM model provides a number of operations which, by taking advantage of
the network properties, are guaranteed to be scalable. The use of these operations makes the WPRAM model
particularly suitable for the design and analysis of scalable algorithms.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
318 R. D. DOWSING AND W. S. MARTINS

relative importance of various low-level parameters on the performance of algorithms.


The models described earlier have in common the two-level memory structure with uni-
form global access. Despite some differences in the number and meaning of the parameters,
performance analysis in these models becomes very similar when barrier synchronisation
is used in the algorithms. In this paper we analyse the performance of three such bulk
synchronous algorithms using the WPRAM model with a smaller number of performance
parameters. The theoretical results are then compared with the more accurate results pro-
duced by the WPRAM simulator[7].

Figure 1. WPRAM simulator

3. WPRAM SIMULATOR
The WPRAM simulator mentioned earlier was used to obtain the experimental results
given in this paper. The simulator is based on the interaction of processes that are used
to represent both the nodes of the target machine and the user processes. Algorithms are
implemented using a programming interface and can be subsequently executed directly on
the simulator. This way the sequence of operations generated by the program drives the
simulator (execution-driven discrete-event simulation). The WPRAM target architecture
for the simulator is a distributed memory machine, which supports uniform global access by
the use of data randomisation;∗ see Figure 1. The simulator includes a detailed performance
model which costs operations based on measured performance figures for the T9000 trans-
puter processor and simulations of the C104 packet router[6]. Local operations modelled
by the simulator include arithmetic calculation, context switching, message handling and
local process management. Messages entering the network are assumed to be split up to
guarantee that no one message ties up a switch for long periods of time, while global data
are assumed to be randomised based on the unit of a cache line. Global operations are costed
based on the high-level parameters g and D. The value for g and the performance figures
mentioned above were obtained from the Esprit PUMA Project, while D was derived from

∗ The use of data randomisation, where data are distributed throughout the local memories, obviates the need
for randomised routing.
Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 319

a simulation of the C104 router carried out at Inmos. Work on validating the simulator
results has been carried out at Leeds University, with a close match being found between
theoretical predictions and experimental results[7].
The simulator is written in C and provides a rich programming interface to execute
algorithms written in C. The programming interface consists of a set of library procedures
which support process management, shared data access and process synchronisation[6].
Only a small subset of these library calls were used in this research, including procedures
for process management (fork, join, my node and my index), read and write procedures
for data access and a procedure to barrier synchronise processes.

4. PERFORMANCE ANALYSIS
In analysing the performance of the algorithms, both the computational performance of the
individual processors and the communication performance of the interconnection network
have been taken into account. Local computations were modelled by run-time functions
resulting from fitting the known form of the functions to the observed times. Local sort
was modelled as ks n log n and local merge as km n.∗ The values of ks and km were
found to be both equal to 1.7µs. As for global communications, only two parameters,
Ti and Tg , were used. These parameters capture two distinct situations that occur in the
global communications of the algorithms studied. Ti represents the time for read/write of
individuals items from/to global memory in a situation where the issuing processor has
to wait for a response – either a data (read) or an acknowledgement (write). Tg is used
when several independent read/write operations are required; pipelining is used to improve
communication performance by overlapping computation and communication, implying a
cost of Tg n to read/write n elements. While Ti is dominated by send/receive overheads and
network delay, these costs are insignificant for blocks operations (modelled by Tg ), where
processor bandwidth is the determining factor. For the machine under study the values of Ti
and Tg , for integer data items, were found to be 47µs and 5.7µs, respectively. In the analysis
a small number of processors is assumed so that Ti ’s value can be taken as a constant. Also,
the synchronisation costs of the algorithms have been found to be small, compared to the
other communication costs, and were thus not included in the equations. Although these
assumptions restrict the analysis, they allow the derivation of simpler equations whose
results approximate well to those obtained experimentally.
The following Sections describe the performance analysis of a selection of sorting
algorithms. The implementations make use of explicit memory management with the
processes not operating directly on global data, but, instead, copying the data to local
memory and then operating locally. This was necessary due to the high access times
incurred for global data access. It has been found that the ratio between global to local
access time is roughly 50:1. Measurements have also shown that, whenever the number
of accesses involved in a computation step is greater than half the elements involved in
this step, it is faster to copy all elements to local memory and do the computations locally.
For each algorithm, performance equations are derived based on the parameters previously
described. The performance of the algorithms is then analysed for different input sizes
and number of processors. The results obtained theoretically are then compared to those
obtained from the simulator .

∗ Logarithms are assumed to be to the base 2.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
320 R. D. DOWSING AND W. S. MARTINS

5. THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS


Sorting is one of the most common activities performed by computers. It is often said that
it accounts for 25–50% of all the work performed by computers. Many computer programs
(compilers, databases etc.) incorporate a sort in order to enhance the speed and simplicity of
the algorithms. In addition to its practical importance, sorting has been studied extensively
in computer science because of its intrinsic theoretical significance. Numerous algorithms
exist for sorting on sequential computers – for example, bubble sort, insertion sort, selection
sort, heap sort, merge sort and quick sort. With the advent of parallel computers, research on
sorting algorithms has become very active and, in fact, an entire book[8] has been devoted
to the subject. Most parallel sorting algorithms have been developed either in the context of
PRAM models, e.g. [9–11], or network models[12–14]. These algorithms typically assume
a large number of processors (comparable to the number of data elements) and either
neglect communication costs (PRAM algorithms) or rely on a specific machine structure
(network-based algorithms). Hence they are not usually suitable for implementation on a
BSP-like machine with high communication costs and no notion of network locality.
In this Section the performance of three parallel sorting algorithms, designed to take
advantage of data grouping (i.e. p  n), is analysed. Only a single set of data inputs are
considered, so no pipelining is possible. The first algorithm, Parallel Mergesort, is a simple
parallel version of the well-known Sequential Mergesort. The second algorithm, Balanced
Mergesort, improves the processors’ utilisation by using all processors at each step of the
merging phase. Finally, the third algorithm, parallel sorting using regular sampling, not
only utilises all processors at each step but also requires just one merge step. These three
algorithms have in common a data-parallel, bulk-synchronous, style. Each processor per-
forms the same operations (or does nothing) over disjoint groups of data. At the end of each
computation phase, the data is remapped (communications) before the next transformation
can be applied. Thus, processors can proceed independently up to the beginning of each
new step, when they all have to synchronise (bulk synchronisation).

5.1. Parallel Mergesort


This well known sort uses a divide-and-conquer paradigm. The original n-element sequence
is divided into two subsequences of n/2 elements which are each sorted in parallel. Each
subsequence can be split into two and sorted in parallel, and this process can be repeated
as many times as desired. If it is assumed that there are p processors available, where p is a
power of 2, then each processor should be given n/p elements to sort; that is, the sequence
should be split in half log p times, as shown in Figure 2 for p = 8. To produce the sorted
original sequence the sorted subsequences are merged in pairs. The merging consists of
log p steps, each one using n/2i processors, where i = 0 . . . log p − 1.

5.1.1. Analysis
Computation: The computation time of the Parallel Mergesort algorithm can be calculated
by summing the contributions of the initial sorting phase with the log p merges of the
merging phase, giving
  logX
p−1 n
n
comp(n, p) = sort + merge i (1)
p 2
i=0

Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 321

Figure 2. Parallel Mergesort

By considering that the n/p-element sequences are all sorted using a sequential version of
mergesort, i.e. sort(n/p) = ks n/p log(n/p), and that a linear merge algorithm is used in
the second phase, i.e. merge(n) = km n, this equation can be simplified to
 
n n n
comp(n, p) = ks log − 2km + 2km n (2)
p p p

Communication: To calculate the communication time of the parallel mergesort algorithm,


all data movements (read/write from/to global memory) must be considered. Figure 3 shows
the action of each processor during the execution of this algorithm. The numbered squares
represent the processors available, in this case eight. Active processors are coloured grey.
At each step, the computation performed is indicated on the top, while the communications
are indicated at the bottom of the Figure. The sorting is performed as follows. Initially,
all processors read their n/p-element sequence from global memory and then sort it se-

Figure 3. Action of eight processors in the Parallel Mergesort

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
322 R. D. DOWSING AND W. S. MARTINS

quentially in local memory. The even-numbered processors write their sequences to global
memory. Then each odd-numbered processor reads one of these sequences to its local
memory and does the merging with its own sequence. The second merge is performed by
processors 1 and 5. Finally, processor 1 reads the sequence written by processor 5, and does
the final merge, writing the result back to global memory.
The communication time of this algorithm is given by the following equation:
    log(p)−1
X h  n   n i
n n
comm(n, p) = read + write + read i+1 + write i (3)
p p 2 2
i=0

Since the time to copy blocks of data to/from global memory is given by read(n) =
write(n) = Tg n, this equation can be simplified to
 
n
comm(n, p) = 3n − Tg (4)
p

The total execution time of the Parallel Mergesort algorithm is then the sum of (3) and (4):
   
n n n n
Texec (n, p) = ks log − 2km + 2km n + 3n − Tg (5)
p p p p

or, substituting the coefficients ks , km and Tg for their values, the total execution time, in
microseconds, is given by
 
n n n
Texec (n, p) = 20.5n − 9.1 + 1.7 log (6)
p p p

5.1.2. Results
Figure 4 shows the predicted and measured time to sort 1K, 5K and 10K data elements,
using 2, 4, 8, 16 and 32 processors. Each point in the graph represents the average of five
runs using data generated randomly.
The measured times correspond closely to the predicted times; they are all within 2%
of the predicted times. As expected from equation (6), if n is kept constant, the execution
time decreases slightly but soon reaches a point where additional processors make little
difference. For a fixed n, as p increases, Texec tends to 20.5n (in agreement with Figure 4).
Increasing the number of processors results in less time to do the sequential sort since the
sequence assigned to each processor is smaller but requires more work in the merge phase.

5.2. Parallel sort using a balanced merge


The problem with the previous algorithm is that it does not fully utilise all the available
processors. Initially, all processors participate in sorting their subsequences, but as the
merge phase proceeds more and more processors remain idle, the final merge step requiring
only a single processor. As a result both computation and communication have a linear
component in n. Francis and Mathieson[15] proposed a balanced merge algorithm that
utilises all processors in all steps of the merge phase.

Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 323

Figure 4. Results for Parallel Mergesort

Figure 5 illustrates the algorithm for eight processors; data sets are represented by letters
and processors by numbers. Each processor initially sorts n/p data items and at every
following stage each processor finds n/p data items in a set of data items and merges them.
For example, for a system of two processors, the data would initially be split into two data
sets and each processor would sequentially sort its own data set. After this each processor
would, in parallel, find the n/2 smallest data items and the n/2 largest data items in the

Figure 5. Parallel Sort with Balanced Merge

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
324 R. D. DOWSING AND W. S. MARTINS

complete set. This is equivalent to partitioning the two data sets each into two segments
where the number of data items in the pair of segments containing the smaller values is
the same as the number of values in the pair of segments containing the larger values.
The two segments containing the smaller values are then merged, as are the two segments
containing the larger values. The resultant data sets are then concatenated to form the result.
The calculation of the partition position in a data set is complex; the algorithm is defined
in [15].

5.2.1. Analysis
Computation: As expected, the computation time of this algorithm does not have any linear
component in n. Initially, each processor sorts n/p elements and then they all participate
in each step of the merge phase, each one producing exactly 1/pth of the final merged data.
The computation time is given by
   
n n
comp(n, p) = sort + merge log p (7)
p p

By making the same considerations as previously, this equation can be simplified to


 
n n n
comp(n, p) = ks log + km log p (8)
p p p

Communication: Figure 6 illustrates how such implementation can be done. Each processor
starts by reading n/8 elements, sorting them and writing the result to global memory. The
merge phase has, in this case, three steps. In each step, the processors merge and write
n/8 elements. The partition’s boundaries are calculated using a binary search. The decision

Figure 6. Action of eight processors in the Balanced Mergesort algorithm

Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 325

whether to copy the entire segments to be searched to local memory or to read the elements
individually as they are required is dependent on the machine parameters and data input size.
For the machine (WPRAM simulator) and data input size (1K, 5K and 10K) considered, it
was verified experimentally that the first option is better only when the number of processors
is small (less than four). Hence it was decided to use the second alternative, where elements
are read individually. The search space size varies in each step of the merge phase; thus,
for the example considered, 2 log(n/8) are read in the first step, 2 log(n/4) in the second
and 2 log(n/2) in the final step. In each of these steps, once the partition’s boundaries have
been calculated, the corresponding n/8 elements are read, merged and written back. The
communication time of this algorithm is given by
   
n n
comp(n, p) = read + write
p p
log p  h  n i    
X n n
+ read 2 log i + read + write (9)
i=1
2 p p

which can be simplified to


 
n n
comm(n, p) = (2 log n − log p − 1) log(p)Ti + 2 + 2 log(p) Tg (10)
p p

The total execution time of the Balanced Mergesort algorithm is then found by adding (8)
to (10):
   
n n
Texec (n, p) = ks log + 2kg + (km + 2) log p + (2 log n − log p − 1) log(p)ki
p p
(11)
By substituting the coefficients ks , km and Tg for their values, equation (11) can be
simplified to
 
n n n n
Texec (n, p) = 11.4 + 1.7 log + 3.7 log p + 94 log n log p − 47(log p)2 − 47 log p
p p p p
(12)
which gives the total execution time in microseconds.

5.2.2. Results
The predicted and measured time for the Balanced Mergesort algorithm is shown in Figure 7.
As with the Mergesort algorithm, the results were obtained for input sizes of 1K, 5K and
10K, using 2, 4, 8, 16 and 32 processors. As can be seen from the graph, the measured
values are all within 1% of the predicted values.
The Balanced Mergesort algorithm removes the bottleneck existing in the previous
algorithm by distributing the work more equally among the processors. Thus the execution
times are much lower than those obtained previously.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
326 R. D. DOWSING AND W. S. MARTINS

Figure 7. Results for Parallel Balanced Mergesort

5.3. Parallel sort using regular sampling


The two previous parallel sorting algorithms have the same basic structure with a single
full sort step and log p merge steps. Li et al.[16] developed a parallel sorting algorithm,
called Parallel Sort using Regular Sampling (PSRS), which goes to the extreme of having
only a single merge step. Figure 8 illustrates how the algorithm works when sorting a list

Figure 8. Parallel Sort with Regular Sampling

Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 327

of 27 elements using three processors.


The algorithm has four phases. Assuming there are n elements to be sorted and p
processors available, in the first phase each processor sorts n/p elements using a sequential
sorting algorithm and selects the elements located at positions 1, n/p2 + 1, 2n/p2 +
1, . . . , (p − 1)(n/p2 ) + 1, of its n/p-element sorted sequence, as a regular sample. In the
second phase, splitters calculation, one processor gathers and sorts all regular samples and
then selects elements at positions p + p/2, 2p + p/2, . . . , (p − 1)p + p/2 of its p2 -element
sorted sequence, to be used as splitters in the next phase. In the distribution phase, every
processor uses the splitters to partition its sequence into p disjoint subsequences. Binary
search is used with each splitter to determine the partitions. Next, each processor i keeps its
ith partition and assigns the jth partition to processor j. Finally, in the merge phase, each
processor merges its p partitions into a single sequence. Although the number of elements
each processor merges may vary, it has been proved that this number is never greater than
2n/p.

5.3.1. Analysis
Computation: The computation time of the PSRS algorithm is calculated by summing up
the contributions of each phase: sorting of the initial n/p elements, sorting of the samples,
calculation of the partitions and merging of the p subsequences:
     
n n n
comp(n, p) = sort + sort(p ) + (p − 1) log
2
+ merge log p (13)
p p p

This equation can be rewritten as


   
n n n
comp(n, p) = ks p2 log p2 + ks + p − 1 log + km log p (14)
p p p

Communication: Figure 9 shows the action of each processor during the execution of the
PSRS algorithm. Initially every processor reads its n/8-element sequence and sorts it. The
regular samples are then selected and written to the global memory. Next, processor 1 reads
all the samples, sorts them and writes the splitters to the global memory. Every processor
then reads the splitters and uses them to find its partitions. The indexes delimiting the local
partitions are then written to the global memory. All these indexes are subsequently read
by each processor, and the n/8-element sequence, sorted previously, is written back to the
global memory. Knowing the indexes delimiting its partitions processor i now keeps its ith
partition and reads p − 1 other partitions, one each from the other processors.
The communication time of the PSRS algorithm is given by
   
n n
comm(n, p) = 2 read + 2 write + read(p) + 3 write(p) + 2 read(p2 ) (15)
p p

which can be simplified to


 
2 n
comm(n, p) = 2p + 4p + 4 Tg (16)
p

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
328 R. D. DOWSING AND W. S. MARTINS

Figure 9. Action of eight processors in the Sort using Regular Sampling

The total execution time is given by the following equation:


   
n n n
Texec (n, p) = ks p2 log p2 + km log p + ks + 2p2 + 4p + 4 Tg (17)
p p p

which can be rewritten as


   
n n n n
Texec (n, p) = 1.7p2 log p2 + 1.7 log p + 1.7 log + 2p2 + 4p + 4 5.7 (18)
p p p p

5.3.2. Results
The predicted and measured time obtained for the PSRS algorithm is shown in Figure 10.
The same range of input sizes (1K, 5K, 10K) and number of processors (2, 4, 8, 16, 32)
were used as the previous examples. From the graph it can be seen that the measured
and predicted times are within 5% of each other, the difference being more accentuated
when the number of processors is small. This is because equation (15) was obtained by
assuming that each processor reads a total of n/p elements in the distribution phase,
which is an overestimation. Although the exact number of elements read in this phase is
data dependent (the partitions depend on the input data), a better approximation could be
obtained with [(p − 1)/p][read(n/p)] since roughly 1/p of the data are already in local
memory. However, this would complicate even more the expression which gives the total
execution time – equation (18).

Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 329

Figure 10. Results for Sort using Regular Sampling

The behaviour of the curves obtained for the PSRS algorithm differs from the curves for
the previous algorithms in that they reach a minimum and then increase. This is because the
PSRS algorithm works best when the number of samples is much less than the number of
elements to be sorted. As the number of processors increases, the overheads of organising
the recombination of the sorted segments outweighs the advantages of performing the
sorting in parallel. However, for a relatively small number of processors, this algorithm
produces better results than the previous ones.

6. COMPARISON
The computation and communication contributions, together with the total execution time,
of the three algorithms, are shown in Figure 11 for the case of 1K elements. As can be seen,
the Parallel Mergesort has the worst performance and benefits least from an increase in the
number of processors. Because its merging steps have to deal with an increasing number of
elements, the resulting communication costs are kept quite high. The Balanced Mergesort
gives the best performance overall. By dividing the work to be done in the merging steps
more equally, the communications costs are substantially reduced. For fixed input data,
as the number of processors increase there is a decrease in the number of elements each
processor has to work with. This results in less time spent computing and moving data in
each step, but more steps are required to have the work done. Parallel sorting using regular
sampling has good performance but this is critically dependent on the number of processors.
When compared to the Balanced Mergesort, its best performance is slightly inferior but is
achieved using fewer processors. Thus, if there are not many processors available and their
number can be tailored to the size of the data set, regular sampling is a good technique
to be used. Otherwise, Balanced Mergesort is better since its performance improves more
uniformly with the number of processors.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
330 R. D. DOWSING AND W. S. MARTINS
Computation, communication and total execution times of the three sorting algorithms
Figure 11.
Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.
THE PERFORMANCE OF A SELECTION OF SORTING ALGORITHMS 331

7. CONCLUSIONS
In this paper we have shown how to predict the performance of some well-known algorithms
on a general purpose parallel architecture using simple analysis of the communication
and computation costs. By deriving performance equations for the algorithms we have
shown that similar results to those produced by the simulator, which includes a much
more detailed performance model, can be obtained. Such analysis can be helpful when
developing algorithms for real parallel machines, since approximate execution times can
be obtained without the step of coding and executing the algorithms. This advantage is
evident with the regular sorting algorithm where the optimum number of processors can be
determined theoretically instead of by trial and error. However, the performance analysis
developed is only applicable when the communication requirements of the program are
statically predictable.
From the sorting algorithms implemented, the Parallel Mergesort has the worst perfor-
mance because it does not fully utilise all the available processors. The Balanced Mergesort
overcomes this problem by making sure that each processor has the same amount of work
during the execution, and thus improves the performance. Parallel sorting using regular
sampling makes better use of the processors than Parallel Mergesort but relies on a single
processor to sort the regular samples which results in a bottleneck when the number of
processors is large. Its performance is thus critically dependent on the number of proces-
sors and, unless this number can be tailored to the size of the data set, Balanced Mergesort
produces better performance.
The BSP (automatic mode) and the WPRAM models provide a global shared memory
abstraction which provides a single address space and automatically maps data to processors
by the use of hash functions. However, when the cost of accessing global memory is high,
this facility cannot be utilised efficiently. To overcome this problem, data locality can be
exploited by copying global data to (unhashed) local memory and including synchronisation
points in the program so as to keep memory coherent, i.e. taking advantage of weak
coherency. This was the technique used in the sorting implementations described. However,
even though substantial improvements in performance can be obtained with this approach,
too much burden is placed on the programmer, who has to be concerned with memory
management in addition to algorithm design.

ACKNOWLEDGEMENTS
We would like to thank Jonathan Nash for the WPRAM simulator. One of the authors,
W. S. Martins, acknowledges the support of both the Depart. Estatı́stica e Informática,
UFG, Goiânia, Brazil, and CNPq, Brazil, research grant number 200721/91-7.

REFERENCES
1. D. E. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian and
T. vonEicken, ‘LogP: Towards a realistic model of parallel computation’, Proceedings of the 4th
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPOPP, San
Diego, California, Vol. 28, ACM Press, May 1993, pp. 1–12.
2. W. F. McColl, ‘General purpose parallel computing’, in A. M. Gibbons and P. Spirakis (Eds.),
Lectures on Parallel Computation, Cambridge International Series on Parallel Computation,
Vol. 4, Cambridge University Press, Cambridge, UK, 1993, pp. 337–391.

1998 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., Vol. 10, 315–332 (1998)
332 R. D. DOWSING AND W. S. MARTINS

3. J. M. Nash and P. Dew, ‘Parallel algorithm design on the XPRAM model’, Proceedings of the
2nd Abstract Machines Workshop, Leeds, 1993.
4. L. G. Valiant, ‘A bridging model for parallel computation’, Commun. ACM, 33, 103–111 (1990).
5. L. G. Valiant, ‘General purpose parallel architectures’, in J. van Leeuwen (Ed.), The Handbook
of Theoretical Computer Science, North Holland, 1990.
6. J. M. Nash, ‘A study of the XPRAM model for parallel computing’, PhD thesis, University of
Leeds, 1993.
7. J. M. Nash, M. E. Dyer and P. M. Dew, ‘Designing practical parallel algorithms for scalable
message passing machines’, Proceedings of the 1995 World Transputer Congress, September
1995, pp. 529–541.
8. S. G. Akl, Parallel Sorting Algorithms, Academic Press, Orlando, FL, 1985.
9. A. Borodin and J. Hopcroft, ‘Routing, merging and sorting on parallel models of computation’,
J. Comput. Syst. Sci., 30, 130–145 (1985).
10. R. Cole, ‘Parallel merge sort’, SIAM J. Comput., 17(4), 770–785 (1988).
11. F. P. Preparata, ‘New parallel sorting schemes’, IEEE Trans. Comput., C-27(7), 669–673 (1978).
12. B. Abali, F. Ozguner and A. Bataineh, ‘Balanced parallel sort on hypercube multiprocessors’,
IEEE Trans. Parallel Distrib. Syst., 4(5), 572–581 (1993).
13. Q. F. Stout, ‘Sorting, merging, selecting and filtering on tree and pyramid machines’, Proceedings
of the International Conference on Parallel Processing, IEEE, New York, August 1983, pp. 214–
221.
14. C. D. Thompson and H. T. Kung, ‘Sorting on a mesh-connected parallel computer’, Commun.
ACM, 20(4), 263–271 (1977).
15. R. S. Francis and I. D. Mathieson, ‘A benchmark parallel sort for shared memory multiproces-
sors’, IEEE Trans. Comput., 37(12), 1619–1626 (1988).
16. X. Li, P. Lu, J. Schaeffer, J. Shillington, P. S. Wong and H. Shi, ‘On the versatility of parallel
sorting by regular sampling’, Technical Report TR 91-06, March 1991, Department of Computing
Science, University of Alberta, Edmonton, Alberta, Canada.

Concurrency: Pract. Exper., Vol. 10, 315–332 (1998) 1998 John Wiley & Sons, Ltd.

You might also like