Comp422 2011 Lecture19 Sorting

Parallel Sorting

John Mellor-Crummey
Department of Computer Science Rice University

COMP 422

Lecture 19 5 April 2011

Topics for Today

Introduction Issues in parallel sorting Sorting networks and Batchers bitonic sort Bubble sort and odd-even transposition sort Parallel quicksort

Why Study Parallel Sorting?

One of the most common operations performed Close relation to task of routing on parallel computers
e.g. HPC Challenge RandomAccess benchmark

Sorting Algorithm Attributes

Internal vs. external
internal: data fits in memory external: uses tape or disk

Comparison-based or not
comparison sort
basic operation: compare elements and exchange as necessary (n log n) comparisons to sort n numbers

non-comparison-based sort
e.g. radix sort based on the binary representation of data (n) operations to sort n numbers

Parallel vs. sequential

Parallel Sorting Is Intrinsically Interesting

Different algorithms for different architecture variants

Abstract parallel architecture


Network topology
hypercube mesh

Communication mechanism
shared address space message passing

Todays focus: parallel comparison-based sorting


Parallel Sorting Basics

Where are the input and output lists stored?
we assume that both input and output lists are distributed

What is a parallel sorted sequence?

sequence partitioned among the processors each processors sub-sequence is sorted all in Pj's sub-sequence < all in Pk's sub-sequence if j < k
the best process numbering can depend on network topology

Element-wise Parallel Compare-Exchange

When partitioning is one element per process 1. Processes Pj and Pk send their elements to each other
aj Pj ak Pk

[communication step]

Each process now has both elements

aj, ak Pj ak, aj Pk

2. Process Pj keeps min(aj,ak), and Pk keeps max(aj, ak)

min(aj, ak) Pj max(ak, aj) Pk

[comparison step]

Bulk Parallel Compare-Split

1. Send block of size n/p to partner 2. Each partner now has both blocks 3. Merge received block with own block 4. Retain only the appropriate half of the merged block
Pi retains smaller values; process Pj retains larger values

Basic Analysis

Pi and Pj are neighbors communication channels are bi-directional

Elementwise compare-exchange: 1 element per processor

time = ts + tw

Bulk compare-split: n/p elements per processor

after compare-split on pair of processors Pi and Pj, i < j
smaller n/p elements are at processor Pi larger n/p elements at Pj

time = ts+ twn/p

merge in O(n/p) time, as long as partial lists are sorted

Sorting Network

Network of comparators designed for sorting Comparator : two inputs x and y; two outputs x' and y
increasing (denoted ): x' = min(x,y) and y' = max(x,y) x y

min(x,y) max(x,y)

decreasing (denoted ) : x' = max(x,y) and y' = min(x,y) x y


min(x,y) Sorting network speed is proportional to its depth


Sorting Networks

Network structure: a series of columns Each column consists of a vector of comparators (in parallel) Sorting network organization:


Example: Bitonic Sorting Network

Bitonic sequence
two parts: increasing and decreasing
1,2,4,7,6,0: first increases and then decreases (or vice versa)

cyclic rotation of a bitonic sequence is also considered bitonic

8,9,2,1,0,4: cyclic rotation of 0,4,8,9,2,1

Bitonic sorting network

sorts n elements in (log2 n) time network kernel: rearranges a bitonic sequence into a sorted one


Bitonic Split

Let s = a0,a1,,an-1 be a bitonic sequence such that

a0 a1 an/2-1 , and an/2 an/2+1 an-1

Consider the following subsequences of s s1 = min(a0,an/2),min(a1,an/2+1),,min(an/2-1,an-1)

s2 = max(a0,an/2),max(a1,an/2+1),,max(an/2-1,an-1)
s1 and s2 are both bitonic x y x s1, y s2 , x < y

Sequence properties

Apply recursively on s1 and s2 to produce a sorted sequence Works for any bitonic sequence, even if |s1| |s2|

Splitting Bitonic Sequences - I

Sequence properties
s1 and s2 are both bitonic x y x s1, y s2 , x < y




Splitting Bitonic Sequences - II

Sequence properties
s1 and s2 are both bitonic x y x s1, y s2 , x < y




Bitonic Merge
Sort a bitonic sequence through a series of bitonic splits Example: use bitonic merge to sort 16-element bitonic sequence How: perform a series of log2 16 = 4 bitonic splits


Sorting via Bitonic Merging Network

Sorting network can implement bitonic merge algorithm
bitonic merging network

Network structure
log2 n columns each column
n/2 comparators performs one step of the bitonic merge

Bitonic merging network with n inputs: BM[n]

yields increasing output sequence

Replacing comparators by comparators: BM[n]

yields decreasing output sequence


Bitonic Merging Network, BM[16]

Input: bitonic sequence

input wires are numbered 0,1,, n - 1 (shown in binary)

Output: sequence in sorted order Each column of comparators is drawn separately


Bitonic Sort
How do we sort an unsorted sequence using a bitonic merge? Two steps

Build a bitonic sequence Sort it using a bitonic merging network


Building a Bitonic Sequence

Build a single bitonic sequence from the given sequence

any sequence of length 2 is a bitonic sequence. build bitonic sequence of length 4
sort first two elements using BM[2] sort next two using BM[2]

Repeatedly merge to generate larger bitonic sequences BM[k] & BM[k]: bitonic merging networks of size k


Building a Bitonic Sequence

Input: sequence of 16 unordered numbers Output: a bitonic sequence of 16 numbers


Bitonic Sort, n = 16

First 3 stages create bitonic sequence input to stage 4 Last stage (BM[16]) yields sorted sequence

Complexity of Bitonic Sorting Networks

Depth of the network is (log2 n)

log2 n merge stages jth merge stage is log2 2j = j
log 2 n log 2 n 2

depth =


2j =

j = (log

n + 1)(log 2 n) /2 = (log 2 n)

Each stage of the network contains n/2 comparators Complexity of serial implementation = (n log2 n)


Mapping Bitonic Sort to a Hypercube

Consider one item per processor

How do we map wires in bitonic network onto a hypercube? In earlier examples

compare-exchange between two wires when labels differ in 1 bit

Direct mapping of wires to processors

all communication is nearest neighbor


Mapping Bitonic Merge to a Hypercube

Communication during the last merge stage of bitonic sort

Each number is mapped to a hypercube node Each connection represents a compare-exchange


Mapping Bitonic Sort to Hypercubes

Communication in bitonic sort on a hypercube

Processes communicate along dims shown in each stage

Algorithm is cost optimal w.r.t. its serial counterpart Not cost optimal w.r.t. the best sorting algorithm

Batchers Bitonic Sort in NESL

function merge(a) = if (#a == 1) then a else let halves = bottop(a); mins = {min(x, y) : x in halves[0]; y in halves[1]}; maxs = {max(x, y) : x in halves[0]; y in halves[1]}; in flatten({merge(x) : x in [mins,maxs]}); function bitonic_sort(a) = if (#a == 1) then a else let b = {bitonic_sort(x) : x in bottop(a)}; in merge(b[0]++reverse(b[1])); bitonic_sort([2, 3, -7, 6, 5, 22, -8, 12]); Try it at:

Bubble Sort and Variants

Sequential bubble sort algorithm Compares and exchanges adjacent elements sequence


Bubble Sort and Variants

Bubble sort complexity: (n2) Difficult to parallelize
algorithm has no concurrency

A simple variant uncovers concurrency


Sequential Odd-Even Transposition Sort


Odd-Even Transposition Sort, n = 8

In each phase, n = 8 elements are compared


Odd-Even Transposition Sort

After n phases of odd-even exchanges, sequence is sorted Each phase of algorithm requires (n) comparisons Serial complexity is (n2)


Parallel Odd-Even Transposition

Consider one item per processor

n iterations
in each iteration, each processor does one compare-exchange

Parallel run time of this formulation is (n) Cost optimal with respect to the base serial algorithm
but not the optimal one!


Parallel Odd-Even Transposition Sort

note: if partner id < 1 or > n, then skip compare



Popular sequential sorting algorithm
simplicity, low overhead, optimal average complexity

select an entry in the sequence to be the pivot divide the sequence into two halves
one with all elements less than the pivot other greater

Apply process recursively to each of sublist


Parallelizing Quicksort

First, recursive decomposition
partition the list serially handle each subproblems on a different processor

Time for this algorithm is lower-bounded by (n)! Can we parallelize the partitioning step?
can we use n processors to partition a list of length n around a pivot in O(1) time?

Tricky on real machines


Practical Parallel Quicksort

Each processor initially responsible for n/p elements

Shared memory formulation

select first pivot & broadcast each processor partitions own data globally rearrange data into smaller and larger parts (in place) recurse with proportional # processors on each part

Message passing formulation

each processor first partitions local portion of array determine which processes will be responsible for each partition (based on size of smaller than pivot and larger than pivot groups) divide up the data among the processor subsets responsible for each part

continue recursively


Data Parallel Quicksort in NESL

1 2 3 4 5 6 7 8

Total work is O(n log n) Recursion depth is O(log n) Depth of each operation is constant

Total depth is O(log n) as well 38

Other Sorting Algorithms

Shellsort - another variant of bubble sort

two stage process
log p rounds of long distance exchanges followed by rounds of odd-even transposition sort until done

key idea: long distance moves of first stage reduce number of rounds necessary in second stage

Radix sort : in a series of rounds, sort elements into buckets by digit Bucket and sample sort
assumes evenly distributed items in an interval buckets represent evenly-sized subintervals

Enumeration sort:
determine rank of each element place it in the correct position CRCW PRAM algorithm: n2 processors, sort in (1) time
assumes that all concurrent writes to a location deposit sum n processes in column j test element j against the rest; write 1 into C[j] place A[j] into A[C[j]]


Other Sorting Algorithms (Cont)

Histogram Sorting
goal: divide keys into p evenly sized pieces use an iterative approach to do so initiating processor broadcasts k > p-1 splitter guesses each processor determines how many keys fall in each bin sum histogram with global reduction one processor examines guesses to see which are satisfactory broadcast finalized splitters and number of keys for each processor each processor sends local data to appropriate processors using all-to-all communication each processor merges chunks it receives Kale and Solomonik improved this (IPDPS 2010)



Adapted from slides Sorting by Ananth Grama Based on Chapter 9 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison Wesley, 2003 Programming Parallel Algorithms. Guy Blelloch. Communications of the ACM, volume 39, number 3, March 1996. Edgar Solomonik and Laxmikant V. Kale. Highly Scalable Parallel Sorting. Proceedings of IPDPS 2010.


