Professional Documents
Culture Documents
Chap9 Slides
Chap9 Slides
Chap9 Slides
• Sorting Networks
• Quicksort
2 7 9 10 12
1 6 8 11 13 2 7 9 10 12 1 6 8 11 13
Pi Pj Pi Pj
Step 1 Step 2
1 2 6 7 8 9 10 11 12 13 1 2 6 7 8 9 10 11 12 13
PSfrag
1 2 6 7 8
replacements
9 10 11 12 13
Pi Pj Pi Pj
Step 3 Step 4
x x0 = max{x, y} x x0 = max{x, y}
y y
0
y = min{x, y} y 0 = min{x, y}
(b)
Interconnection network
Output wires
Input wires
0001 0001
0011 0011
Step 1 Step 2
Step 3 Step 4
TP = Θ(log2 n) (2)
• This algorithm is cost optimal w.r.t. its serial counterpart, but not
w.r.t. the best sorting algorithm.
Mapping Bitonic Sort to Meshes
0100 0101 0110 0111 0111 0110 0101 0100 0010 0011 0110 0111
1000 1001 1010 1011 1000 1001 1010 1011 1000 1001 1100 1101
1100 1101 1110 1111 1111 1110 1101 1100 1010 1011 1110 1111
comparisons
communication
z }| { z √
}| {
2
TP = Θ(log n) + Θ( n).
• Initially the processes sort their n/p elements (using merge sort)
in time Θ((n/p) log(n/p)) and then perform Θ(log 2 p) compare-
split steps.
comparisons
z local
}|sort { z }| communication
{ z }| {
n n n 2 n
TP = Θ log +Θ log p + Θ log2 p .
p p p p
comparisons
z local
}|sort { z }| {
communication
z }| {
n n n n
TP = Θ log +Θ log2 p + Θ √
p p p p
1. procedure ODD-EVEN(n)
2. begin
3. for i := 1 to n do
4. begin
5. if i is odd then
6. for j := 0 to n/2 − 1 do
7. compare-exchange(a2j+1, a2j+2 );
8. if i is even then
9. for j := 1 to n/2 − 1 do
10. compare-exchange(a2j , a2j+1 );
11. end for
12. end ODD-EVEN
3 2 3 8 5 6 4 1
Phase 1 (odd)
2 3 3 8 5 6 1 4
Phase 2 (even)
2 3 3 5 8 1 6 4
Phase 3 (odd)
2 3 3 5 1 8 4 6
Phase 4 (even)
2 3 3 1 5 4 8 6
Phase 5 (odd)
2 3 1 3 4 5 6 8
Phase 6 (even)
2 1 3 3 4 5 6 8
Phase 7 (odd)
1 2 3 3 4 5 6 8
Phase 8 (even)
1 2 3 3 4 5 6 8
Sorted
• This is cost optimal with respect to the base serial algorithm but
not the optimal one.
Parallel Odd-Even Transposition
z local
}|sort { comparisons communication
n n z }| { z }| {
TP = Θ log + Θ(n) + Θ(n).
p p
Parallel Odd-Even Transposition
• During the first phase, processes that are far away from each
other in the array compare-split their elements.
• The processes are split into two groups of size p/2 each and the
process repeated in each group.
Parallel Shellsort
0 3 4 5 6 7 2 1
0 3 4 5 6 7 2 1
0 3 4 5 6 7 2 1
Pivot
(b) 1 2 3 5 8 4 3 7
Final position
(c) 1 2 3 3 4 5 8 7
(d) 1 2 3 3 4 5 7 8
(e) 1 2 3 3 4 5 7 8
• In the best case, the pivot divides the list in such a way that the
larger of the two lists does not have more than αn elements (for
some constant α).
1 5
2 3 8
4 7
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
leftchild 2 1 8 leftchild 2 3 1 8
(d) rightchild 6 5 rightchild 6 5 7 (e)
[4] {54}
1 2 5 8
3 6
7
(f)
2 3 6 7 8
3 7
• Each processor partitions its list into two, say Li and Ui, based
on the selected pivot.
• All of the Li lists are merged and all of the Ui lists are merged
separately.
First Step
P0 P1 P2 P3 P4
after local
7 2 18 13 1 17 14 20 6 10 15 9 3 4 19 16 5 12 11 8 rearrangement
after global
7 2 1 6 3 4 5 18 13 17 14 20 10 15 9 19 16 12 11 8 rearrangement
P0 P1 P2 P3 P4
7 2 1 6 3 4 5 18 13 17 14 20 10 15 9 19 16 12 11 8 pivot selection
pivot=5 pivot=17
Second Step
P0 P1 P2 P3 P4
after local
1 2 7 6 3 4 5 14 13 17 18 20 10 15 9 19 16 12 11 8 rearrangement
after global
1 2 3 4 5 7 6 14 13 17 10 15 9 16 12 11 8 18 20 19 rearrangement
P0 P1 P2 P3 P4
1 2 3 4 5 7 6 14 13 17 10 15 9 16 12 11 8 18 20 19 pivot selection
pivot=11
P0 P1 P2 P3 P4
Third Step
after local
1 2 3 4 5 6 7 10 13 17 14 15 9 8 12 11 16 18 19 20 rearrangement
after global
10 9 8 12 11 13 17 14 15 16 rearrangement
P2 P3
Fourth Step
P0 P1 P2 P3 P4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Solution
Parallelizing Quicksort: Shared Address Space
Formulation
P0 P1 P2 P3 P4
after local
7 2 18 13 1 17 14 20 6 10 15 9 3 4 19 16 5 12 11 8 rearrangement
PSfrag replacements
|Si | 2 1 1 2 1 2 3 3 2 3 |Li |
Prefix Sum Prefix Sum
0 2 3 4 6 7 0 2 5 8 10 13
PrefixSum of |Si |
PrefixSum of |Li | after global
7 2 1 6 3 4 5 18 13 17 14 20 10 15 9 19 16 12 11 8 rearrangement
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
• The parallel time depends on the split and merge time, and the
quality of the pivot.
• The first step takes time Θ(log p), the second, Θ(n/p), the third,
Θ(log p), and the fourth, Θ(n/p).
• The process recurses until there are p lists, at which point, the
lists are sorted locally.
array splits
z local
}|sort { z }| {
n n n
TP = Θ log +Θ log p + Θ(log2 p). (4)
p p p
• Each processor splits its local list into two lists, one less (Li), and
other greater (Ui) than the pivot.
• A processor in the low half of the machine sends its list Ui to the
paired processor in the other half. The paired processor sends
its list Li.
• It is easy to see that after this step, all elements less than the
pivot are in the low half of the machine and all elements
greater than the pivot are in the high half.
Parallelizing Quicksort: Message Passing Formulation
• The above process is recursed until each processor has its own
local list, which is sorted locally.
• Each processor runs through its local list and assigns each of its
elements to the appropriate processor.
P0 P1 P2
Local sort &
1 2 7 13 14 17 18 22 3 6 9 10 15 20 21 24 4 5 8 11 12 16 19 23
sample selection
7 17 9 20 8 16 Sample combining
Global splitter
7 8 9 16 17 20
selection
rag replacements
P0 P1 P2
Final element
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
assignment
block partition
z local
}|sort { sort sample z }| { communication
n n z }| {
n z }| {
TP = Θ log 2
+ Θ p log p + Θ p log + Θ(n/p). (5)
p p p