PC 02 Parallel Algorithms

Parallel
Algorithms
A. Legrand
Parallel Algorithms
Arnaud Legrand, CNRS, University of Grenoble
LIG laboratory, arnaud.legrand@imag.fr
October 2, 2011
1 / 235
Outline
Parallel
Algorithms
A. Legrand
Part I Network Models

Part II Communications on a Ring
Part III Speedup and Efficiency
Part IV Algorithms on a Ring
Part V Algorithm on an Heterogeneous Ring
Part VI Algorithms on an Grid
2 / 235
Parallel
Algorithms
A. Legrand
P2P
Communication
Hockney
LogP and
Friends
TCP Part I
Modeling
Concurency
Multi-port
Single-port
(Pure and Full
Duplex)
Network Models
Flows
Imperfection
Topology
A Few Examples
Virtual
Topologies
3 / 235
Motivation
Parallel
Algorithms
A. Legrand
I Scientific computing : large needs in computation or storage re-
sources.
P2P
Communication I Need to use systems with “several processors”:
Hockney
LogP and
Friends I Parallel computers with shared/dis- I Clusters of clusters
TCP
Modeling
tributed memory I Network of workstations
Concurency
Multi-port
I Clusters I The Grid
Single-port
(Pure and Full
Duplex)
I Heterogeneous clusters I Desktop Grids
Flows
Imperfection I When modeling platform, communications modeling seems to be
Topology
A Few Examples
the most controversial part.
Virtual
Topologies I Two kinds of people produce communication models: those who
are concerned with scheduling and those who are concerned with
performance evaluation.
I All these models are imperfect and intractable.
4 / 235
Outline
Parallel
Algorithms
A. Legrand 1 Point to Point Communication Models

P2P
Hockney
Communication
Hockney
LogP and Friends
LogP and
Friends
TCP
TCP
Modeling 2 Modeling Concurency
Concurency
Multi-port Multi-port
Single-port
(Pure and Full Single-port (Pure and Full Duplex)
Duplex)
Flows Flows
Imperfection
Topology 3 Remind This is a Model, Hence Imperfect

A Few Examples
Virtual
Topologies 4 Topology
A Few Examples
Virtual Topologies
5 / 235
UET-UCT
Parallel
Algorithms
A. Legrand
P2P
Communication
Hockney
LogP and
Unit Execution Time - Unit Communication Time.
Friends
TCP
Modeling Hem. . . This one is mainly used by scheduling theoreticians to prove
Concurency
Multi-port that their problem is hard and to know whether there is some hope to
Single-port
(Pure and Full prove some clever result or not.
Duplex)
Flows
Imperfection Some people have introduced a model whith cost of ε for local com-
Topology munications and 1 for communications with the outside.
A Few Examples
Virtual
Topologies
6 / 235
“Hockney” Model
Parallel
Algorithms
Hockney [Hoc94] proposed the following model for performance eval-
A. Legrand uation of the Paragon. A message of size m from Pi to Pj requires:
P2P
Communication
ti,j (m) = Li,j + m/Bi,j
Hockney
LogP and
Friends In scheduling, there are three types of “corresponding” models:
TCP
I Communications are not “splitable” and each communication k is
Modeling
Concurency associated to a communication time tk (accounting for message
Multi-port
Single-port size, latency, bandwidth, middleware, . . . ).
(Pure and Full
Duplex) I Communications are “splitable” but latency is considered to be
Flows
Imperfection negligible (linear divisible model):
Topology
A Few Examples ti,j (m) = m/Bi,j
Virtual
Topologies
I Communications are “splitable” and latency cannot be neglected
(linear divisible model):
ti,j (m) = Li,j + m/Bi,j

7 / 235
LogP
Parallel
Algorithms
The LogP model [CKP+ 96] is defined by 4 parameters:
A. Legrand
I L is the network latency
I o is the middleware overhead (message splitting and packing,
P2P
Communication buffer management, connection, . . . ) for a message of size w
Hockney
LogP and I g is the gap (the minimum time between two packets communi-
Friends
TCP cation) between two messages of size w
Modeling I P is the number of processors/modules
Concurency
Multi-port o o o o
Single-port Sender
(Pure and Full Card
Duplex)
g g g g
Flows
Network L
Imperfection g g g g
Card
Topology
Receiver
A Few Examples o o o o
Virtual
Topologies
I Sending m bytes with packets ofsize w :
2o + L + m w · max(o, g )
I Occupation on the sender and
on the receiver:
o+L+ m w − 1 · max(o, g )
8 / 235
LogP
Parallel
Algorithms
The LogP model [CKP+ 96] is defined by 4 parameters:
A. Legrand
I L is the network latency
I o is the middleware overhead (message splitting and packing,
P2P
Communication buffer management, connection, . . . ) for a message of size w
Hockney
LogP and I g is the gap (the minimum time between two packets communi-
Friends
TCP cation) between two messages of size w
Modeling I P is the number of processors/modules
Concurency
Multi-port o o o o o
Single-port Sender
(Pure and Full Card
Duplex)
g g g g g
Flows
Network L
Imperfection g g g g
Card
Topology
Receiver
A Few Examples o o o o
Virtual
Topologies
I Sending m bytes with packets ofsize w :
2o + L + m w · max(o, g )
I Occupation on the sender and
on the receiver:
o+L+ m w − 1 · max(o, g )
8 / 235
LogGP & pLogP
Parallel
Algorithms
A. Legrand
The previous model works fine for short messages. However, many par-
P2P
Communication allel machines have special support for long messages, hence a higher
Hockney
LogP and
bandwidth. LogGP [AISS97] is an extension of LogP:
Friends
TCP
G captures the bandwidth for long messages:
short messages 2o + L + m

w · max(o, g )
Modeling
Concurency
Multi-port
Single-port
long messages 2o + L + (m − 1)G
(Pure and Full
Duplex) There is no fundamental difference. . .
Flows
Imperfection OK, it works for small and large messages. Does it work for average-
Topology
A Few Examples
size messages ? pLogP [KBV00] is an extension of LogP when L, o
Virtual
Topologies
and g depends on the message size m. They also have introduced
a distinction between os and or . This is more and more precise but
concurency is still not taken into account.
9 / 235
Bandwidth as a Function of Message Size
Parallel
Algorithms m
With the Hockney model: L+m/B .
A. Legrand
1000
P2P Mpich 1.2.6 sans optimisation
Communication Mpich 1.2.6 avec optimisation
Hockney
LogP and
800
Friends
TCP
Modeling
Bande passante [Mbits/s]
Concurency 600
Multi-port
Single-port
(Pure and Full
Duplex)
400
Flows
Imperfection
Topology 200
A Few Examples
Virtual
Topologies
0
1 2 4 8 16 32 64 128 256 1Ko 2Ko 4Ko 16Ko 64Ko 256Ko 1Mo 4Mo 16Mo
Taille des messages
MPICH, TCP with Gigabit Ethernet
10 / 235
Bandwidth as a Function of Message Size
Parallel
Algorithms m
With the Hockney model: L+m/B .
A. Legrand
1000
P2P Mpich 1.2.6 sans optimisation
Communication Mpich 1.2.6 avec optimisation
Hockney
LogP and
800
Friends
TCP
Modeling
Bande passante [Mbits/s]
Concurency 600
Multi-port
Single-port
(Pure and Full
Duplex)
400
Flows
Imperfection
Topology 200
A Few Examples
Virtual
Topologies
0
1 2 4 8 16 32 64 128 256 1Ko 2Ko 4Ko 16Ko 64Ko 256Ko 1Mo 4Mo 16Mo
Taille des messages
MPICH, TCP with Gigabit Ethernet
10 / 235
What About TCP-based Networks?
Parallel
Algorithms
A. Legrand
The previous models work fine for parallel machines. Most networks
P2P
Communication use TCP that has fancy flow-control mechanism and slow start. Is it
Hockney
LogP and valid to use affine model for such networks?
Friends
TCP The answer seems to be yes but latency and bandwidth parameters
Modeling have to be carefully measured [LQDB05].
Concurency
Multi-port
Single-port
I Probing for m = 1b and m = 1Mb leads to bad results.
(Pure and Full
Duplex) I The whole middleware layers should be benchmarked (theoretical
Flows
Imperfection
latency is useless because of middleware, theoretical bandwidth is
Topology
useless because of middleware and latency).
A Few Examples
Virtual The slow-start does not seem to be too harmful.
Topologies
Most people forget that the round-trip time has a huge impact on the
bandwidth.
11 / 235
Outline
Parallel
Algorithms

P2P
Hockney
Communication
Hockney
LogP and Friends
LogP and
Friends
TCP
TCP
Concurency
Single-port
Duplex)
Flows Flows
Imperfection

A Few Examples
Virtual
A Few Examples
Virtual Topologies
12 / 235
Multi-ports
Parallel
Algorithms I A given processor can communicate with as many other processors
A. Legrand as he wishes without any degradation.
P2P I This model is widely used by scheduling theoreticians (think about
Communication
Hockney all DAG with commmunications scheduling problems) to prove
LogP and
Friends that their problem is hard and to know whether there is some
TCP
hope to prove some clever result or not.
Modeling
Concurency This model is borderline, especially when allowing duplication,
Multi-port
Single-port when one communicates with everybody at the same time, or
(Pure and Full
Duplex) when trying to design algorithms with super tight approximation
Flows
ratios.
Imperfection A
Topology Frankly, such a model is totally unrealistic.
A Few Examples
Virtual I Using MPI and synchronous communica- 1 1
Topologies
tions, it may not be an issue. However, with
multi-core, multi-processor machines, it can- B C
1
not be ignored. . .
Multi-port
(numbers in s)
13 / 235
Bounded Multi-port
Parallel
Algorithms I Assume now that we have threads or multi-core processors.
A. Legrand
We can write that sum of the throughputs of all communications
P2P
Communication
(incomming and outgoing). Such a model is OK for wide-area
Hockney communications [HP04].
LogP and
Friends
TCP
I Remember, the bounds due to the round-trip-time must not be
Modeling forgotten!
Concurency
Multi-port
Single-port
(Pure and Full A
Duplex)
Flows
Imperfection
β/2 β/2
Topology
A Few Examples
Virtual
Topologies B C
β/2
Multi-port (β)
(numbers in Mb/s)
14 / 235
Single-port (Pure)
Parallel
Algorithms
I A process can communicate with only one other process at a time.
A. Legrand This constraint is generally written as a constraint on the sum of
P2P
communication times and is thus rather easy to use in a scheduling
Communication context (even though it complexifies problems).
Hockney
LogP and I This model makes sense when using non-threaded versions of com-
Friends
TCP munication libraries (e.g., MPI). As soon as you’re allowed to
Modeling
Concurency
use threads, bounded-multiport seems a more reasonnable option
Multi-port (both for performance and scheduling complexity).
Single-port
(Pure and Full
Duplex)
Flows A
Imperfection
Topology
A Few Examples
1/3 1/3
Virtual
Topologies
B C
1/3
1-port (pure)
(numbers in s)
15 / 235
Single-port (Full-Duplex)
Parallel
Algorithms
A. Legrand At a given time, a process can be engaged in at most one emission and
P2P
one reception. This constraint is generally written as two constraints:
Communication one on the sum of incomming communication times and one on the
Hockney
LogP and sum of outgoing communication times.
Friends
TCP
Modeling
Concurency A
Multi-port
Single-port
(Pure and Full
Duplex) 1/2 1/2
Flows
Imperfection
Topology B C
A Few Examples 1/2
Virtual
Topologies
1-port (full duplex)
(numbers in Mb/s)
16 / 235
Single-port (Full-Duplex)
Parallel
Algorithms
This model somehow makes sense when using networks like Myrinet
A. Legrand that have few multiplexing units and with protocols without control
flow [Mar07].
P2P
Communication
Hockney
LogP and
Friends
TCP
Modeling
Concurency
Multi-port
Single-port
(Pure and Full
Duplex)
Flows
Imperfection
Topology
A Few Examples
Virtual
Topologies
Even if it does not model well complex situations, such a model is not
harmfull.
17 / 235
Fluid Modeling
Parallel
Algorithms
When using TCP-based networks, it is generally reasonnable to use
A. Legrand
flows to model bandwidth sharing [MR99, Low03].
P2P X
Communication
∀l ∈X L, Income Maximization maximize %r
Hockney
r ∈R
LogP and
Friends %r 6 c l
TCP
r ∈R s.t. l∈r Max-Min Fairness maximize min %r
Modeling r ∈R
Concurency X
Multi-port
Single-port
Proportional Fairness maximize log(%r )
(Pure and Full
Duplex) r ∈R
Flows
Imperfection
Topology
X 1
A Few Examples Potential Delay Minimization minimize
Virtual %r
Topologies r ∈R
X
Some weird function minimize arctan(%r )
r ∈R
18 / 235
Fluid Modeling
Parallel
Algorithms
When using TCP-based networks, it is generally reasonnable to use
A. Legrand
flows to model bandwidth sharing [MR99, Low03].
P2P X
Communication
∀l ∈X L, Income Maximization maximize %r
Hockney
r ∈R
LogP and
Friends %r 6 c l
TCP
r ∈R s.t. l∈r Max-Min Fairness maximize min %r ATM
Modeling r ∈R
Concurency X
Multi-port
Single-port
Proportional Fairness maximize log(%r )
(Pure and Full
Duplex) r ∈R
Flows
TCP Vegas
Imperfection
Topology
X 1
A Few Examples Potential Delay Minimization minimize
Virtual %r
Topologies r ∈R
X
Some weird function minimize arctan(%r )
r ∈R
TCP Reno
18 / 235
Flows Extensions
Parallel
Algorithms
A. Legrand
P2P
Communication
Hockney I Note that this model is a multi-port model with capacity-constraints
LogP and
Friends
TCP
(like in the previous bounded multi-port).
Modeling I When latencies are large, using multiple connections enables to
Concurency
Multi-port get more bandwidth. As a matter of fact, there is very few to
Single-port
(Pure and Full loose in using multiple connections. . .
Duplex)
Flows I Therefore many people enforce a sometimes artificial (but less
Imperfection
intrusive) bound on the maximum number of connections per
Topology
A Few Examples link [Wag05, MYCR06].
Virtual
Topologies
19 / 235
Outline
Parallel
Algorithms

P2P
Hockney
Communication
Hockney
LogP and Friends
LogP and
Friends
TCP
TCP
Concurency
Single-port
Duplex)
Flows Flows
Imperfection

A Few Examples
Virtual
A Few Examples
Virtual Topologies
20 / 235
Remind This is a Model, Hence Imperfect
Parallel
Algorithms
A. Legrand
P2P
Communication
Hockney
I The previous sharing models are nice but you generally do not
LogP and
Friends
know other flows. . .
TCP
Modeling
I Communications use the memory bus and hence interfere with
Concurency computations. Taking such interferences into account may be-
Multi-port
Single-port come more and more important with multi-core architectures.
(Pure and Full
Duplex)
Flows
I Interference between communications are sometimes. . . surprising.
Imperfection
Modeling is an art. You have to know your platform and your applica-
Topology
A Few Examples tion to know what is negligeable and what is important. Even if your
Virtual
Topologies model is imperfect, you may still derive interesting results.
21 / 235
Outline
Parallel
Algorithms

P2P
Hockney
Communication
Hockney
LogP and Friends
LogP and
Friends
TCP
TCP
Concurency
Single-port
Duplex)
Flows Flows
Imperfection

A Few Examples
Virtual
A Few Examples
Virtual Topologies
22 / 235
Various Topologies Used in the Litterature
Parallel
Algorithms
A. Legrand
P2P
Communication
Hockney
LogP and
Friends
TCP
Modeling
Concurency
Multi-port
Single-port
(Pure and Full
Duplex)
Flows
Imperfection
Topology
A Few Examples
Virtual
Topologies
23 / 235
Parallel
Algorithms
A. Legrand
Beyond MPI_Comm_rank()?
P2P
Communication
Hockney
LogP and
 So far, MPI gives us a unique number for each
Friends
TCP processor
Modeling
Concurency
 With this one can do anything
Multi-port
Single-port
 But it’s pretty inconvenient because one can do
(Pure and Full
Duplex) anything with it
Typically, one likes to impose constraints about
Flows

Imperfection
Topology
which processor/process can talk to which other
A Few Examples processor/process
Virtual
Topologies  With this constraint, one can then think of the
algorithm in simpler terms
 There are fewer options for communications between
processors
 So there are fewer choices to implementing an
algorithm Courtesy of Henri Casanova
24 / 235
Parallel
Algorithms
A. Legrand
Virtual Topologies?
P2P
Communication
Hockney
LogP and
 MPI provides an abstraction over physical computers
Friends
TCP
 Each host has an IP address
Modeling  MPI hides this address with a convenient numbers
Concurency
Multi-port
 There could be multiple such numbers mapped to the same
Single-port
(Pure and Full
IP address
Duplex)
Flows
 All “numbers” can talk to each other
Imperfection  A Virtual Topology provides an abstraction over MPI
Topology  Each process has a number, which may be different from
A Few Examples
Virtual
the MPI number
Topologies  There are rules about which “numbers” a “number” can talk
to
 A virtual topology is defined by specifying the
neighbors of each process
Courtesy of Henri Casanova
25 / 235
Parallel
Algorithms Implementing a Virtual
A. Legrand
Topology
P2P
Communication
Hockney
LogP and
Friends
TCP
0 1 2 3 4 5 6 7
Modeling
Concurency
Multi-port
Single-port
(Pure and Full
Duplex) 0,0 (i,j) = (floor(log2(rank+1)), rank - 2max(i,0)+1)
Flows
rank = j -1 + 2max(i,0)
Imperfection
Topology
1,0 1,1
A Few Examples
Virtual
Topologies
2,0 2,1 2,2 2,3
3,0

26 / 235
Parallel
A. Legrand
Topology
P2P
Communication
Hockney
LogP and
Friends
TCP
0 1 2 3 4 5 6 7
Modeling
Concurency
Multi-port
Single-port
(Pure and Full
Flows
rank = j -1 + 2max(i,0)
Imperfection
Topology
1,0 1,1
A Few Examples my_parent(i,j) = (i-1, floor(j/2))
Virtual
Topologies my_left_child(i,j) = (i+1, j*2), if any
2,0 2,1 2,2 2,3 my_right_child(i,j) = (i+1, j*2+1), if any
3,0

27 / 235
Parallel
A. Legrand
Topology
P2P
Communication
Hockney
LogP and
Friends
TCP
0 1 2 3 4 5 6 7
Modeling
Concurency
Multi-port
Single-port
(Pure and Full
Flows
rank = j -1 + 2max(i,0)
Imperfection
Topology
1,0 1,1
A Few Examples my_parent(i,j) = (i-1, floor(j/2))
Virtual
Topologies my_left_child(i,j) = (i+1, j*2), if any
2,0 2,1 2,2 2,3 my_right_child(i,j) = (i+1, j*2+1), if any
MPI_Send(…, rank(my_parent(i,j)), …)
3,0
MPI_Recv(…, rank(my_left_child(i,j)), …)
28 / 235
Parallel
Algorithms
A. Legrand
Typical Topologies
P2P
Communication
Hockney
LogP and
 Common Topologies (see Section 3.1.2)
Friends
TCP
 Linear Array
Modeling
 Ring
2-D grid
Concurency

Multi-port
Single-port
(Pure and Full
 2-D torus
Duplex)
Flows
 One-level Tree
Imperfection
 Fully connected graph
Topology  Arbitrary graph
Two options for all topologies:
A Few Examples
Virtual

Topologies
 Monodirectional links: more constrained but
simpler
 Bidirectional links: less constrained but
potential more complicated
 By “complicated” we typically mean more bug-prone
 We’ll look at Ring and Grid in detail Courtesy of Henri Casanova
29 / 235
Parallel
Algorithms Main Assumption and Big
A. Legrand
Question
P2P
Communication
Hockney  The main assumption is that once we’ve defined the virtual
LogP and
Friends topology we forget it’s virtual and write parallel algorithms
TCP assuming it’s physical
Modeling  We assume communications on different (virtual) links do not
Concurency
interfere with each other
Multi-port
Single-port
 We assume that computations on different (virtual) processors
(Pure and Full do not interfere with each other
Duplex)
Flows  The big question: How well do these assumptions hold?
Imperfection  The question being mostly about the network
Topology  Two possible “bad” cases
A Few Examples
Virtual
 Case #1: the assumptions do not hold and there are
Topologies interferences
 We’ll most likely achieve bad performance
 Our performance models will be broken and reasoning about
performance improvements will be difficult
 Case #2: the assumptions do hold but we leave a lot of the
network resources unutilized
 We could perhaps do better with another virtual topology
30 / 235
Parallel
Algorithms Which Virtual Topology to
A. Legrand
Pick
P2P
Communication
Hockney
LogP and
 We will see that some topologies are really well
Friends
TCP
suited to certain algorithms
Modeling
 The question is whether they are well-suite to the
Concurency
Multi-port
underlying architecture
Single-port
(Pure and Full
 The goal is to strike a good compromise
Duplex)
Flows
 Not too bad given the algorithm
Imperfection
 Not too bad given the platform
Topology
 Fortunately, many platforms these days use
A Few Examples switches, which support naturally many virtual
topologies
Virtual
Topologies
 Because they support concurrent communications
between disjoint pairs of processors
 As part of a programming assignment, you will
explore whether some virtual topology makes
sense on our cluster Courtesy of Henri Casanova
31 / 235
Parallel
Algorithms Topologies and Data
A. Legrand
Distribution
P2P
Communication
Hockney
LogP and
 One of the common steps when writing a
Friends
TCP parallel algorithm is to distribute some
Modeling
Concurency
data (array, data structure, etc.) among
Multi-port the processors in the topology
Single-port
(Pure and Full
Duplex)
 Typically, one does data distribution in a way
Flows that matches the topology
Imperfection  E.g., if the data is 3-D, then it’s nice to have a
Topology
3-D virtual topology
A Few Examples
Virtual
Topologies
 One question that arises then is: how is
the data distributed across the topology?
 In the next set of slides we look at our first
topology: a ring
32 / 235
Parallel
Algorithms
A. Legrand
Assumptions
Broadcast
Scatter
All-to-All Part II
Broadcast: Going
Faster
Communications on a Ring
33 / 235
Outline
Parallel
Algorithms
A. Legrand
Assumptions 5 Assumptions
Broadcast
Scatter
All-to-All 6 Broadcast
Broadcast: Going
Faster
7 Scatter
8 All-to-All
9 Broadcast: Going Faster
34 / 235
Parallel
Algorithms
A. Legrand
Ring Topology (Section 3.3)
Assumptions
Broadcast
 Each processor is identified by a
rank
Scatter
 MY_NUM()
All-to-All
Pp-1
 There is a way to find the total
Broadcast: Going
Faster
P0 number of processors
 NUM_PROCS()
 Each processor can send a
P1 message to its successor
 SEND(addr, L)
P3  And receive a message from its
predecessor
P2  RECV(addr, L)
 We’ll just use the above pseudo-

code rather than MPI
 Note that this is much simpler than
the example tree topology we saw
in the previous set of slides
35 / 235
Parallel
Algorithms
A. Legrand
Virtual vs. Physical Topology
Assumptions
Broadcast  Now that we have chosen to consider a Ring

Scatter
topology we “pretend” our physical topology is a
All-to-All
ring topology
Broadcast: Going
Faster  We can always implement a virtual ring topology
(see previous set of slides)
 And read Section 4.6
 So we can write many “ring algorithms”
 It may be that a better virtual topology is better
suited to our physical topology
 But the ring topology makes for very simple
programs and is known to be reasonably good in
practice
 So it’s a good candidate for our first look at
parallel algorithms Courtesy of Henri Casanova
36 / 235
Parallel
Algorithms Cost of communication (Sect.
A. Legrand
3.2.1)
Assumptions
Broadcast
Scatter
 It is actually difficult to precisely model the cost
of communication
All-to-All
 E.g., MPI implementations do various optimizations
Broadcast: Going
Faster given the message sizes
 We will be using a simple model
Time = L + m/B
L: start-up cost or latency
B: bandwidth (b = 1/B)
m: message size
 We assume that if a message of length m is sent

from P0 to Pq, then the communication cost is q(L
+ m b)
 There are many assumptions in our model, some
not very realistic, but we’ll discuss them later
37 / 235
Parallel
Algorithms Assumptions about
A. Legrand
Communications
Assumptions
Broadcast
Scatter
 Several Options
Both Send() and Recv() are blocking
All-to-All

Broadcast: Going
Faster  Called “rendez-vous”
 Very old-fashioned systems
 Recv() is blocking, but Send() is not

Pretty standard

MPI supports it
 Both Recv() and Send() are non-blocking
 Pretty standard as well

MPI supports it
38 / 235
Parallel
Algorithms Assumptions about
A. Legrand
Concurrency
Assumptions
Broadcast
 One question that’s important is: can the processor
Scatter
do multiple things at the same time?
All-to-All  Typically we will assume that the processor can
Broadcast: Going send, receive, and compute at the same time
Faster
 Call MPI_IRecv() Call MPI_ISend()
 Compute something
 This of course implies that the three operations are
independent
 E.g., you don’t want to send the result of the computation
 E.g., you don’t want to send what you’re receiving
(forwarding)
 When writing parallel algorithms (in pseudo-code),
we’ll simply indicate concurrent activities with a ||
sign
39 / 235
Parallel
Algorithms
A. Legrand
Collective Communications
Assumptions
To write a parallel algorithm, we will need

Broadcast

Scatter
All-to-All
collective operations
Broadcast: Going
 Broadcasts, etc.
Faster
 Now MPI provide those, and they likely:
 Do not use the ring logical topology
 Utilize the physical resources well
 Let’s still go through the exercise of
writing some collective communication
algorithms
 We will see that for some algorithms we
really want to do these communications
“by hand” on our virtual topology rather
than using the MPI collective Courtesy of Henri Casanova
40 / 235
Outline
Parallel
Algorithms
A. Legrand
Broadcast
Scatter
Broadcast: Going
Faster
7 Scatter
8 All-to-All
41 / 235
Parallel
Algorithms
A. Legrand
Broadcast (Section 3.3.1)
Assumptions
Broadcast
Scatter
 We want to write a program that has Pk
All-to-All
send the same message of length m to all
Broadcast: Going
Faster other processors
Broadcast(k,addr,m)
 On the ring, we just send to the next
processor, and so on, with no parallel
communications whatsoever
 This is of course not the way one should
implement a broadcast in practice if the
physical topology is not merely a ring
 MPI uses some type of tree topology
42 / 235
Parallel
Algorithms
A. Legrand
Broadcast (Section 3.3.1)
Assumptions
Broadcast
Brodcast(k,addr,m)
Scatter
q = MY_NUM()
All-to-All
p = NUM_PROCS()
Broadcast: Going
Faster if (q == k)  Assumes a blocking
SEND(addr,m) receive
else  Sending may be
if (q == k1 mod p) non-blocking
RECV(addr,m)
else  The broadcast time
RECV(addr,m) is
SEND(addr,m)
endif (p-1)(L+m b)
endif

43 / 235
Outline
Parallel
Algorithms
A. Legrand
Broadcast
Scatter
Broadcast: Going
Faster
7 Scatter
8 All-to-All
44 / 235
Parallel
Algorithms
A. Legrand
Scatter (Section 3.2.2)
Assumptions
Broadcast
Scatter
 Processor k sends a different message to
All-to-All all other processors (and to itself)
Broadcast: Going
Faster
 Pk stores the message destined to Pq at
address addr[q], including a message at
addr[k]
 At the end of the execution, each
processor holds the message it had
received in msg
 The principle is just to pipeline
communication by starting to send the
message destined to Pk-1, the most distant
processor Courtesy of Henri Casanova
45 / 235
Parallel
Algorithms
A. Legrand
Assumptions
Broadcast
Scatter(k,msg,addr,m) Same execution time as the broadcast
Scatter
All-to-All
q = MY_NUM() (p-1)(L + m b)
Broadcast: Going
p = NUM_PROCS()
Faster if (q == k)
for i = 0 to p2
SEND(addr[k+p1i mod p],m)
msg ← addr[k] Swapping of send buffer
else and receive buffer (pointer)
RECV(tempR,L) Sending and

for i = 1 to k1q mod p Receiving
in Parallel, with a
tempS ↔ tempR non blocking Send
SEND(tempS,m) || RECV(tempR,m)
msg ← tempR

46 / 235
Parallel
Algorithms
A. Legrand
Assumptions
Broadcast
Scatter(k,msg,addr,m) k = 2, p = 4
Scatter q = MY_NUM()
p = NUM_PROCS() Proc q=2
All-to-All send addr[2+4-1-0 % 4 = 1]
if (q == k)
send addr[2+4-1-1 % 4 = 0]
Broadcast: Going for i = 0 to p2 send addr[2+4-1-2 % 4 = 3]
Faster SEND(addr[k+p1i mod p],m) msg = addr[2]
msg ← addr[k] Proc q=3
else recv (addr[1])
// loop 2-1-3 % 4 = 2 times
RECV(tempR,L)
send (addr[1]) || recv (addr[0])
for i = 1 to k1q mod p send (addr[0]) || recv (addr[3])
tempS ↔ tempR
SEND(tempS,m) || RECV(tempR,m) msg = addr[3]
msg ← tempR
Proc q=0
recv (addr[1])
// loop 2-1-2 % 4 = 1 time
send (addr[1]) || recv (addr[0])
0
1 msg = addr[0]
Proc q=1
3 // loop 2-1-1 % 4 = 0 time
recv (addr[1])
2
msg = addr[1]
47 / 235
Outline
Parallel
Algorithms
A. Legrand
Broadcast
Scatter
Broadcast: Going
Faster
7 Scatter
8 All-to-All
48 / 235
Parallel
Algorithms
A. Legrand
All-to-all (Section 3.3.3)
Assumptions
Broadcast All2All(my_addr, addr, m)

Same execution time as the scatter
Scatter q = MY_NUM()
(p-1)(L + m b)
All-to-All p = NUM_PROCS()
Broadcast: Going
Faster
addr[q] ← my_addr
for i = 1 to p1
SEND(addr[qi+1 mod p],m)
|| RECV(addr[qi mod p],m)
1 1
0 0
2 2
49 / 235
Outline
Parallel
Algorithms
A. Legrand
Broadcast
Scatter
Broadcast: Going
Faster
7 Scatter
8 All-to-All
50 / 235
Parallel
Algorithms
A. Legrand
A faster broadcast?
Assumptions
Broadcast  How can we improve performance?

Scatter
All-to-All
 One can cut the message in many small
Broadcast: Going
Faster
pieces, say in r pieces where m is divisible by
r.
 The root processor just sends r messages
 The performance is as follows
 Consider the last processor to get the last piece of the
message
 There need to be p-1 steps for the first piece to arrive,
which takes (p-1)(L + m b / r)
 Then the remaining r-1 pieces arrive one after another,
which takes (r-1)(L + m b / r)

For a total of: (p - 2 + r) (L + mb / r) Courtesy of Henri Casanova
51 / 235
Parallel
Algorithms
A. Legrand
A faster broadcast?
Assumptions
Broadcast
 The question is, what is the value of r that minimizes
Scatter
(p - 2 + r) (L + m b / r) ?
All-to-All
Broadcast: Going  One can view the above expression as (c+ar)(d+b/r),

Faster
with four constants a, b, c, d
 The non-constant part of the expression is then ad.r +
cb/r, which must be minimized
 It is known that this value is minimized for
sqrt(cb / ad)
and we have
ropt = sqrt(m(p-2) b / L)
with the optimal time
(sqrt((p-2) L) + sqrt(m b))2
which tends to mb when m is large, which is independent
of p! Courtesy of Henri Casanova
52 / 235
Parallel
Algorithms
A. Legrand
Well-known Network Principle
Assumptions
Broadcast
Scatter
 We have seen that if we cut a (large) message in
many (small) messages, then we can send the
message over multiple hops (in our case p-1)
All-to-All
almost as fast as we can send it over a single hop

Broadcast: Going
Faster
 This is a fundamental principle of IP networks

 We cut messages into IP frames
 Send them over many routers
 But really go as fast as the slowest router

53 / 235
Parallel
Algorithms
A. Legrand
Speedup
Amdahl’s Law
Part III
Speedup and Efficiency
54 / 235
Outline
Parallel
Algorithms
A. Legrand
Speedup
Amdahl’s Law
10 Speedup
11 Amdahl’s Law
55 / 235
Speedup
Parallel
Algorithms
A. Legrand
Speedup
Amdahl’s Law
I We need a metric to quantify the impact of your performance
enhancement
I Speedup: ratio of “old” time to “new” time
I new time = 1h
I speedup = 2h / 1h = 2
I Sometimes one talks about a “slowdown” in case the “enhance-
ment” is not beneficial
I Happens more often than one thinks
56 / 235
Parallel Performance
Parallel
Algorithms
A. Legrand
Speedup
Amdahl’s Law I The notion of speedup is completely generic

I By using a rice cooker I’ve achieved a 1.20 speedup for rice cooking
I For parallel programs one defines the Parallel Speedup (we’ll just
say “speedup”):
I Parallel program takes time T1 on 1 processor
I Parallel program takes time Tp on p processors
I Parallel Speedup: S(p) = TTp1
I In the ideal case, if my sequential program takes 2 hours on 1
processor, it takes 1 hour on 2 processors: called linear speedup
57 / 235
Speedup
Parallel
speedup
Algorithms
r
A. Legrand
ea
rlin
Speedup linear
pe
su
Amdahl’s Law
sub-linear
number of processors
Superlinear Speedup? There are several possible causes
58 / 235
Speedup
Parallel
speedup
Algorithms
r
A. Legrand
ea
rlin
Speedup linear
pe
su
Amdahl’s Law
sub-linear

Algorithm with optimization problems, throwing many processors at
it increases the chances that one will “get lucky” and find the
optimum fast
58 / 235
Speedup
Parallel
speedup
Algorithms
r
A. Legrand
ea
rlin
Speedup linear
pe
su
Amdahl’s Law
sub-linear

Algorithm with optimization problems, throwing many processors at
it increases the chances that one will “get lucky” and find the
optimum fast
Hardware with many processors, it is possible that the entire applica-
tion data resides in cache (vs. RAM) or in RAM (vs. Disk) 58 / 235
Outline
Parallel
Algorithms
A. Legrand
Speedup
Amdahl’s Law
10 Speedup
11 Amdahl’s Law
59 / 235
Bad News: Amdahl’s Law
Parallel
Algorithms
A. Legrand Consider a program whose execution consists of two phases

Speedup
1 One sequential phase : Tseq = (1 − f )T1
Amdahl’s Law 2 One phase that can be perfectly parallelized (linear speedup)
Tpar = fT1
Therefore: Tp = Tseq + Tpar /p = (1 − f )T1 + fT1 /p.
5
f = 10%
f = 20%
f = 50%
f = 80%
4
Amdahl’s Law: Speedup 3
1 2
Sp = f
1−f + p 1
0
10 20 30 40 50 60
Number of processors
60 / 235
Lessons from Amdahl’s Law
Parallel
Algorithms
A. Legrand
Speedup
Amdahl’s Law
I It’s a law of diminishing return
I If a significant fraction of the code (in terms of time spent in it)
is not parallelizable, then parallelization is not going to be good
I It sounds obvious, but people new to high performance computing
often forget how bad Amdahl’s law can be
I Luckily, many applications can be almost entirely parallelized and
f is small
61 / 235
Parallel Efficiency
Parallel
Algorithms
A. Legrand
Speedup
Amdahl’s Law I Efficiency is defined as Eff (p) = S(p)/p

I Typically < 1, unless linear or superlinear speedup
I Used to measure how well the processors are utilized
I If increasing the number of processors by a factor 10 increases the
speedup by a factor 2, perhaps it’s not worth it: efficiency drops
by a factor 5
I Important when purchasing a parallel machine for instance: if due
to the application’s behavior efficiency is low, forget buying a large
cluster
62 / 235
Scalability
Parallel
Algorithms
A. Legrand
Speedup
Amdahl’s Law I Measure of the “effort” needed to maintain efficiency while adding
processors
I Efficiency also depends on the problem size: Eff (n, p)
I Isoefficiency: At which rate does the problem size need to be
increase to maintain efficiency
I nc (p) such that Eff (nc (p), p) = c
I By making a problem ridiculously large, on can typically achieve
good efficiency
I Problem: is it how the machine/code will be used?
63 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
Product
Open MP
Version
First MPI
Version
Distributing
Part IV
Matrices
Second MPI
Version
Third MPI
Version Algorithms on a Ring
Mixed
Parallelism
Version
Matrix
Multiplication
Stencil
Application
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
64 / 235
Outline
Parallel
Algorithms 12 Matrix Vector Product
A. Legrand
Open MP Version
Matrix Vector First MPI Version
Product
Open MP Distributing Matrices
Version
First MPI Second MPI Version
Version
Distributing Third MPI Version
Matrices
Second MPI Mixed Parallelism Version
Version
Third MPI
Version 13 Matrix Multiplication
Mixed
Parallelism
Version 14 Stencil Application
Matrix
Multiplication
Principle
Stencil Greedy Version
Application
Principle
Reducing the Granularity
Greedy Version
Reducing the 15 LU Factorization
Granularity
LU Factorization
Gaussian Elimination
Gaussian
Elimination
LU
LU
65 / 235
Parallel
Algorithms
A. Legrand
Parallel Matrix-Vector product
Matrix Vector
Product
Open MP
Version
 y=Ax
First MPI  Let n be the size of the matrix
Version
Distributing int a[n][n];
Matrices int x[n];
Second MPI
Version for i = 0 to n1 {
Third MPI y[i] = 0;
Version for j = 0 to n1
Mixed y[i] = y[i] + a[i,j] * x[j];
Parallelism
Version }
Matrix x[N]
Multiplication
Stencil
Application
 How do we do this in
Principle parallel?
Greedy Version
Reducing the
Granularity y[N]
LU Factorization
Gaussian Section 4.1 in the book
Elimination
LU a[N][N]
66 / 235
Parallel
Algorithms
A. Legrand
Parallel Matrix-Vector product
Matrix Vector
Product
How do we do this in parallel?

Open MP
Version 
First MPI
For example:
Version
Distributing

Matrices
Second MPI
Version
 Computations of elements of
Third MPI
Version
vector y are independent
Mixed  Each of these computations
Parallelism
Version
Matrix
requires one row of matrix a and x[N]
Multiplication vector x
In shared-memory:
Stencil

Application
Principle
Greedy Version #pragma omp parallel for private(i,j)
Reducing the for i = 0 to n1 {
Granularity y[N]
y[i] = 0;
LU Factorization for j = 0 to n1
Gaussian
Elimination
y[i] = y[i] + a[i,j] * x[j];
LU } a[N][N]
67 / 235
Parallel
Algorithms
A. Legrand
Parallel Matrix-Vector Product
Matrix Vector
Product
Open MP
Version
 In distributed memory, one possibility is that
First MPI
Version
each process has a full copy of matrix a and of
Distributing
Matrices
vector x
Second MPI
Version
 Each processor declares a vector y of size n/p
Third MPI
Version
 We assume that p divides n
Mixed
Parallelism
 Therefore, the code can just be
Version
load(a); load(x)
Matrix
Multiplication p = NUM_PROCS(); r = MY_RANK();
Stencil for (i=r*n/p; i<(r+1)*n/p; i++) {
Application
Principle
for (j=0;j<n;j++)
Greedy Version y[ir*n/p] = a[i][j] * x[j];
Reducing the
Granularity }
LU Factorization
Gaussian
 It’s embarrassingly parallel
Elimination
LU
 What about the result?
68 / 235
Parallel
Algorithms
A. Legrand
What about the result?
Matrix Vector
Product
Open MP
Version
 After the processes complete the computation, each
First MPI
Version
process has a piece of the result
Distributing
Matrices
 One probably wants to, say, write the result to a file
Second MPI  Requires synchronization so that the I/O is done correctly
Version
Third MPI  For example
Version . . .
Mixed if (r != 0) {
Parallelism
Version recv(&token,1);
}
Matrix
open(file, “append”);
Multiplication
for (j=0; j<n/p ; j++)
Stencil write(file, y[j]);
Application send(&token,1);
Principle close(file)
Greedy Version barrier(); // optional
Reducing the
Granularity
 Could also use a “gather” so that the entire vector is
LU Factorization
Gaussian returned to processor 0
Elimination
LU
 vector y fits in the memory of a single node
69 / 235
Parallel
Algorithms
A. Legrand
What if matrix a is too big?
Matrix Vector
Product
Open MP
Version  Matrix a may not fit in memory
First MPI
Version  Which is a motivation to use distributed memory
Distributing
Matrices implementations
In this case, each processor can store only a
Second MPI
Version 
piece of matrix a
Third MPI
Version
Mixed
Parallelism
Version
 For the matrix-vector multiply, each processor
Matrix can just store n/p rows of the matrix
Multiplication
 Conceptually: A[n][n]
Stencil
Application  But the program declares a[n/p][n]
Principle
Greedy Version  This raises the (annoying) issue of global indices
Reducing the
Granularity versus local indices
LU Factorization
Gaussian
Elimination
LU
70 / 235
Parallel
Algorithms
A. Legrand
Global vs. Local indices
Matrix Vector
Product
Open MP
Version  When an array is split among processes
First MPI
Version
 global index (I,J) that references an element of the matrix
Distributing
Matrices
 local index (i,j) that references an element of the local array
Second MPI that stores a piece of the matrix
Version
Third MPI
 Translation between global and local indices
Version
Mixed
 think of the algorithm in terms of global indices
Parallelism  implement it in terms of local indices
Version
Matrix
Multiplication
Stencil
Global: A[5][3]
Application
Principle
P0 Local: a[1][3] on process P1
Greedy Version
Reducing the
Granularity
P1 a[i,j] = A[(n/p)*rank + i][j]
LU Factorization
P2 n/p
Gaussian
Elimination
LU
N Courtesy of Henri Casanova
71 / 235
Parallel
Algorithms
A. Legrand
Global Index Computation
Matrix Vector
Product
Open MP
Version  Real-world parallel code often implements actual
First MPI
Version translation functions
Distributing
Matrices  GlobalToLocal()
Second MPI
Version  LocalToGlobal()
Third MPI
Version
Mixed
Parallelism
 This may be a good idea in your code, although
Version for the ring topology the computation is pretty
Matrix
Multiplication easy, and writing functions may be overkill
Stencil
Application
 We’ll see more complex topologies with more
Principle complex associated data distributions and then
Greedy Version
Reducing the it’s probably better to implement such functions
Granularity
LU Factorization
Gaussian
Elimination
LU
72 / 235
Parallel
Algorithms
A. Legrand
Distributions of arrays
Matrix Vector
Product
Open MP
Version
 At this point we have
 2-D array a distributed
First MPI
Version
Distributing
Matrices  1-D array y distributed
Second MPI
Version  1-D array x replicated
Third MPI
Having distributed arrays makes it possible to

Version

Mixed
Parallelism
Version partition work among processes
Matrix  But it makes the code more complex due to
Multiplication
Stencil
global/local indices translations
Application  It may require synchronization to load/save the
Principle
Greedy Version array elements to file
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
73 / 235
Parallel
Algorithms
A. Legrand
All vector distributed?
Matrix Vector
Product
Open MP
Version  So far we have array x replicated
First MPI
Version
Distributing
 It is usual to try to have all arrays involved in the
Matrices
Second MPI
same computation be distributed in the same
Version
Third MPI
way
Version  makes it easier to read the code without constantly
Mixed
Parallelism keeping track of what’s distributed and what’s not
Version
Matrix

e.g., “local indices for array y are different from the global
Multiplication ones, but local indices for array x are the same as the
Stencil
global ones” will lead to bugs
Application
Principle
 What one would like it for each process to have
Greedy Version  N/n rows of matrix A in an array a[n/p][n]
Reducing the
Granularity  N/n components of vector x in an array x[n/p]
LU Factorization
Gaussian
 N/n components of vector y in an array y[n/p]
Turns out there is an elegant solution to do this
Elimination

LU
74 / 235
Parallel
Algorithms
A. Legrand
Principle of the Algorithm
Matrix Vector
Product
Open MP
Version
First MPI A00 A01 A02 A03 A04 A05 A06 A07 x0
Version
Distributing
P0 A10 A11 A12 A13 A14 A15 A16 A17 x1
Matrices
Second MPI
Version A20 A21 A22 A23 A24 A25 A26 A27 x2
Third MPI P1 A30 A31 A32 A33 A34 A35 A36 A37 x3 Initial data distribution
Version
Mixed
for:
Parallelism
A40 A41 A42 A43 A44 A45 A46 A47 x4 n=8
Version
P2 A50 A51 A52 A53 A54 A55 A56 A57 x5 p=4
Matrix
Multiplication n/p = 2
A60 A61 A62 A63 A64 A65 A66 A67 x6
Stencil
Application P3 A70 A71 A72 A73 A74 A75 A76 A77 x7
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
75 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
Product
Open MP
Version
First MPI A00 A01 ● ● ● ● ● ● x0
Version
Distributing
P0 A10 A11 ● ● ● ● ● ● x1
Matrices
Second MPI
Version ● ● A22 A23 ● ● ● ● x2
Third MPI P1 ● ● A32 A33 ● ● ● ● x3
Version
Mixed
Parallelism
● ● ● ● A44 A45 ● ● x4
Version
P2 ● ● ● ● A54 A55 ● ● x5
Matrix
Multiplication
● ● ● ● ● ● A66 A67 x6
Stencil
Application P3 ● ● ● ● ● ● A76 A77 x7
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
Step 0
LU
76 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
Product
Open MP
Version
First MPI ● ● ● ● ● ● A06 A07 x6
Version
Distributing
P0 ● ● ● ● ● ● A16 A17 x7
Matrices
Second MPI
Version A20 A21 ● ● ● ● ● ● x0
Third MPI P1 A30 A31 ● ● ● ● ● ● x1
Version
Mixed
Parallelism
● ● A42 A43 ● ● ● ● x2
Version
P2 ● ● A52 A53 ● ● ● ● x3
Matrix
Multiplication
● ● ● ● A64 A65 ● ● x4
Stencil
Application P3 ● ● ● ● A74 A75 ● ● x5
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
Step 1
LU
77 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
Product
Open MP
Version
First MPI ● ● ● ● A04 A05 ● ● x4
Version
Distributing
P0 ● ● ● ● A14 A15 ● ● x5
Matrices
Second MPI
Version ● ● ● ● ● ● A26 A27 x6
Third MPI P1 ● ● ● ● ● ● A36 A37 x7
Version
Mixed
Parallelism
A40 A41 ● ● ● ● ● ● x0
Version
P2 A50 A51 ● ● ● ● ● ● x1
Matrix
Multiplication
● ● A62 A63 ● ● ● ● x2
Stencil
Application P3 ● ● A72 A73 ● ● ● ● x3
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
Step 2
LU
78 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
Product
Open MP
Version
First MPI ● ● A02 A03 ● ● ● ● x2
Version
Distributing
P0 ● ● A12 A13 ● ● ● ● x3
Matrices
Second MPI
Version ● ● ● ● A24 A25 ● ● x4
Third MPI P1 ● ● ● ● A34 A35 ● ● x5
Version
Mixed
Parallelism
● ● ● ● ● ● A46 A47 x6
Version
P2 ● ● ● ● ● ● A56 A57 x7
Matrix
Multiplication
A60 A61 ● ● ● ● ● ● x0
Stencil
Application P3 A70 A71 ● ● ● ● ● ● x1
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
Step 3
LU
79 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
Product
Open MP
Version
First MPI A00 A01 ● ● ● ● ● ● x0
Version
Distributing
P0 A10 A11 ● ● ● ● ● ● x1
Matrices
Second MPI
The final exchange of
Version ● ● A22 A23 ● ● ● ● x2 vector x is not strictly
Third MPI P1 ● ● A32 A33 ● ● ● ● x3
Version necessary, but one may
Mixed want to have it
Parallelism
● ● ● ● A44 A45 ● ● x4
Version
P2 distributed as the end of
● ● ● ● A54 A55 ● ● x5
Matrix the computation like it
Multiplication
● ● ● ● ● ● A66 A67 x6 was distributed at the
Stencil
Application P3 ● ● ● ● ● ● A76 A77 x7 beginning.
Principle
Greedy Version
Reducing the
Granularity
LU Factorization Final state
Gaussian
Elimination
LU
80 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Open MP  Uses two buffers
Version
First MPI  tempS for sending and tempR to receiving
Version
Distributing
Matrices
Second MPI
Version float A[n/p][n], x[n/p], y[n/p];
Third MPI
Version r ← n/p
Mixed tempS ← x /* My piece of the vector (n/p elements) */
Parallelism for (step=0; step<p; step++) { /* p steps */
Version
SEND(tempS,r)
Matrix RECV(tempR,r)
Multiplication
for (i=0; i<n/p; i++)
Stencil for (j=0; j <n/p; j++)
Application y[i] ← y[i] + a[i,(rank step mod p) * n/p + j] * tempS[j]
Principle tempS ↔ tempR
Greedy Version }
Reducing the
Granularity
 In our example, process of rank 2 at step 3 would work with
LU Factorization
Gaussian
the 2x2 matrix block starting at column ((2 - 3) mod 4)*8/4
Elimination = 3 * 8 / 4 = 6;
LU
81 / 235
Parallel
Algorithms
A. Legrand
A few General Principles
Matrix Vector
Large data needs to be distributed among

Product

Open MP
Version
First MPI
processes (running on different nodes of a cluster
Version
Distributing
for instance)
Matrices  causes many arithmetic expressions for index
Second MPI
Version computation
Third MPI
Version
 People who do this for a leaving always end up writing
Mixed local_to_global() and global_to_local() functions
Parallelism
Version  Data may need to be loaded/written before/after
Matrix
Multiplication
the computation
Stencil
 requires some type of synchronization among processes
Application
Principle
 Typically a good idea to have all data structures
Greedy Version distributed similarly to avoid confusion about
Reducing the
Granularity which indices are global and which ones are local
LU Factorization  In our case, all indices are local
Gaussian
Elimination
 In the end the code looks much more complex
LU
than the equivalent OpenMP implementation
82 / 235
Parallel
Algorithms
A. Legrand
Performance
Matrix Vector
Product
Open MP
Version
 There are p identical steps
First MPI
Version
Distributing
 During each step each processor performs
Matrices
Second MPI
three activities: computation, receive, and
Version
Third MPI sending
Version
Mixed
 Computation: r2 w
Parallelism
Version 
w: time to perform one += * operation
Matrix
Multiplication
 Receiving: L + r b
Stencil
 Sending: L + r b
Application
Principle
T(p) = p (r2w + 2L + 2rb)

Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
83 / 235
Parallel
Algorithms
A. Legrand
Asymptotic Performance
Matrix Vector
Product
T(p) = p(r2w + 2L + 2rb)

Open MP
Version 
First MPI
Version
Distributing
 Speedup(p) = n2w / p (r2w + 2L + 2rb)
Matrices
Second MPI = n2w / (n2w/p + 2pL + 2nb)
Version
Third MPI
Version
 Eff(p) = n2w / (n2w+ 2p2L + 2pnb)
Mixed
Parallelism
 For p fixed, when n is large, Eff(p) ~ 1
Version
Matrix
Multiplication
Stencil
 Conclusion: the algorithm is
Application
Principle asymptotically optimal
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
84 / 235
Parallel
Algorithms
A. Legrand
Performance (2)
Matrix Vector
Product
Open MP
Version  Note that an algorithm that initially broadcasts the
First MPI
Version entire vector to all processors and then have every
Distributing
Matrices
Second MPI
processor compute independently would be in time
Version
Third MPI
Version
(p-1)(L + n b) + pr2 w
Mixed
Parallelism
 Could use the pipelined broadcast
Version
Matrix
 which:
Multiplication
 has the same asymptotic performance
Stencil
Application  is a simpler algorithm
Principle
Greedy Version  wastes only a tiny little bit of memory
Reducing the
Granularity  is arguably much less elegant
LU Factorization
Gaussian
 It is important to think of simple solutions and see
Elimination
LU what works best given expected matrix size, Courtesy
etc. of Henri Casanova
85 / 235
Parallel
Algorithms
A. Legrand
Back to the Algorithm
Matrix Vector
Product
Open MP
Version float A[n/p][n], x[n/p], y[n/p];
First MPI
Version
r ← n/p
Distributing tempS ← x /* My piece of the vector (n/p elements) */
Matrices
for (step=0; step<p; step++) { /* p steps */
Second MPI
Version SEND(tempS,r)
Third MPI
Version
RECV(tempR,r)
Mixed for (i=0; i<n/p; i++)
Parallelism for (j=0; j <n/p; j++)
Version
y[i] ← y[i] + a[i,(rank step mod p) * n/p + j] * tempS[j]
Matrix
Multiplication tempS ↔ tempR
}
Stencil
Application
 In the above code, at each iteration, the SEND, the RECV,
Principle and the computation can all be done in parallel
Greedy Version
Reducing the  Therefore, one can overlap communication and
Granularity
computation by using non-blocking SEND and RECV if
LU Factorization
Gaussian
available
Elimination
LU
 MPI provides MPI_ISend() and MPI_IRecv() for this purpose
86 / 235
Parallel
Algorithms
A. Legrand
Nore Concurrent Algorithm
Matrix Vector
Product
Open MP
Version
 Notation for concurrent activities:
First MPI
Version
Distributing float A[n/p][n], x[n/p], y[n/p];
Matrices
Second MPI
tempS ← x /* My piece of the vector (n/p elements) */
Version r ← n/p
Third MPI
Version for (step=0; step<p; step++) { /* p steps */
Mixed
Parallelism SEND(tempS,r)
Version || RECV(tempR,r)
Matrix || for (i=0; i<n/p; i++)
Multiplication
for (j=0; j <n/p; j++)
Stencil
Application y[i] ← y[i]+a[i,(rankstep mod p)*n/p+j]*tempS[j]
Principle tempS ↔ tempR
Greedy Version
Reducing the
}
Granularity
LU Factorization
Gaussian
Elimination
LU
87 / 235
Parallel
Algorithms
A. Legrand
Better Performance
Matrix Vector
Product
Open MP
Version
 There are p identical steps
First MPI
Version
Distributing
 During each step each processor performs
Matrices
Second MPI
three activities: computation, receive, and
Version
Third MPI sending
Version
Mixed
 Computation: r2w
Parallelism
Version  Receiving: L + rb
Sending: L + rb
Matrix

Multiplication
Stencil
Application
Principle
Greedy Version T(p) = p max(r2w , L + rb)
Reducing the
Granularity
LU Factorization Same asymptotic performance as above, but
Gaussian
Elimination
better performance for smaller values of n
LU
88 / 235
Parallel
Algorithms
A. Legrand
Hybrid parallelism
Matrix Vector
Product
Open MP
Version
 We have said many times that multi-core
First MPI
Version
architectures are about to become the standard
Distributing
Matrices
 When building a cluster, the nodes you will buy will
Second MPI
Version be multi-core
Third MPI
Version
 Question: how to exploit the multiple cores?
 Or in our case how to exploit the multiple
Mixed
Parallelism
Version
Matrix
processors in each node
Multiplication  Option #1: Run multiple processes per node
 Causes more overhead and more
Stencil
Application
Principle
Greedy Version
communication
Reducing the  In fact will cause network communication among
Granularity
LU Factorization
processes within a node!
Gaussian  MPI will not know that processes are co-
Elimination
LU located Courtesy of Henri Casanova
89 / 235
Parallel
Algorithms
A. Legrand
OpenMP MPI Program
Matrix Vector
Product
Open MP
Version  Option #2: Run a single multi-threaded process
First MPI
Version per node
Distributing
Matrices  Much lower overhead, fast communication
Second MPI
Version
Third MPI
within a node
 Done by combining MPI with OpenMP!
Version
Mixed
Parallelism
Version  Just write your MPI program
Matrix
Multiplication  Add OpenMP pragmas around loops
Stencil
Application  Let’s look back at our Matrix-Vector multiplication
Principle
Greedy Version example
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
90 / 235
Parallel
Algorithms
A. Legrand
Hybrid Parallelism
Matrix Vector
Product
Open MP float A[n/p][n], x[n/p], y[n/p];
Version
First MPI tempS ← x /* My piece of the vector (n/p elements) */
Version
Distributing
Matrices SEND(tempS,r)
Second MPI
Version || RECV(tempR,r)
Third MPI
Version
|| #pragma omp parallel for private(i,j)
Mixed for (i=0; i<n/p; i++)
Parallelism
Version for (j=0; j <n/p; j++)
Matrix y[i] ← y[i] + a[i,(rank step mod p)*n/p+j]*
Multiplication
tempS[j]
Stencil tempS ↔ tempR
Application
Principle }
Greedy Version
Reducing the
Granularity
 This is called Hybrid Parallelism
LU Factorization
 Communication via the network among nodes
Gaussian
Elimination
 Communication via the shared memory within nodes
LU
91 / 235
Outline
Parallel
A. Legrand
Open MP Version
Product
Version
Version
Matrices
Version
Third MPI
Mixed
Parallelism
Matrix
Multiplication
Principle
Application
Principle
Greedy Version
Granularity
LU Factorization
Gaussian
Elimination
LU
LU
92 / 235
Parallel
Algorithms Matrix Multiplication on the
A. Legrand
Ring
Matrix Vector
Product
Open MP
Version
 See Section 4.2
Turns out one can do matrix multiplication in a
First MPI
Version 
Distributing
Matrices way very similar to matrix-vector multiplication
 A matrix multiplication is just the computation
Second MPI
Version
Third MPI
Version of n2 scalar products, not just n
Mixed
Parallelism
Version
 We have three matrices, A, B, and C
Matrix
 We want to compute C = A*B
Multiplication
Stencil
 We distribute the matrices to that each processor
Application “owns” a block row of each matrix
 Easy to do if row-major is used because all
Principle
Greedy Version
Reducing the
Granularity
matrix elements owned by a processor are
LU Factorization contiguous in memory
Gaussian
Elimination
LU
93 / 235
Parallel
Algorithms
A. Legrand
Data Distribution
Matrix Vector
Product
Open MP
Version
First MPI
Version r
B
Distributing
Matrices
Second MPI
Version
Third MPI
Version
Mixed
Parallelism
Version
n
Matrix
Multiplication
Stencil
Application
A C
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
94 / 235
Parallel
Algorithms
A. Legrand
First Step
Matrix Vector
Product
Open MP
Version
First MPI
p=4
Version r B1,0 B1,1 B1,2 B1,3
Distributing
Matrices let’s look at
processor P1
Second MPI
Version
Third MPI
Version
Mixed
Parallelism
Version
n
Matrix
Multiplication
Stencil
Application += += += +=
Principle
A1,0 A1,1 A1,2 A1,3 A1,1xB1,0 A1,1xB1,1 A1,1xB1,2 A1,1xB1,3
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
95 / 235
Parallel
Algorithms
A. Legrand
Shifting of block rows of B
Matrix Vector
Product
Open MP
Version
First MPI
p=4
Version r
Distributing
processor Pq
Second MPI
Version
Third MPI
Version
Mixed
Parallelism
Version
n
Matrix
Multiplication
Stencil
Application
Principle
Aq,0 Aq,1 Aq,2 Aq,3
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
96 / 235
Parallel
Algorithms
A. Legrand
Second step
Matrix Vector
Product
Open MP
Version
First MPI
p=4
Version r B0,0 B0,1 B0,2 B0,3
Distributing
processor P1
Second MPI
Version
Third MPI
Version
Mixed
Parallelism
Version
n
Matrix
Multiplication
Stencil
Application += += += +=
Principle
A1,0 A1,1 A1,2 A1,3 A1,0xB0,0 A1,0xB0,1 A1,0xB0,2 A1,0xB0,3
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
97 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Open MP
Version
 In the end, every Ci,j block has the correct value: Ai,0B0,j + Ai,1B1,j +
First MPI ...
Version
Distributing  Basically, this is the same algorithm as for matrix-vector
Matrices
Second MPI
multiplication, replacing the partial scalar products by submatrix
Version products (gets tricky with loops and indices)
Third MPI
Version
Mixed float A[N/p][N], B[N/p][N], C[N/p][N];
Parallelism r ← N/p
Version
tempS ← B
Matrix
Multiplication
q ← MY_RANK()
Stencil
SEND(tempS,r*N)
Application
Principle
|| RECV(tempR,r*N)
Greedy Version || for (l=0; l<p; l++)
Reducing the for (i=0; i<N/p; i++)
Granularity
for (j=0; j<N/p; j++)
LU Factorization for (k=0; k<N/p; k++)
Gaussian
Elimination C[i,l*r+j] ← C[i,l*r+j] + A[i,r((q step)%p)+k] * tempS[k,l*r+j]
LU tempS ↔ tempR
}
98 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Open MP for (step=0; step<p; step++) { /* p steps */
Version
First MPI
SEND(tempS,r*N)
Version || RECV(tempR,r*N)
Distributing || for (l=0; l<p; l++)
Matrices
Second MPI for (i=0; i<N/p; i++)
Version for (j=0; j<N/p; j++)
Third MPI
Version for (k=0; k<N/p; k++)
Mixed C[i,lr+j] ← C[i,lr+j] + A[i,r((rank step)%p)+k] * tempS[k,lr+j]
Parallelism
Version tempS ↔ tempR
Matrix
Multiplication
}
step=0
Stencil l=0
i=0
Application
Principle
Greedy Version
Reducing the
Granularity j=0
LU Factorization
Gaussian
Elimination
LU
99 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Version
First MPI
SEND(tempS,r*N)
Matrices
Third MPI
Parallelism
Matrix
Multiplication
}
step=0
Stencil l=0
i=0
Application
Principle
Greedy Version
Reducing the
Granularity j=*
LU Factorization
Gaussian
Elimination
LU
100 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Version
First MPI
SEND(tempS,r*N)
Matrices
Third MPI
Parallelism
Matrix
Multiplication
}
step=0
Stencil l=0
i=*
Application
Principle
Greedy Version
Reducing the
Granularity j=*
LU Factorization
Gaussian
Elimination
LU
101 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Version
First MPI
SEND(tempS,r*N)
Matrices
Third MPI
Parallelism
Matrix
Multiplication
}
step=0
Stencil l=1
i=*
Application
Principle
Greedy Version
Reducing the
Granularity j=*
LU Factorization
Gaussian
Elimination
LU
102 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Version
First MPI
SEND(tempS,r*N)
Matrices
Third MPI
Parallelism
Matrix
Multiplication
}
step=0
Stencil l=*
i=*
Application
Principle
Greedy Version
Reducing the
Granularity j=*
LU Factorization
Gaussian
Elimination
LU
103 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Version
First MPI
SEND(tempS,r*N)
Matrices
Third MPI
Parallelism
Matrix
Multiplication
}
step=1
Stencil l=*
i=*
Application
Principle
Greedy Version
Reducing the
Granularity j=*
LU Factorization
Gaussian
Elimination
LU
104 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Version
First MPI
SEND(tempS,r*N)
Matrices
Third MPI
Parallelism
Matrix
Multiplication
}
step=2
Stencil l=*
i=*
Application
Principle
Greedy Version
Reducing the
Granularity j=*
LU Factorization
Gaussian
Elimination
LU
105 / 235
Parallel
Algorithms
A. Legrand
Algorithm
Matrix Vector
Product
Version
First MPI
SEND(tempS,r*N)
Matrices
Third MPI
Parallelism
Matrix
Multiplication
}
step=3
Stencil l=*
i=*
Application
Principle
Greedy Version
Reducing the
Granularity j=*
LU Factorization
Gaussian
Elimination
LU
106 / 235
Parallel
Algorithms
A. Legrand
Performance
Matrix Vector
Product
Open MP
Version
 Performance Analysis is straightforward
First MPI
Version
Distributing
 p steps and each step takes time:
max (nr2 w, L + nrb)
Matrices
Second MPI
Version
Third MPI

p rxr matrix products = pr3 = nr2 operations
Version
Mixed
Parallelism
 Hence, the running time is:
T(p) = p max (nr2 w , L + nrb)
Version
Matrix
Multiplication
 Note that a naive algorithm computing n
Matrix-vector products in sequence using
Stencil
Application
Principle
Greedy Version our previous algorithm would take time
Reducing the
Granularity T(p) = p max(nr2 w , nL + nrb)
LU Factorization
Gaussian
Elimination
 We just saved network latencies!
LU
107 / 235
Outline
Parallel
A. Legrand
Open MP Version
Product
Version
Version
Matrices
Version
Third MPI
Mixed
Parallelism
Matrix
Multiplication
Principle
Application
Principle
Greedy Version
Granularity
LU Factorization
Gaussian
Elimination
LU
LU
108 / 235
Parallel
Algorithms Stencil Application (Section
A. Legrand
4.3)
Matrix Vector
Product
Open MP
 We’ve talked about stencil applications in the
Version context of shared-memory programs
First MPI
Version
Distributing
Matrices
Second MPI
Version
0 1 2 3 4 5 6 t+1
Third MPI
Version
1 2 3 4 5 6 7 t+1 t
Mixed
Parallelism
2 3 4 5 6 7 8
Version
3 4 5 6 7 8 9
Matrix new = update(old,W,N)
Multiplication 4 5 6 7 8 9 10
Stencil 5 6 7 8 9 10 11
Application
Principle 6 7 8 9 10 11 12
Greedy Version
Reducing the
Granularity
LU Factorization
 We found that we had to cut the matrix in “small”
Gaussian blocks
Elimination
LU
 On a ring the same basic idea applies, but let’s do it step-
by-step Courtesy of Henri Casanova
109 / 235
Parallel
Algorithms
A. Legrand
Stencil Application
Matrix Vector
Product
Open MP
Version
 Let us, for now, consider that the domain is of size nxn and
First MPI that we have p=n processors
Version
Distributing
 Classic way to first approach a problem
Matrices
Second MPI
 Each processor is responsible for computing one row of the
Version domain (at each iteration)
Third MPI
Version
 Each processor holds one row of the domain and has the
Mixed
Parallelism
following declaration:
Version var A: array[0..n-1] of real
Matrix
Multiplication
 One first simple idea is to have each processor send each
cell value to its neighbor as soon as that cell value is
Stencil
Application
computed
Principle  Basic principle: do communication as early as possible to
Greedy Version
get your “neighbors” started as early as possible
Reducing the
Granularity  Remember that one of the goals of a parallel program is to
LU Factorization reduce idle time on the processors
Gaussian
Elimination
 We call this algorithm the Greedy algorithm, and seek an
LU evaluation of its performance
110 / 235
Parallel
Algorithms
A. Legrand
The Greedy Algorithm
Matrix Vector
Product
Open MP q = MY_NUM()
Version p = NUM_PROCS
First MPI
Version if (q == 0) then
Distributing A[0] = Update(A[0],nil,nil)
Matrices Send(A[0],1)
Second MPI else
Version First element of the row
Third MPI Recv(v,1)
Version A[0] = Update(A[0],nil,v)
Mixed endif
Parallelism
Version for j = 1 to n-1
if (q == 0) then
Matrix
Multiplication A[j] = Update(A[j], A[j-1], nil)
Send(A[j],1)
Stencil elsif (q == p-1) then
Application
Recv(v,1)
Principle
A[j] = Update(A[j], A[j-1], v)
Other elements
Greedy Version
Reducing the else
Granularity Send(A[j-1], 1) || Recv(v,1)
LU Factorization A[j] = Update(A[j], A[j-1], v)
Gaussian endif
Elimination endfor note the use of “nil”
LU for borders and corners
111 / 235
Parallel
Algorithms
A. Legrand
Greedy Algorithm
Matrix Vector
Product
Open MP
Version
 This is all well and good, but typically we have n > p
First MPI
Version
 Assuming that p divides n, each processor will hold n/p
Distributing rows
Matrices
Second MPI
 Good for load balancing
Version  The goal of a greedy algorithm is always to allow
Third MPI
Version processors to start computing as early as possible
Mixed
Parallelism
 This suggests a cyclic allocation of rows among processors
Version
Matrix P0
Multiplication P1
Stencil
P2
Application P0
Principle P1
Greedy Version P2
Reducing the P0
Granularity P1
LU Factorization P2
Gaussian
Elimination
LU  P1 can start computing after P0 has computed its Courtesy
first cell
of Henri Casanova
112 / 235
Parallel
Algorithms
A. Legrand
Greedy Algorithm
Matrix Vector
Product
Open MP
Version
 Each processor holds n/p rows of the domain
First MPI
Version  Thus it declares:
Distributing
Matrices
Second MPI
var A[0..n/p-1,n] of real
Version
Third MPI
 Which is a contiguous array of rows, with these
Version
Mixed
rows not contiguous in the domain
 Therefore we have a non-trivial mapping
Parallelism
Version
Matrix between global indices and local indices, but
we’ll see that they don’t appear in the code
Multiplication
Stencil
Application  Let us rewrite the algorithm
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
113 / 235
Parallel
Algorithms
A. Legrand
The Greedy Algorithm
Matrix Vector
Product p = MY_NUM()
Open MP q = NUM_PROCS
Version
For i = 0 to n/p -1
First MPI
Version if (q == 0) and (i == 0) then
Distributing A[0,0] = Update(A[0,0],nil,nil)
Matrices Send(A[0],1)
Second MPI
Version else
Third MPI Recv(v,1)
Version A[i,0] = Update(A[i,0],nil,v)
Mixed endif
Parallelism
Version for j = 1 to n-1
if (q == 0) and (i == 0) then
Matrix
Multiplication A[i,j] = Update(A[i,j], A[i,j-1], nil)
Send(A[i,j],1)
Stencil
elsif (q == p-1) and (i = n/p-1) then
Application
Recv(v,1)
Principle
Greedy Version
A[i,j] = Update(A[i,j], A[i-1,j], v)
Reducing the else
Granularity Send(A[i,j-1], 1) || Recv(v,1)
LU Factorization A[i,j] = Update(A[i,j], A[i-1,j-1], v)
Gaussian endif
Elimination endfor
LU endfor Courtesy of Henri Casanova
114 / 235
Parallel
Algorithms
A. Legrand
Performance Analysis
Matrix Vector
Product
Open MP
Version
 Let T(n,p) denote the computation time of the algorithm for
First MPI a nxn domain and with p processors
Version
Distributing
 A each step a processor does at most three things
Matrices
Second MPI
 Receive a cell
Version  Send a cell
Third MPI
Version
 Update a cell
Mixed
Parallelism
 The algorithm is “clever” because at each step k, the
Version sending of messages from step k is overlapped with the
Matrix receiving of messages at step k+1
Multiplication
 Therefore, the time needed to compute one algorithm step
Stencil
Application
is the sum of
Principle  Time to send/receive a cell: L+b
Greedy Version  Time to perform a cell update: w
Reducing the
Granularity  So, if we can count the number of steps, we can simply
LU Factorization multiply and get the overall execution time
Gaussian
Elimination
LU
115 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
Product
Open MP
 It takes p-1 steps before processor Pp-1 can start computing
Version
First MPI
its first cell
Version
Distributing
 Thereafter, this processor can compute one cell at every step
Matrices
Second MPI
 The processor holds n*n/p cells
Version
Third MPI
 Therefore, the whole program takes: p-1+n*n/p steps
Version
Mixed
 And the overall execution time:
Parallelism
Version T(n,p) = (p - 1 + n2/p ) (w + L + b)
Matrix  The sequential time is: n2w
Multiplication
 The Speedup, S(n,p) = n2w / T(n,p)
Stencil
Application  When n gets large, T(n,p) ~ n2/p (w + L + b)
Principle
Greedy Version
 Therefore, Eff(n,p) ~ w / (w + L + b)
Reducing the
Granularity
 This could be WAY below one
LU Factorization
 In practice, and often, L + b >> w
Gaussian
Elimination
 Therefore, this greedy algorithm is probably not a good idea
LU at all! Courtesy of Henri Casanova
116 / 235
Parallel
Algorithms
A. Legrand
Granularity
Matrix Vector
Product
Open MP
Version
 How do we improve on performance?
What really kills performance is that we have to
First MPI
Version 
Distributing
Matrices do so much communication
 Many bytes of data
Second MPI
Version
Third MPI
Version  Many individual messages
Mixed
Parallelism
Version
 So we we want is to augment the granularity of
Matrix the algorithm
Multiplication
 Our “tasks” are not going to be “update one
cell” but instead “update multiple cells”

Stencil
Application
This will allow us to reduce both the amount of

Principle

Greedy Version
Reducing the
Granularity
data communicated and the number of messages
LU Factorization exchanged
Gaussian
Elimination
LU
117 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
Product
Open MP
Version
 A simple approach: have a processor compute k
First MPI
Version
cells in sequence before sending them
Distributing
Matrices
 This is in conflict with the “get processors to
Second MPI
Version compute as early as possible” principle we based
Third MPI
Version
our initial greedy algorithm on
Mixed  So we will reduce communication cost, but will
Parallelism
Version
increase idle time
Matrix
Multiplication  Let use assume that k divides n
Stencil
Application
 Each row now consists of n/k segments
Principle  If k does not divide n we have left over cells
Greedy Version
Reducing the and it complicates the program and the
performance analysis and as usual doesn’t
Granularity
LU Factorization
Gaussian change the asymptotic performance analysis
Elimination
LU
118 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
k
Product
Open MP
Version
First MPI
Version
P0 0 1 2 3
Distributing
Matrices
Second MPI
P1 1 2 3 4
Version
Third MPI
Version P2 2 3 4 5
Mixed
Parallelism
Version P3 3 4 5 6
Matrix
P0
Multiplication
4 5 6
Stencil
Application
Principle
Greedy Version
 The algorithm computes segment after segment
Reducing the
Granularity
 The time before P1 can start computing is the
LU Factorization time for P0 to compute a whole segment
Gaussian
Elimination  Therefore, it will take longer until Pp-1 can start
LU
computing Courtesy of Henri Casanova
119 / 235
Parallel
Algorithms Reducing the Granularity
A. Legrand
More
Matrix Vector
Product
Open MP
Version
 So far, we’ve allocated non-contiguous rows of
First MPI
Version
the domain to each processor
Distributing
Matrices
 But we can reduce communication by allocating
Second MPI
Version processors groups of contiguous rows
Third MPI  If two contiguous rows are on the same
Version
Mixed
Parallelism processors, there is no communication
Version
involved to update the cells of the second row
Matrix
Multiplication  Let us use say that we allocate blocks of rows of
Stencil size r to each processor
Application
Principle
 We assume that r*p divides n
Greedy Version
Reducing the
 Processor Pi holds rows j such that
i = floor(j/r) mod p
Granularity
LU Factorization
Gaussian
Elimination
 This is really a “block cyclic” allocation
LU
120 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
Product
Open MP
Version
k
P0
First MPI
r
Version
Distributing 0 1 2 3
Matrices
Second MPI
Version
P1 1 2 3 4
Third MPI
Version
Mixed
P2 2 3 4 5
Parallelism
Version
Matrix
P3 3 4 5 6
Multiplication
Stencil P0 4 5 6 7
Application
Principle
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
121 / 235
Parallel
Algorithms
A. Legrand
Idle Time?
Matrix Vector
Product
Open MP
Version  One question is: does any processor stay idle?
First MPI
Version
Distributing
 Processor P0 computes all values in its first block
Matrices
Second MPI
of rows in n/k algorithm steps
Version
Third MPI  After that, processor P0 must wait for cell values
Version
Mixed
Parallelism
from processor Pp-1
Version
Matrix
 But Pp-1 cannot start computing before p steps
Multiplication
Stencil
 Therefore:
Application  If p >= n/k, P0 is idle
Principle
Greedy Version  If p < n/k, P1 is not idle
Reducing the
Granularity  If p < n/k, then processors had better be able to
LU Factorization
Gaussian buffer received cells while they are still
Elimination
LU computing
 Possible increase in memory consumption 122 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
Product
Open MP
Version  It is actually very simple
First MPI
Version
Distributing
 At each step a processor is involved at most in
Matrices  Receiving k cells from its predecessor
Second MPI
Version  Sending k cells to its successor
Third MPI
Version
Mixed
 Updating k*r cells
Parallelism
Version  Since sending and receiving are overlapped, the
Matrix
Multiplication
time to perform a step is L + k b + k r w
Stencil
 Question: How many steps?
Application
Principle
 Answer: It takes p-1 steps before Pp-1 can start
Greedy Version
Reducing the
doing any thing. Pp-1 holds n2/(pkr) blocks
Granularity
LU Factorization
 Execution time:
Gaussian
Elimination T(n,p,r,k) = (p-1 + n2/(pkr)) (L + kb + k r w)
LU
123 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
Product
Open MP
Version
 Our naïve greedy algorithm had asymptotic efficiency equal
First MPI to w / (w + L + b)
Version
Distributing
 This algorithm does better: Assympt. Eff = w / (w + L/rk +
Matrices b/r)
Second MPI
Version  Divide n2w by p T(n,p,r,k)
Third MPI
Version
 And make n large
Mixed
Parallelism
 In the formula for the efficiency we clearly see the effect of
Version the granularity increase
Matrix  Asymptotic efficiency is higher
Multiplication
 But not equal to 1
Stencil
Application  Therefore, this is a “difficult” application to parallelize
Principle  We can try to do the best we can by increasing r and k, but it’s
Greedy Version
Reducing the
never going to be perfect
Granularity  One can compute the optimal values of r and k using
LU Factorization numerical solving
Gaussian
Elimination
 See the book for details
LU
124 / 235
Outline
Parallel
A. Legrand
Open MP Version
Product
Version
Version
Matrices
Version
Third MPI
Mixed
Parallelism
Matrix
Multiplication
Principle
Application
Principle
Greedy Version
Granularity
LU Factorization
Gaussian
Elimination
LU
LU
125 / 235
Parallel
Algorithms
A. Legrand
Solving Linear Systems of Eq.
Matrix Vector
Product
Open MP
Version
 Method for solving Linear Systems
First MPI  The need to solve linear systems arises in an estimated 75% of all scientific
Version
Distributing
computing problems [Dahlquist 1974]
Matrices  Gaussian Elimination is perhaps the most well-known
Second MPI
Version method
Third MPI
Version
 based on the fact that the solution of a linear system is
Mixed invariant under scaling and under row additions
Parallelism
Version
 One can multiply a row of the matrix by a constant as long as one
multiplies the corresponding element of the right-hand side by the
Matrix same constant
Multiplication  One can add a row of the matrix to another one as long as one
Stencil adds the corresponding elements of the right-hand side
Application  Idea: scale and add equations so as to transform matrix A in
Principle
Greedy Version
an upper triangular matrix:
?
Reducing the ?
Granularity x ?
? =
LU Factorization ?
?
Gaussian
Elimination
LU equation n-i has i unknowns, with
126 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
Product
Open MP 1 1 1 0
Version
First MPI
Version
1 -2 2 x= 4
1 2 -1 2
Distributing
Matrices Subtract row 1 from rows 2 and 3
Second MPI
Version 1 1 1 0
Third MPI
Version
0 -3 1 x= 4
Mixed 0 1 -2 2
Parallelism
Version Multiple row 3 by 3 and add row 2
Matrix 1 1 1 0
Multiplication
Stencil
0 -3 1 x= 4
Application 0 0 -5 1
0
Principle
Greedy Version
Reducing the -5x3 = 10 x3 = -2
Granularity
Solving equations in -3x2 + x3 = 4 x2 = -2
LU Factorization
Gaussian reverse order (backsolving) x1 + x2 + x3 = 0 x1 = 4
Elimination
LU
127 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
Product
Open MP
Version  The algorithm goes through the matrix from the
First MPI
Version top-left corner to the bottom-right corner
Distributing
Matrices
Second MPI
 the ith step eliminates non-zero sub-diagonal
Version
Third MPI
elements in column i, substracting the ith row
Version
Mixed
scaled by aji/aii from row j, for j=i+1,..,n.
Parallelism
Version
Matrix values already computed
Multiplication
Stencil pivot row i

Application
to be zeroed
Principle
Greedy Version
Reducing the
0 values yet to be
updated
Granularity
LU Factorization
Gaussian
Elimination
LU
i Courtesy of Henri Casanova
128 / 235
Parallel
Algorithms Sequential Gaussian
A. Legrand
Elimination
Matrix Vector
Product
Open MP
Version
Simple sequential algorithm
First MPI
Version
// for each column i
Distributing
Matrices // zero it out below the diagonal by adding
Second MPI
// multiples of row i to later rows
Version for i = 1 to n1
Third MPI // for each row j below row i
Version for j = i+1 to n
Mixed // add a multiple of row i to row j
Parallelism for k = i to n
Version A(j,k) = A(j,k) (A(j,i)/A(i,i)) * A(i,k)
Matrix
Multiplication  Several “tricks” that do not change the spirit of the
Stencil
Application
algorithm but make implementation easier and/or more
Principle efficient
Greedy Version  Right-hand side is typically kept in column n+1 of the matrix
Reducing the
Granularity and one speaks of an augmented matrix
LU Factorization
 Compute the A(i,j)/A(i,i) term outside of the loop
Gaussian
Elimination
LU
129 / 235
Parallel
Algorithms
A. Legrand
Pivoting: Motivation
Matrix Vector
Product
0 1
Open MP
Version
 A few pathological cases 1 1
First MPI
Version
Distributing  Division by small numbers → round-off error in computer
Matrices
Second MPI arithmetic
Version
Third MPI
 Consider the following system
Version 0.0001x1 + x2 = 1.000
Mixed
Parallelism x1 + x2 = 2.000
Version
Matrix
 exact solution: x1=1.00010 and x2 = 0.99990
Multiplication
 say we round off after 3 digits after the decimal point
Stencil
Application
 Multiply the first equation by 104 and subtract it from the second
Principle equation
Greedy Version
Reducing the  (1 - 1)x1 + (1 - 104)x2 = 2 - 104
Granularity
LU Factorization
 But, in finite precision with only 3 digits:
Gaussian  1 - 104 = -0.9999 E+4 ~ -0.999 E+4
Elimination
LU
 2 - 104 = -0.9998 E+4 ~ -0.999 E+4
 Therefore, x2 = 1 and x1 = 0 (from the first equation) 130 / 235
Parallel
Algorithms
A. Legrand
Partial Pivoting
Matrix Vector
Product
Open MP
Version
 One can just swap rows
First MPI x1 + x2 = 2.000
Version
Distributing
0.0001x1 + x2 = 1.000
Matrices  Multiple the first equation my 0.0001 and subtract it from the second
Second MPI
Version
equation gives:
Third MPI (1 - 0.0001)x2 = 1 - 0.0001
Version 0.9999 x2 = 0.9999 => x2 = 1
Mixed
Parallelism and then x1 = 1
Version
 Final solution is closer to the real solution. (Magical?)
Matrix
Multiplication  Partial Pivoting
 For numerical stability, one doesn’t go in order, but pick the next row in rows i to
Stencil n that has the largest element in row i
Application  This row is swapped with row i (along with elements of the right hand side)
Principle before the subtractions
Greedy Version 
the swap is not done in memory but rather one keeps an indirection array
Reducing the
Granularity
 Total Pivoting
 Look for the greatest element ANYWHERE in the matrix
LU Factorization  Swap columns
Gaussian
Elimination
 Swap rows
LU
 Numerical stability is really a difficult field 131 / 235
Parallel
Algorithms Parallel Gaussian
A. Legrand
Elimination?
Matrix Vector
Product
Open MP
Version
 Assume that we have one processor per matrix element
First MPI
Version
Distributing
Matrices
Second MPI
Version max aji needed to compute
Third MPI to find the max aji the scaling factor Independent computation
Version
Mixed
of the scaling factor
Parallelism
Version Reduction Broadcast Compute
Matrix
Multiplication
Stencil
Application Every update needs the
Independent
Principle scaling factor and the
computations
Greedy Version element from the pivot row
Reducing the
Granularity
LU Factorization
Broadcasts Compute
Gaussian
Elimination
LU
132 / 235
Parallel
Algorithms
A. Legrand
LU Factorization (Section 4.4)
Matrix Vector
Product
Open MP  Gaussian Elimination is simple but
Version  What if we have to solve many Ax = b systems for different values of b?
First MPI
Version

This happens a LOT in real applications
Distributing  Another method is the “LU Factorization”
Matrices
Second MPI
 Ax = b
Version  Say we could rewrite A = L U, where L is a lower triangular matrix, and U is
Third MPI an upper triangular matrix O(n3)
Version
Mixed
 Then Ax = b is written L U x = b
Parallelism  Solve L y = b O(n2)
Version
 Solve U x = y O(n2)
Matrix
Multiplication
Stencil
Application triangular system solves are easy
Principle
Greedy Version ? ?
Reducing the ? ?
Granularity x ?
? = x ?
? =
LU Factorization ? ?
? ?
Gaussian
Elimination
LU equation i has i unknowns equation n-i has i unknowns
133 / 235
Parallel
Algorithms
A. Legrand
LU Factorization: Principle
Matrix Vector
Product
Open MP
Version
 It works just like the Gaussian Elimination, but instead of zeroing
First MPI out elements, one “saves” scaling coefficients.
Version
Distributing
Matrices
gaussian
Second MPI
Version
1 2 - 1 2 - 1 2 - elimination 1 2 -
1 1 save the
Third MPI 4 3 1 gaussian 0 - 5 scaling 4 - 1
5 +
4 - 1
5
Version elimination save the
2 2 3 2 5
2 3
factor 5 scaling 5-
Mixed 2 2 3 2 5
Parallelism factor
Version 2
gaussian
Matrix
elimination 1 2 -1
Multiplication 1 0 0 1 2 -1
+
4 -5 5
Stencil save the
scaling 2/5
L= 4 1 0 U= 0 -5 5
Application 2 3
factor 2 2/5 1 0 0 3
Principle
Magically, A = L x U !
Greedy Version

Reducing the
Granularity
LU Factorization
 Should be done with pivoting as well
Gaussian
Elimination
LU
134 / 235
Parallel
Algorithms
A. Legrand
LU Factorization
Matrix Vector
Product
Open MP
Version
 We’re going to look at the simplest possible version
First MPI  No pivoting:just creates a bunch of indirections that are easy but make
Version
the code look complicated without changing the overall principle
Distributing
Matrices
Second MPI
Version LUsequential(A,n) {
Third MPI for k = 0 to n2 {
Version stores the scaling factors
Mixed // preparing column k
Parallelism for i = k+1 to n1
Version
aik ← aik / akk
Matrix
Multiplication
for j = k+1 to n1
// Task Tkj: update of column j
Stencil for i=k+1 to n1
Application
aij ← aij + aik * akj
k
Principle
Greedy Version } k
Reducing the }
Granularity
LU Factorization
Gaussian
Elimination
LU
135 / 235
Parallel
Algorithms
A. Legrand
LU Factorization
Matrix Vector
Product
Open MP
Version
 We’re going to look at the simplest possible version
First MPI
Version
 No pivoting:just creates a bunch of indirections that are easy
Distributing but make the code look complicated without changing the
Matrices
overall principle
Second MPI
Version LUsequential(A,n) {
Third MPI for k = 0 to n2 {
Version
Mixed // preparing column k
Parallelism for i = k+1 to n1
Version
aik ← aik / akk
Matrix
Multiplication
for j = k+1 to n1
// Task Tkj: update of column j
Stencil for i=k+1 to n1 update
Application
aij ← aij + aik * akj
Principle k
Greedy Version }
Reducing the } k
Granularity
LU Factorization j
Gaussian
Elimination
i
LU
136 / 235
Parallel
Algorithms
A. Legrand
Parallel LU on a ring
Matrix Vector
Product
Open MP
Version
 Since the algorithm operates by columns from left to right,
First MPI
Version
we should distribute columns to processors
Distributing
Matrices
 Principle of the algorithm
Second MPI  At each step, the processor that owns column k does the
Version
Third MPI
“prepare” task and then broadcasts the bottom part of column
Version k to all others
Mixed
Parallelism 
Annoying if the matrix is stored in row-major fashion
Version
 Remember that one is free to store the matrix in anyway one
Matrix wants, as long as it’s coherent and that the right output is
Multiplication
generated
Stencil
Application
 After the broadcast, the other processors can then update
Principle their data.
Greedy Version
Reducing the
 Assume there is a function alloc(k) that returns the rank of
Granularity the processor that owns column k
LU Factorization  Basically so that we don’t clutter our program with too many
Gaussian
Elimination global-to-local index translations
LU  In fact, we will first write everything in terms of global
indices, as to avoid all annoying index arithmetic 137 / 235
Parallel
Algorithms
A. Legrand
LU-broadcast algorithm
Matrix Vector
Product
Open MP LUbroadcast(A,n) {
Version
First MPI q ← MY_NUM()
Version p ← NUM_PROCS()
Distributing
Matrices for k = 0 to n2 {
Second MPI if (alloc(k) == q)
Version
Third MPI // preparing column k
Version for i = k+1 to n1
Mixed
Parallelism buffer[ik1] ← aik ← aik / akk
Version
broadcast(alloc(k),buffer,nk1)
Matrix
Multiplication
for j = k+1 to n1
if (alloc(j) == q)
Stencil
Application
// update of column j
Principle for i=k+1 to n1
Greedy Version aij ← aij + buffer[ik1] * akj
Reducing the
Granularity }
LU Factorization
}
Gaussian
Elimination
LU
138 / 235
Parallel
Algorithms
A. Legrand
Dealing with local indices
Matrix Vector
Product
Open MP
Version  Assume that p divides n
First MPI
Version
Distributing
 Each processor needs to store r=n/p columns and
Matrices
Second MPI
its local indices go from 0 to r-1
Version
Third MPI
 After step k, only columns with indices greater
Version
Mixed than k will be used
Parallelism
Version  Simple idea: use a local index, l, that everyone
Matrix
Multiplication initializes to 0
Stencil  At step k, processor alloc(k) increases its local
Application
Principle index so that next time it will point to its next
Greedy Version
Reducing the
local column
Granularity
LU Factorization
Gaussian
Elimination
LU
139 / 235
Parallel
Algorithms
A. Legrand
LU-broadcast algorithm
Matrix Vector
Product
Open MP ...
Version
First MPI double a[n1][r1];
Version
Distributing
Matrices q ← MY_NUM()
Second MPI p ← NUM_PROCS()
Version
Third MPI l ← 0
Version for k = 0 to n2 {
Mixed
Parallelism if (alloc(k) == q)
Version
for i = k+1 to n1
Matrix buffer[ik1] ← a[i,k] ← a[i,l] / a[k,l]
Multiplication
l ← l+1
Stencil broadcast(alloc(k),buffer,nk1)
Application
Principle
for j = l to r1
Greedy Version for i=k+1 to n1
Reducing the
Granularity
a[i,j] ← a[i,j] + buffer[ik1] * a[k,j]
}
LU Factorization
Gaussian
}
Elimination
LU
140 / 235
Parallel
Algorithms What about the Alloc
A. Legrand
function?
Matrix Vector
Product
Open MP
Version
 One thing we have left completely unspecified is
First MPI
Version
how to write the alloc function: how are columns
Distributing
Matrices
distributed among processors
Second MPI
Version
 There are two complications:
Third MPI
Version
 The amount of data to process varies throughout the
Mixed algorithm’s execution
Parallelism
Version

At step k, columns k+1 to n-1 are updated
Matrix

Fewer and fewer columns to update
Multiplication  The amount of computation varies among columns
Stencil 
e.g., column n-1 is updated more often than column 2
Application
Principle

Holding columns on the right of the matrix leads to much
Greedy Version more work
Reducing the
Granularity
 There is a strong need for load balancing
LU Factorization  All processes should do the same amount of work
Gaussian
Elimination
LU
141 / 235
Parallel
Algorithms
A. Legrand
Bad load balancing
Matrix Vector
Product
P1 P2 P3 P4
Open MP
Version
First MPI
Version
Distributing
Matrices
Second MPI
Version
Third MPI
Version
Mixed
already
Parallelism
Version done
Matrix
Multiplication
Stencil already
done
Application
Principle working
on it
Greedy Version
Reducing the
Granularity
LU Factorization
Gaussian
Elimination
LU
142 / 235
Parallel
Algorithms
A. Legrand
Good Load Balancing?
Matrix Vector
Product
Open MP
Version
First MPI
Version
Distributing
Matrices
Second MPI
Version
Third MPI
already
Version
Mixed
done
Parallelism
Version
Matrix
Multiplication already
Stencil done
Application
Principle
working
Greedy Version on it
Reducing the
Granularity
LU Factorization
Cyclic distribution
Gaussian
Elimination
LU
143 / 235
Parallel
Algorithms Proof that load balancing is
A. Legrand
good
Matrix Vector
Product
Open MP
Version
 The computation consists of two types of operations
First MPI  column preparations
Version
Distributing
 matrix element updates
Matrices
Second MPI
 There are many more updates than preparations, so we really
Version care about good balancing of the preparations
Third MPI
Version  Consider column j
Mixed
Parallelism
 Let’s count the number of updates performed by the processor
Version holding column j
Matrix
Multiplication
 Column j is updated at steps k=0, ..., j-1
 At step k, elements i=k+1, ..., n-1 are updates
Stencil
Application  indices start at 0
Principle  Therefore, at step k, the update of column j entails n-k-1 updates
Greedy Version
Reducing the  The total number of updates for column j in the execution is:
Granularity
LU Factorization
Gaussian
Elimination
LU
144 / 235
Parallel
Algorithms Proof that load balancing is
A. Legrand
good
Matrix Vector
Product
Open MP
Version
 Consider processor Pi, which holds columns lp+i for l=0, ... , n/p -1
First MPI  Processor Pi needs to perform this many updates:
Version
Distributing
Matrices
Second MPI
Version
Third MPI
Version
Mixed
Parallelism
Version  Turns out this can be computed
Matrix  separate terms
Multiplication  use formulas for sums of integers and sums of squares
Stencil  What it all boils down to is:
Application
Principle
Greedy Version
Reducing the
Granularity
LU Factorization  This does not depend on i !!
Gaussian
Elimination
 Therefore it is (asymptotically) the same for all Pi processors
LU  Therefore we have (asymptotically) perfect load balancing!Courtesy of Henri Casanova
145 / 235
Parallel
Algorithms
A. Legrand
Load-balanced program
Matrix Vector
Product
Open MP ...
Version
First MPI double a[n1][r1];
Version
Distributing
Matrices q ← MY_NUM()
Second MPI p ← NUM_PROCS()
Version
Third MPI l ← 0
Version for k = 0 to n2 {
Mixed
Parallelism if (k mod p == q)
Version
for i = k+1 to n1
Matrix buffer[ik1] ← a[i,k] ← a[i,l] / a[k,l]
Multiplication
l ← l+1
Stencil broadcast(alloc(k),buffer,nk1)
Application
Principle
for j = l to r1
Greedy Version for i=k+1 to n1
Reducing the
Granularity
a[i,j] ← a[i,j] + buffer[ik1] * a[k,j]
}
LU Factorization
Gaussian
}
Elimination
LU
146 / 235
Parallel
Algorithms
A. Legrand
Matrix Vector
Product
Open MP
Version
 How long does this code take to run?
First MPI
Version  This is not an easy question because there are
Distributing
Matrices many tasks and many communications
Second MPI
Version  A little bit of analysis shows that the execution
Third MPI
Version time is the sum of three terms
Mixed
Parallelism
 n-1 communications: n L + (n2/2) b + O(1)
Version  n-1 column preparations: (n2/2) w’ + O(1)
Matrix
Multiplication
 column updates: (n3/3p) w + O(n2)
Stencil  Therefore, the execution time is ~ (n3/3p) w
Application
Principle
 Note that the sequential time is: (n3 /3) w
Greedy Version
Reducing the
 Therefore, we have perfect asymptotic efficiency!
Granularity
LU Factorization
 This is good, but isn’t always the best in practice
Gaussian
Elimination
 How can we improve this algorithm?
LU
147 / 235
Parallel
Algorithms
A. Legrand
Pipelining on the Ring
Matrix Vector
Product
Open MP
Version
 So far, the algorithm we’ve used a simple
First MPI
Version
broadcast
Distributing
Matrices
 Nothing was specific to being on a ring of
Second MPI
Version
processors and it’s portable
Third MPI
Version
 in fact you could just write raw MPI that just looks like
Mixed our pseudo-code and have a very limited, inefficient for
Parallelism
Version small n, LU factorization that works only for some
Matrix
Multiplication  But it’s not efficient
Stencil
Application
 The n-1 communication steps are not overlapped with
Principle computations
Greedy Version  Therefore Amdahl’s law, etc.
Reducing the
Granularity  Turns out that on a ring, with a cyclic distribution
LU Factorization
Gaussian
of the columns, one can interleave pieces of the
Elimination broadcast with the computation
LU
 It almost looks like inserting the source code from the
148 / 235
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete)
Heterogeneous
Part V
Network
(General Case)
A Complete Example on an Heterogeneous Ring
149 / 235
The Context: Distributed Heterogeneous Platforms
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network How to embed a ring in a complex network [LRRV04].
(Complete)
Heterogeneous
Network
Sources of problems
(General Case)
I Heterogeneity of processors (computational power, memory, etc.)
I Heterogeneity of communications links.
I Irregularity of interconnection network.
150 / 235
Targeted Applications: Iterative Algorithms
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
I A set of data (typically, a matrix)
Network
(Complete)
Heterogeneous
I Structure of the algorithms:
Network
(General Case)
1 Each processor performs a computation on its chunk of data
2 Each processor exchange the “border” of its chunk of data with
its neighbor processors
3 We go back at Step 1
151 / 235
Targeted Applications: Iterative Algorithms
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
I A set of data (typically, a matrix)
Network
(Complete)
Heterogeneous
I Structure of the algorithms:
Network
(General Case)
1 Each processor performs a computation on its chunk of data
2 Each processor exchange the “border” of its chunk of data with
its neighbor processors
3 We go back at Step 1
Question: how can we efficiently execute such an algorithm on such
a platform?
151 / 235
The Questions
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete)
Heterogeneous
Network
I Which processors should be used ?
(General Case)
I What amount of data should we give them ?
I How do we cut the set of data ?
152 / 235
First of All, a Simplification: Slicing the Data
Parallel
Algorithms
A. Legrand
The Problem
I Data: a 2-D array
Fully
Homogeneous
Network
P1 P2
Heterogeneous
Network
(Complete)
Heterogeneous
Network
(General Case)
P3 P4
153 / 235
Parallel
Algorithms
A. Legrand
The Problem
I Data: a 2-D array
Fully
Homogeneous
Network
P1 P2
Heterogeneous
Network
(Complete)
Heterogeneous
Network
(General Case) P1 P2 P4 P3
P3 P4
I Unidimensional cutting into vertical slices
153 / 235
Parallel
Algorithms
A. Legrand
The Problem
I Data: a 2-D array
Fully
Homogeneous
Network
P1 P2
Heterogeneous
Network
(Complete)
Heterogeneous
Network
P3 P4
I Consequences:
153 / 235
Parallel
Algorithms
A. Legrand
The Problem
I Data: a 2-D array
Fully
Homogeneous
Network
P1 P2
Heterogeneous
Network
(Complete)
Heterogeneous
Network
P3 P4
I Consequences:
1 Borders and neighbors are easily defined
153 / 235
Parallel
Algorithms
A. Legrand
The Problem
I Data: a 2-D array
Fully
Homogeneous
Network
P1 P2
Heterogeneous
Network
(Complete)
Heterogeneous
Network
P3 P4
I Consequences:
2 Constant volume of data exchanged between neighbors: Dc
153 / 235
Parallel
Algorithms
A. Legrand
The Problem
I Data: a 2-D array
Fully
Homogeneous
Network
P1 P2
Heterogeneous
Network
(Complete)
Heterogeneous
Network
P3 P4
I Consequences:
2 Constant volume of data exchanged between neighbors: Dc
3 Processors are virtually organized into a ring
153 / 235
Notations
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network I Processors: P1 , ..., Pp
Heterogeneous
Network
(Complete)
I Processor Pi executes a unit task in a time wi
Heterogeneous
Network
(General Case)
I Overall amount of work Dw ;
Share of P
Pi : αi .Dw processed in a time αi .Dw .wi
(αi > 0, j αj = 1)
I Cost of a unit-size communication from Pi to Pj : ci,j

I Cost of a sending from Pi to its successor in the ring: Dc .ci,succ(i)
154 / 235
Communications: 1-Port Model (Full-Duplex)
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete) A processor can:
Heterogeneous
Network
(General Case) I send at most one message at any time;
I receive at most one message at any time;
I send and receive a message simultaneously.
155 / 235
Objective
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
1 Select q processors among p
Heterogeneous
Network
(Complete)
Heterogeneous
Network
(General Case)
156 / 235
Objective
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete)
2 Order them into a ring
Heterogeneous
Network
(General Case)
156 / 235
Objective
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete)
Heterogeneous
Network
3 Distribute the data among them
(General Case)
156 / 235
Objective
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete)
Heterogeneous
Network
3 Distribute the data among them
(General Case)
So as to minimize:
max I{i}[αi .Dw .wi + Dc .(ci,pred(i) + ci,succ(i) )]

16i6p
Where I{i}[x] = x if Pi participates in the computation, and 0 other-

wise
156 / 235
Special Hypotheses
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
1 There exists a communication link between any two processors
(Complete)
Heterogeneous
2 All links have the same capacity
Network
(General Case) (∃c, ∀i, j ci,j = c)
157 / 235
Consequences
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
I Either the most powerful processor performs all the work, or all
Heterogeneous
Network
the processors participate
(Complete)
Heterogeneous
Network
(General Case)
158 / 235
Consequences
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete)
Heterogeneous I If all processors participate, all end their share of work simultane-
Network
(General Case) ously
158 / 235
Consequences
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete)
Heterogeneous I If all processors participate, all endP
their share of work simultane-
Network
(General Case) ously(∃τ, αi .Dw .wi = τ , so 1 = i Dwτ.wi )
158 / 235
Consequences
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete)
Heterogeneous I If all processors participate, all endP
their share of work simultane-
Network
(General Case) ously(∃τ, αi .Dw .wi = τ , so 1 = i Dwτ.wi )
I Time of the optimal solution:
( )
1
Tstep = min Dw .wmin , Dw . P 1 + 2.Dc .c
i wi
158 / 235
Special hypothesis
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete)
Heterogeneous
1 There exists a communication link between any two processors
Network
(General Case)
159 / 235
All the Processors Participate: Study (1)
Parallel
Algorithms
A. Legrand
The Problem time

Fully
Homogeneous
Network
Heterogeneous α2 .Dw .w2
Network
(Complete) α1 .Dw .w1 α3 .Dw .w3
Heterogeneous α5 .Dw .w5
Network
(General Case) Dc .c2,3 α4 .Dw .w4
Dc .c3,4
Dc .c1,2
Dc .c3,2 Dc .c5,1
Dc .c2,1 Dc .c4,5
Dc .c1,5 Dc .c5,4
Dc .c4,3
P1 P2 P3 P4 P5 processors
All processors end simultaneously
160 / 235
Parallel
Algorithms
A. Legrand
The Problem
Fully I All processors end simultaneously
Homogeneous
Network
Heterogeneous
Network
(Complete)
Tstep = αi .Dw .wi + Dc .(ci,succ(i) + ci,pred(i) )
Heterogeneous
Network
(General Case)
161 / 235
Parallel
Algorithms
A. Legrand
The Problem
Fully I All processors end simultaneously
Homogeneous
Network
Heterogeneous
Network
(Complete)
Tstep = αi .Dw .wi + Dc .(ci,succ(i) + ci,pred(i) )
Heterogeneous
Network p p
(General Case) X X Tstep − Dc .(ci,succ(i) + ci,pred(i) )
I αi = 1 ⇒ = 1. Thus
Dw .wi
i=1 i=1
p
Tstep Dc X ci,succ(i) + ci,pred(i)
=1+
Dw .wcumul Dw wi
i=1
where wcumul = P1 1
i wi
161 / 235
All the Processors Participate: Interpretation
Parallel
Algorithms
A. Legrand
The Problem
Fully p
Homogeneous Tstep Dc X ci,succ(i) + ci,pred(i)
Network
=1+
Heterogeneous
Network Dw .wcumul Dw wi
(Complete) i=1
Heterogeneous
Network
(General Case)
162 / 235
Parallel
Algorithms
A. Legrand
The Problem
Fully p
Network
=1+
Heterogeneous
(Complete) i=1
Heterogeneous
Network
(General Case)
p
X ci,succ(i) + ci,pred(i)
Tstep is minimal when is minimal
wi
i=1
162 / 235
Parallel
Algorithms
A. Legrand
The Problem
Fully p
Network
=1+
Heterogeneous
(Complete) i=1
Heterogeneous
Network
(General Case)
p
wi
i=1
Look for an hamiltonian cycle of minimal weight in a graph where the

c c
edge from Pi to Pj has a weight of di,j = wi,ji + wj,ij
162 / 235
Parallel
Algorithms
A. Legrand
The Problem
Fully p
Network
=1+
Heterogeneous
(Complete) i=1
Heterogeneous
Network
(General Case)
p
wi
i=1
Look for an hamiltonian cycle of minimal weight in a graph where the

c c
edge from Pi to Pj has a weight of di,j = wi,ji + wj,ij
NP-complete problem
162 / 235
All the Processors Participate: Linear Program
Parallel
Algorithms
A. Legrand
The Problem
Fully Pp Pp
Homogeneous
Network Minimize i=1 j=1 di,j .xi,j ,
Heterogeneous
Network
(Complete)
Heterogeneous
satisfying the (in)equations
Network
(General Case)  Pp
(1) xi,j = 1 16i 6p

 Pj=1p
 (2) x = 1 16j 6p

i=1 i,j

(3) xi,j ∈ {0, 1} 1 6 i, j 6 p
(4) ui − uj + p.xi,j 6 p − 1 2 6 i, j 6 p, i 6= j




(5) ui integer, ui > 0 26i 6p

xi,j = 1 if, and only if, the edge from Pi to Pj is used
163 / 235
General Case: Linear program
Parallel
Algorithms
A. Legrand
Best ring made of q processors
The Problem
Fully
Homogeneous Minimize T satisfying the (in)equations
Network
Heterogeneous (1) i,j ∈ {0, 1}
xP 1 6 i, j 6 p

Network  p
(Complete) (2) xi,j 6 1 16j 6p


Heterogeneous

 Ppi=1 P p
(3) j=1 xP
i,j = q

Network

(General Case)

 Pi=1
p p
(4) i=1 xi,j = i=1 xj,i 16j 6p







 Pp


 (5) i=1Pαi = 1
(6) αi 6 pj=1 xi,j 16i 6p
Pp
(7) αi .wi + DDc

j=1 (xi,j ci,j + xj,i cj,i ) 6 T 16i 6p


w




p
 P
(8) i=1 yi = 1




(9) − p.yi − p.yj + ui − uj + q.xi,j 6 q − 1 1 6 i, j 6 p, i 6= j




 (10) yi ∈ {0, 1} 16i 6p



(11) ui integer, ui > 0 16i 6p
164 / 235
Linear Programming
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete) I Problems with rational variables: can be solved in polynomial time
Heterogeneous
Network
(General Case)
(in the size of the problem).
I Problems with integer variables: solved in exponential time in the
worst case.
I No relaxation in rationals seems possible here. . .
165 / 235
And, in Practice ?
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous All processors participate. One can use a heuristic to solve the
Network
Heterogeneous
Network
traveling salesman problem (as Lin-Kernighan’s one)
(Complete)
Heterogeneous
Network
(General Case)
166 / 235
And, in Practice ?
Parallel
Algorithms
A. Legrand
The Problem
Fully
Network
Heterogeneous
Network
(Complete)
Heterogeneous
No guarantee, but excellent results in practice.
Network
(General Case)
166 / 235
And, in Practice ?
Parallel
Algorithms
A. Legrand
The Problem
Fully
Network
Heterogeneous
Network
(Complete)
Heterogeneous
No guarantee, but excellent results in practice.
Network
(General Case)
General case.
1 Exhaustive search: feasible until a dozen of processors. . .
2 Greedy heuristic: initially we take the best pair of processors; for
a given ring we try to insert any unused processor in between any
pair of neighbor processors in the ring. . .
166 / 235
New Difficulty: Communication Links Sharing
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network P1
Heterogeneous
Network
P1 P2
(Complete) P2
Heterogeneous
Network
(General Case) P4
P3 P3 P4
Heterogeneous platform Virtual ring
167 / 235
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network P1
Heterogeneous
Network
P1 P2
(Complete) P2
Heterogeneous
Network
(General Case) P4
P3 P3 P4
167 / 235
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network P1
Heterogeneous
Network
P1 P2
(Complete) P2
Heterogeneous
Network
(General Case) P4
P3 P3 P4
167 / 235
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network P1
Heterogeneous
Network
P1 P2
(Complete) P2
Heterogeneous
Network
(General Case) P4
P3 P3 P4
We must take communication link sharing into account.
167 / 235
New Notations
Parallel
Algorithms
A. Legrand
The Problem
Fully
I A set of communications links: e1 , ..., en
Homogeneous
Network
Heterogeneous
I Bandwidth of link em : bem
Network
(Complete) I There is a path Si from Pi to Psucc(i) in the network
Heterogeneous
Network
(General Case) I Si uses a fraction si,m of the bandwidth bem of link em
1
I Pi needs a time Dc . to send to its successor a mes-
minem ∈Si si,m
sage of size Dc X
I Constraints on the bandwidth of em : si,m 6 bem
16i6p
I Symmetrically, there is a path Pi from Pi to Ppred(i) in the network,

which uses a fraction pi,m of the bandwidth bem of link em
168 / 235
Toy Example: Choosing the Ring
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
P1 Q
Network
Heterogeneous
a R c d
Network P2 P3
(Complete)
Heterogeneous
h e
Network
(General Case)
g b f
P5 P4
I 7 processors and 8 bidirectional communications links

I We choose a ring of 5 processors:
P1 → P2 → P3 → P4 → P5 (we use neither Q, nor R)
169 / 235
Toy Example: Choosing the Ring
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
P1 Q
Network
Heterogeneous
a R c d
Network P2 P3
(Complete)
Heterogeneous
h e
Network
(General Case)
g b f
P5 P4
I 7 processors and 8 bidirectional communications links

I We choose a ring of 5 processors:
P1 → P2 → P3 → P4 → P5 (we use neither Q, nor R)
169 / 235
Toy Example: Choosing the Paths
Parallel
Algorithms
A. Legrand
The Problem
P1 Q
Fully
Homogeneous
a R c d
Network P2 P3
Heterogeneous
Network h e
(Complete) g b f
Heterogeneous
Network
(General Case)
P5 P4
170 / 235
Parallel
Algorithms
A. Legrand
The Problem
P1 Q
Fully
Homogeneous
a R c d
Network P2 P3
Heterogeneous
Network h e
(Complete) g b f
Heterogeneous
Network
(General Case)
P5 P4
From P1 to P2 , we use the links a and b: S1 = {a, b}.
170 / 235
Parallel
Algorithms
A. Legrand
The Problem
P1 Q
Fully
Homogeneous
a R c d
Network P2 P3
Heterogeneous
Network h e
(Complete) g b f
Heterogeneous
Network
(General Case)
P5 P4

From P2 to P1 , we use the links b, g and h: P2 = {b, g , h}.
170 / 235
Parallel
Algorithms
A. Legrand
The Problem
P1 Q
Fully
Homogeneous
a R c d
Network P2 P3
Heterogeneous
Network h e
(Complete) g b f
Heterogeneous
Network
(General Case)
P5 P4

From P2 to P1 , we use the links b, g and h: P2 = {b, g , h}.
From P1 : to P2 , S1 = {a, b} and to P5 , P1 = {h}
From P2 : to P3 , S2 = {c, d} and to P1 , P2 = {b, g , h}
From P3 : to P4 , S3 = {d, e} and to P2 , P3 = {d, e, f }
From P4 : to P5 , S4 = {f , b, g } and to P3 , P4 = {e, d}
From P5 : to P1 , S5 = {h} and to P4 , P5 = {g , b, f }
170 / 235
Toy Example: Bandwidth Sharing
Parallel
Algorithms
A. Legrand
1
The Problem
Fully
From P1 to P2 we use links a and b: c1,2 = min(s1,a ,s1,b ) .
Homogeneous
1
Network
Heterogeneous
From P1 to P5 we use the link h: c1,5 = p1,h .
Network
(Complete)
Heterogeneous
Network
(General Case)
171 / 235
Toy Example: Bandwidth Sharing
Parallel
Algorithms
A. Legrand
1
The Problem
Fully
From P1 to P2 we use links a and b: c1,2 = min(s1,a ,s1,b ) .
Homogeneous
1
Network
Heterogeneous
From P1 to P5 we use the link h: c1,5 = p1,h .
Network
(Complete)
Heterogeneous Set of all sharing constraints:
Network
(General Case) Lien a: s1,a 6 ba
Lien b: s1,b + s4,b + p2,b + p5,b 6 bb
Lien c: s2,c 6 bc
Lien d: s2,d + s3,d + p3,d + p4,d 6 bd
Lien e: s3,e + p3,e + p4,e 6 be
Lien f : s4,f + p3,f + p5,f 6 bf
Lien g : s4,g + p2,g + p5,g 6 bg
Lien h: s5,h + p1,h + p2,h 6 bh
171 / 235
Toy Example: Final Quadratic System
Parallel
Algorithms
A. Legrand
The Problem
Fully Minimize max16i65 (αi .Dw .wi + Dc .(ci,i−1 + ci,i+1 )) under the constraints
Homogeneous
Network
Heterogeneous  P5
Network  i=1 αi = 1
(Complete)
s1,a 6 ba s1,b + s4,b + p2,b + p5,b 6 bb s2,c 6 bc


Heterogeneous


s2,d + s3,d + p3,d + p4,d 6 bd s3,e + p3,e + p4,e 6 be s4,f + p3,f + p5,f 6 bf
Network


(General Case)


s4,g + p2,g + p5,g 6 bg s5,h + p1,h + p2,h 6 bh




s1,a .c1,2 > 1 s1,b .c1,2 > 1 p1,h .c1,5 > 1




s2,c .c2,3 > 1 s2,d .c2,3 > 1 p2,b .c2,1 > 1


 p2,g .c2,1 > 1 p2,h .c2,1 > 1 s3,d .c3,4 > 1
s3,e .c3,4 > 1 p3,d .c3,2 > 1 p3,e .c3,2 > 1




p3,f .c3,2 > 1 s4,f .c4,5 > 1 s4,b .c4,5 > 1




s4,g .c4,5 > 1 p4,e .c4,3 > 1 p4,d .c4,3 > 1




s5,h .c5,1 > 1 p5,g .c5,4 > 1 p5,b .c5,4 > 1




p5,f .c5,4 > 1
172 / 235
Toy Example: Conclusion
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
The problem sums up to a quadratic system if
Network
Heterogeneous
1 The processors are selected;
Network
(Complete)
Heterogeneous
2 The processors are ordered into a ring;
Network
(General Case) 3 The communication paths between the processors are known.
In other words: a quadratic system if the ring is known.
173 / 235
Toy Example: Conclusion
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
The problem sums up to a quadratic system if
Network
Heterogeneous
1 The processors are selected;
Network
(Complete)
Heterogeneous
2 The processors are ordered into a ring;
Network
(General Case) 3 The communication paths between the processors are known.
In other words: a quadratic system if the ring is known.
If the ring is known:

I Complete graph: closed-form expression;
I General graph: quadratic system.
173 / 235
And, in Practice ?
Parallel
Algorithms
A. Legrand
The Problem
Fully We adapt our greedy heuristic:
Homogeneous
Network
Heterogeneous
1 Initially: best pair of processors
Network
(Complete) 2 For each processor Pk (not already included in the ring)
Heterogeneous
Network
(General Case)
I For each pair (Pi , Pj ) of neighbors in the ring
1 We build the graph of the unused bandwidths
(Without considering the paths between Pi and Pj )
2 We compute the shortest paths (in terms of bandwidth) between
Pk and Pi and Pj
3 We evaluate the solution
3 We keep the best solution found at step 2 and we start again
+ refinements (max-min fairness, quadratic solving).
174 / 235
Is This Meaningful ?
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous
Network
Heterogeneous
Network
(Complete)
I No guarantee, neither theoretical, nor practical
Heterogeneous
Network
(General Case)
I Simple solution:
1 we build the complete graph whose edges are labeled with the
bandwidths of the best communication paths
2 we apply the heuristic for complete graphs
3 we allocate the bandwidths
175 / 235
Example: an Actual Platform (Lyon)
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous moby canaria
router backbone
routlhpc
Network
sci2
Heterogeneous sci1
Network myri1
sci3
(Complete)
Hub mryi0 popc0 sci0 Switch
Heterogeneous
Network sci4
myri2
(General Case) sci6
sci5
Hub
Topology
P0 P1 P2 P3 P4 P5 P6 P7 P8
0.0206 0.0206 0.0206 0.0206 0.0291 0.0206 0.0087 0.0206 0.0206
P9 P10 P11 P12 P13 P14 P15 P16

0.0206 0.0206 0.0206 0.0291 0.0451 0 0 0
Processors processing times (in seconds par megaflop)
176 / 235
Results
Parallel
Algorithms First heuristic building the ring without taking link sharing into ac-
A. Legrand count
The Problem
Fully
Second heuristic taking into account link sharing (and with quadratic
Homogeneous
Network
programing)
Heterogeneous
Network
(Complete)
Heterogeneous
Ratio Dc /Dw H1 H2 Gain
Network
(General Case) 0.64 0.008738 (1) 0.008738 (1) 0%
0.064 0.018837 (13) 0.006639 (14) 64.75%
0.0064 0.003819 (13) 0.001975 (14) 48.28%
Ratio Dc /Dw H1 H2 Gain
0.64 0.005825 (1) 0.005825 (1) 0%
0.064 0.027919 (8) 0.004865 (6) 82.57%
0.0064 0.007218 (13) 0.001608 (8) 77.72%
Table: Tstep /Dw for each heuristic on Lyon’s and Strasbourg’s platforms (the
numbers in parentheses show the size of the rings built).
177 / 235
Conclusion
Parallel
Algorithms
A. Legrand
The Problem
Fully
Homogeneous Even though this is a very basic application, it illustrates many diffi-
Network
Heterogeneous culties encountered when:
Network
(Complete)
Heterogeneous
I Processors have different characteristics
Network
(General Case) I Communications links have different characteristics
I There is an irregular interconnection network with complex band-
width sharing issues.
We need to use a realistic model of networks... Even though a more
realistic model leads to a much more complicated problem, this is worth
the effort as derived solutions are more efficient in practice.
178 / 235
Parallel
Algorithms
A. Legrand
Communications
Matrix
Multiplication
Outer Product
Grid Rocks!
Cannon
Part VI
Fox
Snyder
Data
Distribution Algorithms on a Grid
179 / 235
Outline
Parallel
Algorithms
A. Legrand
Communications
Matrix 16 Communications
Multiplication
Outer Product
Grid Rocks!
Cannon
Fox 17 Matrix Multiplication
Snyder
Data Outer Product
Distribution
Grid Rocks!
Cannon
Fox
Snyder
Data Distribution
180 / 235
Parallel
Algorithms
A. Legrand
2-D Grid (Chapter 5)
Communications
Matrix
Multiplication
P0,0 P0,1 P0,2
Outer Product
Grid Rocks!
Cannon P1,0 P1,1 P1,2
Fox
Snyder
Data
Distribution P2,0 P2,1 P2,2
 Consider p=q2 processors

 We can think of them arranged in a square grid
 A rectangular grid is also possible, but we’ll
stick to square grids for most of our algorithms

 Each processor is identified as Pi,j
 i: processor row
 J: processor column

181 / 235
Parallel
Algorithms
A. Legrand
2-D Torus
Communications
Matrix
Multiplication
Outer Product
P0,0 P0,1 P0,2
Grid Rocks!
Cannon
Fox P1,0 P1,1 P1,2
Snyder
Data
Distribution
P2,0 P2,1 P2,2
 Wrap-around links from edge to edge

 Each processor belongs to 2 different rings
 Will make it possible to reuse algorithms we develop for
the ring topology
 Mono-directional links OR Bi-directional links
 Depending on what we need the algorithm to do and on
what makes sense for the physical platform
182 / 235
Parallel
Algorithms Concurrency of Comm. and
A. Legrand
Comp.
Communications
Matrix
 When developing performance models we will
Multiplication
Outer Product
assume that a processor can do all three activities in
Grid Rocks! parallel
Cannon
 Compute
Fox
Snyder
 Send
Data
Distribution
 Receive
 What about the bi-directional assumption?

 Two models
 Half-duplex: two messages on the same link
going in opposite directions contend for the

link’s bandwidth
 Full-duplex: it’s as if we had two links in
between each neighbor processors

 The validity of the assumption depends on the
platform Courtesy of Henri Casanova

183 / 235
Parallel
Algorithms Multiple concurrent
A. Legrand
communications?
Communications
Matrix
Multiplication  Now that we have 4 (logical) links at each
processor, we need to decide how many
Outer Product
Grid Rocks!
Cannon
Fox
concurrent communications can happen at the
Snyder
Data
same time
Distribution  There could be 4 sends and 4 receives in the
bi-directional link model

 If we assume that 4 sends and 4 receives can
happened concurrently without loss of
performance, we have a multi-port model
 If we only allow 1 send and 1 receive to occur
concurrently we have a single-port model

184 / 235
Parallel
Algorithms
A. Legrand
So what?
Communications
Matrix
Multiplication  We have many options:
Outer Product
Grid Rocks!
 Grid or torus
Cannon
Fox
 Mono- or bi-directional
Snyder
Data
 Single-or multi-port
Distribution  Half- or full-duplex
 We’ll mostly use the torus, bi-directional, full-
duplex assumption
 We’ll discuss the multi-port and the single-port
assumptions
 As usual, it’s straightforward to modify the
performance analyses to match with whichever
assumption makes sense for the physical
platform
185 / 235
Parallel
Algorithms How realistic is a grid
A. Legrand
topology?
Communications
Matrix
Multiplication  Some parallel computers are built as
physical grids (2-D or 3-D)
Outer Product
Grid Rocks!
Cannon
Fox
 Example: IBM’s Blue Gene/L
Snyder
Data
Distribution
 If the platform uses a switch with all-to-all
communication links, then a grid is
actually not a bad assumption
 Although the full-duplex or multi-port
assumptions may not hold
 We will see that even if the physical
platform is a shared single medium (e.g.,
a non-switched Ethernet), it’s sometimes
preferable to think of it as a grid when
developing algorithms! Courtesy of Henri Casanova
186 / 235
Parallel
Algorithms
A. Legrand
Communication on a Grid
Communications
Matrix
Multiplication  As usual we won’t write MPI here, but
some pseudo code
Outer Product
Grid Rocks!
Cannon
Fox  A processor can call two functions to
known where it is in the grid:
Snyder
Data
Distribution
 My_Proc_Row()
 My_Proc_Col()
 A processor can find out how many
processors there are in total by:
 Num_Procs()
 Recall that here we assume we have a square
grid
 In programming assignment we may need to
use a rectangular grid Courtesy of Henri Casanova
187 / 235
Parallel
Algorithms
A. Legrand
Communication on the Grid
Communications
Matrix
Multiplication  We have two point-to-point functions
Outer Product
Grid Rocks!  Send(dest, addr, L)
Cannon
Fox  Recv(src, addr, L)
Snyder
Data
Distribution
 We will see that it’s convenient to have
broadcast algorithms within processor
rows or processor columns
 BroadcastRow(i, j, srcaddr, dstaddr, L)
 BroadcastCol(i, j, srcaddr, dstaddr, L)
 We assume that a a call to these functions by
a processor not on the relevant processor row
or column simply returns immediately
 How do we implement these broadcasts?
188 / 235
Parallel
Algorithms
A. Legrand
Row and Col Broadcasts?
Communications
Matrix
Multiplication  If we have a torus
Outer Product
Grid Rocks!
 If we have mono-directional links, then we can reuse the
Cannon broadcast that we developed on a ring of processors
Fox  Either pipelined or not
Snyder
Data  It we have bi-directional links AND a multi-port model,
Distribution
we can improved performance by going both-ways
simultaneously on the ring
 We’ll see that the asymptotic performance is not changed
 If we have a grid
 If links are bi-directional then messages can be sent
both ways from the source processor

Either concurrently or not depending on whether we have a
one-port or multi-port model
 If links are mono-directional, then we can’t implement
the broadcasts at all

189 / 235
Outline
Parallel
Algorithms
A. Legrand
Communications
Matrix 16 Communications
Multiplication
Outer Product
Grid Rocks!
Cannon
Fox 17 Matrix Multiplication
Snyder
Data Outer Product
Distribution
Grid Rocks!
Cannon
Fox
Snyder
Data Distribution
190 / 235
Parallel
Algorithms Matrix Multiplication on a
A. Legrand
Grid
Communications
Matrix
Multiplication  Matrix multiplication on a Grid has been studied a
Outer Product
Grid Rocks!
lot because
Cannon  Multiplying huge matrices fast is always
Fox
Snyder important in many, many fields
Data
Distribution  Each year there is at least a new paper on
the topic
 It’s a really good way to look at and learn
many different issues with a grid topology

 Let’s look at the natural matrix distribution
scheme induced by a grid/torus

191 / 235
Parallel
Algorithms
A. Legrand
2-D Matrix Distribution
Communications
Matrix
Multiplication
Outer Product
P0,0 P0,1 P0,2 P0,2
Grid Rocks!
Cannon
 We denote by ai,j an
Fox P1,0 P1,1 P1,2 P1,2 element of the matrix
Snyder
Data
Distribution
 We denote by Ai,j (or Aij)
P2,0 P2,1 P2,2 P2,2 the block of the matrix
allocated to Pi,j
P2,0 P2,1 P2,2 P2,2
C00 C01 C02 C03 A00 A01 A02 A03 B00 B01 B02 B03
C10 C11 C12 C13 A10 A11 A12 A13 B10 B11 B12 B13
C20 C21 C22 C23 A20 A21 A22 A23 B20 B21 B22 B23
C30 C31 C32 C33 A30 A31 A32 A33 B30 B31 B32 B33
192 / 235
Parallel
Algorithms
How do Matrices Get Distributed? (Sec.
A. Legrand
4.7)
Communications
Matrix
 Data distribution can be completely ad-hoc
Multiplication  But what about when developing a library that will be used by others?
Outer Product  There are two main options:
Grid Rocks!  Centralized
Cannon
Fox
 when calling a function (e.g., matrix multiplication)
 the input data is available on a single “master” machine (perhaps in a file)
Snyder  the input data must then be distributed among workers
Data
Distribution

the output data must be undistributed and returned to the “master” machine (perhaps in a file)
 More natural/easy for the user
 Allows for the library to make data distribution decisions transparently to the user
 Prohibitively expensive if one does sequences of operations
 and one almost always does so
 Distributed
 when calling a function (e.g., matrix multiplication)

Assume that the input is already distributed

Leave the output distributed
 May lead to having to “redistribute” data in between calls so that distributions match,
which is harder for the user and may be costly as well

For instance one may want to change the block size between calls, or go from a non-cyclic to a
cyclic distribution
 Most current software adopt the distributed approach
 more work for the user
 more flexibility and control
 We’ll always assume that the data is magically already distributed by the user
193 / 235
Parallel
Algorithms Four Matrix Multiplication
A. Legrand
Algorithms
Communications
Matrix
Multiplication  We’ll look at four algorithms
Outer Product
 Outer-Product
Grid Rocks!
Cannon
 Cannon
Fox
Snyder
 Fox
Data
Distribution
 Snyder
 The first one is used in practice

 The other three are more “historical” but are
really interesting to discuss
 We’ll have a somewhat hand-wavy discussion
here, rather than look at very detailed code

194 / 235
Parallel
Algorithms
A. Legrand
The Outer-Product Algorithm
Communications
Matrix
Multiplication
 Consider the “natural” sequential matrix multiplication
Outer Product algorithm
Grid Rocks!
Cannon
for i=0 to n-1
Fox for j=0 to n-1
Snyder for k=0 to n-1
Data
Distribution ci,j += ai,k * bk,j
 This algorithm is a sequence of inner-products (also called
scalar products)
 We have seen that we can switch loops around
 Let’s consider this version
for k=0 to n-1
for i=0 to n-1
for j=0 to n-1
ci,j += ai,k * bk,j
 This algorithm is a sequence of outer-products!

195 / 235
Parallel
Algorithms
A. Legrand
Communications
Matrix
Multiplication for k=0 to n-1
Outer Product for i=0 to n-1
Grid Rocks!
Cannon for j=0 to n-1
Fox
Snyder
ci,j += ai,k * bk,j
Data
Distribution
K=0 B K=1 B
A C += x A C += x

196 / 235
Parallel
Algorithms
A. Legrand
The outer-product algorithm
Communications
Matrix
Multiplication
 Why do we care about switching the loops around to view the matrix
Outer Product multiplication as a sequence of outer products?
Grid Rocks!  Because it makes it possible to design a very simple parallel algorithm on
Cannon a grid of processors!
Fox
Snyder
 First step: view the algorithm in terms of the blocks assigned to the
Data processors
Distribution
for k=0 to q-1
for i=0 to q-1
for j=0 to q-1
Ci,j += Ai,k * Bk,j
C00 C01 C02 C03 A00 A01 A02 A03 B00 B01 B02 B03
C10 C11 C12 C13 A10 A11 A12 A13 B10 B11 B12 B13
C20 C21 C22 C23 A20 A21 A22 A23 B20 B21 B22 B23
C30 C31 C32 C33 A30 A31 A32 A33 B30 B31 BCourtesy
32 B33 of Henri Casanova
197 / 235
Parallel
Algorithms
A. Legrand
Communications
Matrix
Multiplication C00 C01 C02 C03 A00 A01 A02 A03 B0 B0 B0 B0 for k=0 to q-1
Outer Product
Grid Rocks! C10 C11 C12 C13 A10 A11 A12 A13 B1
0
B1
1
B1
2
B1
3 for i=0 to q-1
Cannon
C20 C21 C22 C23 A20 A21 A22 A23 B2
0
B2
1
B2
2
B2
3 for j=0 to q-1
Fox
Snyder C30 C31 C32 C33 A30 A31 A32 A33 B3
0
B3
1
B3
2
B3
3 Ci,j += Ai,k * Bk,j
Data
0 1 2 3
Distribution
 At step k, processor Pi,j needs Ai,k and Bk,j

 If k = j, then the processor already has the
needed block of A
 Otherwise, it needs to get it from P
i,k
 If k = I, then the processor already has the
needed block of B
 Otherwise, it needs to get it from P
k,j

198 / 235
Parallel
Algorithms
A. Legrand
Communications
Matrix
Multiplication  Based on the previous statements, we can now
see how the algorithm works
Outer Product
Grid Rocks!
Cannon
Fox  At step k
Snyder
 Processor P
i,k broadcasts its block of matrix A
Data
Distribution
to all processors in processor row i
 True for all i
 Processor P
k,j broadcasts its block of matrix B
to all processor in processor column j
 True for all j
 There are q-1 steps

199 / 235
Parallel
Algorithms
A. Legrand
The Outer Product Algorithm
Communications
Matrix
Multiplication
Outer Product
Grid Rocks!
Cannon
Fox P00 A01 P02 P03 P00 P01 P02 P03
Snyder
Data
Distribution
P10 A11 P12 P13 B10 B11 B12 B13
P20 A21 P22 P23 P20 P21 P22 P23
P30 A31 P32 P33 P30 P31 P32 P33
Step k=1 of the algorithm

200 / 235
Parallel
Algorithms
A. Legrand
Communications
Matrix
// m = n/q
Multiplication
var A, B, C: array[0..m-1, 0..m-1] of real
Outer Product
Grid Rocks!
var bufferA, bufferB: array[0..m-1, 0..m-1] of real
Cannon var myrow, mycol
Fox myrow = My_Proc_Row()
Snyder mycol = My_Proc_Col()
Data for k = 0 to q-1
Distribution
// Broadcast A along rows
for i = 0 to q-1
BroadcastRow(i,k,A,bufferA,m*m)
// Broadcast B along columns
for j=0 to q-1
BroadcastCol(k,j,B,bufferB,m*m)
// Multiply Matrix blocks (assuming a convenient MatrixMultiplyAdd()
function)
if (myrow == k) and (mycol == k)
MatrixMultiplyAdd(C,A,B,m)
else if (myrow == k)
MatrixMultiplyAdd(C,bufferA,B,m)
else if (mycol == k)
MatrixMultiplyAdd(C, A, bufferB, m)
else
MatrixMultiplyAdd(C, bufferA, bufferB, m)
201 / 235
Parallel
Algorithms
A. Legrand
Communications
Matrix
Multiplication
 The performance analysis is straightforward
Outer Product  With a one-port model:
Grid Rocks!
Cannon
 The matrix multiplication at step k can occur in parallel with
Fox the broadcasts at step k+1
Snyder  Both broadcasts happen in sequence
Data
Distribution  Therefore, the execution time is equal to:
T(m,q) = 2 Tbcast + (q-1) max (2 Tbcast, m3 w) + m3 w

 w: elementary += * operation
 Tbcast: time necessary for the broadcast
 With a multi-port model:
 Both broadcasts can happen at the same time
T(m,q) = Tbcast + (q-1) max ( Tbcast, m3 w) + m3 w
 The time for a broadcast, using the pipelined broadcast:
Tbcast = ( sqrt( (q-2)L ) + sqrt( m2 b ) )2
 When n gets large: T(m,q) ~ q m3 = n3 / q2
 Thus, asymptotic parallel efficiency is 1! Courtesy of Henri Casanova
202 / 235
Parallel
Algorithms
A. Legrand
So what?
Communications
Matrix
Multiplication  On a ring platform we had already given an
asymptotically optimal matrix multiplication
Outer Product
Grid Rocks!
Cannon
Fox
algorithm on a ring in an earlier set of slides
Snyder
Data
 So what’s the big deal about another
Distribution
asymptotically optimal algorithm?
 Once again, when n is huge, indeed we don’t
care
 But communication costs are often non-negligible
and do matter
 When n is “moderate”
 When w/b is low
 It turns out, that the grid topology is
advantageous for reducing communication costs!
203 / 235
Parallel
Algorithms
A. Legrand
Ring vs. Grid
Communications
Matrix
Multiplication
 When we discussed the ring, we found that the
Outer Product communication cost of the matrix multiplication algorithm
Grid Rocks! was: n2 b
Cannon
Fox
 A each step, the algorithm sends n2/p matrix elements among
Snyder neighboring processors
Data
Distribution
 There are p steps
 For the algorithm on a grid:
 Each step involves 2 broadcasts of n2/p matrix elements
 Assuming a one-port model, not to give an “unfair” advantage to
the grid topology
 Using a pipelined broadcast, this can be done in approximately
the same time as sending n2/p matrix elements between
neighboring processors on each ring (unless n is really small)
 Therefore, at each step, the algorithm on a grid spends twice
as much time communicating as the algorithm on a ring
 But it does sqrt(p) fewer steps!
 Conclusion: the algorithm on a grid spends at least sqrt(p)
less time in communication than the algorithm on a ring
204 / 235
Parallel
Algorithms
A. Legrand
Grid vs. Ring
Communications
Matrix
Multiplication  Why was the algorithm on a Grid much better?
Outer Product
Grid Rocks!  Reason: More communication links can be used
Cannon
Fox in parallel
Snyder
Data
 Point-to-point communication replaced by broadcasts
Distribution
 Horizontal and vertical communications may be
concurrent
 More network links used at each step
 Of course, this advantage isn’t really an
advantage if the underlying physical platform
does not really look like a grid
 But, it turns out that the 2-D distribution is
inherently superior to the 1-D distribution, no
matter what the underlying platform is!
205 / 235
Parallel
Algorithms
A. Legrand
Grid vs. Ring
Communications
Matrix  On a ring
Multiplication
Outer Product
 The algorithm communicates p matrix block rows that each
Grid Rocks! contain n2/p elements, p times
Cannon
Fox
 Total number of elements communicated: pn2
Snyder
Data
 On a grid
Distribution  Each step, 2sqrt(p) blocks of n2/p elements are sent, each to
sqrt(p)-1 processors, sqrt(p) times
 Total number of elements communicated: 2sqrt(p)n2
 Conclusion: the algorithm with a grid in mind
inherently sends less data around than the algorithm
on a ring
 Using a 2-D data distribution would be better than
using a 1-D data distribution even if the underlying
platform were a non-switched Ethernet for instance!
 Which is really 1 network link, and one may argue is closer to
a ring (p comm links) than a grid (p2 comm links) Courtesy of Henri Casanova
206 / 235
Parallel
Algorithms
A. Legrand
Conclusion
Communications
Matrix
Multiplication  Writing algorithms on a grid topology is a little bit
more complicated than in a ring topology
Outer Product
Grid Rocks!
Cannon
Fox  But there is often a payoff in practice and grid
topologies are very popular
Snyder
Data
Distribution

207 / 235
Parallel
Algorithms
A. Legrand
2-D Matrix Distribution
Communications
Matrix
Multiplication
Outer Product
P0,0 P0,1 P0,2 P0,2
Grid Rocks!
Cannon
 We denote by ai,j an
Fox P1,0 P1,1 P1,2 P1,2 element of the matrix
Snyder
Data
Distribution
 We denote by Ai,j (or Aij)
P2,0 P2,1 P2,2 P2,2 the block of the matrix
allocated to Pi,j
P2,0 P2,1 P2,2 P2,2
C00 C01 C02 C03 A00 A01 A02 A03 B00 B01 B02 B03
C10 C11 C12 C13 A10 A11 A12 A13 B10 B11 B12 B13
C20 C21 C22 C23 A20 A21 A22 A23 B20 B21 B22 B23
C30 C31 C32 C33 A30 A31 A32 A33 B30 B31 B32 B33
208 / 235
Parallel
Algorithms
A. Legrand
The Cannon Algorithm
Communications
Matrix
Multiplication
Outer Product
 This is a very old algorithm
Grid Rocks!
Cannon
 From the time of systolic arrays
Adapted to a 2-D grid
Fox

Snyder
Data
Distribution  The algorithm starts with a
redistribution of matrices A and B
 Called “preskewing”
 Then the matrices are multiplied
 Then the matrices are re-
redistributed to match the initial
distribution
 Called “postskewing” Courtesy of Henri Casanova
209 / 235
Parallel
Algorithms
A. Legrand
Cannon’s Preskewing
Communications
Matrix
Multiplication
Outer Product
 Matrix A: each block row of matrix A is
Grid Rocks!
Cannon
shifted so that each processor in the first
Fox
Snyder
processor column holds a diagonal block
Data
Distribution of the matrix
A00 A01 A02 A03 A00 A01 A02 A03

A10 A11 A12 A13 A11 A12 A13 A14
A20 A21 A22 A23 A22 A23 A20 A21
A30 A31 A32 A33 A33 A30 A31 A32

210 / 235
Parallel
Algorithms
A. Legrand
Cannon’s Preskewing
Communications
Matrix
Multiplication  Matrix B: each block column of matrix B is
shifted so that each processor in the first
Outer Product
Grid Rocks!
Cannon
Fox processor row holds a diagonal block of
Snyder
Data
the matrix
Distribution
B00 B01 B02 B03 B00 B11 B22 B33

B10 B11 B12 B13 B10 B21 B32 B03
B20 B21 B22 B23 B20 B31 B02 B13
B30 B31 B32 B33 B30 B01 B12 B23

211 / 235
Parallel
Algorithms
A. Legrand
Cannon’s Computation
Communications
Matrix
Multiplication
Outer Product
 The algorithm proceeds in q steps
Grid Rocks!
Cannon
Fox
 At each step each processor
Snyder
Data performs the multiplication of its
Distribution
block of A and B and adds the result
to its block of C
 Then blocks of A are shifted to the
left and blocks of B are shifted
upward
 Blocks of C never move
 Let’s see it on a picture Courtesy of Henri Casanova
212 / 235
Parallel
Algorithms
A. Legrand
Cannon’s Steps
Communications
Matrix
Multiplication
C00 C01 C02 C03 A00 A01 A02 A03 B00 B11 B22 B33
Outer Product
Grid Rocks!
C10 C11 C12 C13 A11 A12 A13 A10 B10 B21 B32 B03 local
B20 B31 B02 B13 computation
Cannon
C20 C21 C22 C23 A22 A23 A20 A21
Fox
Snyder
on proc (0,0)
Data C30 C31 C32 C33 A33 A30 A31 A32 B30 B01 B12 B23
Distribution
C00 C01 C02 C03 A01 A02 A03 A00 B10 B21 B32 B03
C10 C11 C12 C13 A12 A13 A10 A11 B20 B31 B02 B13 Shifts
C20 C21 C22 C23 A23 A20 A21 A22 B30 B01 B12 B23
C30 C31 C32 C33 A30 A31 A32 A33 B00 B11 B22 B33
C00 C01 C02 C03 A01 A02 A03 A00 B10 B21 B32 B03
C10 C11 C12 C13 A12 A13 A10 A11 B20 B31 B02 B13 local
computation
C20 C21 C22 C23 A23 A20 A21 A22 B30 B01 B12 B23
on proc (0,0)
C30 C31 C32 C33 A30 A31 A32 A33 B00 B11 B22 B33 Courtesy of Henri Casanova
213 / 235
Parallel
Algorithms
A. Legrand
The Algorithm
Communications
Matrix
Multiplication
Outer Product
Participate in preskewing of A
Grid Rocks!
Cannon Partitipate in preskweing of B
Fox
Snyder
Data
For k = 1 to q
Distribution
Local C = C + A*B
Vertical shift of B
Horizontal shift of A
Participate in postskewing of A
Partitipate in postskewing of B

214 / 235
Parallel
Algorithms
A. Legrand
Communications
Matrix
Multiplication  Let’s do a simple performance analysis
with a 4-port model
Outer Product
Grid Rocks!
Cannon
Fox
 The 1-port model is typically more complicated
Snyder
Data
Distribution
 Symbols
 n: size of the matrix
 qxq: size of the processor grid
 m=n/q
 L: communication start-up cost
 w: time to do a basic computation (+= . * .)
 b: time to communicate a matrix element
 T(m,q) = Tpreskew + Tcompute +
Tpostskew
215 / 235
Parallel
Algorithms
A. Legrand
Pre/Post-skewing times
Communications
Matrix
Multiplication
 Let’s consider the horizontal shift
Outer Product  Each row must be shifted so that the diagonal block ends
Grid Rocks!
Cannon
up on the first column
Fox  On a mono-directional ring:
Snyder
Data
 The last row needs to be shifted (q-1) times
Distribution  All rows can be shifted in parallel
 Total time needed: (q-1) (L + m2 b)
 On a bi-directional ring, a row can be shifted left or right,
depending on which way is shortest!
 A row is shifted at most floor(q/2) times
 All rows can be shifted in parallel
 Total time needed: floor(q/2) (L + m2 b)
 Because of the 4-port assumption, preskewing of A and B
can occur in parallel (horizontal and vertical shifts do not
interfere)
 Therefore: Tpreskew = Tpostskew = floor(q/2) (L+m2b)
216 / 235
Parallel
Algorithms
A. Legrand
Time for each step
Communications
Matrix
Multiplication
Outer Product
 At each step, each processor computes an
Grid Rocks!
Cannon
mxm matrix multiplication
Fox
Snyder
 Compute time: m3 w
At each step, each processor
Data
Distribution 
sends/receives a mxm block in its

processor row and its processor column
 Both can occur simultaneously with a 4-port
model
 Takes time L+ m2b
 Therefore, the total time for the q steps is:
Tcompute = q max (L + m2b, m3w)
217 / 235
Parallel
Algorithms
A. Legrand
Cannon Performance Model
Communications
Matrix
Multiplication
Outer Product
 T(m,n) =2* floor(q/2) (L + m2b) +
Grid Rocks!
Cannon q max(m3w, L + m2b)
Fox
Snyder
Data
 This performance model is easily
Distribution
adapted
 If one assumes mono-directional links,
then the “floor(q/2)” above becomes
“(q-1)”
 If one assumes 1-port, there is a factor 2
added in front of communication terms
 If one assumes no overlap of
communication and computation Courtesy
at aof Henri 218
Casanova
/ 235
Parallel
Algorithms
A. Legrand
The Fox Algorithm
Communications
Matrix
Multiplication  This algorithm was originally developed to
run on a hypercube topology
Outer Product
Grid Rocks!
Cannon
Fox
 But in fact it uses a grid, embedded in the
Snyder
Data
hypercube
This algorithm requires no pre- or post-
Distribution

skewing
 It relies on horizontal broadcasts of the
diagonals of matrix A and on vertical shifts
of matrix B
 Sometimes called the “multiply-broadcast-
roll” algorithm
 Let’s see it on a picture
 Although it’s a bit awkward to draw because of 219 / 235
Parallel
Algorithms
A. Legrand
Execution Steps...
Communications
Matrix
Multiplication
C00 C01 C02 C03 A00 A01 A02 A03 B00 B01 B02 B03
Outer Product
Grid Rocks!
C10 C11 C12 C13 A10 A11 A12 A13 B10 B11 B12 B13 initial
Cannon
C20 C21 C22 C23 A20 A21 A22 A23 B B B B state
Fox 20 21 22 23
Snyder
Distribution
C00 C01 C02 C03 A00 A00 A00 A00 B00 B01 B02 B03 Broadcast of
A’s 1st diag.
C10 C11 C12 C13 A11 A11 A11 A11 B10 B11 B12 B13 (stored in a
C20 C21 C22 C23 A22 A22 A22 A22 B20 B21 B22 B23 Separate
buffer)
C30 C31 C32 C33 A33 A33 A33 A33 B30 B31 B32 B33
C00 C01 C02 C03 A00 A00 A00 A00 B00 B01 B02 B03
C10 C11 C12 C13 A11 A11 A11 A11 B10 B11 B12 B13 Local
computation
C20 C21 C22 C23 A22 A22 A22 A22 B20 B21 B22 B23
220 / 235
Parallel
Algorithms
A. Legrand
Execution Steps...
Communications
Matrix
Multiplication
C00 C01 C02 C03 A00 A01 A02 A03 B10 B11 B12 B13
Outer Product
Grid Rocks!
C10 C11 C12 C13 A10 A11 A12 A13 B20 B21 B22 B23 Shift of B
Cannon
Fox C20 C21 C22 C23 A20 A21 A22 A23 B30 B31 B32 B33
Snyder
Distribution
C00 C01 C02 C03 A01 A01 A01 A01 B10 B11 B12 B13 Broadcast of
A’s 2nd diag.
C10 C11 C12 C13 A12 A12 A12 A12 B20 B21 B22 B23
(stored in a
C20 C21 C22 C23 A23 A23 A23 A23 B30 B31 B32 B33 Separate
C30 C31 C32 C33 A30 A30 A30 A30 B B B B buffer)
00 01 02 03
C00 C01 C02 C03 A01 A01 A01 A01 B10 B11 B12 B13
computation
C20 C21 C22 C23 A23 A23 A23 A23 B30 B31 B32 B33
221 / 235
Parallel
Algorithms
A. Legrand
Fox’s Algorithm
Communications
Matrix
Multiplication
Outer Product
Grid Rocks!
Cannon
// No initial data movement
Fox
Snyder for k = 1 to q in parallel
Broadcast A’s kth diagonal
Data
Distribution
Local C = C + A*B
Vertical shift of B
// No final data movement
 Again note that there is an additional array to

store incoming diagonal block
 This is the array we use in the A*B multiplication
222 / 235
Parallel
Algorithms
A. Legrand
Communications
Matrix
Multiplication
Outer Product
 You’ll have to do it in a homework
Grid Rocks!
Cannon assignment
Fox
Snyder
Data
 Write pseudo-code of the algorithm in
Distribution
more details
 Write the performance analysis

223 / 235
Parallel
Algorithms
A. Legrand
Snyder’s Algorithm (1992)
Communications
Matrix
Multiplication
Outer Product
Grid Rocks!
 More complex than Cannon’s or
Cannon
Fox
Snyder Fox’s
Data
Distribution
 First transposes matrix B
 Uses reduction operations (sums) on
the rows of matrix C
 Shifts matrix B

224 / 235
Parallel
Algorithms
A. Legrand
Execution Steps...
Communications
Matrix
Multiplication
C00 C01 C02 C03 A00 A01 A02 A03 B00 B01 B02 B03
Outer Product
Grid Rocks!
C10 C11 C12 C13 A10 A11 A12 A13 B10 B11 B12 B13 initial
Cannon
C20 C21 C22 C23 A20 A21 A22 A23 B B B B state
Fox 20 21 22 23
Snyder
Distribution
C00 C01 C02 C03 A00 A01 A02 A03 B00 B10 B20 B30
C10 C11 C12 C13 A10 A11 A12 A13 B01 B11 B21 B31
Transpose B
C20 C21 C22 C23 A20 A21 A22 A23 B02 B12 B22 B32
C30 C31 C32 C33 A30 A31 A32 A33 B03 B13 B23 B33
C00 C01 C02 C03 A00 A01 A02 A03 B00 B10 B20 B30
computation
C20 C21 C22 C23 A20 A21 A22 A23 B02 B12 B22 B32
225 / 235
Parallel
Algorithms
A. Legrand
Execution Steps...
Communications
Matrix C00 C01 C02 C03 A00 A01 A02 A03 B01 B11 B21 B31
Multiplication
Outer Product C10 C11 C12 C13 A10 A11 A12 A13 B02 B12 B22 B32
Grid Rocks! Shift B
Cannon C20 C21 C22 C23 A20 A21 A22 A23 B03 B13 B23 B32
Fox
Snyder
C30 C31 C32 C33 A30 A31 A32 A33 B00 B10 B20 B30
Data
Distribution
C00 C01 C02 C03 A00 A01 A02 A03 B01 B11 B21 B31
Global
C10 C11 C12 C13 A10 A11 A12 A13 B02 B12 B22 B32
sum
C20 C21 C22 C23 A20 A21 A22 A23 B03 B13 B23 B32 on the rows
C30 C31 C32 C33 A30 A31 A32 A33 B00 B10 B20 B30 of C
C00 C01 C02 C03 A00 A01 A02 A03 B01 B11 B21 B31
C20 C21 C22 C23 A20 A21 A22 A23 B03 B13 B23 B32 computation
226 / 235
Parallel
Algorithms
A. Legrand
Execution Steps...
Communications
Matrix
Multiplication
C00 C01 C02 C03 A00 A01 A02 A03 B02 B12 B22 B32
Outer Product
C10 C11 C12 C13 A10 A11 A12 A13 B03 B13 B23 B33
Grid Rocks! Shift B
Cannon
Fox C20 C21 C22 C23 A20 A21 A22 A23 B00 B10 B20 B30
Snyder
Distribution
C00 C01 C02 C03 A00 A01 A02 A03 B02 B12 B22 B32
Global
C10 C11 C12 C13 A10 A11 A12 A13 B03 B13 B23 B33 sum
C20 C21 C22 C23 A20 A21 A22 A23 B00 B10 B20 B30 on the rows
of C
C30 C31 C32 C33 A30 A31 A32 A33 B01 B11 B21 B31
C00 C01 C02 C03 A00 A01 A02 A03 B02 B12 B22 B32
C20 C21 C22 C23 A20 A21 A22 A23 B B B B computation
00 10 20 30
227 / 235
Parallel
Algorithms
A. Legrand
The Algorithm
Communications
Matrix
Multiplication var A,B,C: array[0..m-1][0..m-1] of real
Outer Product
Grid Rocks! var bufferC: array[0..m-1][0..m-1] of real
Cannon
Fox
Snyder
Transpose B
Data
Distribution MatrixMultiplyAdd(bufferC, A, B, m)
Vertical shifts of B
For k = 1 to q-1
Global sum of bufferC on proc rows into Ci,(i+k-1)%q
MatrixMultiplyAdd(bufferC, A, B, m)
Vertical shift of B
Global sum of bufferC on proc rows into Ci,(i+k-1)%q
Transpose B
228 / 235
Parallel
Algorithms
A. Legrand
Communications
Matrix
Multiplication
Outer Product
 The performance analysis isn’t
Grid Rocks!
Cannon fundamentally different than what
we’ve done so far
Fox
Snyder
Data
Distribution
 But it’s a bit cumbersome
 See the textbook
 in particular the description of the
matrix transposition (see also Exercise
5.1)

229 / 235
Parallel
Algorithms
A. Legrand
Which Data Distribution?
Communications
Matrix
Multiplication
Outer Product
 So far we’ve seen:
Grid Rocks!
Cannon
 Block Distributions
1-D Distributions
Fox

Snyder
Data
Distribution  2-D Distributions
 Cyclic Distributions
 One may wonder what a good choice
is for a data distribution?
 Many people argue that a good
“Swiss Army knife” is the “2-D block
cyclic distribution
230 / 235
Parallel
Algorithms The 2-D block cyclic
A. Legrand
distribution
Communications
Matrix
Multiplication
Outer Product
 Goal: try to have all the advantages
Grid Rocks!
Cannon
of both the horizontal and the
Fox
Snyder
vertical 1-D block cyclic distribution
Data
Distribution  Works whichever way the computation
“progresses”

left-to-right, top-to-bottom, wavefront, etc.
 Consider a number of processors p =
r*c
 arranged in a rxc matrix
 Consider a 2-D matrix of size NxN
 Consider a block size b (which Courtesy of Henri Casanova
divides N) 231 / 235
Parallel
A. Legrand
distribution
Communications
Matrix
Multiplication
Outer Product
Grid Rocks! P0 P1 P2
Cannon
Fox P3 P4 P5
Snyder
Data
Distribution
b
232 / 235
Parallel
A. Legrand
distribution
Communications
Matrix
Multiplication
Outer Product
Cannon
Fox P0 P1 P2 P3 P4 P5
Snyder
Data
Distribution P3 P4 P5
N
b
233 / 235
Parallel
A. Legrand
distribution
Communications
Matrix
Multiplication
Outer Product
Cannon
Fox P0 P1 P2 P0 P1 P2 P0 P1 P3 P4 P5
Snyder
Data
Distribution P3 P4 P5 P3 P4 P5 P3 P4
P0 P1 P2 P0 P1 P2 P0 P1  Slight load imbalance
 Becomes negligible with
N many blocks
P3 P4 P5 P3 P4 P5 P3 P4  Index computations had
better be implemented in
separate functions
P0 P1 P2 P0 P1 P2 P0 P1  Also: functions that tell a
process who its neighbors
P3 P4 P5 P3 P4 P5 P3 P4 
are
Overall, requires a whole
infrastructure, but many
P0 P1 P2 P0 P1 P2 P0 P1 b think you can’t go wrong
with this distribution
b
234 / 235
Parallel
Algorithms
A. Legrand
Conclusion
Communications
Matrix
Multiplication
Outer Product
 All the algorithms we have seen in the
Grid Rocks!
Cannon
semester can be implemented on a 2-D
Fox
Snyder
block cyclic distribution
The code ends up much more complicated
Data
Distribution 
 But one may expect several benefits “for

free”
 The ScaLAPAK library recommends to use
the 2-D block cyclic distribution
 Although its routines support all other
distributions

235 / 235
Parallel
Algorithms A. Alexandrov, M. Ionescu, K. Schauser, and C. Scheiman.
A. Legrand LogGP: Incorporating long messages into the LogP model for par-
allel computation.
Communications
Matrix
Journal of Parallel and Distributed Computing, 44(1):71–79, 1997.
Multiplication
Outer Product D. Culler, R. Karp, D. Patterson, A. Sahay, E. Santos,
Grid Rocks!
Cannon K. Schauser, R. Subramonian, and T. von Eicken.
Fox
Snyder LogP: a practical model of parallel computation.
Data
Distribution Communication of the ACM, 39(11):78–85, 1996.
R. W. Hockney.
The communication challenge for mpp : Intel paragon and meiko
cs-2.
Parallel Computing, 20:389–398, 1994.
B. Hong and V.K. Prasanna.
Distributed adaptive task allocation in heterogeneous computing
environments to maximize throughput.
235 / 235
Parallel In International Parallel and Distributed Processing Symposium
Algorithms
IPDPS’2004. IEEE Computer Society Press, 2004.
A. Legrand
Communications
T. Kielmann, H. E. Bal, and K. Verstoep.
Matrix
Fast measurement of LogP parameters for message passing plat-
Multiplication
Outer Product
forms.
Grid Rocks! In Proceedings of the 15th IPDPS. Workshops on Parallel and
Cannon
Fox Distributed Processing, 2000.
Snyder
Data
Distribution Steven H. Low.
A duality model of TCP and queue management algorithms.
IEEE/ACM Transactions on Networking, 2003.
Dong Lu, Yi Qiao, Peter A. Dinda, and Fabián E. Bustamante.
Characterizing and predicting tcp throughput on the wide area
network.
In Proceedings of the 25th IEEE International Conference on Dis-
tributed Computing Systems (ICDCS’05), 2005.
235 / 235
Parallel
Algorithms Arnaud Legrand, Hélène Renard, Yves Robert, and Frédéric
A. Legrand Vivien.
Mapping and load-balancing iterative computations on heteroge-
Communications
Matrix
neous clusters with shared links.
Multiplication IEEE Trans. Parallel Distributed Systems, 15(6):546–558, 2004.
Outer Product
Grid Rocks!
Cannon Maxime Martinasso.
Fox
Snyder Analyse et modélisation des communications concurrentes dans
Data
Distribution les réseaux haute performance.
PhD thesis, Université Joseph Fourier de Grenoble, 2007.
Laurent Massoulié and James Roberts.
Bandwidth sharing: Objectives and algorithms.
In INFOCOM (3), pages 1395–1403, 1999.
Loris Marchal, Yang Yang, Henri Casanova, and Yves Robert.
Steady-state scheduling of multiple divisible load applications on
wide-area distributed computing platforms.
235 / 235
Parallel Int. Journal of High Performance Computing Applications, (3),
Algorithms
2006.
A. Legrand
Frédéric Wagner.
Communications
Redistribution de données à travers un réseau haut débit.
Matrix
Multiplication PhD thesis, Université Henri Poincaré Nancy 1, 2005.
Outer Product
Grid Rocks!
Cannon
Fox
Snyder
Data
Distribution
235 / 235

PC 02 Parallel Algorithms

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PC 02 Parallel Algorithms

Uploaded by

Copyright:

Available Formats

Parallel

Arnaud Legrand, CNRS, University of Grenoble

LIG laboratory, arnaud.legrand@imag.fr

Part I Network Models

A. Legrand 1 Point to Point Communication Models

Topology 3 Remind This is a Model, Hence Imperfect

ti,j (m) = Li,j + m/Bi,j

MPICH, TCP with Gigabit Ethernet

MPICH, TCP with Gigabit Ethernet

A. Legrand 1 Point to Point Communication Models

Topology 3 Remind This is a Model, Hence Imperfect

A. Legrand 1 Point to Point Communication Models

Topology 3 Remind This is a Model, Hence Imperfect

A. Legrand 1 Point to Point Communication Models

Topology 3 Remind This is a Model, Hence Imperfect

Courtesy of Henri Casanova

Courtesy of Henri Casanova

9 Broadcast: Going Faster

 We’ll just use the above pseudo-

Broadcast  Now that we have chosen to consider a Ring

 We assume that if a message of length m is sent

To write a parallel algorithm, we will need

9 Broadcast: Going Faster

Courtesy of Henri Casanova

9 Broadcast: Going Faster

RECV(tempR,L) Sending and

Courtesy of Henri Casanova

9 Broadcast: Going Faster

Broadcast All2All(my_addr, addr, m)

9 Broadcast: Going Faster

Broadcast  How can we improve performance?

Broadcast: Going  One can view the above expression as (c+ar)(d+b/r),

almost as fast as we can send it over a single hop

 This is a fundamental principle of IP networks

 Send them over many routers

 But really go as fast as the slowest router

Courtesy of Henri Casanova

Speedup and Efficiency

Amdahl’s Law I The notion of speedup is completely generic

Superlinear Speedup? There are several possible causes

Superlinear Speedup? There are several possible causes

Superlinear Speedup? There are several possible causes

A. Legrand Consider a program whose execution consists of two phases

Amdahl’s Law: Speedup 3

Amdahl’s Law I Efficiency is defined as Eff (p) = S(p)/p

How do we do this in parallel?

Having distributed arrays makes it possible to

Large data needs to be distributed among

T(p) = p (r2w + 2L + 2rb)

T(p) = p(r2w + 2L + 2rb)

cell” but instead “update multiple cells”

This will allow us to reduce both the amount of

Stencil pivot row i

A Complete Example on an Heterogeneous Ring

I Cost of a unit-size communication from Pi to Pj : ci,j

max I{i}[αi .Dw .wi + Dc .(ci,pred(i) + ci,succ(i) )]

Where I{i}[x] = x if Pi participates in the computation, and 0 other-

The Problem time

All processors end simultaneously

Look for an hamiltonian cycle of minimal weight in a graph where the

Look for an hamiltonian cycle of minimal weight in a graph where the

xi,j = 1 if, and only if, the edge from Pi to Pj is used

We must take communication link sharing into account.