Matrix Multiplications and Collective Communication: Michael Hanke

Matrix
Multiplication
Michael
Hanke
Introduction
Matrix Multiplications And Collective
The Recursive
Doubling
Algorithm Communication
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication Michael Hanke
An
Application:
Google
School of Engineering Sciences
Parallel Computations for Large-Scale Problems I
Chapter 5, p. 1/38
Matrix
Multiplication
Michael Outline
Hanke
Introduction
The Recursive
Doubling
Algorithm
1 Introduction
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication 2 The Recursive Doubling Algorithm
An
Application:
Google
3 Matrix-Vector Multiplication
4 Matrix-Matrix Multiplication
5 An Application: Google
Chapter 5, p. 2/38
Matrix
Multiplication
Michael Basic Matrix Operations

Hanke
Introduction
The Recursive
Doubling
Algorithm • Let A and B be to matrices of dimensions M × N and K × L
Matrix-Vector
Multiplication • Matrix addition C = A + B (defined if M = K and N = L)
Matrix-Matrix
Multiplication
cm,n = am,n + bm,n (1 ≤ m ≤ M, 1 ≤ n ≤ N)
An
Application:
Google
• Matrix multiplication C = AB (defined if N = K )
N
X
cm,l = am,n bn,l
n=1
cm,l is the product of the m-th row of A and the l -th column of
B
Chapter 5, p. 3/38
Matrix
Multiplication
Michael Addition of Matrices

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
for n = 1:N
Matrix-Matrix
Multiplication for m = 1:M
An c(m,n) = a(m,n)+b(m,n);
Application:
Google end
end
• Embarrassingly parallel, all data access purely local

• Any data distribution will do, no overlap necessary
Chapter 5, p. 4/38
Matrix
Multiplication
Michael Multiplication of Matrices

Hanke
Introduction
The Recursive
Doubling
Algorithm
• Assume that C is initialized to zero.
Matrix-Vector
Multiplication
for l = 1:L
Matrix-Matrix
Multiplication for m = 1:M
An
Application:
for n = 1:N
Google c(m,l) = c(m,l)+a(m,n)*b(n,l);
end
end
end
• The innermost loop implies a global data dependency for every

cm,l
Chapter 5, p. 5/38
Matrix
Multiplication
Michael The Challenge

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication
An
Application: Find an algorithm and a data distribution such that
Google
1 data doubling on different processes is avoided;
2 excessive data movement (communication!) is avoided.
Chapter 5, p. 6/38
Matrix
Multiplication
Michael What Will Come

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Strategy
Multiplication
Consider the loops from the innermost to the outermost one.
An
Application:
Google
1 The innermost (n-) loop is the scalar product of two vectors.
2 The n- and the m-loops are a matrix-vector multiplication for
each l
3 Finally, find an implementation of the complete algorithm
Chapter 5, p. 7/38
Matrix
Multiplication
Michael Scalar Product of Two Vectors

Hanke
Introduction
• The scalar product of two vectors x, y ∈ RM is defined as
The Recursive
Doubling
Algorithm
M
X
Matrix-Vector T
Multiplication s = hx, y i = x y = xm ym
Matrix-Matrix m=1
Multiplication
An • Assume some data distribution on P processors such that x and

Application:
Google y are distributed identically
• Then it holds
M
X P−1
XX P−1
X
s= xm ym = xi yi = ωp
m=1 p=0 i∈Ip p=0
| {z }
ωp
• Assume that all local computations are done such that ωp is

available
Chapter 5, p. 8/38
Matrix
Multiplication
Michael Global Summation

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication Problem
Matrix-Matrix
Multiplication
Compute the sum s on P processors,
An
Application:
Google
s = ω0 + ω1 + · · · + ωP−1
• sequential complexity: O(P) operations

• The MPI solution to this problem: MPI_Reduce
• How could it be realized?
Chapter 5, p. 9/38
Matrix
Multiplication
Michael Global Summation: Idea

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication
Idea (for P = 8):
An
Application:
Google s = ω0 + ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7
| {z } | {z } | {z } | {z }
| {z } | {z }
| {z }
s
Chapter 5, p. 10/38
Matrix
Multiplication
Michael Global Summation

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication
An
Application:
Google
Chapter 5, p. 11/38
Matrix
Multiplication
Michael Realization
Hanke
Introduction
The Recursive
Doubling
Algorithm • Assume that P = 2D is a power of 2. On process p, the program
Matrix-Vector
Multiplication
reads:
Matrix-Matrix
Multiplication
s = ωp ;
An
for d = 0:D-1
Application:
Google
q = bitflip(p,d);
send(s,q);
receive(h,q);
s = s+h;
end
• The bitflip operation inverts bit number d in the binary

representation of p.
Chapter 5, p. 12/38
Matrix
Multiplication
Michael Comments
Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
• After execution of the program, every processor contains s
Matrix-Matrix
Multiplication • This algorithm is called Recursive Doubling (Butterfly algorithm)
An
Application: • The basic algorithm can be applied in many different contexts
Google
(examples: barriers, broadcast)
• Since every processor contains the final sum, its MPI
counterpart is MPI_Allreduce
• The algorithm as given may dead-lock. Use MPI_Sendrecv
Chapter 5, p. 13/38
Matrix
Multiplication
Michael Modification For Any Number of

Hanke
Introduction
Processes
The Recursive • Let 2D < P < 2D+1
Doubling
Algorithm s = ωp ;
Matrix-Vector
if p >= 2D
Multiplication send(s,bitflip(p,D));
Matrix-Matrix end
Multiplication if p < P-2D
An receive(h,bitflip(p,D));
Application:
Google s = s+h;
end
if p < 2D
for d = 0:D-1
send(s,bitflip(p,d));
receive(h,bitflip(p,d));
s = s+h;
end
end
if p < P-2D
send(s,bitflip(p,D));
end
if p >= 2D
receive(s,bitflip(p,D));
end
Chapter 5, p. 14/38
Matrix
Multiplication
Michael Performance Analysis

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
• Obviously, T1∗ = (2M − 1)ta
Matrix-Matrix • The local computation time is tcomp,1 = (2Ip − 1)ta
Multiplication
An • Time for recursive doubling:

Application:
Google
t = D(2(tstartup + 1 · tdata ) + ta )
• Using a load balanced data distribution, i.e., Ip = M/P it holds
Tp = (2Ip − 1)ta + log P(2(tstartup + 1 · tdata ) + ta )
Chapter 5, p. 15/38
Matrix
Multiplication
Michael Matrix-Vector Multiplication

Hanke
Introduction
The Recursive
Doubling
Algorithm • For a given M × N-matrix A and an N-dimensional vector x,
Matrix-Vector
Multiplication y = Ax is given by
Matrix-Matrix
Multiplication N
X
An
Application:
ym = am,n xn = ham , xi
Google n=1
where am is the m-th row of A

• Use a load-balanced data distribution on R = P × Q processors
in array topology to store A (no overlap!)
• x must have the same data distribution as the rows of A
• Similarly, y has the same data distribution as the columns of A
Chapter 5, p. 16/38
Matrix
Multiplication
Michael Data Distribution

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication
An
Application:
Google
There is additional memory needed to store x and y : In the

sequential version, M + N memory locations are necessary. In the
parallel version, QM + PN locations are necessary.
Chapter 5, p. 17/38
Matrix
Multiplication
Michael Algorithm
Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication • Once the data are distributed, all individual components ym can
An be computed in parallel using the Recursive Doubling Algorithm
Application:
Google
• Do not forget to block data exchanges for avoiding excessive
startup times!
• The performance analysis is similar to the one given before
Chapter 5, p. 18/38
Matrix
Multiplication
Michael Matrix-Matrix Multiplication

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
• Let A be an M × N-matrix, B an N × K -matrix. Then C = AB
Matrix-Matrix
Multiplication is
N
An X
Application:
Google
cm,k = am,n bn,k = ham , bk i
n=1
m
where a is the m-th row of A and bk is the k-th column of B
• Why not simply generalize matrix-vector multiplication?
We would need excessive additional memory (QMK + PNK )
Chapter 5, p. 19/38
Matrix
Multiplication
Michael Potential
Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication Observation
An If M = N = K , the matrix multiplication requires M 3 operations on
Application:
Google M 2 data. So we expect a good potential for parallelization.
In the following we assume M = N = K and R = P × P.
Chapter 5, p. 20/38
Matrix
Multiplication
Michael A Simple Algorithm

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
• Assume that we have P = M
Multiplication
• Assign (am , bk ) to processor (m, k)
Matrix-Matrix
Multiplication
• Compute cm,k on processor (m, k)
An
Application: • Advantages:
Google
• Computation time tcomp = (2M − 1)ta

• No communication at the computation stage
• Drawbacks:
• A lot of communication for initializing
• or, large memory requirements
Chapter 5, p. 21/38
Matrix
Multiplication
Michael Cannon’s Algorithm

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix • In a first step, assume P = M

Multiplication
An • Data distribution: processor (m − 1, n − 1) holds am,n , bm,n , cm,n

Application:
Google
A × B = C
b12
b22
a31 a32 a33 b32 c32
Chapter 5, p. 22/38
Matrix
Multiplication
Michael Partial Products

Hanke
• On the “diagonal” processors (m − 1, n − 1), parts of the sum

Introduction
The Recursive
Doubling cm,m = am,1 b1,m + · · · + am,m bm,m + · · · + am,M bM,m
Algorithm
Matrix-Vector
Multiplication are available
Matrix-Matrix A × B = C
Multiplication
An b12
Application:
Google a21 a22 a23 b22 c22
b32
• Shifting the rows of A cyclic to the right and those of B cyclic

downwards provides the next term
A × B = C
b32
a23 a21 a22 b12 c22
b22
• Repeating this cyclic exchange ones again completes the

calculation of the diagonal elements
Chapter 5, p. 23/38
Matrix
Multiplication
Michael The Outer Diagonal Elements

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication • Consider element c1,2 ,
Matrix-Matrix
Multiplication
c1,2 = a1,1 b1,2 + a1,2 b2,2 + · · ·
An
Application:
Google • While a1,2 is available, b2,2 is not
• This can be corrected by shifting the second column of B
(cyclically) upwards
• In order not to change the indices on the diagonal, the second
row of A must be shifted (cyclically) to the left
Chapter 5, p. 24/38
Matrix
Multiplication
Michael Initial Data Distribution

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
• More general, the matrices A and B must be initialized as
Matrix-Matrix follows:
Multiplication
• The m-th row of A is shifted cyclically m − 1 positions to the left
An
Application: • The n-th column of B is shifted cyclically n − 1 positions upwards
Google
A B
a11 a12 a13 b11 b22 b33
a22 a23 a21 b21 b32 b13
a33 a31 a32 b31 b12 b23
Chapter 5, p. 25/38
Matrix
Multiplication
Michael Algorithm
Hanke
Introduction
The Recursive
Doubling 1 Initially, processor (p, q) has elements ap+1,q+1 , bp+1,q+1
Algorithm
Matrix-Vector 2 Shift the rows of A and the columns of B into the structure
Multiplication
described above
Matrix-Matrix
Multiplication 3 for k = 1, . . . , N
An
Application: 1 Multiply the local values on each processor
Google
2 Shift the rows of A and the columns of B cyclically by one
processor
4 If necessary: undo step 1
Remarks:
• If the number of processors is much less than N 2 ,
a blocked version will do.
• The last shift in step 3.2 can be omitted.
Chapter 5, p. 26/38
Matrix
Multiplication
Michael Efficiency Analysis

Hanke
Introduction
The Recursive
Doubling
Algorithm Assume: R = P × P and P divides N
Matrix-Vector
Multiplication Step 2: Use red-black communication
Matrix-Matrix
Multiplication
t2 = 4 ∗ (tstartup + Ip2 tdata )
An
Application:
Google
Step 3.1: t3,1 = 2 ∗ Ip3 ta
Step 3.2: t3,2 = 4 ∗ (tstartup + Ip2 tdata )
TR =TP×P = 2PIp3 ta + 4(P + 1)(tstartup + Ip2 tdata )

Ts∗ =2N 3 ta
Chapter 5, p. 27/38
Matrix
Multiplication
Michael Speedup
Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication
2N 3 ta
SR =
An 2PIp3 ta + 4(P + 1)(tstartup + Ip2 tdata )
Application:
Google
2N 3 ta
=
2P(N/P)3 ta + 4(P + 1)(tstartup + (N/P)2 tdata )
1
=R √
R( R+1) tstartup tdata
1 + 2 N3 ta + 2 P+1N ta
Chapter 5, p. 28/38
Matrix
Multiplication
Michael Other Algorithms

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication There are a number of other algorithms available which have a better
Matrix-Matrix efficiency than Cannon’s algorithm:
Multiplication
An • Strassen’s algorithm reduces the complexity of matrix

Application:
Google multiplication from O(N 3 ) to O(N 2.81... ). This algorithm can be
refined to complexity O(N 2.376... )
• A refined communication strategy on a differently organized
processor topology was proposed by Dekel, Nassimi, and Sahni
(DNS algorithm)
Chapter 5, p. 29/38
Matrix
Multiplication
Michael Google’s Problem

Hanke
Introduction
The Recursive
Doubling
Algorithm The birth of Google
Matrix-Vector
Multiplication
Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd: The
Matrix-Matrix PageRank citation ranking: Bringing order to the web. Technical
Multiplication
Report SIDL-WP-1999-0120,
An
Application: http://dbpubs.stanford.edu/pub/1999-66
Google
A search engine faces two problems:
1 Find all web pages which contain the search phrase (more
general: satisfy some search criterion)
2 Among all these pages, find the most relevant
Ad 1: Web crawler
Ad 2: PageRank citation ranking
Chapter 5, p. 30/38
Matrix
Multiplication
Michael Page Ranking

Hanke
Introduction
The Recursive
Doubling
Algorithm • Every web page has a number of forward links (pointers to other
Matrix-Vector
Multiplication
web pages) and a number of backlinks (web pages pointing to
Matrix-Matrix the given page)
Multiplication
An
• A web page seems to be more important if the number of
Application:
Google
backlinks is large
• Simple counting the number of backlinks may not be sufficient:
A backlink with a high importance should have a higher weight
• This is a recursive definition: A page is important if the
backlinks are important.
• But what is the root of this recursion?? The web does not have
a root :-(
Chapter 5, p. 31/38
Matrix
Multiplication
Michael Definition of PageRank

Hanke
Introduction
The Recursive
Doubling
Algorithm • Let m be a webpage
Matrix-Vector
Multiplication • Let Fm be the set of pages m points to
Matrix-Matrix
Multiplication • Let Bm be the set of pages that point to m
An
Application:
• Let Nm = |Fm | be the number of forward links, and c be a
Google
normalization factor
• Then we define PageRank:
X Rn
Rm = c
Nn
n∈Bm
• Actually, this definition is slightly simplified.
Chapter 5, p. 32/38
Matrix
Multiplication
Michael Reformulation of PageRank

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
• Define a matrix A by
Matrix-Matrix
Multiplication (
An 1/Nn , if there is a link from m to n,
Application: amn =
Google 0, otherwise
• Let R = (R1 , . . . , RM )T be the PageRank vector. Then it holds
R = cAR
Chapter 5, p. 33/38
Matrix
Multiplication
Michael PageRank
Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Definition
Matrix-Matrix
The PageRank of a given connectivity matrix A is an eigenvector R
Multiplication corresponding to the largest eigenvalue c of A.
An
Application:
Google
Remark: Actually, A is a so-called sparse matrix, i.e. a matrix which

contains very many zero entries. For the time being, we neglect this
property. In connection with algorithms on graphs, we will consider
sparse matrices.
Chapter 5, p. 34/38
Matrix
Multiplication
Michael The Power Method For Eigenvalue

Hanke
Introduction
Problems
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
Matrix-Matrix
Multiplication • Given a square matrix A and an initial guess x [0] 6= 0
An
Application: • Form the sequence x [k+1] = Ax [k] , k = 0, 1, . . .
Google
• It can be shown that, for many matrices A and almost all initial
guesses x [0] the sequence {x [k] /kx [k] k} converges towards an
eigenvector to the largest eigenvalue c of A
hx [k+1] ,x [k] i
• An estimation ck of c can be obtained by ck =
hx [k] ,x [k] i
Chapter 5, p. 35/38
Matrix
Multiplication
Michael Algorithm
Hanke
Introduction
The Recursive
Doubling
Algorithm
% Let x = x0 be chosen
Matrix-Vector
Multiplication err = tol;
Matrix-Matrix c = 0;
Multiplication
while err >= tol*abs(c)
An
Application: nrm = 1/sqrt(<x,x>); % recursive doubling
Google
x = nrm*x;
y = A*x; % matrix-vektor mult
cnew = <y,x>; % recursice doubling
err = abs(cnew-c);
c = cnew;
x = y; % vector transposition
end
Chapter 5, p. 36/38
Matrix
Multiplication
Michael Vector Transposition

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector
Multiplication
• Assume a P × Q processor mesh and A distributed accordingly
Matrix-Matrix
Multiplication • Let y be distributed in the same way as the columns of A
An
Application: • The simple case: P = Q and row and column data distributions
Google
are equal:
• MPI_Alltoall can be used to transpose y
• The general case is much more involved. Essentially, it is
equivalent to one recursive doubling step
Chapter 5, p. 37/38
Matrix
Multiplication
Michael Optimization of Parallel Algorithms

Hanke
Introduction
The Recursive
Doubling
Algorithm
Matrix-Vector • In a sequential algorithm, the program will run faster if the

Multiplication
Matrix-Matrix
consecutive steps are optimized individually.
Multiplication
• Hence, optimizing a sequential algorithm is a local problem.
An
Application:
Google
• In a parallel environment, for a given algorithm the execution
time depends both on the compuation time and the data
distribution.
• A certain data distribution may be optimal at one step of the
algorithm while suboptimal for another.
• Thus, optimizing a parallel program is a global problem!
Chapter 5, p. 38/38

Matrix Multiplications and Collective Communication: Michael Hanke

Uploaded by

Copyright:

Available Formats

You might also like

Matrix Multiplications and Collective Communication: Michael Hanke

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Matrix Multiplications and Collective Communication: Michael Hanke

Uploaded by

Copyright:

Available Formats

Matrix

Parallel Computations for Large-Scale Problems I

Michael Basic Matrix Operations

Michael Addition of Matrices

• Embarrassingly parallel, all data access purely local

Michael Multiplication of Matrices

• The innermost loop implies a global data dependency for every

Michael The Challenge

Michael What Will Come

Michael Scalar Product of Two Vectors

An • Assume some data distribution on P processors such that x and

• Assume that all local computations are done such that ωp is

Michael Global Summation

• sequential complexity: O(P) operations

Michael Global Summation: Idea

Michael Global Summation

• The bitflip operation inverts bit number d in the binary

Michael Modification For Any Number of

Michael Performance Analysis

An • Time for recursive doubling:

• Using a load balanced data distribution, i.e., Ip = M/P it holds

Tp = (2Ip − 1)ta + log P(2(tstartup + 1 · tdata ) + ta )

Michael Matrix-Vector Multiplication

where am is the m-th row of A

Michael Data Distribution

There is additional memory needed to store x and y : In the

Michael Matrix-Matrix Multiplication

In the following we assume M = N = K and R = P × P.

Michael A Simple Algorithm

• Computation time tcomp = (2M − 1)ta

Michael Cannon’s Algorithm

Matrix-Matrix • In a first step, assume P = M

An • Data distribution: processor (m − 1, n − 1) holds am,n , bm,n , cm,n

Michael Partial Products

• On the “diagonal” processors (m − 1, n − 1), parts of the sum

• Shifting the rows of A cyclic to the right and those of B cyclic

• Repeating this cyclic exchange ones again completes the

Michael The Outer Diagonal Elements

Michael Initial Data Distribution

Michael Efficiency Analysis

TR =TP×P = 2PIp3 ta + 4(P + 1)(tstartup + Ip2 tdata )

Michael Other Algorithms

An • Strassen’s algorithm reduces the complexity of matrix

Michael Google’s Problem

Michael Page Ranking

Michael Definition of PageRank

• Actually, this definition is slightly simplified.

Michael Reformulation of PageRank

• Let R = (R1 , . . . , RM )T be the PageRank vector. Then it holds

Remark: Actually, A is a so-called sparse matrix, i.e. a matrix which

Michael The Power Method For Eigenvalue

Michael Vector Transposition

Michael Optimization of Parallel Algorithms

Matrix-Vector • In a sequential algorithm, the program will run faster if the

You might also like