Matrix Multiplications and Collective Communication: Michael Hanke

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Matrix

Multiplication

Michael
Hanke

Introduction
Matrix Multiplications And Collective
The Recursive
Doubling
Algorithm Communication
Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication Michael Hanke
An
Application:
Google
School of Engineering Sciences

Parallel Computations for Large-Scale Problems I

Chapter 5, p. 1/38
Matrix
Multiplication

Michael Outline
Hanke

Introduction

The Recursive
Doubling
Algorithm
1 Introduction
Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication 2 The Recursive Doubling Algorithm
An
Application:
Google

3 Matrix-Vector Multiplication

4 Matrix-Matrix Multiplication

5 An Application: Google

Chapter 5, p. 2/38
Matrix
Multiplication

Michael Basic Matrix Operations


Hanke

Introduction

The Recursive
Doubling
Algorithm • Let A and B be to matrices of dimensions M × N and K × L
Matrix-Vector
Multiplication • Matrix addition C = A + B (defined if M = K and N = L)
Matrix-Matrix
Multiplication
cm,n = am,n + bm,n (1 ≤ m ≤ M, 1 ≤ n ≤ N)
An
Application:
Google
• Matrix multiplication C = AB (defined if N = K )

N
X
cm,l = am,n bn,l
n=1

cm,l is the product of the m-th row of A and the l -th column of
B

Chapter 5, p. 3/38
Matrix
Multiplication

Michael Addition of Matrices


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
for n = 1:N
Matrix-Matrix
Multiplication for m = 1:M
An c(m,n) = a(m,n)+b(m,n);
Application:
Google end
end

• Embarrassingly parallel, all data access purely local


• Any data distribution will do, no overlap necessary

Chapter 5, p. 4/38
Matrix
Multiplication

Michael Multiplication of Matrices


Hanke

Introduction

The Recursive
Doubling
Algorithm
• Assume that C is initialized to zero.
Matrix-Vector
Multiplication
for l = 1:L
Matrix-Matrix
Multiplication for m = 1:M
An
Application:
for n = 1:N
Google c(m,l) = c(m,l)+a(m,n)*b(n,l);
end
end
end

• The innermost loop implies a global data dependency for every


cm,l

Chapter 5, p. 5/38
Matrix
Multiplication

Michael The Challenge


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication

An
Application: Find an algorithm and a data distribution such that
Google
1 data doubling on different processes is avoided;
2 excessive data movement (communication!) is avoided.

Chapter 5, p. 6/38
Matrix
Multiplication

Michael What Will Come


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Strategy
Multiplication
Consider the loops from the innermost to the outermost one.
An
Application:
Google
1 The innermost (n-) loop is the scalar product of two vectors.
2 The n- and the m-loops are a matrix-vector multiplication for
each l
3 Finally, find an implementation of the complete algorithm

Chapter 5, p. 7/38
Matrix
Multiplication

Michael Scalar Product of Two Vectors


Hanke

Introduction
• The scalar product of two vectors x, y ∈ RM is defined as
The Recursive
Doubling
Algorithm
M
X
Matrix-Vector T
Multiplication s = hx, y i = x y = xm ym
Matrix-Matrix m=1
Multiplication

An • Assume some data distribution on P processors such that x and


Application:
Google y are distributed identically
• Then it holds
M
X P−1
XX P−1
X
s= xm ym = xi yi = ωp
m=1 p=0 i∈Ip p=0
| {z }
ωp

• Assume that all local computations are done such that ωp is


available

Chapter 5, p. 8/38
Matrix
Multiplication

Michael Global Summation


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication Problem
Matrix-Matrix
Multiplication
Compute the sum s on P processors,
An
Application:
Google
s = ω0 + ω1 + · · · + ωP−1

• sequential complexity: O(P) operations


• The MPI solution to this problem: MPI_Reduce
• How could it be realized?

Chapter 5, p. 9/38
Matrix
Multiplication

Michael Global Summation: Idea


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication
Idea (for P = 8):
An
Application:
Google s = ω0 + ω1 + ω2 + ω3 + ω4 + ω5 + ω6 + ω7
| {z } | {z } | {z } | {z }
| {z } | {z }
| {z }
s

Chapter 5, p. 10/38
Matrix
Multiplication

Michael Global Summation


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication

An
Application:
Google

Chapter 5, p. 11/38
Matrix
Multiplication

Michael Realization
Hanke

Introduction

The Recursive
Doubling
Algorithm • Assume that P = 2D is a power of 2. On process p, the program
Matrix-Vector
Multiplication
reads:
Matrix-Matrix
Multiplication
s = ωp ;
An
for d = 0:D-1
Application:
Google
q = bitflip(p,d);
send(s,q);
receive(h,q);
s = s+h;
end

• The bitflip operation inverts bit number d in the binary


representation of p.

Chapter 5, p. 12/38
Matrix
Multiplication

Michael Comments
Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
• After execution of the program, every processor contains s
Matrix-Matrix
Multiplication • This algorithm is called Recursive Doubling (Butterfly algorithm)
An
Application: • The basic algorithm can be applied in many different contexts
Google
(examples: barriers, broadcast)
• Since every processor contains the final sum, its MPI
counterpart is MPI_Allreduce
• The algorithm as given may dead-lock. Use MPI_Sendrecv

Chapter 5, p. 13/38
Matrix
Multiplication

Michael Modification For Any Number of


Hanke

Introduction
Processes
The Recursive • Let 2D < P < 2D+1
Doubling
Algorithm s = ωp ;
Matrix-Vector
if p >= 2D
Multiplication send(s,bitflip(p,D));
Matrix-Matrix end
Multiplication if p < P-2D
An receive(h,bitflip(p,D));
Application:
Google s = s+h;
end
if p < 2D
for d = 0:D-1
send(s,bitflip(p,d));
receive(h,bitflip(p,d));
s = s+h;
end
end
if p < P-2D
send(s,bitflip(p,D));
end
if p >= 2D
receive(s,bitflip(p,D));
end
Chapter 5, p. 14/38
Matrix
Multiplication

Michael Performance Analysis


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
• Obviously, T1∗ = (2M − 1)ta
Matrix-Matrix • The local computation time is tcomp,1 = (2Ip − 1)ta
Multiplication

An • Time for recursive doubling:


Application:
Google
t = D(2(tstartup + 1 · tdata ) + ta )

• Using a load balanced data distribution, i.e., Ip = M/P it holds

Tp = (2Ip − 1)ta + log P(2(tstartup + 1 · tdata ) + ta )

Chapter 5, p. 15/38
Matrix
Multiplication

Michael Matrix-Vector Multiplication


Hanke

Introduction

The Recursive
Doubling
Algorithm • For a given M × N-matrix A and an N-dimensional vector x,
Matrix-Vector
Multiplication y = Ax is given by
Matrix-Matrix
Multiplication N
X
An
Application:
ym = am,n xn = ham , xi
Google n=1

where am is the m-th row of A


• Use a load-balanced data distribution on R = P × Q processors
in array topology to store A (no overlap!)
• x must have the same data distribution as the rows of A
• Similarly, y has the same data distribution as the columns of A

Chapter 5, p. 16/38
Matrix
Multiplication

Michael Data Distribution


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication

An
Application:
Google

There is additional memory needed to store x and y : In the


sequential version, M + N memory locations are necessary. In the
parallel version, QM + PN locations are necessary.

Chapter 5, p. 17/38
Matrix
Multiplication

Michael Algorithm
Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication • Once the data are distributed, all individual components ym can
An be computed in parallel using the Recursive Doubling Algorithm
Application:
Google
• Do not forget to block data exchanges for avoiding excessive
startup times!
• The performance analysis is similar to the one given before

Chapter 5, p. 18/38
Matrix
Multiplication

Michael Matrix-Matrix Multiplication


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
• Let A be an M × N-matrix, B an N × K -matrix. Then C = AB
Matrix-Matrix
Multiplication is
N
An X
Application:
Google
cm,k = am,n bn,k = ham , bk i
n=1
m
where a is the m-th row of A and bk is the k-th column of B
• Why not simply generalize matrix-vector multiplication?
We would need excessive additional memory (QMK + PNK )

Chapter 5, p. 19/38
Matrix
Multiplication

Michael Potential
Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication Observation
An If M = N = K , the matrix multiplication requires M 3 operations on
Application:
Google M 2 data. So we expect a good potential for parallelization.

In the following we assume M = N = K and R = P × P.

Chapter 5, p. 20/38
Matrix
Multiplication

Michael A Simple Algorithm


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
• Assume that we have P = M
Multiplication
• Assign (am , bk ) to processor (m, k)
Matrix-Matrix
Multiplication
• Compute cm,k on processor (m, k)
An
Application: • Advantages:
Google

• Computation time tcomp = (2M − 1)ta


• No communication at the computation stage

• Drawbacks:
• A lot of communication for initializing
• or, large memory requirements

Chapter 5, p. 21/38
Matrix
Multiplication

Michael Cannon’s Algorithm


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix • In a first step, assume P = M


Multiplication

An • Data distribution: processor (m − 1, n − 1) holds am,n , bm,n , cm,n


Application:
Google
A × B = C
b12
b22
a31 a32 a33 b32 c32

Chapter 5, p. 22/38
Matrix
Multiplication

Michael Partial Products


Hanke

• On the “diagonal” processors (m − 1, n − 1), parts of the sum


Introduction

The Recursive
Doubling cm,m = am,1 b1,m + · · · + am,m bm,m + · · · + am,M bM,m
Algorithm

Matrix-Vector
Multiplication are available
Matrix-Matrix A × B = C
Multiplication

An b12
Application:
Google a21 a22 a23 b22 c22
b32

• Shifting the rows of A cyclic to the right and those of B cyclic


downwards provides the next term
A × B = C
b32
a23 a21 a22 b12 c22
b22

• Repeating this cyclic exchange ones again completes the


calculation of the diagonal elements
Chapter 5, p. 23/38
Matrix
Multiplication

Michael The Outer Diagonal Elements


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication • Consider element c1,2 ,
Matrix-Matrix
Multiplication
c1,2 = a1,1 b1,2 + a1,2 b2,2 + · · ·
An
Application:
Google • While a1,2 is available, b2,2 is not
• This can be corrected by shifting the second column of B
(cyclically) upwards
• In order not to change the indices on the diagonal, the second
row of A must be shifted (cyclically) to the left

Chapter 5, p. 24/38
Matrix
Multiplication

Michael Initial Data Distribution


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
• More general, the matrices A and B must be initialized as
Matrix-Matrix follows:
Multiplication
• The m-th row of A is shifted cyclically m − 1 positions to the left
An
Application: • The n-th column of B is shifted cyclically n − 1 positions upwards
Google

A B
a11 a12 a13 b11 b22 b33
a22 a23 a21 b21 b32 b13
a33 a31 a32 b31 b12 b23

Chapter 5, p. 25/38
Matrix
Multiplication

Michael Algorithm
Hanke

Introduction

The Recursive
Doubling 1 Initially, processor (p, q) has elements ap+1,q+1 , bp+1,q+1
Algorithm

Matrix-Vector 2 Shift the rows of A and the columns of B into the structure
Multiplication
described above
Matrix-Matrix
Multiplication 3 for k = 1, . . . , N
An
Application: 1 Multiply the local values on each processor
Google
2 Shift the rows of A and the columns of B cyclically by one
processor
4 If necessary: undo step 1
Remarks:
• If the number of processors is much less than N 2 ,
a blocked version will do.
• The last shift in step 3.2 can be omitted.

Chapter 5, p. 26/38
Matrix
Multiplication

Michael Efficiency Analysis


Hanke

Introduction

The Recursive
Doubling
Algorithm Assume: R = P × P and P divides N
Matrix-Vector
Multiplication Step 2: Use red-black communication
Matrix-Matrix
Multiplication
t2 = 4 ∗ (tstartup + Ip2 tdata )
An
Application:
Google
Step 3.1: t3,1 = 2 ∗ Ip3 ta
Step 3.2: t3,2 = 4 ∗ (tstartup + Ip2 tdata )

TR =TP×P = 2PIp3 ta + 4(P + 1)(tstartup + Ip2 tdata )


Ts∗ =2N 3 ta

Chapter 5, p. 27/38
Matrix
Multiplication

Michael Speedup
Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication
2N 3 ta
SR =
An 2PIp3 ta + 4(P + 1)(tstartup + Ip2 tdata )
Application:
Google
2N 3 ta
=
2P(N/P)3 ta + 4(P + 1)(tstartup + (N/P)2 tdata )
1
=R √
R( R+1) tstartup tdata
1 + 2 N3 ta + 2 P+1N ta

Chapter 5, p. 28/38
Matrix
Multiplication

Michael Other Algorithms


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication There are a number of other algorithms available which have a better
Matrix-Matrix efficiency than Cannon’s algorithm:
Multiplication

An • Strassen’s algorithm reduces the complexity of matrix


Application:
Google multiplication from O(N 3 ) to O(N 2.81... ). This algorithm can be
refined to complexity O(N 2.376... )
• A refined communication strategy on a differently organized
processor topology was proposed by Dekel, Nassimi, and Sahni
(DNS algorithm)

Chapter 5, p. 29/38
Matrix
Multiplication

Michael Google’s Problem


Hanke

Introduction

The Recursive
Doubling
Algorithm The birth of Google
Matrix-Vector
Multiplication
Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd: The
Matrix-Matrix PageRank citation ranking: Bringing order to the web. Technical
Multiplication
Report SIDL-WP-1999-0120,
An
Application: http://dbpubs.stanford.edu/pub/1999-66
Google
A search engine faces two problems:
1 Find all web pages which contain the search phrase (more
general: satisfy some search criterion)
2 Among all these pages, find the most relevant
Ad 1: Web crawler
Ad 2: PageRank citation ranking

Chapter 5, p. 30/38
Matrix
Multiplication

Michael Page Ranking


Hanke

Introduction

The Recursive
Doubling
Algorithm • Every web page has a number of forward links (pointers to other
Matrix-Vector
Multiplication
web pages) and a number of backlinks (web pages pointing to
Matrix-Matrix the given page)
Multiplication

An
• A web page seems to be more important if the number of
Application:
Google
backlinks is large
• Simple counting the number of backlinks may not be sufficient:
A backlink with a high importance should have a higher weight
• This is a recursive definition: A page is important if the
backlinks are important.
• But what is the root of this recursion?? The web does not have
a root :-(

Chapter 5, p. 31/38
Matrix
Multiplication

Michael Definition of PageRank


Hanke

Introduction

The Recursive
Doubling
Algorithm • Let m be a webpage
Matrix-Vector
Multiplication • Let Fm be the set of pages m points to
Matrix-Matrix
Multiplication • Let Bm be the set of pages that point to m
An
Application:
• Let Nm = |Fm | be the number of forward links, and c be a
Google
normalization factor
• Then we define PageRank:
X Rn
Rm = c
Nn
n∈Bm

• Actually, this definition is slightly simplified.

Chapter 5, p. 32/38
Matrix
Multiplication

Michael Reformulation of PageRank


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
• Define a matrix A by
Matrix-Matrix
Multiplication (
An 1/Nn , if there is a link from m to n,
Application: amn =
Google 0, otherwise

• Let R = (R1 , . . . , RM )T be the PageRank vector. Then it holds

R = cAR

Chapter 5, p. 33/38
Matrix
Multiplication

Michael PageRank
Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
Definition
Matrix-Matrix
The PageRank of a given connectivity matrix A is an eigenvector R
Multiplication corresponding to the largest eigenvalue c of A.
An
Application:
Google

Remark: Actually, A is a so-called sparse matrix, i.e. a matrix which


contains very many zero entries. For the time being, we neglect this
property. In connection with algorithms on graphs, we will consider
sparse matrices.

Chapter 5, p. 34/38
Matrix
Multiplication

Michael The Power Method For Eigenvalue


Hanke

Introduction
Problems
The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication

Matrix-Matrix
Multiplication • Given a square matrix A and an initial guess x [0] 6= 0
An
Application: • Form the sequence x [k+1] = Ax [k] , k = 0, 1, . . .
Google
• It can be shown that, for many matrices A and almost all initial
guesses x [0] the sequence {x [k] /kx [k] k} converges towards an
eigenvector to the largest eigenvalue c of A
hx [k+1] ,x [k] i
• An estimation ck of c can be obtained by ck =
hx [k] ,x [k] i

Chapter 5, p. 35/38
Matrix
Multiplication

Michael Algorithm
Hanke

Introduction

The Recursive
Doubling
Algorithm
% Let x = x0 be chosen
Matrix-Vector
Multiplication err = tol;
Matrix-Matrix c = 0;
Multiplication
while err >= tol*abs(c)
An
Application: nrm = 1/sqrt(<x,x>); % recursive doubling
Google
x = nrm*x;
y = A*x; % matrix-vektor mult
cnew = <y,x>; % recursice doubling
err = abs(cnew-c);
c = cnew;
x = y; % vector transposition
end

Chapter 5, p. 36/38
Matrix
Multiplication

Michael Vector Transposition


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector
Multiplication
• Assume a P × Q processor mesh and A distributed accordingly
Matrix-Matrix
Multiplication • Let y be distributed in the same way as the columns of A
An
Application: • The simple case: P = Q and row and column data distributions
Google
are equal:
• MPI_Alltoall can be used to transpose y
• The general case is much more involved. Essentially, it is
equivalent to one recursive doubling step

Chapter 5, p. 37/38
Matrix
Multiplication

Michael Optimization of Parallel Algorithms


Hanke

Introduction

The Recursive
Doubling
Algorithm

Matrix-Vector • In a sequential algorithm, the program will run faster if the


Multiplication

Matrix-Matrix
consecutive steps are optimized individually.
Multiplication
• Hence, optimizing a sequential algorithm is a local problem.
An
Application:
Google
• In a parallel environment, for a given algorithm the execution
time depends both on the compuation time and the data
distribution.
• A certain data distribution may be optimal at one step of the
algorithm while suboptimal for another.
• Thus, optimizing a parallel program is a global problem!

Chapter 5, p. 38/38

You might also like