Parallel Computing: Mpi - Iii

CSCI-UA.
0480-003
Parallel Computing
MPI - III
Mohamed Zahran (aka Z)
mzahran@cs.nyu.edu
http://www.mzahran.com
Many slides of this

lecture are adopted
and slightly modified from:
• Gerassimos Barlas
• Peter S. Pacheco
Collective
point-to-point
Data distributions
Sequential version
Copyright © 2010, Elsevier Inc.
All rights Reserved
Different partitions of a 12-
component vector among 3 processes
block-cyclic: when? e.g., compute factorial x[Y]

block: 0123⼯作⼩ 891011⼯作⼤ load imbalance
cyclic: does unnecessary work
• Block: Assign blocks of consecutive components to each process.
• Cyclic: Assign components in a round robin fashion.
• Block-cyclic: Use a cyclic distribution of blocks of components.
Parallel implementation of
vector addition
How will you distribute parts of x[] and y[] to processes?

All rights Reserved
Scatter
• Read an entire vector on process 0
• MPI_Scatter sends the needed
components to each of the other
processes.
send: include the sender itself
therefore source needs 2 buffers: sent and recv
# data items going
to each process
按rank的顺序get data
Important:
• All arguments are important for the source process (process 0 in our example)
• For all other processes, only recv_buf_p, recv_count, recv_type, src_proc,
and comm are important
Reading and distributing a vector
process 0 itself
also receives data.
如果没有”else{}” —> deadlock —> all

blocked
如果free(a)写在外⾯, then everyone
Copyright © 2010, Elsevier Inc. needs to free a, but maybe it is not
All rights Reserved everyone has a
• send_buf_p
– is not used except by the sender.
– However, it must be defined or NULL on others to make the
code correct.
– Must have at least communicator size * send_count elements
• All processes must call MPI_Scatter, not only the sender.
• send_count the number of data items sent to each process.
• recv_buf_p must have at least send_count elements
• MPI_Scatter uses block distribution
Process 0 0 1 2 3 4 5 6 7 0
Process 0 0 1 2
Process 1 3 4 5
Process 2 6 7 8
Gather
• MPI_Gather collects all of the
components of the vector onto process
dest process, ordered in rank order.
number of elements
in send_buf_p
number of elements
for any single receive
Important:
• All arguments are important for the destination process.
• For all other processes, only send_buf_p, send_count, send_type, dest_proc,
and comm are important
Print a distributed vector (1)

All rights Reserved
Print a distributed vector (2)

All rights Reserved
Allgather
• Concatenates the contents of each
process’ send_buf_p and stores this in
each process’ recv_buf_p.
• As usual, recv_count is the amount of data
being received from each process.

All rights Reserved
Matrix-vector multiplication
i-th component of y
Dot product of the ith
row of A with x.

All rights Reserved
Matrix-vector multiplication
Pseudo-code Serial Version

C style arrays
stored as

All rights Reserved
Serial matrix-vector multiplication
Let’s assume x[] is distributed among the different processes

An MPI matrix-vector
multiplication function (1)

All rights Reserved
An MPI matrix-vector
multiplication function (2)

All rights Reserved
Keep in mind …
• In distributed memory systems,
communication is more expensive than
computation.
• Distributing a fixed amount of data
among several messages is more
expensive than sending a single big
message.
Derived datatypes
• Used to represent any collection of data
items
• If a function that sends data knows this
information about a collection of data items,
it can collect the items from memory before
they are sent.
• A function that receives data can distribute
the items into their correct destinations in
memory when they’re received.

All rights Reserved
Derived datatypes
• A sequence of basic MPI data types
together with a displacement for each
of the data types.
Address in memory where the variables are stored
a and b are double; n is int
displacement from the beginning of the type

(We assume we start with a.)

All rights Reserved
MPI_Type create_struct
• Builds a derived datatype that consists

of individual elements that have
different basic types.
Number of elements in the type
an integer type that is big enough From the address of item 0

to store an address on the system.
why get_address but not directly &?
不同machine上有的32bit有的64bit
Before you start using your new data type
Allows the MPI implementation to

optimize its internal representation of
the datatype for use in communication
functions.
When you are finished with your new type
This frees any additional storage used.

All rights Reserved
Example (1)

All rights Reserved
Example (2)

All rights Reserved
Example (3)
The receiving end can use

the received complex data
item as if it is a structure.
All rights Reserved
MEASURING TIME IN MPI
We have seen in the past …
• time in Linux
• clock() inside your code
• Does MPI offer anything else?
Elapsed parallel time
• Returns the number of seconds that
have elapsed since some time in the
past.
Elapsed time for

the calling process
How to Sync Processes?
MPI_Barrier
• Ensures that no process will return from

calling it until every process in the
communicator has started calling it.

All rights Reserved
Let’s see how we can analyze the
performance of an MPI program
The matrix-vector multiplication
Run-times of serial and parallel
matrix-vector multiplication
(Seconds)

All rights Reserved
Speedups of Parallel Matrix-
Vector Multiplication

All rights Reserved
Efficiencies of Parallel Matrix-
Vector Multiplication

All rights Reserved
Conclusions
• Reducing messages is a good
performance strategy!
– Collective vs point-to-point
• Distributing a fixed amount of data
among several messages is more
expensive than sending a single big
message.

Parallel Computing: Mpi - Iii

Uploaded by

Copyright:

Available Formats

You might also like

Parallel Computing: Mpi - Iii

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Computing: Mpi - Iii

Uploaded by

Copyright:

Available Formats

CSCI-UA.

Many slides of this

block-cyclic: when? e.g., compute factorial x[Y]

How will you distribute parts of x[] and y[] to processes?

Copyright © 2010, Elsevier Inc.

如果没有”else{}” —> deadlock —> all

Copyright © 2010, Elsevier Inc.

Copyright © 2010, Elsevier Inc.

Copyright © 2010, Elsevier Inc.

Copyright © 2010, Elsevier Inc.

Pseudo-code Serial Version

Copyright © 2010, Elsevier Inc.

Let’s assume x[] is distributed among the different processes

Copyright © 2010, Elsevier Inc.

Copyright © 2010, Elsevier Inc.

Copyright © 2010, Elsevier Inc.

displacement from the beginning of the type

Copyright © 2010, Elsevier Inc.

• Builds a derived datatype that consists

an integer type that is big enough From the address of item 0

Allows the MPI implementation to

This frees any additional storage used.

Copyright © 2010, Elsevier Inc.

Copyright © 2010, Elsevier Inc.

Copyright © 2010, Elsevier Inc.

The receiving end can use

Elapsed time for

• Ensures that no process will return from

Copyright © 2010, Elsevier Inc.

Copyright © 2010, Elsevier Inc.

Copyright © 2010, Elsevier Inc.

Copyright © 2010, Elsevier Inc.

You might also like