Professional Documents
Culture Documents
12 MPIProgramPerformance
12 MPIProgramPerformance
12 MPIProgramPerformance
Introduction
• Asymptotic analysis
– For theoretical ease, performance is sometimes chara
cterized in a large limit. You may encounter the followi
ng example: Asymptotic analysis reveals that the algor
ithm requires order((N/P)*log(N/P)) time on P processo
rs, where N is some parameter characterizing the prob
lem size.
– This analysis is not always relevant, and can be mislea
ding, because you will often be interested in a regime
where the lower order terms are significant.
Developing Better Models
Developing Better Models
• Definition of Efficiency
– Relative efficiency is defined as T1/(P*Tp), where T1 is the executi
on time on one processor and Tp is the execution time on P proce
ssors. Absolute efficiency is obtained by replacing the execution t
ime on one processor with the execution time of the fastest sequ
ential algorithm.
• Definition of Speedup
– Relative speedup is defined as execution time on one processor
over the execution time on P processors. Absolute speedup is ob
tained by replacing the execution time on one processor with the
execution time of the fastest sequential algorithm.
Developing Better Models
• When you are collecting empirical data you must take into account
– Accuracy. In general, performance data obtained using sampling
techniques is less accurate than data obtained using counters or timers.
In the case of timers, the accuracy of the clock must also be considered.
– Simplicity. The best tools, in many circumstances, are those that
collect data automatically with little or no programmer intervention and
provide convenient analysis capabilities.
– Flexibility. A flexible tool can easily be extended to collect additional
performance data or to provide different views of the same data.
Flexibility and simplicity are often opposing requirements.
– Intrusiveness. Unless a computer provides hardware support,
performance data collection inevitably introduces some overhead. You
need to be aware of this overhead and account for it when analyzing
data.
– Abstraction. A good performance tool allows data to be examined at a
level of abstraction appropriate for the programming model of the
parallel program.
Finding Bottlenecks with
Profiling Tools
Finding Bottlenecks with Profiling
Tools
• Bottlenecks in your code can be of two types:
– computational bottlenecks (slow serial performance)
– communications bottlenecks
• Tools are available for gathering information about both.
The simplest tool to use is the MPI routine MPI_WTIME
which can give you information about the performance of
a particular section of your code. For a more detailed
analysis, you can typically use any of a number of
performance analysis tools designed for either serial or
parallel codes. These are discussed in the next two
sections.
• Definition of MPI_WTIME
– Used to return a number of wall-clock seconds (floating-point)
that have elapsed since some time in the past.
Serial Profiling Tools
• We discuss the serial tools first since some of these tools form the b
asis for more sophisticated parallel performance tools.
• Also, you generally want to start from a highly optimized serial code
when converting to a parallel implementation.
• Some useful serial performance analysis tools include Speedshop (s
srun, SGI) and Performance Application Programming Interface (PA
PI (http://icl.cs.utk.edu/papi/), many platforms).
• The Speedshop and PAPI tools use hardware event counters within
your CPU to gather information about the performance of your code.
Thus, they can be used to gather information without recompiling yo
ur original code. PAPI also provides the low-level interface for anoth
er tool called HPMcount.
Serial Profiling Tools
• Definition of Hardware Event Counters
– Hardware event counters are extra logic gates that are inserted in the actual processor to keep track of sp
ecific computational events and are usually updated at every clock cycle.
– These types of counters are non-intrusive, and very accurate, because you don't perturb the code whose
performance you are trying to assess with extra code.
– These tools usually allow you to determine which subroutines in your code are responsible for poor perfor
mance.
– More detailed analyses of code blocks within a subroutine can be done by instrumenting the code.
– The computational events that are counted (and the number of events you can simultaneously track) depe
nd on the details of the specific processor, but some common ones are
• Cycles
• Instructions
• Floating point and fixed point operations
• Loads and Stores
• Cache misses
• TLB misses
– These built in counters can be used to define other useful metrics such as
• CPI: cycles per instruction (or IPC)
• Floating point rate
• Computation intensity
• Instructions per cache miss
• Instructions per load/store
• Load/stores per data cache miss
• Loads per load miss
• Stores per TLB miss
• Cache hit rate
Parallel Profiling Tools