Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

MPI Application Tune-Up

Four Steps to Performance

Abstract
Cluster systems continue to grow in complexity and capability. Getting optimal performance
can be challenging. Making sense of the MPI communications, whether determining load
balance or finding platform bandwidth limitations in the process/ranks can be daunting and
result in different levels of performance. A four step process is outlined using a Poisson solver
that is implemented as an MPI application. The tools that will be used for the profiling and
analysis will be Intel® Trace Analyzer and Collector and Intel® VTune™ Amplifier XE to
demonstrate the process.

This paper is a first introduction and overview of the methodology with emphasis on the most
important features that Intel® Trace Analyzer and Collector (1) (2) and VTune™ Amplifier XE (3)
offer for analysis of pure MPI applications on HPC clusters using Poisson solver as an
illustrative example. We concentrated on a fixed size workload to perform a strong scaling
analysis by varying the number of MPI ranks, each of them mapped to a single physical core.

Besides showing a sample analysis, the reader will learn detailed command lines and GUI
usage information demonstrating how to use the above mentioned tools effectively in a
cluster environment.

We have used the Intel Endeavor cluster, housed at Intel’s Customer Response Team (CRT)
Datacenter in New Mexico. Each node of this cluster comprises 2 Intel® 12-core Xeon® E5 v2
processors using the Intel microarchitecture code name Ivy Town (4), and connected through
a Mellanox* InfiniBand.

1
Methodology
Parallel High Performance Computing (HPC) applications often rely on multi-node
architectures of modern clusters. Performance tuning of such applications must involve
analysis of cross-node application behavior as well as single node performance analysis. Two
performance analysis tools, Intel Trace Analyzer and Collector and VTune™ Amplifier, can
provide important insights to help in this analysis. For example, message passing interface
(MPI) communication hotspots, synchronization bottlenecks, load balancing and other
complex issues can be investigated using Intel Trace Analyzer and Collector. At the same
time, VTune™ Amplifier can be used to understand intra-node performance issues. We will
apply both of these tools to the pure MPI version of the Poisson solver. The more complex
case of hybrid applications is left for later studies.

The methodology presented below represents broad recommendations to combine global


performance application metrics such as speedup and parallel efficiency with more detailed
measurements such as message passing rates and memory bandwidth. With more detailed
analysis beyond simple scaling profiling, the use of software tools such as Intel® Trace
Analyzer and Collector and VTune Amplifier is necessary. Our goal is not to achieve better
performance per se but to show how to use Intel Trace Analyzer and Collector and VTune
Amplifier to better understand performance problems on a specific hardware platform as a
prerequisite for subsequent tuning.

We focus on the cluster level performance optimizations assuming that application already
went through single core performance tuning and hence achieved some level of maturity. The
performance issues related to the scalability have to be evaluated at level of concurrency the
application is used in real-life situation. For example, by performing analysis using realistic
level workloads running at scale on the cluster, we investigate application at relevant memory
footprint conditions.

The methodology consists of 4 phases depicted in Figure 1 (a more detailed version is shown
in Appendix C):

2
1.Global Analysis
characteristic values for the
whole program:
Timing, Speedup, Efficiencies
+ Imbalance diagram

Yes No
Imbalance
>
Interconnect

2. Algorithmic Investigations 3. MPI Runtime tuning


Change algorithm: some Setting Intel MPI environment
workloads provide alternatives. variables.
Analyze and change MPI pattern. Using a faster comm. network.
Analyze load imbalance of the Applying an optimized rank to
computation. node mapping.

No
Runtime Runtime
reduced? reduced?

Yes Yes

4. Single node/process tuning


Hotspot analysis for MPI routines.
Tuning of routines showing a load
imbalance.
Bandwidth analysis for single node Iterate
scalability

Figure 1 Flow Chart of the tuning methodology. The central decision is taken based on the relation of
Imbalance time vs. Interconnect time. These times can be determined by ITAC (see below). Large
imbalance time means that we have much waiting time in MPI routines and we cannot get much faster even
with an ideal communication network. Large interconnect time means that any improvement of the
network performance speeds up the program significantly.

3
Scaling using different MPI rank distributions on the computational
grid

The experiments were conducted on Intel Endeavor cluster with 2 sockets per node of Xeon E5
v2 processors (the Intel microarchitecture code name Ivy Town) providing 24 physical cores
per node. Intel® Hyper-Threading Technology was enabled on the cluster nodes. However, in
our case of pure MPI runs, we used no more than 24 ranks per node, mapping one MPI rank to
one physical core.

We investigate our Poisson solver on a square 3200x3200 computational grid that is large
enough to run into bandwidth limitations. The bandwidth limitations can be remedied by
undersubscribing computational nodes, tuning hardware and software prefetchers, and by
algorithmic changes and other means.

The 3200x3200 grid points can be distributed to MPI ranks using a 2D process grid, e.g., in the
case of 4 ranks, one can use 2 rows x 2 columns of processes or a 1D distribution with 4 rows
x 1 column or 1 row x 4 columns (Figure 2).

1600x1600
2 3 local grid
points per
MPI rank

0 1

3200x800
local grid
points per
MPI rank
0 1 2 3

Figure 2 Mapping of computation grid on MPI Processes/Grid using 2D (2x2) and 1D (1x4) processes grid.

4
According to the proposed methodology, the first stage is global Analysis and we start it with a
scaling investigation by running the application with different numbers of processes (p),
recording the timings T[p] and then measuring speedup defined as S[p] = T[1]/T[p]. The
complimentary metric, parallel efficiency, defined as E[p] = S[p]/p, will be used later, as well.
The speedup curves for the 2D quadratic and 1D process (1D 1xN and 1D Nx1) grids show
some differences in scaling (Figure 3). Indeed, the 48x32 = 1536 process grid delivers a
speedup of 284, while the 1536x1 process grid gives 288 speedup but the 1x1536 process
decomposition only produces a 144x scaling (Please see Appendix A for the cluster
configuration that was used in these experiments).

Benchmark configuration in Appendix A

Figure 3 Speedup for 2D and 1D process grids. Note Y axis logarithmic scale. A single node contains 24 cores
(IVT). The “Ideal” curve is just speedup == # ranks. A small additional dent in the “Ideal” curve results from the
fact that the ranks are not powers of 2. The 2D distribution uses a square distribution NxM with n == m. If this is
not possible then the “nearest” approximation with n>m is chosen: e.g. for 384 ranks NxM == 24x16.

From the timing data it is not obvious whether the scaling degradation is due to MPI
performance or to single core compute performance. We can use Intel Trace Analyzer and
Collector function profile functionality to separate and further analyze impact of different
ranks placements on inter-node and intra-node performances (ITAC: chart->function profile).

5
Before we do that we may have a look on the message passing performance due to different
process mappings.

Message Passing Profile


Message Passing Profile is a display of various characteristics of message passing in a
sender/receiver matrix that can be obtained through Charts-> Message Profile Chart. Because
dealing with 1536 ranks generates a huge matrix we may fuse all ranks for each node:
Advanced-> Process Aggregation-> All Nodes. The diagonal now shows the intra-node
performance characteristics while the off diagonals show the inter-node statistics. Without
Process Aggregation the diagonal will be only filled if we send messages from rank n to the
same rank n which is usually not a good idea.

Several attributes may be displayed by using Right click->Attribute to show. Most interesting
is the attribute “Average Transfer Rate,” displaying the message passing rate including all
waiting times. But some other attributes like total volume [in MB] and the number of messages
may also be of interest.

Figure 4: The Message Passing Profile for 1536 = 48 x 32 ranks process placement on 64 nodes. The rows and
columns represent senders and receivers correspondingly with squares also showing a color coded legend for
the chosen attribute and representing the messages between different sender/receiver pairs. In this case we
fused all ranks inside each node. The attribute is “Average Transfer rate” [MB/s] in this case. Rates are pretty
low and this gives a first indication that we should investigate the message passing in more detail here.

6
Figure 5: The Message Passing Profile for 1536x1 ranks 1D process placement on 64 nodes. Rates are much
better compared to the 2D case. Especially, the intra node communication (1D) has about a factor of 1.38 GB/
30 MB better throughput compared to 2D!

By looking at other attributes offered by the charts on Figure 4 and Figure 5 one can realize
that we have fewer massages in the 1D case that are larger, which in turn leads to a higher
average transfer rate. But in the 1D case we also transfer a larger amount of data.

These 2 different MPI ranks placements, 2D (48x32) and 1D (1536x1) with message profiles
illustrated on Figure 4 and Figure 5, result in similar performance. While the simple 1D
message passing pattern offers not much potential for optimization, we may apply some
optimizations to the 2D pattern like reordering the MPI_Isend, MPI_I receive and MPI_Waitall()
for several messages. There are other optimization ideas that can be worked out using Intel
Trace Analyzer and Collector, and they are left for future work.

7
Application and compute part parallel efficiency.

To shed the light on the issues related to the scalability on the cluster level, we first look at
MPI communication and compute performance breakdown of total runtime

T[p] = T_comp[p] + T_mpi[p]

that can be accessed through the Trace Analyzer’s Function Profile (Intel® Trace Analyzer
displays the Function Profile Chart when opening a trace file) (Figure 6). The trace file for Trace
Analyzer and Collector can be generated by adding the flag “-trace” to the Intel MPI mpirun or
mpiexec.hydra command.

The trace analyzer Timing is


This column is the
API was used to accumulated
average time per
time just 100 of over Ranks.
process. It can be
1653 iterations. Application
added by right click
VT_API is paused time is
and Function
time T_comp
Profile Settings

Figure 6: Intel Trace Analyzer and Collector Function Profile Chart. This snapshot shows the output for a 768
process run.

A breakdown of MPI functions can be seen by right clicking on Group MPI and Ungroup MPI.
This will simply reveal that MPI_Allreduce and MPI_Waitall are the main hotspots in the MPI
library.

Speedup and parallel efficiency can then be calculated and plotted for the compute time of
the application separately (Figure 7). First, one can see that MPI Time is insignificant up to 48

8
cores (the equivalent of two nodes). Above 96 ranks (4 nodes), pure computational
application performance also yields super linear scaling. However, at the same data point, at
around 96 ranks, MPI time becomes the main reason for low efficiency.

Benchmark configuration in Appendix A

Figure 7: Compute vs. Total parallel efficiency as a function of number of MPI ranks for the 2D distribution.
Compute + MPI is the whole application. This Plot is still part of the Global Analysis. It shows that the efficiency
is determined by the computation efficiency for small rank count and by the network overhead for more than 2
nodes. MPI hotspot functions were also determined by the Function Profile

To investigate why this is the case, we may need to look beyond the flat profile since it is not
clear if the poor timings shown in calls to MPI routines are caused by slow network
performance or algorithmic inefficiencies causing unnecessary wait time as a first decision
branch in the proposed methodology chart (Figure 1)

Interconnect time vs. Imbalance time


To understand relative impact of MPI application imbalance vs. interconnect (hardware and
software stack) on application scalability (see Flow Chart of tuning methodology, Figure 1), we
can start by employing the ideal network simulator (invoked through the Advanced-
>Idealization menu). This allows us to separate network stack performance impact on total

9
MPI performance from algorithmic inefficiencies like imbalance and dependencies. A simple
network model for the transfer time as a function of message volume V is

T_trans[V] = L + (1/BW)*V

where L is latency, defined as the time needed to transfer a 0 byte message, and bandwidth
BW is the transfer rate for asymptotically large messages. The ideal network may be
simulated by setting all transfer times to 0. This would mean L = 0 and BW = ∞. The process
of ideal trace generation is automated in Intel Trace Analyzer and Collector. It can be invoked
through the Advanced->Idealization menu. The analyzer’s imbalance diagrams (Advanced ->
Application Imbalance Diagram menu) can then be generated using real and idealized traces.
The imbalance diagrams are represented as stacked column charts for the different processes
distributions (Figure 8).

Benchmark configuration in Appendix A


Figure 8: Poisson Solver Imbalance diagram for different process (rank) placements in case of 1536 ranks (=64
24-core nodes). Timings looked up automatically by Intel Trace Analyzer and Collector from the original and
simulated traces. The 2D distribution and the 1536x1 distribution looks quite the same although the MPI
exchange pattern is very different. It was expected that the 1x1536 distribution compute performance is much
worse because each row will only contain one or two elements. The traces were collected for a reduced run-time
of only 100 iterations. The Y-axis is time in seconds summed over all ranks. We suggest using a minimum
number of iterations and ranks that would still capture main performance application features to minimize
analysis time. This Graph is not the original imbalance diagram delivered by Intel Trace Analyzer and Collector
because we wanted to combine three different experiments in a single plot.

10
The imbalance diagram can be dominated in theory by either transfer times (the algorithm is
balanced but we have to improve the network performance by different process placement or
new network hardware) or waiting times (the algorithm has to be revisited for better load
balancing and removal of dependencies). In case of the Poisson solver, the imbalance
diagram (Figure 8) shows that the application suffers predominantly from high transfer times.
Therefore, following tuning methodology chart (Figure 1), the first decision after Global
Analysis should be to concentrate on the reducing of the transfer times (Interconnect) first by
MPI runtime tuning before the investigations related to imbalance.

Algorithmic Investigations
In the last chapter we determined that investigations related to imbalance are not the most
efficient next step for this Poisson solver. However, to illustrate Trace Collector and Analyzer
capabilities, we may shortly describe what the root causes of the observed wait times are since
each application is unique and imbalance impact can prevail in case of other application.

One of the common reasons behind possible imbalance issues is the inability to perfectly
map the processor grid onto the computational grid. For example, if we map 1536 processors
on a computational grid of 3200x3200 points we have in the 2D case with the 48x32
processor grid a local size of 3200/48 x 3200/32 points which leads to 32x32 processes with
67x100 grid points and 16x32 processes with 66x100 grid points. The difference of less than
2% for the number of grid points may be not observable. In the 1D case with a 1536x1
process grid we would get 128 processors with 3x3200 grid points and 1408 processors with
2x3200 local grid points. The differences in run time are observable by clicking on the trace
analyzer’s Load Balance Tab, available next to Flat Profile Tab (Figure 9).

11
One additional row
of grid points for
process 0-127

Figure 9 Intel Trace Analyzer and Collector Load Balance information for each MPI process.

Another possible cause of load imbalance might be not because of the algorithmic
inefficiencies but due to a phenomenon called OS jitter (OS for Operating System). The OS
jitter describes OS events that can slow down compute performance. Some processes may
run slower for some iterations and they will cause imbalances and MPI wait time. For
applications running on thousands of nodes (common scenario these days), this noise can
become crucial because a single event on single process may slow down the whole
application. The reduction of this noise by using a specialized minimal OS is a current HPC
research topic (5).

The other source of MPI waiting time is dependency1 in the MPI coding techniques used in
Poisson solver. A closer look at the MPI hotspots in the idealized trace file reveals that 85% of
the wait time is due to MPI_Allreduce. The rest is due to MPI_Waitall. This shows that

1
Dependency means that Part B of an application can only be started when part A has been finished. Each
message introduces a dependency because the program on the receiver rank can only proceed after the message
from Part A has been send and Part B on the receiver rank has received it. If the receiver has already arrived at Part
B while Part A has not sent the message, we see the receive routine inside waiting mode. The time is, however,
reported as plain MPI time and only the idealization can tell whether we have waiting time (imbalance time).

12
practically all message passing dependencies have been removed. In a previous version of
Poisson the exchange was programmed with blocking MPI_Send/Recv. This version clearly
showed substantial wait time in MPI_Recv in the idealized trace file. Changing the algorithm to
MPI_Isend/Irecv/Waitall successfully reduced these dependencies.

MPI Runtime tuning


This is the third phase of the methodology and it is necessary when we observe high transfer
times as shown in the imbalance diagram. We often can improve MPI performance without
changing the source code. This can be done by using Intel MPI environment variables or by
changing the process mapping of ranks to compute nodes. The process to node mapping can
be altered by advanced methodologies like machine- or configuration files or by reordering
the ranks inside of a communicator. The MPI standard also contains support for Cartesian
topologies as it is described in chapter 4 of (6). For the applications with high transfer times, it
is also beneficial to use faster communication hardware in contrast to high wait time that
cannot be removed even by the ideal network.

We may start the tuning by concentrating on global operations. The simple Poisson solver
uses only MPI_Allreduce with an MPI_SUM operation for a global sum. Just a single double
precision (8Byte) value is summed over all processes. This is necessary for building the
residual which is a measure for the difference of the result array computed for two
subsequent iterations. The solver iterations stop when the residuum falls below a predefined
threshold.

It is always good advice to set the environment variable I_MPI_DEBUG to the integer value 5.
This will print valuable information about used variables, network fabrics and process
placement. Setting I_MPI_DEBUG to 6 will further reveal the default algorithms for collective
algorithms.

The Intel MPI reference guide reveals that we can select 8 different algorithms for
MPI_Allreduce. Some of these algorithms will be not appropriate for single 8-byte values but
we can simply test all of these algorithms for our application and find out if a non-default
value may provide better performance. The algorithm can easily be changed by setting the
environment variable I_MPI_ADJUST_ALLREDUCE to an integer value in the range of 1-8,
corresponding to the algorithm found in the Intel MPI reference manual. A comparison of run
times for each algorithm is shown on Figure 10. Algorithm 1 and 5 are pretty close but #5
delivers the best performance for 1536 ranks.

13
Benchmark configuration in Appendix A

Figure 10 Poisson speedup for the 8 different MPI_Allreduce algorithms. Algorithm 1 and 5 are pretty close but
#5 delivers the best performance for 1536 ranks. The default is algorithm #1 Recursive doubling algorithm for 1
– 48 processors and algorithm #5 Binomial gather + scatter algorithm above. We see that we can slightly
optimize the performance for 96-384 processors by choosing algorithm #1.

Analyzing Intra-Node performance using Intel® VTune™ Amplifier XE


Hotspot Analysis of MPI functions
VTune Amplifier was designed for single node analysis including threading. Many
performance events can be read from the Performance Monitoring Unit (PMU) for a detailed
analysis of Intel processor core and uncore behavior under a specific program. For a complete
analysis of parallel programs Intel Trace Analyzer and Collector is not sufficient due to its’
primarily focus on MPI performance. This becomes even more obvious when we start
analyzing hybrid codes that combine parallel MPI processes with threading for a more
efficient exploitation of computing resources.

We show first how to further analyze MPI hotspots with VTune Amplifier. Then we measure
the bandwidth in search of a better understanding of the efficiency curve plotted in Figure 7.

For Vtune Amplifier hotspot analysis we may run the amplifier command line interface as the
parallel MPI program to distribute with N MPI ranks. The Poisson solver invocation comes as a

14
parameter to the amplifier command line:

mpirun –n N amplxe-cl –result-dir hotspots_N –collect hotspots -- poisson.x

This command line runs poisson.x on the N ranks and produces for each rank a result
directory containing a hotspot analysis for this rank. The result directory for rank # m will be
named: hotspots_N.m .

The hotspot analysis for a chosen rank is done usually by transferring the result directory, the
executable and sources to the workstation for further inspection using VTune Amplifier GUI.
After unpacking the results on a workstation, we open VTune Amplifier bottom up tab and
select the “Call Stack Mode”: “user functions +1”. This action will show MPI functions (prefixed
with P for the profile version) in VTune Amplifier GUI (Figure 11).

From Intel Trace Analyzer and Collector analysis we know which MPI functions are the
hotspots but we don’t know which occurrence of MPI_Waitall function actually has the largest
contribution to the application runtime. By revealing call stack information, VTune Amplifier
can point to specific MPI_Waitall function dominating the runtime. This is useful starting point
for implementing code changes to improve application performance.

Figure 11 MPI functions in VTune Amplifier GUI. One of the MPI_Waitcall functions dominates with 57.7% of
runtime

15
As a result we see that the last MPI_Waitall call stack (Appendix D, poisson.c, line 226, function
call #1) dominates with 57.7% of runtime. The second call stack (Appendix D, call #2) got
34.6% and the first call stack (Appendix D, #3) has only 7.7%. The corresponding source can
be found in Appendix D. The reason is that the first exchange starts almost at the same time
for all processes but generates an imbalance. The second call is slowed down because of this
imbalance. This gets worse in the last exchange until all ranks are synchronized again by
MPI_Allreduce and a new iteration starts. The corresponding Intel Trace Analyzer and
Collector snapshot for a single iteration is depicted on Figure 12:

Callstack #3 Callstack #2 Callstack #1

Figure 12 Intel Trace Analyzer and Collector snapshot for a single iteration of Poisson solver. A single iteration
can be detected by using more advance user function instrumentation. For this simple implementation we just
know that at the end of each iteration there is an MPI_Allreduce.We just have to zoom in after an Allreduce
including the following Allreduce. The numbering of call stacks is done by Vtune AmplifierXE. The first call stack
is associated with the largest time fraction.

Such VTune Amplifier Hotspots analysis of MPI hotspot functions analysis may be conducted
for all ranks but usually we can begin with a single rank, at least in the case of homogeneous
clusters (clusters consisting of Intel® Xeon or Xeon Phi only but not a mixture of both). Since
the speedup curve for a single node (Fig. 4) shows saturation on the node (24 cores per node),

16
we may anticipate some bandwidth saturation issues. Fortunately, VTune Amplifier XE
provides a bandwidth analysis collection to verify this assumption.

Bandwidth Analysis
To use VTune Amplifier bandwidth analysis in conjunction with MPI, we can use the same trick
as in the previous example for interposing VTune Amplifier with the MPI invocations of
poisson.x, but with an added wrinkle to restrict VTune Amplifier to a single rank. Here is an
example of a command to start 59 ranks as usual with VTune Amplifier bandwidth analysis
data being collected only for the first rank:

mpirun -n 1 amplxe-cl -start-paused --result-dir snb-bandwidth_60 --collect snb-bandwidth \


-- poisson.x : -n 59 poisson.x

We have used above the snb-bandwidth analysis type, which also incorporates the
architecture bandwidth analysis for the microarchitecture code name Ivy Town (as follows
from <vtune_installation_dir>/config/analysis_type/snb_bandwidth.cfg) in the current release
of VTune Amplifier XE 2013 Update 16. Since bandwidth analysis employs hardware
collection sampling, the SEP (sampling) driver must be installed on each node where data will
be collected. We also had to disable NMI watchdog to enable collection with hardware
counters (Please see Appendix B).

Since we are not interested in analyzing the MPI startup or data initialization section of the
application we would like to collect VTune Amplifier data only for a specific time period, when
the application runs a computational kernel. The command line arguments allow data
collection to begin after a specified number of seconds (through that -start-paused command
line option). This option was used in conjunction with VTune Amplifier API functions
__itt_resume() and __itt_pause() surrounding computation kernel in the source code.

It should be understood that even though we specified only one rank to invoke VTune™
Amplifier in the command above, that invocation will collect everything running on the node
executing that rank. When collecting event-based sampling data such as LLC (Last Level
Cache) misses, those events can be linked to the appropriate process under which they
occurred. In case of a few MPI ranks on a single node, we will see the event based data (LLC)
divided up among these MPI processes in the VTune Amplifier GUI as well as the total number
of LLC misses. However, the bandwidth data are reported per memory channels and packages
(not per rank/process) and then summed up for the whole node. A summary of the collected
bandwidth data comes out via standard output after collection is done and it gets redirected
by the LSF scheduler into the job report. Alternatively, one can use the command line tool to
generate a summary report to standard output using this command:

amplxe-cl -R summary -r <results-directory>.

17
The total bandwidth, in GB/S, is reported separately for each package on the node. We
subsequently summed up the reported bandwidths for both packages to obtain the total
bandwidth for the node.

By plotting the result of bandwidth analysis and parallel efficiency at the same time as we
scale out (Figure 13), we observe an inverse correlation between them. The bandwidth in our
experiments saturates at about 87 GB/s. This is about 88% of the STREAM benchmark (6)
bandwidth results, 98GB/sec, performed during the same job on the same node. The STREAM
bandwidth result in turn is ~80% of peak theoretical for quad-channel RAM installed on the
nodes used for this test.

Bandwidth vs. Parallel Efficiency on a first node


100 1.4
Bandwidth on one node,

83.548 86.569
90

Parallel Efficiency
80.717 1.2
80
70 61.836 1
60 0.8
50
GB/s

40 0.6
30.392
30 0.4
20 15.839
8.908 0.2
10
0 0
1 6 12 24 48 72 96

Number of ranks
Bandwidth, GB/s Parallel Efficiency
Benchmark configuration in Appendix A

Figure 13 Efficiency vs. Bandwidth, GB/S as function of number of MPI ranks.1-24 ranks are located on a single
IVT node.

Summary
We present a methodology for performing analysis of HPC applications using Intel Trace
Analyzer and Collector and VTune Amplifier. This methodology is applied starting from the
whole application (or part of program that should be tuned) followed by detailed analysis
with focus on the communication patterns and single MPI routines.

A key role is played by the Intel Trace Analyzer and Collector’s idealizer, which can simulate
program execution on an ideal network with infinitely fast communications but the same
processor speed. The outcome of this simulation guides us to the next steps: algorithmic
investigations or tuning of MPI library. While algorithmic investigations may lead to the code

18
changes, we also can choose to use the Intel MPI environment variables or MPI processes
placement on specific physical cores to reduce MPI communications runtime.

Intel® VTune™ Amplifier XE can be used for call stack analysis of hotspot MPI functions
(MPI_Waitall in the Poisson case). Vtune Amplifier based bandwidth analysis has been shown
to be useful in finding performance bottlenecks of Poisson solver application on an HPC
cluster. It can clearly explain the reasons behind the scaling saturation on a single node.

References
1. Intel® Trace Collector. Reference Guide. [Online]
http://software.intel.com/sites/products/documentation/hpc/ics/itac/81/ITC_Reference_Guid
e/index.htm.

2. Intel® Trace Analyzer.Reference Guide. [Online]


http://software.intel.com/sites/products/documentation/hpc/ics/itac/81/ITA_Reference_Guid
e/index.htm.

3. Intel® VTune™ Amplifier XE 2013. [Online] http://software.intel.com/en-us/intel-vtune-


amplifier-xe.

4. Intel® Xeon® Processor E5-2695 v2 (30M Cache, 2.40 GHz). [Online]


http://ark.intel.com/products/75281/Intel-Xeon-Processor-E5-2695-v2-30M-Cache-2_40-
GHz.

5. OS Jitter Mitigation Techniques. [Online]


https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4df
d_4b40_9d82_446ebc23c550/page/OS%20Jitter%20Mitigation%20Techniques.

6. William Gropp, Ewing L. Lusk, Anthony Skjellum. Using MPI - 2nd Edition: Portable Parallel
Programming with the Message Passing Interface . s.l. : The MIT Press, 1999.

7. STREAM benchmark. [Online] https://www.nersc.gov/systems/nersc-8-procurement/trinity-


nersc-8-rfp/nersc-8-trinity-benchmarks/stream.

8. Intel® MPI Library. Reference Manual for Linux* OS. [Online]


http://software.intel.com/sites/products/documentation/hpc/ics/impi/41/lin/Reference_Manu
al/index.htm.

9. Intel® Xeon® Processor E5 v2 and E7 v2. Product Families Uncore Performance Monitoring
Reference Manual. [Online]
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/xeon-e5-2600-
v2-uncore-manual.pdf.

19
Appendix A
Benchmark Environment

Intel® Xeon® E5 v2 processors (Ivy Town) with 12 cores. Frequency: 2.7 GHz
2 processors per node (24 cores per node)
Mellanox QDR Infiniband
Operating system: RedHat EL 6.4
Intel® MPI 4.1.3.045

Appendix B
Disabling of Non Maskable Interrupt in cluster environment.

The Non Maskable Interrupt (NMI) Watchdog has to be disabled for VTune™ Amplifier to
function properly. NMI can be used in Linux kernel to periodically detect if a CPU is locked.
However NMI Watchdog needs to use a hardware performance counter, so other performance
tools including VTune™ Amplifier can’t use PMU event-based sampling data collection.

To permanently disable the nmi_watchdog interrupt:

1. Under the root account, edit /boot/grub/grub.conf by adding the nmi_watchdog=0


parameter to the kernel line so that it looks like: /boot/vmlinuz-2.6.32-131.0.15.el6.x86_64 ro
root=/dev/sda8 panic=60 nmi_watchdog=0 2.

2. Reboot the system.

3. After rebooting, enter the following command to verify whether nmi_watchdog is


disabled: grep NMI /proc/interrupts. If you see zeroes, nmi_watchdog is successfully disabled.

To temporarily disable the nmi_watchdog interrupt, enter:

echo 0 > /proc/sys/kernel/nmi_watchdog

On Endeavor cluster, disabling of NMI interrupt is implemented through setting a run-time


variable NMI_WATCHDOG=OFF. This runtime variable defines a new behavior in the modified
LSF job manager where VTune Amplifier is treated as a resource enabled by a user request in
the cluster manager prologue. In the LSF prologue, this resource gets cleaned up once the job
is done.

20
Appendix C – Detailed Methodology
The methodology consists of 4 main phases:

1. Global Analysis of the whole application that gives first indications of performance
issues that can be further subdivided into:
a. Run time and scaling analysis
b. Message Passing performance analysis on an inter/intra node level, including
finding of MPI hotspots
c. Network Idealization that yields an imbalance diagram, providing guidance on
how to proceed, either to phase 2 below if significant wait time is found, or in
case of high transfer times we may skip phase 2 and proceed directly to phase
3.
2. Algorithmic investigation: source code changes to implement better message passing
practices or improve the load balance of the application by:
a. Fixing imbalances in communication patterns of MPI and non-MPI routines. For
example, slow sequential I/O often causes imbalances.
b. Removing unnecessary synchronization. For example, message passing
patterns using blocking send and receive may cause a send/receive order that
increases wait times. This may be resolved by using non-blocking
MPI_Isend/MPI_Irecv pairs.
3. MPI run-time tuning:
Intel MPI can be tuned without changing the source code using:
a. Environment variables for tuning of collective operations, e.g.,
I_MPI_ADJUST_ALLREDUCE
b. Environment variables for changing the message passing characteristics, e.g.,
I_MPI_DAPL_DIRECT_COPY_THRESHOLD
c. It is also possible to change the MPI process/rank to node mapping for a better
inter/intra node communication balance
4. Single process/node tuning: is necessary for serial performance optimizations.
Furthermore, single node tuning is important for improving overall application
scalability and reducing load imbalance.
a. We suggest conducting a hotspot analysis for each rank or critical ranks
identified in phase 1 and 2. The call stack information for a specific MPI routine
may be also helpful in refining of the analysis in 1 (b).
b. Bandwidth analysis on the node is important for an understanding of
deficiencies in cluster level scaling. This technique will be used in the paper to
explain the “dive” in the parallel efficiency curve of our Poisson solver.

After each tuning step the analysis can be repeated by starting with phase 1. At least steps 1
(a-c) should be conducted again to get new advice for the next tuning actions.

21
Appendix D – Compute Part Source Code
The following source code shows the iteration loop. The enumeration of exchange functions
calls (from #1 through #3) corresponds to the MPI functions hotspot analysis “weight” (the
hottest function is marked as #1 below and through the text surrounding Figure 11)

Iteration index it.

ITAC API cutting


iteration 100 to 200

Copy array for later


residuum calc.

CALL #3, Exchange


routine at line #205
contains MPI_Waitall

Update of “red” points

CALL #2, Exchange


routine at line #216

Update „black“ points

CALL # 1, Exchange
routine at line #226

Residuum calc.
contains
MPI_Allreduce

22
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software,
operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information
and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product
when combined with other products.

For more information go to http://www.intel.com/performance .

For more information regarding performance and optimization choices in Intel® software products, visit
http://software.intel.com/en-us/articles/optimization-notice.

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors
for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3
instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of
any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this
product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel
microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference
Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804

© 2015, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Cilk Plus, and Intel VTune are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.

23

You might also like