Professional Documents
Culture Documents
THE CASE OF THE MISSING SUPERCOMPUTER PERFORMANCE - ACHIEVING OPTIMAL PERFORMANCE ON THE 8,192 PROCESSORS OF ASCl Q
THE CASE OF THE MISSING SUPERCOMPUTER PERFORMANCE - ACHIEVING OPTIMAL PERFORMANCE ON THE 8,192 PROCESSORS OF ASCl Q
Title:
THE CASE OF THE MISSING SUPERCOMPUTER
PERFORMANCE: ACHIEVING OPTIMAL
PERFORMANCE ON THE 8,192 PROCESSORS OF
ASCl Q
Author(s):
Fabrizio Petrini, 154601, CCS-3
Darren J. Kerbyson, 176262, CCS-3
Scott Pakin, 179752, CCS-3
Submitted to:
Supercomputing Conference 2003
Phoenix, AZ
a Los Ala
NATIONAL LABORATORY
LOSAlamos National Laboratory, an affirmative action/equal opportunityemployer, is operated by the Universityof Californiafor the US.
Departmentof Energy under contract W-7405-ENG-36. By acceptance of this article, the publisher recognizes that the US. Government
retains a nonexclusive,royalty-free license to publish or reproduce the publishedform of this contribution, or to allow others to do SO,for U.S.
Government purposes. Los Alamos National Laboratory requests that the publisher identify this article as work performed under the
auspices of the US. Departmentof Energy. Los Alamos National Laboratorystrongly supports academic freedom and a researcher's right to
publish; as an institution. however, the Laboratorydoes not endorse the viewpoint of a publicationor guarantee its technical correctness.
Form 836 (8/00)
The Case of the Missing Supercomputer Performance:
Achieving Optimal Performance on the 8,192 Processors of ASCI Q
Fabrizio Petrini Darren J. Kerbyson Scott Pakin
Performance and Architecture Laboratory (PAL,)
Computer and Computational Sciences (CCS) Division
Los Alamos National Laboratory
{f abrizio,djk,pakin}@lanl .gov
May 2,2003
Abstract
In this paper we describe how we improved the effective performance of ASCI Q, the world's second-fastestsu-
percomputer, to meet our expectations. Using an arsenal of performance-analysis techniques including analytical
models, custom microbenchmarks, full applications, and simulators, we succeeded in observing a serious-but pre-
viously undetectable-performance problem. We identified the source of the problem, eliminated the problem, and
"closed the loop" by demonstrating improved application performance. We present our methodology and provide
insight into performance analysis that is immediately applicable to other large-scale cluster-based supercomputers.
1
TABLE1: Performance analysis tools and techniaues
Technique Meaning Purpose
measurement running full applications under various determine how well the application actu-
~
performance computing workloads. As a result of com- was not performing as well as it could. Section 3
plexity in applications and in supercomputers it is diffi- details how we systematically applied the tools and
cult to determine the source of suboptimal application techniques shown in Table 1 to identify the source
performance-or even to determine if performance is of the performance loss. Section 4 explains how we
suboptimal. used the knowledge gained in Section 3 to achieve our
Ensuring that key, large-scale applications run at goal of improving application performance to the point
maximal efficiency requires a methodology that is where it is within a small factor of the best that could
highly disciplined and scientific, yet is still sufficiently be expected. Section 5 completes the analysis by re-
flexible to adapt to unexpected observations. The ap- measuring the performance of SAGE on an optimized
proach we took is as follows: ASCI Q and demonstrating how close the new perfor-
mance is to the ideal for that application and system.
1. Using detailed knowledge of both the application A discussion of the insight gained in the course of this
and the computer system, use performance model- exercise is presented in Section 6 , and we present our
ing to determine the performance that SAGE ought conclusions in Section 7.
to see when running on ASCI Q.
2. If SAGE'S measured performance is less than the ex-
pected performance, determine the source of the
discrepancy. 2 Performance expectations
3. Eliminate the cause of the suboptimal performance.
4. Repeat from step 2 until the measured performance Based on the Top 500 results, ASCI Q appears to be per-
matches the expected performance. forming well. It runs the LINPACK [l] benchmark at
75% of peak performance, which is well within range
Step 2 is the most difficult part of the procedure and is for machines of that class. However, there are more
therefore the focus of this paper. accurate methods for determining how well a system
While following the above procedure the perfor- is actually performing. From the testing of the first
mance analyst has a number of tools and techniques ASCI Q hardware in March 2001, performance models
at his disposal as listed in Table 1. An important con- of several applications representative of the ASCI work-
straint is that time on ASCI Q is a scarce resource. As a load were used to provide an expectation of the perfor-
result, any one researcher or research team has limited mance that should be achievable on the full-scale sys-
opportunity to take measurements on the actual super- tem [ 3 , 41. These performance models are parametric
computer. Furthermore, configuration changes are not in terms of certain basic, system-related features such
always practical. It often takes a significant length of as the sequential processing time-as measured on
time to install or reconfigure software on thousands of a single processor-and the communication network
nodes and cluster administrators are reluctant to make performance.
modifications that may adversely affect other users. In In particular, a performance model of a large-scale
addition, a reboot of the entire system can take sev- hydrodynamic application, SAGE, was developed for
eral hours [61 and is therefore performed only when the express purpose of predicting the performance of
absolutely necessary. SAGE on the full-sized ASCI Q. The model has been
The remainder of the paper is structured as follows. validated on many large-scale systems-including all
Section 2 describes how we determined that ASCI Q ASCI systems-with a typical prediction error of less
2
than 10% [SI. The HP ES45 nodes used in ASCI Q x axis. It can be seen that the only significant differ-
actually went through two major upgrades during in- ence occurs when using all four processors per node
stallation: the PCI bus within the nodes was upgraded thus giving confidence to the model being accurate.
from 33 MHz to 66 MHz and the processor speed was
upgraded from 1GHz to 1.25 GHz. The SAGE model 0.6
was used to provide an expected performance of the I -+- 1 PEsPerNode I i
ASCI Q nodes in all of these configurations. -e2PEsPerNode
The performance of the first 4,096-processor seg- -c- 3PEsPerNode
*
c 9
ment of ASCI Q (“QA”) was measured in Septem- d 4PEsPerNode
ber 2002 and the performance of the second 4,096- 2 0.4 i
processor segment (“QB”)-at the time, not physically
connected to QA-was measured in November. The
results of the two measurements are consistent with
each other although they rapidly diverge from the per-
formance predicted by the SAGE performance model
(Figure 1).At 4,096 processors, SAGE’Scycle time was
twice the time predicted by the model. This was consid-
ered to be a “difference of opinion” between the model
and the measurements. Without further analysis it
would have been impossible to discern whether the 1 10 100 1000 10000
performance model was inaccurate-although it has #PES
been validated on many other systems-or whether
there was a problem with some aspect of ASCI Q’s Figure 2: Error in modeled and measured SAGE per-
hardware or software configuration. formance when using 1,2,3, or 4 processors per node
-1
segments. The ensuing cycle times are shown in Fig-
ure 4(a) and a histogram of the variability is shown
worse on ASCI Q than in Figure 4(b). It is interesting to note that the cycle
time ranges from just over 0.7s to over 3.0s, indicating
greater than a factor of 4 in variability,
A profile of the cycle time when using all four pro-
In order to identify why there was a difference in cessors per node, as shown in Figure 5, reveals a
performance between the measured and expected per- number of important characteristics in the execution
formance, we performed a battery of tests on ASCI Q. of SAGE. The profile was obtained by separating
The most revealing result came from varying the num- out the time taken in each of the local boundary ex-
ber of processors per node used to run SAGE. Figure 2 changes (token-get and token-put) and the collective-
shows the difference between the modeled arid the communication operations (allreduce, reduction, and
measured performance when using 1, 2, 3, or 4 pro- broadcast) on the root processor. The overall cycle
cessors per node. Note that a log scale is used on the time, which includes computation time, is also shown
3
200000 1 I
E
U +ZPEsPerNode
160000 t3PEsPerNode
8
E 140000 - h > 1 L 4PEsPerNodeL
3
:
a
120000 -. A .
-
$ 100000 -
-*
u
80000
ti..\
- ----*--
..
'*\
-
'.
*. L - r 4 7 % *
Ry
4.-
IT 60000 --be*
-0
Ul
40000 - - ~
al
; 20000
a
-'
01
100 200 300 400 500 600 700 800 900 1000
Cycle Number
Figure 3: Effective SAGE processing rate when using
1, 2,3, or 4 processors per node (a) Variability
4
A --- - Allreduce and Bamer Latency, 4 Processes per Node
A token-put
6 -6 cyc-time
9, 0.8
0 f‘“
1 10 100 1000 10000 Figure 7: allreduce and barrier latency with varying
#PES amounts of intervening computation
Figure 5: Profile of SAGE cycle time
We made several attempts to optimize the allreduce
in the four-processor case and were able to substan-
processors within a node. With up to three proces- tially improve the performance. To do so, we used a
sors the allreduce is fully scalable and takes, on aver- different synchronization mechanism. In the existing
age, less than 300 ps. With four processors the latency implementation the processes in the,reduce tree poll
surges to more than 3 ms. These measurements were while waiting for incoming messages. By changing the
obtained on the QB segment of ASCI Q. synchronization mechanism to poll for a limited time
(100 ps) and then block, we were able to improve the
Allreduce Latency latency by a factor of 7.
At 4,096 processors, SAGE spends over 51% of its
time in allreduce. Therefore, a sevenfold speedup in
- - * -. 3 proccsscs per node allreduce ought to lead to a 78% performance gain
J pL~K?\\l-<
IY 1 tl(Kk
in SAGE. In fact, although extensive testing was per-
formed on the modified collectives, this resulted in
2 1.5 . only a marginal improvement in application perfor-
3 mance.
2 1 -
0.5
0
*- .-#.-e-’
*.-a
=.+&-J MYSTERY# 2
Although SAGE spends half of its time in allreduce
(at 4,096 processors), making allreduce seven times
0 100 200 300 400 500 600 700 800 900 lo00
Nodes
faster leads to a negligible performance improve-
Figure 6: allreduce latency as a function of the number
of nodes and processes per node
We can therefore conclude that neither the MPI im-
Figure 7 provides more clues to our analysis. It plementation nor the network are responsible for the
shows the performance of the allreduce and barrier in performance problems. By process of elimination, we
a synthetic parallel benchmark that alternately com- can infer that the source of the performance loss is in
putes for either 0, 1, or 5 ms then performs either an the nodes themselves.
allreduce or a barrier. In an ideal, scalable, system we
should see a logarithmic growth with the number of 3.2 Analyzing the computational noise
nodes and insensitivity to the computational granular-
ity. Instead, what we see is that the completion time Our intuition was that periodic system activities were
increases linearly with both the number of nodes and interfering with application execution. This hypothesis
the computational granularity Figure 7 also shows follows from the observation that using all four proces-
that both allreduce and barrier exhibit similar perfor- sors per node results in lower performance than when
mance. Given that the barrier is implemented using a using fewer processors. Figures 3 and 6 confirm this
simple hardware broadcast whose execution is almost observation for both SAGE and allreduce performance.
instantaneous (only a few microseconds) and that it re- System activities can run without interfering with the
produces the same problem, we will consider a barrier application as long as there is a spare processor avail-
benchmark later in the analysis. able to absorb them. When there is no spare processor,
5
the application must relinquish one of its processors Sticking to our assumption that noise is somehow
to the system activity. Doing so may introduce perfor- responsible for SAGE’s performance problems we re-
mance variability, which we refer to as “noise”. fined our microbenchmark into the version shown in
To determine if system noise is, in fact, the source of Figure 10. The new microbenchmark was intended to
SAGE’s performance variability, we crafted a simple mi- provide a finer level of detail into the measurements
crobenchmark designed to expose the problems. The presented in Figure 9. In the new microbenchmark,
benchmark works as shown in Figure 8: each node per- each node performs 1million iterations of a synthetic
forms a synthetic computation carefully calibrated to computation, with each iteration carefully calibrated
run for exactly 1,000 seconds in the absense of noise. to run for exactly 1 m s in the absense of noise, for
an ideal total run time of 1,000 seconds. Using a
START
small granularity, such as 1ms, is important because
many LANL codes exhibit such granularity between
communication phases. During the purely computa-
tional phase there is no message exchange, I/O, or
p. memory access. As a result, the run time of each it-
eration should always be 1ms in a noiseless machine.
TIME
- sTAw
Figure 8: Performance-variability microbenchmark
The total normalized run time for the microbench-
mark is shown in Figure 9 for all 4,096 processors in
QB. Because of interference from noise the process- START EM,
ing time can be longer and can vary from process to
process. However, the measurements indicate that the *
slowdown experienced by each process is low, with TIME
6
Computational Overhead (Fine Grained) at regular intervals, with a duration that can be up
1-7-
7
to 200ms. Node 1 experiences a few heavyweight
interrupts-one every 60 seconds-that freeze the pro-
h
7
3.35e+07 3.35e+07
1.04e+06 l.O4e+O6
32768 32768
3 8
A 1024 1024 335 ms every 60 s
32 32
1 1
0 0
Latency (ms) Latency (ms)
1.04e+06 1.04e+06
RMS cluster
' 327,68 32768
42 2
Q)
1024 1024
8 a2, 32
1 1
Latency (ms) Latency (ms)
Figure 13: Identification of the events that cause the different types of noise
8
3.4 Modeling system events We decreased the frequency of RMS monitoring by a
factor of 2 on each node (from an interval of 30 sec-
We developed a discrete-event simulator that takes onds to an interval of 60 seconds).
into account all the classes of events identified and We moved several damons from nodes 1 and 2 to
characterized in Section 3.2. This simulator provides node 0 on each cluster, in order to confine the heavy-
a realistic lower bound on the execution time of a bar- weight noise to this node.
rier operation. We validated the simulator for the mea-
sured events, and we can see from Figure 15 that the As an initial test of the efficacy of these optimiza-
model is close to the experimental data. The gap be- tions we used a simple benchmark in which all nodes
tween the model and the data at high node counts can compute for a fixed amount of time and then synchro-
be explained by the presence of a few especially noisy nize using a global barrier, whose latency is measured.
(probably misconfigured) clusters, Figure 16 shows the results for three types of compu-
Using the simulator we can predict the performance tational granularity-0 ms (a simple sequence of barri-
gain that can be obtained by selectively removing the ers without any intervening computation), 1ms, and
sources of the noise. For example, Figure 15 shows 5 ms-and both with and without the noise-reducing
that with a computational granularity of 1ms, if we re- optimizations described above.
move the noise generated by either node 0 , l or 31, we Burner Latency 8s a functionof tla coniputntional granularity
only get a marginal improvement, approximately 15%.
If we remove all three “special” nodes-0, 1 and 31-
we get an improvement of 35%. However, the surprise ‘2 4-
I ;,1- >iW!Il, I
opt!mlrkh 5 ms
2 2x
is that the noise in the system dramatically reduces ,o
--*: ’ill,’. Irpttiol7rd
Nodes
-- FINDING# 2 --
applications, more
5 SAGE: Optimized performance
lost to short but frequent noise on Following from the removal of much of the noise in-
long but less frequent noise on a few nodes. duced by the operating system the performance of
SAGE was again analyzed. This was done in two sit-
uations, one a t the end of January 2003 on a 1,024-
node segment of ASCI Q, followed by the performance
4 Eliminating the sources of noise on the full sized ASCI Q at the start of May 2003 (af-
ter the two individual 1,024-node segments had been
On January 25, 2003 we undertook the following opti- connected together). The average cycle time obtained
mizations on ASCI Q: is shown in Figure 17. Note that the performance ob-
tained in September and November 2002 is repeated
We removed about ten daemons (including envmod, from Figure 1. Also, the performance obtained in Jan-
i n s i g h t d , snmpd, lpd, and n i f f ) from all nodes. uary 2003 is measured only up to 3,716 processors
9
while that obtained in May 2003 is measured up to We expect to be able to increase the available pro-
7,680 processors. These tests represent the largest-size cessors by just removing one processor from each of
machine on those dates but with nodes 0 and 31 con- node 0 and 31 of each cluster. This will allow the oper-
figured out of each 32-node cluster. ating system tasks to be performed without interfering
with the application, while at the same time increase
the number of usable processors per cluster from 120
(30 out of 32 usable nodes) to 126 (with only two pro-
cessors removed). This should improve the processing
rate by a further 5% just by the increase of the usable
processors by 6 per cluster while not increasing the ef-
fect of noise.
6 Discussion
In the previous section we have seen how the elimina-
tion of a few system activities benefited SAGE with a
specific input deck. We now try to provide some guide-
lines to generalize our analysis.
In order to estimate the potential gains on other
0 1024 2048 3072 4096 5120
# PES
6144 7168 8192
applications we provide insight on how the compu-
tational granularity of a balanced bulk-synchronous
Figure 17: ‘‘SAGE performance: expected and mea- application correlates to the type of noise. The in-
sured after noise removal tuition behind this discussion is the following: while
any source of noise has a negative impact on the over-
It can be seen that the performance obtained in Jan- all performance of the application, a few sources of
uary and May is much improved over that obtained be- noises tend to have a significant impact. As a rule of
fore noise was removed from the system. Also shown thumb, the computational granularity of the applica-
in Figure 17 is the minimum cycle time obtained over tion is deemed to “enter in resonance’’ with noise of a
50 cycles. It can be seen that the minimum time very similar harmonic frequency and duration.
closely matches the expected performance. The min- In order to explain this correlation, consider the
imum time represents those cycles of the application barrier benchmark of Figure 16 for the three opti-
that were least effected by noise. Thus it appears that mized configurations with 0, 1 and 5ms of compu-
further optimizations may be possible that will help tational granularity For each of these cases we ana-
reduce the average cycle time down towards the mini- lyze the barrier-synchronization latency for the largest
mum cycle time. node count. For example, in such a configuration
The effective performance for the different configu- the barrier takes 0.19 ms, 2 ms, and 7 ms respectively
rations tested prior to noise removal and after is listed In each case we consider the cumulative latency dis-
in Table 2. Listed are the cycle time for the differ- tribution, as shown in Figure 18. Each graph de-
ent configurations. However, the total processing rate scribes how different sources of noise affect the barrier-
across the system should be considered in comparing synchronization latency.
the performance as the number of usable processors Figure 18(a) shows the results for a sequence of bar-
varies between the configurations. The achieved pro- riers without any computation (which represents an
cessing rate of the application that is the total number extreme case of fine grained computation). We can see
of cell-updates per second is also listed. This is derived that 66% of the delay is caused by fine-grained noise,
from the cycle time as the processor count x cells per generating computational holes of less than 4 ms. The
processor + cycle time. The cells per processor in all heavyweight, but less frequent, noise generated by
the SAGE runs presented here was 13,500 cells. Note node 0 in each cluster impacts the barrier latency by
that the default performance on 8,192 processors is an only 17%.
extrapolation from the 4,096 processor performance With 1ms of computational granularity, the impact
using a linear performance degradation observed in of the fine-grained noise is reduced to only 33% while
the measurements of September/November 2002. the relative effect of the heavyweight noise grows
to 27%, as shown in Figure 18(b). The primary source
of degradation is the medium-grained noise generated
FINDING#3 by RMS on node 31 and on each cluster node.
Finally, we can see that with 5 ms of computational
granularity (Figure 18(c)), more than half of the bar-
nodes 0 and 31 removed from each cluster is only rier latency is caused by node 0, while 33% is caused
15% below the model’s expectations. by RMS on node 31 plus the cluster nodes.
10
TABLE 2: SAGE effective performance after noise removal
Usable Cycle Processing rate Improvement
Configuration
processors time (lo6 cell updateshec.) factor
~ _ _ - I _ . - ~ -
Unoptimized system 8,192 1.60 69.1 -N/A-
3 PEs/node 6,144 0.64 129.3 1.87
Without node 0 7,936 0.87 123.1 1.78
Without nodes 0 and 31 7,680 0.86 120.6 1.75
Without nodes 0 and 31 (best observed) 7,680 0.68 152.5 2.21
Model 8,192 0.63 178.4 2.58
11
Department, University of Tennessee, Knoxville,
Tennessee, 1989. Available from http ://urn.
netlib.org/benchmark/performance.ps.
[2] E. Frachtenberg, E Petrini, et al. STORM:
Cumulative Noise Distribution. Barrier Synchronization Lightning-fast resource management. In Proceed-
l - . , . I ' , ' ings of SC2002. Baltimore, Maryland, Nov. 16-
22 2002. Available from http://sc-2002.org/
o.8 - 17% RMS Cluster Nodes + paperpdf s/pap .pap297.pdf .
91 Node 31
'6
Z 0.6 - [31 A, Hoisie, 0. Lubeck, et al. A general pre-
dictive performance model for wavefront algo-
rithms on clusters of SMPs. In Proceedings of
the 2000 International Conference on Parallel Pro-
cessing (ICPP-2000). Toronto, Canada, Aug. 21-
24,2000. Available from http ://urn.c3. l a n l .
0.015 0.062 0.25 I 4 16 64 gov/par-arch/pubs/icpp.pdf.
Noise Latency (ms)
[41 D. J. Kerbyson, H. J. Alme, et al. Pre-
(a) Barrier synchronizations with no intervening computa- dictive performance and scalability modeling
tion of a large-scale application. In Proceedings
of SC2002. Denver, Colorado, Nov. 10-16,
2001. Available from http://www .sc2001.org/
( '
Cumulative Noise Distribution, Barrier Synchronization, 1 ms Granularity
1 4 . I * I I . t b n I . I ' 1 1
.
papers/pap pap255.pdf.
0.8 - ,
27%.Node 0 [SI D. J. Kerbyson, A. Hoisie, et al. Use of pre-
dictive performance modeling during large-scale
.-
I system installation. Parallel Processing Letters,
2 2003. World Scientific Publishing.
- 1 s
-2 [6] K. Koch. How does ASCI actually complete
a - -056 B multi-month 1000-processor milestone simula-
33%. Cluster Nodes
0.2
tions? In Proceedings of the Conference on
High Speed Computing. Gleneden Beach, Oregon,
0.015 0062 025 I 4
Nowe Latency (ms)
16 64 256
0
Apr. 22-25, 2002. Available from http :/ / w w .
ccs.lanl.gov/salishan02/koch.pdf.
(b) lms [7] E Petrini and W. Feng. Improved resource uti-
lization with Buffered Coschedulinn. Journal of
Cumulative Noise Distribution, Barrier Synchronization, 5 ms Granularity
Parallel Algorithms and Applications;l6: 123-14,
2001. Available from http://urn. c3,l a n l .
gov/"fabrizio/papers/paaOO.ps.
-
[8] E Petrini, W. Feng, et al.
0.8
2 The Quadrics
Z 0.6 -
'6 network: High-performance clustering technol-
om. IEEE Micro, 22(1):46-57, Jan./Feb. 2002.
ISSN 0272-1732. Available from http://www .
computer.org/rnicro/mi2002/pdf/mi046.pdf.
a -
0.2
- I [9] J. C. Phillips, G. Zheng, et al. NAMD: Biomolec-
ular simulation on thousands of processors. In
0.015 0.062 0.25 I 4
Noise Latency (ms)
16 64 256 Proceedings of SC2002. Baltimore, Maryland,
Nov. 16-22 2002. Available from http ://urn .
(c) 5ms sc-2002.org/paperpdfs/pap.pap277.pdf.
[lo] Quadrics Supercomputers World Ltd. RMS Refer-
Figure 18: Cumulative noise distribution for different ence Manual, Jun. 2002.
computational granularities
12