Professional Documents
Culture Documents
The Effect of NUMA Tunings On CPU Performance: Hollowec@bnl - Gov
The Effect of NUMA Tunings On CPU Performance: Hollowec@bnl - Gov
The Effect of NUMA Tunings On CPU Performance: Hollowec@bnl - Gov
2
Typical NUMA Architecture
3
How Can NUMA Improve Performance?
In uniform memory access (UMA) SMP systems, generally
only one processor can access memory at a time
CPUs block the memory bus when accessing RAM
This can lead to significant performance degradation
The problem gets worse as the number of CPUs in a
system increases
NUMA's advantage – each CPU has it's own local RAM that it
can effectively access independently of other CPUs in the system
numactl
Display system NUMA topology
Allows for the execution of a process with specific NUMA affinity
tunings
# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
node 0 size: 40920 MB
node 0 free: 23806 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
node 1 size: 40960 MB
node 1 free: 24477 MB
node distances:
node 0 1
0: 10 20
1: 20 10
Force execution on node0 with memory allocation forced on node0 (out of memory condition when
all node0 memory exhausted):
# numactl --cpunodebind=0 --membind=0 COMMAND
Added in Red Hat Enterprise (RHEL) and Scientific Linux (SL) 6.3
HEPSPEC06 (HS06)
Standard HEP-wide CPU performance benchmark
http://w3.hepix.org/benchmarks/doku.php/
7
Benchmarking NUMA Tunings (Cont.)
ATLAS Software Benchmarks
Software from ATLAS KitValidation executed, and timed
Event Generation
Geant4 Simulation
Digitization
Reconstruction
10
HEPSPEC06
390
379.43
373.85374.92 372.34373.85
368.79 368.31
37 0 366.77
No Tun in g
31 0 100ms khu ge pa g ed
sca n_sleep_millisecs
HS06 100ms khu ge pa g ed , numa d -i 1 5
100ms khu ge pa g ed , numa d -i 30
290 1000ms kh u gepage d , nu mad -i 15
1000ms kh u gepage d , nu mad -i 30
nu mactl nod e cpu bin d, pre fe rred
node memory bind, 1 00ms
27 0 kh uge pa ge d
nu mactl force d cpu a n d me mory
node bind, 100ms kh uge pa ge d
250
230
211.77
209.46 208.27
21 0 207.54
204.09 204.76
200.04 200.21
190
E5-2 660v2 (40 processes) E5-2 660 (32 processe s) X5660 (2 4 proce sses)
System CPU
11
HEPSPEC06 (Cont.)
Manual NUMA bindings for each HS06 process had little effect
on systems with newer Intel CPUs
Detrimental on the Westmere host
12
ATLAS Software in Parallel – Reconstruction
600
580.7 580.5
580 574.75 573.98
563.65
560
553.73
557.12 Smaller Values Preferred
552.39
547.99
545.74 545.01
540 536.48
Average
Execution No Tun in g
Time Per 52 0 517.24 516.34 515.78 514.55 100ms khu ge pa g ed
sca n_sleep_millisecs
Job in 506.48 100ms khu ge pa g ed , numa d -i 1 5
Seconds 100ms khu ge pa g ed , numa d -i 30
500 495.78 nu mactl nod e cpu bin d, pre fe rred
node memory bind, 1 00ms
kh uge pa ge d
nu mactl force d cpu a n d me mory
480 node bind, 100ms kh uge pa ge d
460
440
42 0
400
E5-2 660v2 (40 proce sses) E5-2660 (32 proce sses) X5660 (24 processe s)
System CPU
13
ATLAS Software in Parallel – Reconstruction (Cont.)
14
ATLAS Software in Parallel – Simulation
1500
1483. 91
1457.35 1453.85
1453. 23
1445.13
1450 1441. 87 1441.74
1433. 01 1429.51
1416.63 1416. 96 1416. 54
1 350
Average
Execution No Tun in g
Time Per 1 300 100ms khu ge pa g ed
sca n_sleep_millisecs
Job in 100ms khu ge pa g ed , numa d -i 1 5
Seconds 100ms khu ge pa g ed , numa d -i 30
1 250 nu mactl nod e cpu bin d, pre fe rred
node memory bind, 1 00ms
kh uge pa ge d
nu mactl force d cpu a n d me mory
1 200 node bind, 100ms kh uge pa ge d
1 150
1 100
1050
1000
E5-2 660v2 (40 processe s) E5-2660 (32 proce sses) X5660 (2 4 processe s)
System CPU
15
ATLAS Software in Parallel – Simulation (Cont.)
16
Conclusions
Specific NUMA tunings best for different workloads and hardware
Unfortunately there doesn't appear to be a “one size fits all”
option
18