Professional Documents
Culture Documents
Hpux Performance Troubleshooting Class CPU Bottlenecks
Hpux Performance Troubleshooting Class CPU Bottlenecks
Hpux Performance Troubleshooting Class CPU Bottlenecks
Performance
Troubleshooting Class
CPU Bottlenecks
1
CPU bottlenecks
• We have a CPU bottleneck if our application
spends most of its time…
• Executing on the CPU, or…
• Waiting to execute on the CPU
• We need to know
• Where CPU cycles are consumed
• Which load module, function, source line, instruction
• How CPU cycles are consumed
• Useful execution, NOPs, stalled and why
A Process has a CPU bottleneck if it spends all its time using CPU resources.
If many processes need to wait in order to execute then there may be a system
level CPU bottleneck.
2
Time spent executing on the CPU
Glance (process threads)
3
Time spent waiting for the CPU
Glance (thread wait states)
4
Do we have a CPU bottleneck?
Prospect (thread report)
Section 2.1.1: Sorted Execution+Blocked=Elapsed Clocks for KTid(5711), EPB
-------------------------------------------------------------------------------
Clock Type Seconds % Elapsed uSec/Count Counts
•The guts of the report produced by Prospect consists of a set of sections for
each thread, potentially for each process on the system (you set the scope
when invokving Prospect) . The first of these sections gives an overview of
how the thread spent its time.
•Here we can clearly see that of an elapsed 30 seconds, this thread spent 23
seconds blocked and 6.9 seconds executing.
5
Quick question
• My system has a CPU utilization of only 3%,
therefore I definitely don’t have a CPU bottleneck,
right?
6
OK, so we have a CPU bottleneck –
now what?
• Identify where we are consuming most CPU
• Which module? (library or application binary)
• tells us who owns the offending code
• Which function?
• focuses the developers mind on a particular algorithm
• Which line of source? Which instruction?
• allows us to begin to identify microarchitecture-type issues e.g.
TLB misses, cache misses
7
Where are CPU cycles consumed?
Caliper fprof
# caliper fprof –o fprof.out –threads=all –event-defaults PLM=all \
--attach=4555 –duration=30
8
Where are CPU cycles consumed?
Caliper fprof
Function Details (All Threads)
---------------------------------------------------
% Total Line|
IP IP Slot| >Statement|
Samples Samples Col,Offset Instruction
---------------------------------------------------
12.70 [vmunix::wait_for_lock_spinner, 0xe000000000710e60, pdk_spinlock.c]
30463 ~391 Function Totals
------------------------------------------
[File not found: /ux/core/kern/em/svc/sync/pdk_spinlock.c]
(8231) ~416 >
~1,0x0050:0 M nop.m 0
:1 M nop.m 0
:2 B_ br.call.sptk.many
rp=preArbitration+0x0 ;;
(10) ~399 *>
6 ~1,0x0060:0 M ld8 r9=[r40]
:1 I nop.i 0
:2 I_ nop.i 0 ;;
(568) ~412 *>
568 ~1,0x0070:0 M cmp.eq.unc p0,p6=r0,r9
:1 M nop.m 0
:2 B_(p6) br.dpnt.many {self}+0x130 ;;
9
Where are CPU cycles consumed?
Prospect (PA RISC)
# prospect –P
# prospect –f prospect.out –V4 sleep 30
-------------------------------------------------------------------------------
Section 2.1.5: Thread KTid(4615499) Intersections with 100Hz System Clock, U
-------------------------------------------------------------------------------
pcnt accum% Hits Secs Address Routine name Instruction TEXT Filename
10
•One of the most powerful Prospect features is its ability to help identify precisely what a thread is doing
while it is executing on the CPU. It does this by providing separate userspace and kernel profiles for each
target thread.
•A profile shows how much time a target process spends in each function, and generally lists this
information with the most time-consuming function at the top. So in the above example we can quickly
see that this process is spending 5.73 seconds executing userspace code. Of this, 3.38 seconds is in the
function busy_loop(), 0.42 seconds in routine() and so on. You’ll notice that Prospect has profiled not
only the application code, but also HP and third party libraries; because of this Prospect can be very
helpful in “pointing the finger” towards the party that is the cause of the performance problem.
•A profile is generated by interrupting the running program every so often and recording its current
program counter value – the address of the instruction being executed. The system does this anyway
during processing of its periodic clock interrupt, and this data is available via the KI HARDCLOCK trace.
After collecting this data for some while Prospect maps these addresses to functions in the application
binary of one of the mapped libraries, and determine approximately how much CPU time is consumed by
each function. The profile then lists those functions that over a threshold amount of CPU.
•A profile is an example of sampled data rather than measured data, and as such is not perfectly
accurate. However for most purposes it works just fine.
•(One issue that relates to the HARDCLOCK mechanism underlying Prospect sampling that should be
mentioned is that HARDCLOCK occurs at a regular interval (once every 1/100th of a second), and is part
of the same mechanism as used to process kernel callouts. Callouts are timed events such as wakeups
of processes sleeping in sleep(), nanosleep(), select() etc. This may cause distortion of the profile for
certain application types, but no tool is perfect…)
•In summary, if the application is consuming a lot of time in user-mode CPU, Prospect allows us to “look
inside” and see what it’s doing.
10
Where are CPU cycles consumed?
Prospect (PA RISC)
KERNEL portion of profile:
608 hits = 6.08 seconds. KTC time was 3.487 seconds
pcnt accum% Hits Secs Address Routine name Instruction TEXT Filename
11
•One of the most powerful Prospect features is its ability to help identify precisely what a thread
is doing while it is executing on the CPU. It does this by providing separate userspace and
kernel profiles for each target thread.
•A profile shows how much time a target process spends in each function, and generally lists
this information with the most time-consuming function at the top. So in the above example we
can quickly see that this process is spending 5.73 seconds executing userspace code. Of this,
3.38 seconds is in the function busy_loop(), 0.42 seconds in routine() and so on. You’ll notice
that Prospect has profiled not only the application code, but also HP and third party libraries;
because of this Prospect can be very helpful in “pointing the finger” towards the party that is the
cause of the performance problem.
11
High system CPU
• Don’t immediately assume a kernel bug:
•Is the application and environment sensible?
• the application with 150,000 files in a single directory
• the dumb mutex benchmark
•Is this the real cause, or an effect?
• transaction timeouts causing all processes to write to
the same log file
•Is there a more efficient way of doing things?
• e.g. replace gettimeofday() with gethrtime()
• e.g. replace semop() with userspace spinlocks or
postwait()…
12
•Always best if you are able to compare costs at good and bad times.
12
How to approach high %sys
• Identify the costly system calls
• Identify the hot kernel functions
• Check for spinlock contention
• Use syscall, function & lock names as keywords search
for solutions:
• HP Knowledge Mine (“kmine”) support database:
• https://i3107kmi.atl.hp.com/kmine/
• Chart, the labs defect tracking system:
• http://chart.hp.com/
• Odin, the central repository for HP-UX development documents:
• http://odin.rsn.hp.com
• Apply for a jazz account (source code access):
• http://integration.cup.hp.com/uxdev/newusers/
13
13
CPU Consumed executing syscalls
Glance (global syscalls)
14
14
CPU consumed executing syscalls
tusc -c
$ tusc -c dd if=deldata of=/dev/null bs=8192
10000+0 records in
10000+0 records out
Syscall Seconds Calls Errors
exit 0.00 1
read 0.43 10015
write 0.19 10010
open 0.00 10 3
close 0.00 6
…
----- ----- ----- -----
Total 0.62 20083 4
15
In the second case, a separate process was writing to the file at the same time.
Tusc helped to measure the effect of file access contention on the read()
system call for this process.
15
CPU consumed executing syscalls
Prospect (PA RISC)
-------------------------------------------------------------------------------
Section 2.1.2: Time KTid(5711) Executed during System Calls
-------------------------------------------------------------------------------
TOTAL Sys 0.448207 100.000% 2.0 221941
lwp_cond_broadcast Sys 0.168130 37.512% 1.4 118409
lwp_mutex_unlock Sys 0.124931 27.873% 2.2 56976
lwp_mutex_init Sys 0.043733 9.757% 3.4 12792
lwp_mutex_lock_sys Sys 0.036828 8.217% 8.4 4386
lwp_cond_init Sys 0.030386 6.779% 2.4 12790
clock_gettime Sys 0.022919 5.113% 1.7 13413
lwp_cond_signal Sys 0.008531 1.903% 3.7 2327
lwp_cond_wait_sys Sys 0.006161 1.375% 19.9 310
sched_yield Sys 0.004528 1.010% 8.9 507
- Skip remainder, SPU of 0.0001 is below '-m 0.0010' Sec
16
•This shows the amount of CPU time spent executing each system call type.
•One trick to quickly reading Prospect output is to read the “seconds” or the
“samples” columns, and not to be distracted by the percentages quoted. Here
we see a total of 0.44 seconds executing in system calls, and comparing this to
the elapsed time of 30 seconds from the previous section might conclude that
time spent executing in system calls is not a significant factor in the
performance of this application.
16
How does the kernel spend its time?
kgmon and kprof
# kgmon –b
<wait for a bit>
# kgmon –hp
# kprof > kp.out
•Kernel profiling through kgmon is “the definitive” kernel profile – the profiling
tool used and supported by the kernel performance team. However it’s
important to understand that kgmon tells us how the kernel as a whole is
spending its time, not how one particular thread or process spends its time in
kernel. As such, a kgmon profile represents the activity of all processes on the
system. This can actually be a good thing as it may give a different, broader
perspective on the problem.
17
How to approach high %sys
• Identify the costly system calls
• Identify the hot kernel functions
• Check for spinlock contention
• Use syscall, function & lock names as keywords search
for solutions:
• HP Knowledge Mine (“kmine”) support database:
• https://i3107kmi.atl.hp.com/kmine/
• Chart, the labs defect tracking system:
• http://chart.hp.com/
• Odin, the central repository for HP-UX development documents:
• http://odin.rsn.hp.com
• Apply for a jazz account (source code access):
• http://integration.cup.hp.com/uxdev/newusers/
18
18
OK, so we have a CPU bottleneck –
now what?
• Identify where we are consuming most CPU
• Which module? (library or application binary)
• tells us who owns the offending code
• Which function?
• focuses the developers mind on a particular algorithm
• Which line of source? Which instruction?
• allows us to begin to identify microarchitecture-type issues e.g.
TLB misses, cache misses
19
19
How are CPU cycles consumed?
The Itanium2 pipeline
Generate instruction pointers, begin L1 I-cache and I-TLB accesses
Format instruction stream, load into instruction buffer
Expand instruction templates, disperse instructions
Register renaming
Deliver data to functional units from registers
Execution phase
Detect exceptions and branch
mispredictions
Write-back result to resigters
Instruction flow
20
•The first 2 stages are Front End processing that stage the instructions. Stalls
here are often due to ILTB or Instruction cache missies.
•The last 6 stages are Back End processing and stalls here are often due to
DTLB or Data Cache misses.
20
How are CPU cycles consumed?
Bubble analysis
Instruction flow
8 7 6 5 4 3 2 1
IB
stall
Time
8 7 6 5 4 3 B 2
IB
8 7 6 5 4 3 B B
IB
21
In this case the stall was in the EXE phase so the bubble was inserted at that
point. This indicates a cycle passed with no instruction being retired. In the
second interval a second bubble was inserted because the EXE was still
stalled. Bubbles allow caliper to report where the stalls originate.
21
Why does the CPU stall?
Caliper total_cpu --metrics=stall
• Shows impact of various stall types on
application performance
# caliper cpu_metrics -o cpu.stall --metrics=stall --att=16024 --dur=30
22
•Some commercial applications (like Oracle) will have high Dcache miss rates
because of lot of the processing is on large amounts of memory with poor
localization.
•Often scientific applications can be designed to get very low CPI – under 4.
22
What % of cycles do useful work?
Caliper cpu_metrics --metrics=cpi
• Cycles Per Instruction (CPI)
• Average number of cycles elapsed per completed
instruction
• Small numbers are better
• Theoretical best value of 0.2, typical 0.8 – 4.0
23
•If everything required (instructions and data) is in L1 cache, CPI might be very
low. However if required data is in L3 or main memory, the CPU might stall for
long periods waiting for this data. This will significantly increase CPI.
•An interesting case is when data is frequently shared between CPUs; in this
case cache-to-cache copies must occur. These are particularly slow on cell-
based systems, and can lead to high CPI.
23
Quick question
• Is a CPI of 3.0 good or bad?
24
24
The translation lookaside buffer
Virtual-to-physical
translation information
is itself stored in main
memory
Executable text
contains virtual
addresses
It would be
horribly inefficient
to have to look
up every virtual
address through
the structures in
main memory!
25
25
The translation lookaside buffer
Virtual address
Page # Offset
26
26
Itanium 2 TLB implementation
>31 cycles VHPT
2 cycles
L2ITLB
ker
wal
L1ITLB >25 cycles
Virtual Address
128 entries e
pag
instruction
32 entries
re
ardwa
H
L1DTLB
L2DTLB OS
data trap OS
128 entries page
L2 cache
32 entries
tables
4 cycles
4k pages
L3 cache
only large
pages
up to 2000
cycles
27
27
Where are the TLB misses
originating?
# caliper dtlb_miss –o chase.dtlb ./ctlb
•This report breaks down the number of translations according to which level in
the hierarchy they are completed.
28
Where are the TLB misses
originating?
Function Details
--------------------------------------------------------------------------------------
% Total % % %
Sampled … DTLB DTLB DTLB Line|
DTLB … L2 HPW Soft Slot| >Statement|
Misses … Fill Fill Fill Col,Offset Instruction
--------------------------------------------------------------------------------------
100.00 [ctlb::ptrchase_loop, 0x40017f0, ctlb.c]
79.6 20.4 0.0 109 Function Totals
-----------------------------------------------------------------------------
[/home/col/chase/ctlb.c]
(79.6) (20.4) (0.0) 110 > while (p = (unsigned char **) *p) ;
3,0x0020:0 M cmp4.eq p6,p0=r0,r33
:1 M nop.m 0
:2 B_(p6) br.dptk {self}+0x50 ;;
0.0 0.0 0.0 37,0x0030:0 M_ addp4 r8=0,r33 ;;
79.6 20.4 0.0 :1 M ld4 r33=[r8]
0.0 0.0 0.0 :2 I_ nop.i 0 ;;
37,0x0040:0 M cmp4.eq p0,p6=r0,r33
:1 M nop.m 0
:2 B_(p6) br.dptk {self}+0x30 ;;
--------------------------------------------------------------------------------------
29
Here we see the exact location in the program generating TLB misses. In this
case, 100% of sampled misses occurred in the function ptrchase_loop(), which
is defined in the file ctlb.c. Within this function, at line number 110 within the
file, we see a while loop which accounts for all of the tlb misses. Within the
three bundles that implement this source line, we can see that all the TLB
misses are originating in the second bundle due to the ld4 instruction.
29
Improving TLB performance
• Cluster code and data that is used together
• Place functions that are used together next to each
other
• Place data that is used together next to each other
• (Occupy as small a number of pages as possible)
• Use larger pages. Supported sizes:
• 4k – 256M
•L
•D
• Merge the data segments of shared libraries, so
they can be placed in larger pages:
chatr +mergseg enable <binary>
30
30
How to use large pages
• Link time:
-Wl,+pd <size> -Wl,+pi <size>
• chatr +pd <size> +pi <size> <binary>
• Kernel tuning
• vps_pagesize
• vps_ceiling
• vps_chatr_ceiling
31
•vps_pagesize is the minimum page size the kernel should use if the user has
not chatr()'d a specific value. vps_pagesize is specified in units of kilobytes and
should equate to one of the supported values. In the event vps_pagesize does
not correspond to a supported page size, the closest page size smaller than
the users specification is used. For example, specifying 20K would result in the
kernel using 16K. vps_pagesize is essentially a boot time configurable page
size for all user objects created. The actual effectiveness of that page size on
the system is unknown. The actual page size used is also dependent on virtual
alignment. Even though vps_pagesize is configured to 16K, if the virtual
alignment is not suitable for a 16K page then 4K pages are used instead. The
current default value for vps_pagesize is 4 kilobytes (vps_pagesize = 4)
•vps_ceiling is the maximum size page the kernel uses when selecting page
size "transparently". vps_ceiling is specified in units of kilobytes. Like
vps_pagesize, vps_ceiling should be a valid page size. If not, the value is
rounded down to the closest valid page size. vps_ceiling places a limit on the
size used for process data/stack and the size of pages the kernel selects
transparently for non dynamic objects (text, shared memory, etc). The default
value for vps_ceiling is 16K (vps_ceiling = 16).
31
•vps_chatr_ceiling is largest value a user is able to chatr()a program or library
to. The command itself is not limited, but the kernel checks the chatr'd value
against the maximum and only values below or equal to vps_chatr_ceiling are
actually used. In the event the chatr() value exceeds vps_chatr_ceiling, the
actual value used is the value of vps_chatr_ceiling. Like the others,
vps_chatr_ceiling is specified in units of kilobytes and will be rounded down to
the closest page size, if an invalid size is specified. chatr() does not require
any sort of user privilege and can therefore be used by any user. This tunable
allows the system administrator to control the use of large pages if there are
"bad citizens" who abuse the facility. The default value for vps_chatr_ceiling is
64 Megabytes (vps_chatr_ceiling = 65536).
32
The downside of large pages
• Increased memory consumption
• Page faults take longer
• Memory locality for multithreaded applications
33
•This only applies on systems with cell local memory, and applications that
have shared memory (either heap shared by threads or shared memory
regions shared by processes and allocated IPC_MEM_FIRST_TOUCH), and
where each thread/process intializes a chunk of shared memory and is the
predominant user of that chunk. A good rule of thumb is to try to make the
page size no larger than 1/10 of the size of the chunks the threads work on
(that minimizes edge effects where a page spans the boundary between two
chunks). The flip side of this is that too small a page may generate TLB
misses.
33
•A good example of this is the stream benchmark, which has three
large arrays, copied in parallel by multiple threads. On a 64-way
machine, each thread copies a chunk that is 1/64 of each array. If
each array is 640MB, then each thread will operate on a 10MB
chunk, and a page size of 1MB will give you good locality without
generating too many TLB misses.
34
The memory hierarchy Within x-bar:
256 GB
472 cycles
128 words
CPU speed registers
32KB
L1 cache A
1 cycle
256KB B
L2 cache
5-11 cycles
6 MB
L3 cache
12-18 cycles
Across x-bar:
C 768 GB
548 cycles
64 GB Local cell
322 cycles memory
disk
Multiple TB
7,500,000 cycles
35
Keep the most frequently accessed data as close as possible to the thread
using it.
35
Where are the cache misses
originating?
# caliper dcache_miss –o chase.dmiss ./chase L1
...
Sampling Specification
Sampling event: DATA_EAR_EVENTS
Sampling rate: 10000 events
Sampling rate variation: 500 (5.00% of sampling rate)
Sampling counter privilege: user_priv_level (user-space sampling)
Data granularity: 16 bytes
Number of samples: 2654
Data sampled: Data cache miss
Data Cache Metrics Summed for Entire Run
-----------------------------------------------
Counter Priv. Mask Count
-----------------------------------------------
L1D_READS 8 (USER) 26985136325
L1D_READ_MISSES_ALL 8 (USER) 212283019
DATA_REFERENCES 8 (USER) 26985171755
-----------------------------------------------
L1 data cache miss percentage:
0.79 = 100 * (L1D_READ_MISSES_ALL / L1D_READS)
36
36
Not such a happy L1 data cache
----------------------------------------------
Counter Priv. Mask Count
----------------------------------------------
L1D_READS 8 (USER) 9224811357
L1D_READ_MISSES_ALL 8 (USER) 9224670482
DATA_REFERENCES 8 (USER) 9224848339
----------------------------------------------
L1 data cache miss percentage:
100.00 = 100 * (L1D_READ_MISSES_ALL / L1D_READS)
Function Summary
--------------------------------------------------------------------------------------------
% Total Cumulat Avg.
Latency % of Sampled Latency Laten.
Cycles Total Misses Cycles Cycles Function File
--------------------------------------------------------------------------------------------
99.91 99.91 115290 576490 5.0 chase::ptrchase_loop chase.c
0.09 100.00 100 500 5.0 chase::main chase.c
--------------------------------------------------------------------------------------------
37
37
But which source code line?
Function Details
-----------------------------------------------------------------------
% Total Avg. Line|
Latency Sampled Latency Laten. Slot| >Statement|
Cycles Misses Cycles Cycles Col,Offset Instruction
-----------------------------------------------------------------------
99.91 [chase::ptrchase_loop, 0x40000000000019f0, chase.c]
115290 576490 5.0 ~109 Function Totals
--------------------------------------------------------------
[/tmp/col/chase.c]
(105) (525) (5.0) ~109 >{
0 0 ~1,0x0000:0 M alloc r31=ar.pfs,0,2,0,0
105 525 5.0 :1 M ld8 r8=[r32]
0 0 :2 I_ mov r33=pr ;;
(115185) (575965) (5.0) ~110 > while (p = (unsigned char **) *p) ;
~3,0x0010:0 M cmp.eq.unc p6,p0=r0,r8
:1 M nop.m 0
:2 B_(p6) br.dpnt.many {self}+0x50 ;;
~3,0x0020:0 M cmp.ne.or.andcm p17,p16=42,r0
:1 I mov.i ar.ec=1
:2 I_ nop.i 0 ;;
115185 575965 5.0 ~3,0x0030:0 M_(p17) ld8 r8=[r8] ;;
0 0 :1 M (p17) cmp.eq p0,p16=0,r8
0 0 :2 I nop.i 0
~3,0x0040:0 M nop.m 0
:1 M nop.m 0
:2 B_(p16) br.wtop.dptk.many {self}+0x30 ;;
-----------------------------------------------------------------------
38
38
An even less happy data cache
Data Cache Metrics Summed for Entire Run
---------------------------------------------
Counter Priv. Mask Count
---------------------------------------------
L1D_READS 8 (USER) 524471503
L1D_READ_MISSES_ALL 8 (USER) 524330677
DATA_REFERENCES 8 (USER) 524638283
---------------------------------------------
L1 data cache miss percentage:
99.97 = 100 * (L1D_READ_MISSES_ALL / L1D_READS)
39
39
High cache miss rate –
What can the programmer do?
• Identify the source of the problem with Caliper
• Cluster frequently read data in as few cache lines as
possible
• gets most useful data into small caches
• reduces probability of cache thrashing
• (“machinfo” will show cache line size)
• CAVEAT: Keep frequently updated data in different cache lines, if
updates occur on different CPUs
• Be wary of “pointer chasing” code
• Long linked lists – use skip lists/hash pools/whatever
• Sparse structures holding pointers – keep pointers together
outside larger structures
• Don’t mix integer and float in a cache line
40
•Data items are generally placed into memory in the order in which they are
declared in the program source. This allows the programmer control over what
data is near what other data.
•Since floating point data is not held by the L1 cache, storing a float in a cache
line causes the entire line to be flushed from L1 cache – impairing the speed of
access to the non-floating point data items. Mixing float and integers in a single
line therefore is not wise.
40
Reducing memory latency with
locality domains
• HP cell-based systems are
ccNUMA
• all memory locations do not A
have the same access
latency! B
• Superdome example:
• Latency within cell (A-A)
~215 ns
• Latency w/in xbar (A-B)
~315 ns
• Latency across x/bar (A-C) C
~365 ns
41
Using locality domains systems can be tuned to keep the maximum amount of
memory accesses in the local cell.
41
Interleaved memory physical memory
in cells physical address
• Blocks of physical memory space of system
from cells are organized
round-robin to form the
physical address space of
system
• Interleaving performed at
boot by firmware
• Stripe size is a cache line
• Memory access is
uniformly mediocre
…
42
42
Cell-local memory physical memory
in cells physical address
• Blocks of memory that space of system
come from a given cell
appear at a range of
addresses
• Cell-local memory is not
interleaved with memory
from other cells
• Access times to these
addresses will be optimal
for the processors in the
same cell
…
43
43
A typical configurationphysical memory physical address
in cells space of system
• Typically some percentage
of memory will be
interleaved, and some
configured as cell-local
• Some address ranges
correspond to cell-local
memory, and others to
…
interleave memory
• Amount of cell-local
memory is configurable at
boot time
44
Some system calls have been enhanced to allow applications to take explicit
control over their use of CLM. For example shmget() allows specification of
either IPC_MEM_LOCAL for cell-local memory or IPC_MEM_INTERLEAVED
for interleaved memory. If IPC_MEM_LOCAL is specified the memory is that
local to the thread calling shmget() unless IPC_MEM_FIRST_TOUCH is
specified, in which case the first thread to access the shared memory defines
the cell on which that memory is allocated.
44
Page allocation policies
• Memory that is private to a process is allocated
from same locality as the process
• e.g. stack, data segment, private memory-mapped data
• Memory that is shared between processes is
allocated from interleaved memory
• e.g. kernel memory, program text, shared libraries,
shared memory, shared memory-mapped data
• shmget() and mmap() enhanced to allow
interleave or CLM to be requested
• chatr +id requests interleave memory for
process data segment
45
For a process with lots of threads data segments should be interleaved since
there is a greater chance that a some of the threads will end up in a different
locality domain. With only a few threads using cell local memory may be better.
45
Default process and thread
launch policies
• Defines where processes/threads initially start to
execute
• Processes are placed in the locality with the
fewest number of threads currently queued
waiting to run (“least loaded”)
• Threads are placed in same locality as creator
until there are more threads than CPUs; then spill
to another locality (“fill first”)
• No binding is established; free to migrate
46
46
Controlling locality with mpsched(1)
•Display localities and their CPUs:
# mpsched –s
•Bind process 1234 to CPU 3:
# mpsched –c 3 –p 1234
•Execute the command newtm in LDOM 0:
# mpsched –l 0 ./newtm
•Execute all of newtm’s threads in the same
LDOM:
# mpsched –T PACKED ./newtm
47
47
How to monitor CLM usage
locinfo
• The source is in the pstat_getlocality man page!!
• Compile with +DD64 flag
# ./locinfo
--- System wide locality info: ---
index ldom physid type total free used
0 0 0 CLM 0 0 0
1 1 2 CLM 8175M 7939M 236M
2 -1 -1 ILV 23G 18G 5312M
----- ----- -----
31G 26G 5549M
48
48
How to monitor CLM usage
Caliper --memory-usage=exit
• Add “—memory-usage=exit” to any caliper invocation
# caliper fprof –o mem.cal --memory-usage=exit --att=12345 --dur=10
The weighted pages field attempts to provide a measure of the total memory
cost of this process by adding the private page count to the shared page count
divided by the number of sharing processes.
49
Does cell-local memory really help?
Interleave memory
hpcpc17: /home/col> chatr +id enable memm2
...
hpcpc17: /home/col> ./memm2
memmove() test
data moved per cycle: 671088640
move unit: 64 to 65536 bytes
memory footprint: 16 to 1024 pages
output in MB/s
source footprint (pages)
move size 16 32 64 128 256 512 1024
64 2468 2395 1997 1958 1944 1921 227
128 3578 3364 2531 2433 2426 2552 230
256 4614 4114 2336 2172 2163 2482 230
512 7519 6489 4883 4729 4653 4719 332
1024 10051 8494 6144 5953 5803 5810 405
2048 12088 9919 7078 6863 6685 6618 447
4096 13444 10836 7652 7420 7215 7097 465
8192 14240 11361 7971 7731 7539 7376 475
16384 14667 11523 8127 7879 7702 7529 481
32768 14896 11782 8216 7968 7784 7604 484
65536 15042 11849 8261 8003 7831 7645 486
hpcpc17: /home/col>
50
•The output of interest in the above test is that relating to a large (1024 page)
memory footprint, which cannot be performed in cache.
50
Does cell-local memory really help?
Cell-local memory
hpcpc17: /home/col> chatr +id disable memm2
...
hpcpc17: /home/col> ./memm2
memmove() test
data moved per cycle: 671088640
move unit: 64 to 65536 bytes
memory footprint: 16 to 1024 pages
output in MB/s
source footprint (pages)
move size 16 32 64 128 256 512 1024
64 2470 2402 1989 1957 1951 1935 373
128 3577 3421 2583 2484 2429 2457 377
256 4619 4236 2473 2253 2187 2222 375
512 7523 6926 4935 4781 4711 4554 550
1024 10064 9068 6192 6013 5911 5650 673
2048 12094 10678 7115 6918 6806 6474 730
4096 13468 11614 7682 7480 7353 7030 757
8192 14251 12316 7996 7785 7660 7347 773
16384 14694 12607 8143 7934 7813 7526 782
32768 14937 12800 8234 7975 7895 7610 787
65536 15044 12898 8285 8067 7942 7659 787
hpcpc17: /home/col>
51
51
What can a sysadmin do to
reduce memory latencies?
• Ensure enough CLM is configured on the system
• Consider whether interleave or cell-local memory
is best for process data segment
• chatr +id enable <binary>
• chatr +id disable <binary>
• Keep dissimilar processes apart to reduce cache
pollution
• mpsched
• PSETs
• Place cooperating processes close to each other
52
•Data items are generally placed into memory in the order in which they are
declared in the program source. This allows the programmer control over what
data is near what other data.
•Since floating point data is not held by the L1 cache, storing a float in a cache
line causes the entire line to be flushed from L1 cache – impairing the speed of
access to the non-floating point data items. Mixing float and integers in a single
line therefore is not wise.
52
Snoopy cache coherence
cache
CPU
53
Snoopy cache coherence is suitable when all caches share a single bus.
However on a cell-based architecture bus traffic is segregated, with only the
necessary transactions traversing the crossbars. As a result snoopy cache
coherence is inappropriate in cell-based systems.
53
Directory cache coherence
CPU CPU CPU CPU CPU CPU CPU CPU
cache cache cache cache cache cache cache cache
directory directory
crossbar
directory directory
54
Load-to-use latencies for
sx1000-based systems
sx1000
Madison
55
55
Load-to-use implications
• Small bus-based systems can be much faster for
workloads with low concurrency
• Cell-based systems can offer much better
throughput because of improved scaling
• Workloads that frequently share data between
CPUs can perform poorly on cell-based systems
• A 4-CPU partition of a cell-based system is NOT
equivalent to a 4-CPU bus-based system
56
56
Further learning
• Introduction to Microarchitectural Optimization for Itanium 2 Processors:
http://www.intel.com/software/products/vtune/techtopic/software_optimization.pdf
57
57