Hpux Performance Troubleshooting Class CPU Bottlenecks

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 57

HPUX

Performance
Troubleshooting Class

CPU Bottlenecks

© 2004 Hewlett-Packard Development Company, L.P.


The information contained herein is subject to change without notice

1
CPU bottlenecks
• We have a CPU bottleneck if our application
spends most of its time…
• Executing on the CPU, or…
• Waiting to execute on the CPU
• We need to know
• Where CPU cycles are consumed
• Which load module, function, source line, instruction
• How CPU cycles are consumed
• Useful execution, NOPs, stalled and why

A Process has a CPU bottleneck if it spends all its time using CPU resources.
If many processes need to wait in order to execute then there may be a system
level CPU bottleneck.

2
Time spent executing on the CPU
Glance (process threads)

3
Time spent waiting for the CPU
Glance (thread wait states)

4
Do we have a CPU bottleneck?
Prospect (thread report)
Section 2.1.1: Sorted Execution+Blocked=Elapsed Clocks for KTid(5711), EPB
-------------------------------------------------------------------------------
Clock Type Seconds % Elapsed uSec/Count Counts

ELAPSED TOTAL 30.119902 100.000% Total


BLOCKED TOTAL 23.193053 77.002% 2345.1 9890
Blocked misc 21.542943 71.524% 4356.5 4945
EXECUTION TOTAL 6.926849 22.998% Total
Execution UseR 6.423332 21.326% 28.0 229600
Blocked PreempT 1.650110 5.478% 333.7 4945
Execution Sys 0.448207 1.488% 2.0 221941
Execution VfaulT 0.018977 0.063% 27.6 687
Execution IntUseR 0.017371 0.058% 22.3 778
Execution CsW 0.016974 0.056% 2.8 6022
Execution IntSys 0.001186 0.004% 18.0 66
- Skip remainder, SPU of 0.0008 is below '-m 0.0010' Sec

•The guts of the report produced by Prospect consists of a set of sections for
each thread, potentially for each process on the system (you set the scope
when invokving Prospect) . The first of these sections gives an overview of
how the thread spent its time.

•Here we can clearly see that of an elapsed 30 seconds, this thread spent 23
seconds blocked and 6.9 seconds executing.

5
Quick question
• My system has a CPU utilization of only 3%,
therefore I definitely don’t have a CPU bottleneck,
right?

6
OK, so we have a CPU bottleneck –
now what?
• Identify where we are consuming most CPU
• Which module? (library or application binary)
• tells us who owns the offending code
• Which function?
• focuses the developers mind on a particular algorithm
• Which line of source? Which instruction?
• allows us to begin to identify microarchitecture-type issues e.g.
TLB misses, cache misses

• Identify how we are consuming most CPU


• Genuine efficient execution, or…
• Inefficient execution
• excessive NOPS, cache misses, TLB misses etc.

7
Where are CPU cycles consumed?
Caliper fprof
# caliper fprof –o fprof.out –threads=all –event-defaults PLM=all \
--attach=4555 –duration=30

Function Summary (All Threads)


-------------------------------------------------------------------------
% Total Cumulat
IP % of IP
Samples Total Samples Function File
-------------------------------------------------------------------------
12.70 12.70 30463 vmunix::wait_for_lock_spinner pdk_spinlock.c
11.59 24.29 27793 vmunix::resume_cleanup pdk_swtch.c
11.09 35.38 26594 vmunix::preArbitration pdk_spinlock.c
7.26 42.64 17416 vmunix::spinlock
3.91 46.54 9368 vmunix::swtch_to_thread pm_swtch.c
3.75 50.30 9003 vmunix::syscall syscall.c
3.30 53.59 7906 vmunix::spinlock_usav
2.19 55.78 5247 vmunix::setrq pm_spu_runq.c
1.85 57.63 4431 vmunix::force_run_rtsched sched_setrun.c
1.42 59.05 3411 libpthread.so.1::__mxn_sleep sleep.c
1.09 60.14 2604 vmunix::sched_yield_core pm_rtsched.c
1.05 61.19 2515 libpthread.so.1::__spin_lock_spin spin.c
0.98 62.17 2350 prcthr::cmplt_ cmplt.f
0.98 63.15 2349 vmunix::owns_spinlock spinlock.c

8
Where are CPU cycles consumed?
Caliper fprof
Function Details (All Threads)
---------------------------------------------------
% Total Line|
IP IP Slot| >Statement|
Samples Samples Col,Offset Instruction
---------------------------------------------------
12.70 [vmunix::wait_for_lock_spinner, 0xe000000000710e60, pdk_spinlock.c]
30463 ~391 Function Totals
------------------------------------------
[File not found: /ux/core/kern/em/svc/sync/pdk_spinlock.c]
(8231) ~416 >
~1,0x0050:0 M nop.m 0
:1 M nop.m 0
:2 B_ br.call.sptk.many
rp=preArbitration+0x0 ;;
(10) ~399 *>
6 ~1,0x0060:0 M ld8 r9=[r40]
:1 I nop.i 0
:2 I_ nop.i 0 ;;
(568) ~412 *>
568 ~1,0x0070:0 M cmp.eq.unc p0,p6=r0,r9
:1 M nop.m 0
:2 B_(p6) br.dpnt.many {self}+0x130 ;;

9
Where are CPU cycles consumed?
Prospect (PA RISC)
# prospect –P
# prospect –f prospect.out –V4 sleep 30
-------------------------------------------------------------------------------
Section 2.1.5: Thread KTid(4615499) Intersections with 100Hz System Clock, U
-------------------------------------------------------------------------------

USER portion of profile:


573 hits = 5.73 seconds. KTC time was 5.636 seconds

pcnt accum% Hits Secs Address Routine name Instruction TEXT Filename

59% 59% 338 3.38 0x000055fc busy_loop ENTRY GLOBAL /home/col/mallbench/main


7% 66% 42 0.42 0x00003c90 routine CODE GLOBAL /home/col/mallbench/main
6% 72% 35 0.35 0xc0197360 _sigfillset STUB EXTERNAL /usr/lib/libc.2
4% 76% 23 0.23 0xc004dfa0 pthread_mutex_lock CODE GLOBAL /usr/lib/libpthread.1
3% 80% 18 0.18 0xc01a6d18 rand_r ENTRY GLOBAL /usr/lib/libc.2
3% 82% 16 0.16 0xc004e568 pthread_mutex_unlock CODE GLOBAL /usr/lib/libpthread.1
2% 84% 11 0.11 0xc004e538 __pthread_mutex_unlock ENTRY GLOBAL /usr/lib/libpthread.1
2% 86% 10 0.10 0x000021bc $$remU MILLICODE GLOBAL /home/col/mallbench/main
2% 88% 10 0.10 0xc021cd00 __thread_mutex_lock CODE GLOBAL /usr/lib/libc.2
1% 89% 8 0.08 0xc0137adc $$dyncall_external MILLICODE GLOBAL /usr/lib/libc.2
1% 91% 8 0.08 0xc0196950 mallinfo CODE GLOBAL /usr/lib/libc.2
1% 92% 8 0.08 0xc019ab70 malloc CODE GLOBAL /usr/lib/libc.2
1% 93% 7 0.07 0xc004df88 __pthread_mutex_lock ENTRY GLOBAL /usr/lib/libpthread.1
1% 94% 7 0.07 0xc019ad38 free ENTRY GLOBAL /usr/lib/libc.2
1% 96% 7 0.07 0xc021cdb0 __thread_mutex_unlock ENTRY GLOBAL /usr/lib/libc.2

10

•One of the most powerful Prospect features is its ability to help identify precisely what a thread is doing
while it is executing on the CPU. It does this by providing separate userspace and kernel profiles for each
target thread.

•A profile shows how much time a target process spends in each function, and generally lists this
information with the most time-consuming function at the top. So in the above example we can quickly
see that this process is spending 5.73 seconds executing userspace code. Of this, 3.38 seconds is in the
function busy_loop(), 0.42 seconds in routine() and so on. You’ll notice that Prospect has profiled not
only the application code, but also HP and third party libraries; because of this Prospect can be very
helpful in “pointing the finger” towards the party that is the cause of the performance problem.

•A profile is generated by interrupting the running program every so often and recording its current
program counter value – the address of the instruction being executed. The system does this anyway
during processing of its periodic clock interrupt, and this data is available via the KI HARDCLOCK trace.
After collecting this data for some while Prospect maps these addresses to functions in the application
binary of one of the mapped libraries, and determine approximately how much CPU time is consumed by
each function. The profile then lists those functions that over a threshold amount of CPU.

•A profile is an example of sampled data rather than measured data, and as such is not perfectly
accurate. However for most purposes it works just fine.

•(One issue that relates to the HARDCLOCK mechanism underlying Prospect sampling that should be
mentioned is that HARDCLOCK occurs at a regular interval (once every 1/100th of a second), and is part
of the same mechanism as used to process kernel callouts. Callouts are timed events such as wakeups
of processes sleeping in sleep(), nanosleep(), select() etc. This may cause distortion of the profile for
certain application types, but no tool is perfect…)

•In summary, if the application is consuming a lot of time in user-mode CPU, Prospect allows us to “look
inside” and see what it’s doing.

10
Where are CPU cycles consumed?
Prospect (PA RISC)
KERNEL portion of profile:
608 hits = 6.08 seconds. KTC time was 3.487 seconds

pcnt accum% Hits Secs Address Routine name Instruction TEXT Filename

26% 26% 159 1.59 0x0014e0f8 syscall FUNC GLOBAL /stand/vmunix


26% 52% 159 1.59 0x003f9000 msgsnd2 FUNC GLOBAL /stand/vmunix
9% 61% 52 0.52 0x003f95b0 msgrcv2 FUNC GLOBAL /stand/vmunix
7% 68% 44 0.44 0x003f7dd8 ki_syscalltrace FUNC GLOBAL /stand/vmunix
4% 73% 27 0.27 0x00097810 resume_cleanup FUNC GLOBAL /stand/vmunix
3% 76% 21 0.21 0x00033a08 syscallinit OBJECT GLOBAL /stand/vmunix
3% 79% 20 0.20 0x00034cb0 spinunlock OBJECT GLOBAL /stand/vmunix
2% 82% 14 0.14 0x00038e00 ulbcopy_gr_method OBJECT GLOBAL /stand/vmunix
2% 84% 14 0.14 0x0014efd8 copyin FUNC GLOBAL /stand/vmunix
2% 86% 14 0.14 0x008678b0 ki_accum_push_TOS FUNC GLOBAL /stand/vmunix
2% 88% 13 0.13 0x00033f64 $syscallrtn OBJECT GLOBAL /stand/vmunix
2% 90% 13 0.13 0x0043cef8 CopyOutLong FUNC GLOBAL /stand/vmunix
1% 92% 9 0.09 0x003fb0b8 msgconv FUNC GLOBAL /stand/vmunix
1% 93% 7 0.07 0x00867790 ki_accum_pop_TOS_sys FUNC GLOBAL /stand/vmunix
1% 94% 4 0.04 0x00038c40 ulbcopy OBJECT GLOBAL /stand/vmunix
1% 94% 4 0.04 0x0014ed98 spluser FUNC GLOBAL /stand/vmunix
0% 95% 3 0.03 0x000344d0 $syscallexit OBJECT GLOBAL /stand/vmunix
0% 95% 3 0.03 0x00038ca0 ulbcopy_pcxu_method OBJECT GLOBAL /stand/vmunix

11

•One of the most powerful Prospect features is its ability to help identify precisely what a thread
is doing while it is executing on the CPU. It does this by providing separate userspace and
kernel profiles for each target thread.

•A profile shows how much time a target process spends in each function, and generally lists
this information with the most time-consuming function at the top. So in the above example we
can quickly see that this process is spending 5.73 seconds executing userspace code. Of this,
3.38 seconds is in the function busy_loop(), 0.42 seconds in routine() and so on. You’ll notice
that Prospect has profiled not only the application code, but also HP and third party libraries;
because of this Prospect can be very helpful in “pointing the finger” towards the party that is the
cause of the performance problem.

11
High system CPU
• Don’t immediately assume a kernel bug:
•Is the application and environment sensible?
• the application with 150,000 files in a single directory
• the dumb mutex benchmark
•Is this the real cause, or an effect?
• transaction timeouts causing all processes to write to
the same log file
•Is there a more efficient way of doing things?
• e.g. replace gettimeofday() with gethrtime()
• e.g. replace semop() with userspace spinlocks or
postwait()…

12

•Always best if you are able to compare costs at good and bad times.

12
How to approach high %sys
• Identify the costly system calls
• Identify the hot kernel functions
• Check for spinlock contention
• Use syscall, function & lock names as keywords search
for solutions:
• HP Knowledge Mine (“kmine”) support database:
• https://i3107kmi.atl.hp.com/kmine/
• Chart, the labs defect tracking system:
• http://chart.hp.com/
• Odin, the central repository for HP-UX development documents:
• http://odin.rsn.hp.com
• Apply for a jazz account (source code access):
• http://integration.cup.hp.com/uxdev/newusers/

13

13
CPU Consumed executing syscalls
Glance (global syscalls)

14

14
CPU consumed executing syscalls
tusc -c
$ tusc -c dd if=deldata of=/dev/null bs=8192
10000+0 records in
10000+0 records out
Syscall Seconds Calls Errors
exit 0.00 1
read 0.43 10015
write 0.19 10010
open 0.00 10 3
close 0.00 6

----- ----- ----- -----
Total 0.62 20083 4

$ tusc -c dd if=deldata of=/dev/null bs=8192


10000+0 records in
10000+0 records out
Syscall Seconds Calls Errors
exit 0.00 1
read 1.32 10015
write 0.13 10010
open 0.00 10 3
close 0.00
...
----- ----- ----- -----
Total 1.45 20083 4

15

In the second case, a separate process was writing to the file at the same time.
Tusc helped to measure the effect of file access contention on the read()
system call for this process.

15
CPU consumed executing syscalls
Prospect (PA RISC)
-------------------------------------------------------------------------------
Section 2.1.2: Time KTid(5711) Executed during System Calls
-------------------------------------------------------------------------------
TOTAL Sys 0.448207 100.000% 2.0 221941
lwp_cond_broadcast Sys 0.168130 37.512% 1.4 118409
lwp_mutex_unlock Sys 0.124931 27.873% 2.2 56976
lwp_mutex_init Sys 0.043733 9.757% 3.4 12792
lwp_mutex_lock_sys Sys 0.036828 8.217% 8.4 4386
lwp_cond_init Sys 0.030386 6.779% 2.4 12790
clock_gettime Sys 0.022919 5.113% 1.7 13413
lwp_cond_signal Sys 0.008531 1.903% 3.7 2327
lwp_cond_wait_sys Sys 0.006161 1.375% 19.9 310
sched_yield Sys 0.004528 1.010% 8.9 507
- Skip remainder, SPU of 0.0001 is below '-m 0.0010' Sec

16

•This shows the amount of CPU time spent executing each system call type.

•One trick to quickly reading Prospect output is to read the “seconds” or the
“samples” columns, and not to be distracted by the percentages quoted. Here
we see a total of 0.44 seconds executing in system calls, and comparing this to
the elapsed time of 30 seconds from the previous section might conclude that
time spent executing in system calls is not a significant factor in the
performance of this application.

16
How does the kernel spend its time?
kgmon and kprof
# kgmon –b
<wait for a bit>
# kgmon –hp
# kprof > kp.out

%time cumsecs seconds name


16.0 67.04 67.04 psema_spin_1
13.5 123.74 56.70 psema_spin_n
11.2 170.82 47.08 idle
9.6 211.19 40.37 enable_hw_tlb_miss_ll
6.6 239.07 27.88 wait_for_lock_spinner
3.8 255.04 15.97 ulbcopy
3.0 267.63 12.59 spinlock
2.8 279.30 11.67 asm_spinlock_usav
2.4 289.27 9.97 syscall
1.7 296.27 7.00 syscallinit
1.4 302.13 5.86 ki_accum_push_TOS
1.2 307.36 5.23 ki_data
1.1 311.84 4.48 psema
17

•Kernel profiling through kgmon is “the definitive” kernel profile – the profiling
tool used and supported by the kernel performance team. However it’s
important to understand that kgmon tells us how the kernel as a whole is
spending its time, not how one particular thread or process spends its time in
kernel. As such, a kgmon profile represents the activity of all processes on the
system. This can actually be a good thing as it may give a different, broader
perspective on the problem.

17
How to approach high %sys
• Identify the costly system calls
• Identify the hot kernel functions
• Check for spinlock contention
• Use syscall, function & lock names as keywords search
for solutions:
• HP Knowledge Mine (“kmine”) support database:
• https://i3107kmi.atl.hp.com/kmine/
• Chart, the labs defect tracking system:
• http://chart.hp.com/
• Odin, the central repository for HP-UX development documents:
• http://odin.rsn.hp.com
• Apply for a jazz account (source code access):
• http://integration.cup.hp.com/uxdev/newusers/

18

18
OK, so we have a CPU bottleneck –
now what?
• Identify where we are consuming most CPU
• Which module? (library or application binary)
• tells us who owns the offending code
• Which function?
• focuses the developers mind on a particular algorithm
• Which line of source? Which instruction?
• allows us to begin to identify microarchitecture-type issues e.g.
TLB misses, cache misses

• Identify how we are consuming most CPU


• Genuine efficient execution, or…
• Inefficient execution
• excessive NOPS, cache misses, TLB misses etc.

19

19
How are CPU cycles consumed?
The Itanium2 pipeline
Generate instruction pointers, begin L1 I-cache and I-TLB accesses
Format instruction stream, load into instruction buffer
Expand instruction templates, disperse instructions
Register renaming
Deliver data to functional units from registers
Execution phase
Detect exceptions and branch
mispredictions
Write-back result to resigters

IPG ROT EXP REN REG EXE DET WRB


IB

Instruction flow
20

•Itanium2 is an 8-stage pipeline.

•The first 2 stages are Front End processing that stage the instructions. Stalls
here are often due to ILTB or Instruction cache missies.

•The last 6 stages are Back End processing and stalls here are often due to
DTLB or Data Cache misses.

20
How are CPU cycles consumed?
Bubble analysis
Instruction flow

8 7 6 5 4 3 2 1
IB

stall
Time

8 7 6 5 4 3 B 2
IB

8 7 6 5 4 3 B B
IB

21

In this case the stall was in the EXE phase so the bubble was inserted at that
point. This indicates a cycle passed with no instruction being retired. In the
second interval a second bubble was inserted because the EXE was still
stalled. Bubbles allow caliper to report where the stalls originate.

21
Why does the CPU stall?
Caliper total_cpu --metrics=stall
• Shows impact of various stall types on
application performance
# caliper cpu_metrics -o cpu.stall --metrics=stall --att=16024 --dur=30

STALL Summary Statistics


---------------------------------------------------------------------------------------------------------------------------
------FE Components------ -----------------------------BE Components-------------------------
Stats RAW ----Iaccess---- Unstall BE Score ----------Daccess--------- RSE
CPI Itlb Icache Branch Execute Flush Board L1Dtlb L2Dtlb Dcache Active
---------------------------------------------------------------------------------------------------------------------------
MEAN 3.329 0.16% 1.73% 0.53% 10.72% 0.78% 0.03% 0.28% 0.00% 85.13% 0.65%
STDEV 0.413 0.01% 0.16% 0.03% 1.45% 0.13% 0.00% 0.02% 0.00% 1.43% 0.11%
90%CI 0.082 0.00% 0.03% 0.01% 0.29% 0.03% 0.00% 0.00% 0.00% 0.29% 0.02%
MINIMUM 2.585 0.14% 1.43% 0.46% 8.26% 0.55% 0.02% 0.23% 0.00% 82.22% 0.48%
LOW90 3.246 0.16% 1.69% 0.52% 10.43% 0.75% 0.03% 0.27% 0.00% 84.84% 0.63%
HIGH90 3.411 0.16% 1.76% 0.53% 11.01% 0.81% 0.03% 0.28% 0.00% 85.41% 0.68%
MAXIMUM 4.072 0.18% 2.04% 0.60% 13.62% 1.06% 0.04% 0.33% 0.00% 87.76% 0.83%
---------------------------------------------------------------------------------------------------------------------------

22

•Some commercial applications (like Oracle) will have high Dcache miss rates
because of lot of the processing is on large amounts of memory with poor
localization.

•Often scientific applications can be designed to get very low CPI – under 4.

22
What % of cycles do useful work?
Caliper cpu_metrics --metrics=cpi
• Cycles Per Instruction (CPI)
• Average number of cycles elapsed per completed
instruction
• Small numbers are better
• Theoretical best value of 0.2, typical 0.8 – 4.0

# caliper cpu_metrics -o cpu.cpi --metrics=cpi --att=16024 --dur=30

CPI Summary Statistics


---------------------------------------------------------------------------------------------------------------------------
Stats Cycles IA64 Instr Nops Pred-Off %Useful %Nops %Pred ------CPI----- -----MIPS-----
(raw) (eff) (raw) (eff)
---------------------------------------------------------------------------------------------------------------------------
MEAN 1500000 446441 140147 6675 67.96% 30.47% 1.57% 3.469 5.066 446.4 299.6
STDEV 0 76250 45624 827 5.08% 5.56% 0.48% 0.642 0.559 76.2 31.9
90%CI 0 477 285 5 0.03% 0.03% 0.00% 0.004 0.003 0.5 0.2
MINIMUM 1500000 296331 58396 4235 60.27% 18.27% 0.76% 2.390 3.894 296.3 230.2
LOW90 1500000 445964 139861 6670 67.92% 30.43% 1.57% 3.465 5.062 446.0 299.4
HIGH90 1500000 446918 140432 6680 67.99% 30.50% 1.58% 3.473 5.069 446.9 299.8
MAXIMUM 1500000 627598 239676 8708 79.11% 38.95% 2.62% 5.062 6.517 627.6 385.2
---------------------------------------------------------------------------------------------------------------------------

23

•Itanium instructions come in bundles of three. If the compiler cannot organize


the code to make use of all three instruction slots in a bundle, NOPs will be
inserted. Since a NOP achieves no real work, it needs to be factored out to
determine the number of useful instructions executed. This accounts for the
difference between raw and effective CPI.

•If everything required (instructions and data) is in L1 cache, CPI might be very
low. However if required data is in L3 or main memory, the CPU might stall for
long periods waiting for this data. This will significantly increase CPI.

•An interesting case is when data is frequently shared between CPUs; in this
case cache-to-cache copies must occur. These are particularly slow on cell-
based systems, and can lead to high CPI.

23
Quick question
• Is a CPI of 3.0 good or bad?

24

24
The translation lookaside buffer
Virtual-to-physical
translation information
is itself stored in main
memory
Executable text
contains virtual
addresses

It would be
horribly inefficient
to have to look
up every virtual
address through
the structures in
main memory!

25

25
The translation lookaside buffer
Virtual address
Page # Offset

The TLB caches


recently-used
translations

26

26
Itanium 2 TLB implementation
>31 cycles VHPT

2 cycles

L2ITLB
ker
wal
L1ITLB >25 cycles
Virtual Address

128 entries e
pag
instruction
32 entries
re
ardwa
H
L1DTLB
L2DTLB OS
data trap OS
128 entries page

L2 cache
32 entries
tables

4 cycles
4k pages

L3 cache
only large
pages

up to 2000
cycles
27

•Level 1 TLBs are only used for 4k pages.

27
Where are the TLB misses
originating?
# caliper dtlb_miss –o chase.dtlb ./ctlb

Event Name U..K TH Count


-----------------------------------------------
DATA_REFERENCES x___ 0 8599487
L2DTLB_MISSES x___ 0 2099510
DTLB_INSERTS_HPW x___ 0 2067721
-----------------------------------------------
Total L1 data TLB references:
8599487 = DATA_REFERENCES

Percentage of data references covered by L1 and L2 DTLB:


75.59 % = 100 * (1 - L2DTLB_MISSES / DATA_REFERENCES)

Percentage of data references covered by the HPW:


24.04 % = 100 * (DTLB_INSERTS_HPW / DATA_REFERENCES)

Percentage of data references covered by software trap:


0.37 % = 100 * ((L2DTLB_MISSES - DTLB_INSERTS_HPW) / DATA_REFERENCES)

Percentage of L2 DTLB misses covered by the HPW:


98.49 % = 100 * (DTLB_INSERTS_HPW / L2DTLB_MISSES)
----------------------------------------------- 28

•For the instruction TLB use the itlb_miss.

•This report breaks down the number of translations according to which level in
the hierarchy they are completed.

28
Where are the TLB misses
originating?
Function Details
--------------------------------------------------------------------------------------
% Total % % %
Sampled … DTLB DTLB DTLB Line|
DTLB … L2 HPW Soft Slot| >Statement|
Misses … Fill Fill Fill Col,Offset Instruction
--------------------------------------------------------------------------------------
100.00 [ctlb::ptrchase_loop, 0x40017f0, ctlb.c]
79.6 20.4 0.0 109 Function Totals
-----------------------------------------------------------------------------
[/home/col/chase/ctlb.c]
(79.6) (20.4) (0.0) 110 > while (p = (unsigned char **) *p) ;
3,0x0020:0 M cmp4.eq p6,p0=r0,r33
:1 M nop.m 0
:2 B_(p6) br.dptk {self}+0x50 ;;
0.0 0.0 0.0 37,0x0030:0 M_ addp4 r8=0,r33 ;;
79.6 20.4 0.0 :1 M ld4 r33=[r8]
0.0 0.0 0.0 :2 I_ nop.i 0 ;;
37,0x0040:0 M cmp4.eq p0,p6=r0,r33
:1 M nop.m 0
:2 B_(p6) br.dptk {self}+0x30 ;;
--------------------------------------------------------------------------------------

29

Here we see the exact location in the program generating TLB misses. In this
case, 100% of sampled misses occurred in the function ptrchase_loop(), which
is defined in the file ctlb.c. Within this function, at line number 110 within the
file, we see a while loop which accounts for all of the tlb misses. Within the
three bundles that implement this source line, we can see that all the TLB
misses are originating in the second bundle due to the ld4 instruction.

29
Improving TLB performance
• Cluster code and data that is used together
• Place functions that are used together next to each
other
• Place data that is used together next to each other
• (Occupy as small a number of pages as possible)
• Use larger pages. Supported sizes:
• 4k – 256M
•L
•D
• Merge the data segments of shared libraries, so
they can be placed in larger pages:
chatr +mergseg enable <binary>
30

30
How to use large pages
• Link time:
-Wl,+pd <size> -Wl,+pi <size>
• chatr +pd <size> +pi <size> <binary>
• Kernel tuning
• vps_pagesize
• vps_ceiling
• vps_chatr_ceiling

31

•vps_pagesize is the minimum page size the kernel should use if the user has
not chatr()'d a specific value. vps_pagesize is specified in units of kilobytes and
should equate to one of the supported values. In the event vps_pagesize does
not correspond to a supported page size, the closest page size smaller than
the users specification is used. For example, specifying 20K would result in the
kernel using 16K. vps_pagesize is essentially a boot time configurable page
size for all user objects created. The actual effectiveness of that page size on
the system is unknown. The actual page size used is also dependent on virtual
alignment. Even though vps_pagesize is configured to 16K, if the virtual
alignment is not suitable for a 16K page then 4K pages are used instead. The
current default value for vps_pagesize is 4 kilobytes (vps_pagesize = 4)

•vps_ceiling is the maximum size page the kernel uses when selecting page
size "transparently". vps_ceiling is specified in units of kilobytes. Like
vps_pagesize, vps_ceiling should be a valid page size. If not, the value is
rounded down to the closest valid page size. vps_ceiling places a limit on the
size used for process data/stack and the size of pages the kernel selects
transparently for non dynamic objects (text, shared memory, etc). The default
value for vps_ceiling is 16K (vps_ceiling = 16).

31
•vps_chatr_ceiling is largest value a user is able to chatr()a program or library
to. The command itself is not limited, but the kernel checks the chatr'd value
against the maximum and only values below or equal to vps_chatr_ceiling are
actually used. In the event the chatr() value exceeds vps_chatr_ceiling, the
actual value used is the value of vps_chatr_ceiling. Like the others,
vps_chatr_ceiling is specified in units of kilobytes and will be rounded down to
the closest page size, if an invalid size is specified. chatr() does not require
any sort of user privilege and can therefore be used by any user. This tunable
allows the system administrator to control the use of large pages if there are
"bad citizens" who abuse the facility. The default value for vps_chatr_ceiling is
64 Megabytes (vps_chatr_ceiling = 65536).

32
The downside of large pages
• Increased memory consumption
• Page faults take longer
• Memory locality for multithreaded applications

33

•Another downside to large pages on ccNUMA systems is that a large page


size can impact locality. When a new page is touched by a thread, the vm
system will attempt to allocate a page from local memory of the size specified
by the current hint. If the hint is large, then a large chunk of memory that may
ultimately include data used by other threads will be allocated. For instance, if
you have a 1GB array or other data structure that is actually accessed in
chunks by different threads, the whole array might wind up getting allocated as
a single page in the locality of the first thread that touches it. If you use smaller
pages, then you are more likely to get pages that are accessed entirely by a
single thread. When you have pages as large as 4GB you will often find that
entire data structures can wind up in a single page.

•This only applies on systems with cell local memory, and applications that
have shared memory (either heap shared by threads or shared memory
regions shared by processes and allocated IPC_MEM_FIRST_TOUCH), and
where each thread/process intializes a chunk of shared memory and is the
predominant user of that chunk. A good rule of thumb is to try to make the
page size no larger than 1/10 of the size of the chunks the threads work on
(that minimizes edge effects where a page spans the boundary between two
chunks). The flip side of this is that too small a page may generate TLB
misses.

33
•A good example of this is the stream benchmark, which has three
large arrays, copied in parallel by multiple threads. On a 64-way
machine, each thread copies a chunk that is 1/64 of each array. If
each array is 640MB, then each thread will operate on a 10MB
chunk, and a page size of 1MB will give you good locality without
generating too many TLB misses.

34
The memory hierarchy Within x-bar:
256 GB
472 cycles
128 words
CPU speed registers

32KB
L1 cache A
1 cycle

256KB B
L2 cache
5-11 cycles

6 MB
L3 cache
12-18 cycles
Across x-bar:
C 768 GB
548 cycles
64 GB Local cell
322 cycles memory

disk
Multiple TB
7,500,000 cycles

35

Keep the most frequently accessed data as close as possible to the thread
using it.

35
Where are the cache misses
originating?
# caliper dcache_miss –o chase.dmiss ./chase L1
...
Sampling Specification
Sampling event: DATA_EAR_EVENTS
Sampling rate: 10000 events
Sampling rate variation: 500 (5.00% of sampling rate)
Sampling counter privilege: user_priv_level (user-space sampling)
Data granularity: 16 bytes
Number of samples: 2654
Data sampled: Data cache miss
Data Cache Metrics Summed for Entire Run
-----------------------------------------------
Counter Priv. Mask Count
-----------------------------------------------
L1D_READS 8 (USER) 26985136325
L1D_READ_MISSES_ALL 8 (USER) 212283019
DATA_REFERENCES 8 (USER) 26985171755
-----------------------------------------------
L1 data cache miss percentage:
0.79 = 100 * (L1D_READ_MISSES_ALL / L1D_READS)

Percent of data references accessing L1 data cache:


100.00 = 100 * (L1D_READS / DATA_REFERENCES)
-----------------------------------------------
• Load Module Summary
• ------------------------------------------------------------------
• % Total Cumulat Avg.
• Latency % of Sampled Latency Laten.
• Cycles Total Misses Cycles Cycles Load Module
• ------------------------------------------------------------------
• 100.00 100.00 2654 13594 5.1 chase
• ------------------------------------------------------------------
• 100.00 100.00 2654 13594 5.1 Total
• ------------------------------------------------------------------

36

36
Not such a happy L1 data cache
----------------------------------------------
Counter Priv. Mask Count
----------------------------------------------
L1D_READS 8 (USER) 9224811357
L1D_READ_MISSES_ALL 8 (USER) 9224670482
DATA_REFERENCES 8 (USER) 9224848339
----------------------------------------------
L1 data cache miss percentage:
100.00 = 100 * (L1D_READ_MISSES_ALL / L1D_READS)

Percent of data references accessing L1 data cache:


100.00 = 100 * (L1D_READS / DATA_REFERENCES)
----------------------------------------------

Load Module Summary


------------------------------------------------------------------
% Total Cumulat Avg.
Latency % of Sampled Latency Laten.
Cycles Total Misses Cycles Cycles Load Module
------------------------------------------------------------------
100.00 100.00 115390 576990 5.0 chase
------------------------------------------------------------------
100.00 100.00 115390 576990 5.0 Total
------------------------------------------------------------------

Function Summary
--------------------------------------------------------------------------------------------
% Total Cumulat Avg.
Latency % of Sampled Latency Laten.
Cycles Total Misses Cycles Cycles Function File
--------------------------------------------------------------------------------------------
99.91 99.91 115290 576490 5.0 chase::ptrchase_loop chase.c
0.09 100.00 100 500 5.0 chase::main chase.c
--------------------------------------------------------------------------------------------

37

Average Latency Cycles is the average number of cycles it takes to resolved


the data cache miss.

37
But which source code line?
Function Details
-----------------------------------------------------------------------
% Total Avg. Line|
Latency Sampled Latency Laten. Slot| >Statement|
Cycles Misses Cycles Cycles Col,Offset Instruction
-----------------------------------------------------------------------
99.91 [chase::ptrchase_loop, 0x40000000000019f0, chase.c]
115290 576490 5.0 ~109 Function Totals
--------------------------------------------------------------
[/tmp/col/chase.c]
(105) (525) (5.0) ~109 >{
0 0 ~1,0x0000:0 M alloc r31=ar.pfs,0,2,0,0
105 525 5.0 :1 M ld8 r8=[r32]
0 0 :2 I_ mov r33=pr ;;
(115185) (575965) (5.0) ~110 > while (p = (unsigned char **) *p) ;
~3,0x0010:0 M cmp.eq.unc p6,p0=r0,r8
:1 M nop.m 0
:2 B_(p6) br.dpnt.many {self}+0x50 ;;
~3,0x0020:0 M cmp.ne.or.andcm p17,p16=42,r0
:1 I mov.i ar.ec=1
:2 I_ nop.i 0 ;;
115185 575965 5.0 ~3,0x0030:0 M_(p17) ld8 r8=[r8] ;;
0 0 :1 M (p17) cmp.eq p0,p16=0,r8
0 0 :2 I nop.i 0
~3,0x0040:0 M nop.m 0
:1 M nop.m 0
:2 B_(p16) br.wtop.dptk.many {self}+0x30 ;;
-----------------------------------------------------------------------

38

38
An even less happy data cache
Data Cache Metrics Summed for Entire Run
---------------------------------------------
Counter Priv. Mask Count
---------------------------------------------
L1D_READS 8 (USER) 524471503
L1D_READ_MISSES_ALL 8 (USER) 524330677
DATA_REFERENCES 8 (USER) 524638283
---------------------------------------------
L1 data cache miss percentage:
99.97 = 100 * (L1D_READ_MISSES_ALL / L1D_READS)

Percent of data references accessing L1 data cache:


99.97 = 100 * (L1D_READS / DATA_REFERENCES)
---------------------------------------------

Load Module Summary


------------------------------------------------------------------
% Total Cumulat Avg.
Latency % of Sampled Latency Laten.
Cycles Total Misses Cycles Cycles Load Module
------------------------------------------------------------------
100.00 100.00 5852 760180 129.9 chase
------------------------------------------------------------------
100.00 100.00 5852 760180 129.9 Total
------------------------------------------------------------------

39

39
High cache miss rate –
What can the programmer do?
• Identify the source of the problem with Caliper
• Cluster frequently read data in as few cache lines as
possible
• gets most useful data into small caches
• reduces probability of cache thrashing
• (“machinfo” will show cache line size)
• CAVEAT: Keep frequently updated data in different cache lines, if
updates occur on different CPUs
• Be wary of “pointer chasing” code
• Long linked lists – use skip lists/hash pools/whatever
• Sparse structures holding pointers – keep pointers together
outside larger structures
• Don’t mix integer and float in a cache line

40

•Data items are generally placed into memory in the order in which they are
declared in the program source. This allows the programmer control over what
data is near what other data.

•Try to cluster sequentially accessed data in as few cache lines as possible. By


packing those cache lines with data that will be accessed sequentially we not
only assure we get the most useful data into cache as possible, but also
minimize the probability that the line will be flushed before later data items are
needed. We also minimize the probability of cache thrashing – where two
pieces of critical data, accessed repeatedly, contend for the same line. In
general when data isn’t accessed sequentially, try to order data to minimize the
stride.

•Since floating point data is not held by the L1 cache, storing a float in a cache
line causes the entire line to be flushed from L1 cache – impairing the speed of
access to the non-floating point data items. Mixing float and integers in a single
line therefore is not wise.

40
Reducing memory latency with
locality domains
• HP cell-based systems are
ccNUMA
• all memory locations do not A
have the same access
latency! B
• Superdome example:
• Latency within cell (A-A)
~215 ns
• Latency w/in xbar (A-B)
~315 ns
• Latency across x/bar (A-C) C
~365 ns

41

Using locality domains systems can be tuned to keep the maximum amount of
memory accesses in the local cell.

41
Interleaved memory physical memory
in cells physical address
• Blocks of physical memory space of system
from cells are organized
round-robin to form the
physical address space of
system
• Interleaving performed at
boot by firmware
• Stripe size is a cache line
• Memory access is
uniformly mediocre


42

42
Cell-local memory physical memory
in cells physical address
• Blocks of memory that space of system
come from a given cell
appear at a range of
addresses
• Cell-local memory is not
interleaved with memory
from other cells
• Access times to these
addresses will be optimal
for the processors in the
same cell


43

43
A typical configurationphysical memory physical address
in cells space of system
• Typically some percentage
of memory will be
interleaved, and some
configured as cell-local
• Some address ranges
correspond to cell-local
memory, and others to


interleave memory
• Amount of cell-local
memory is configurable at
boot time

44

Some system calls have been enhanced to allow applications to take explicit
control over their use of CLM. For example shmget() allows specification of
either IPC_MEM_LOCAL for cell-local memory or IPC_MEM_INTERLEAVED
for interleaved memory. If IPC_MEM_LOCAL is specified the memory is that
local to the thread calling shmget() unless IPC_MEM_FIRST_TOUCH is
specified, in which case the first thread to access the shared memory defines
the cell on which that memory is allocated.

44
Page allocation policies
• Memory that is private to a process is allocated
from same locality as the process
• e.g. stack, data segment, private memory-mapped data
• Memory that is shared between processes is
allocated from interleaved memory
• e.g. kernel memory, program text, shared libraries,
shared memory, shared memory-mapped data
• shmget() and mmap() enhanced to allow
interleave or CLM to be requested
• chatr +id requests interleave memory for
process data segment
45

For a process with lots of threads data segments should be interleaved since
there is a greater chance that a some of the threads will end up in a different
locality domain. With only a few threads using cell local memory may be better.

45
Default process and thread
launch policies
• Defines where processes/threads initially start to
execute
• Processes are placed in the locality with the
fewest number of threads currently queued
waiting to run (“least loaded”)
• Threads are placed in same locality as creator
until there are more threads than CPUs; then spill
to another locality (“fill first”)
• No binding is established; free to migrate

46

46
Controlling locality with mpsched(1)
•Display localities and their CPUs:
# mpsched –s
•Bind process 1234 to CPU 3:
# mpsched –c 3 –p 1234
•Execute the command newtm in LDOM 0:
# mpsched –l 0 ./newtm
•Execute all of newtm’s threads in the same
LDOM:
# mpsched –T PACKED ./newtm

47

47
How to monitor CLM usage
locinfo
• The source is in the pstat_getlocality man page!!
• Compile with +DD64 flag

# ./locinfo
--- System wide locality info: ---
index ldom physid type total free used
0 0 0 CLM 0 0 0
1 1 2 CLM 8175M 7939M 236M
2 -1 -1 ILV 23G 18G 5312M
----- ----- -----
31G 26G 5549M

48

48
How to monitor CLM usage
Caliper --memory-usage=exit
• Add “—memory-usage=exit” to any caliper invocation
# caliper fprof –o mem.cal --memory-usage=exit --att=12345 --dur=10

System Memory Configuration


------------------------------------------------------------------------
Domain Physical # Used Free Total
Id Id Type CPUs Pages Pages Pages
------------------------------------------------------------------------
0 0 CLM 4 111801 3025735 3137536
1 1 CLM 4 95210 3042326 3137536
2 2 CLM 4 80119 3057413 3137532
3 3 CLM 3 80530 3057006 3137536
-1 -1 ILV 0 1704436 2489867 4194303
------------------------------------------------------------------------
Total 2072096 14672347 16744443
------------------------------------------------------------------------

Process Memory Usage


-------------------------------------------------------------------------------
Domain Shared Private Weighted Total
Time Id Pages Pages Pages Pages
-------------------------------------------------------------------------------
End 0 0 1 1 1
1 0 258 258 258
2 0 4 4 4
3 0 3 3 3
-1 676 160 186 836
Total 676 426 452 1102
-------------------------------------------------------------------------------
49

The weighted pages field attempts to provide a measure of the total memory
cost of this process by adding the private page count to the shared page count
divided by the number of sharing processes.

49
Does cell-local memory really help?
Interleave memory
hpcpc17: /home/col> chatr +id enable memm2
...
hpcpc17: /home/col> ./memm2
memmove() test
data moved per cycle: 671088640
move unit: 64 to 65536 bytes
memory footprint: 16 to 1024 pages
output in MB/s
source footprint (pages)
move size 16 32 64 128 256 512 1024
64 2468 2395 1997 1958 1944 1921 227
128 3578 3364 2531 2433 2426 2552 230
256 4614 4114 2336 2172 2163 2482 230
512 7519 6489 4883 4729 4653 4719 332
1024 10051 8494 6144 5953 5803 5810 405
2048 12088 9919 7078 6863 6685 6618 447
4096 13444 10836 7652 7420 7215 7097 465
8192 14240 11361 7971 7731 7539 7376 475
16384 14667 11523 8127 7879 7702 7529 481
32768 14896 11782 8216 7968 7784 7604 484
65536 15042 11849 8261 8003 7831 7645 486
hpcpc17: /home/col>
50

•The memm2 program is a simple program that tests the speed of


memmove(). It runs many tests for a range of memory footprints, and a
number of move sizes. For small move sizes very many more memmove()
calls will be needed to move the required amount of data, so the overhead of
the call to memmove() will be very high. For smaller footprints performance will
be much higher because the copies can be carried out in cache.

•The output of interest in the above test is that relating to a large (1024 page)
memory footprint, which cannot be performed in cache.

50
Does cell-local memory really help?
Cell-local memory
hpcpc17: /home/col> chatr +id disable memm2
...
hpcpc17: /home/col> ./memm2
memmove() test
data moved per cycle: 671088640
move unit: 64 to 65536 bytes
memory footprint: 16 to 1024 pages
output in MB/s
source footprint (pages)
move size 16 32 64 128 256 512 1024
64 2470 2402 1989 1957 1951 1935 373
128 3577 3421 2583 2484 2429 2457 377
256 4619 4236 2473 2253 2187 2222 375
512 7523 6926 4935 4781 4711 4554 550
1024 10064 9068 6192 6013 5911 5650 673
2048 12094 10678 7115 6918 6806 6474 730
4096 13468 11614 7682 7480 7353 7030 757
8192 14251 12316 7996 7785 7660 7347 773
16384 14694 12607 8143 7934 7813 7526 782
32768 14937 12800 8234 7975 7895 7610 787
65536 15044 12898 8285 8067 7942 7659 787
hpcpc17: /home/col>
51

51
What can a sysadmin do to
reduce memory latencies?
• Ensure enough CLM is configured on the system
• Consider whether interleave or cell-local memory
is best for process data segment
• chatr +id enable <binary>
• chatr +id disable <binary>
• Keep dissimilar processes apart to reduce cache
pollution
• mpsched
• PSETs
• Place cooperating processes close to each other

52

•Data items are generally placed into memory in the order in which they are
declared in the program source. This allows the programmer control over what
data is near what other data.

•Try to cluster sequentially accessed data in as few cache contiguous lines as


possible. By packing those cache lines with data that will be accessed
sequentially we not only assure we get the most useful data into cache as
possible, but also minimize the probability that the line will be flushed before
later data items are needed. We also minimize the probability of cache
thrashing – where two pieces of critical data, accessed repeatedly, contend for
the same line. In general when data isn’t accessed sequentially, try to order
data to minimize the stride.

•Since floating point data is not held by the L1 cache, storing a float in a cache
line causes the entire line to be flushed from L1 cache – impairing the speed of
access to the non-floating point data items. Mixing float and integers in a single
line therefore is not wise.

52
Snoopy cache coherence

Each CPUs cache “snoops” the bus,


listening for transactions that
invalidate a locally cached line
bus

cache

CPU

53

Snoopy cache coherence is suitable when all caches share a single bus.
However on a cell-based architecture bus traffic is segregated, with only the
necessary transactions traversing the crossbars. As a result snoopy cache
coherence is inappropriate in cell-based systems.

53
Directory cache coherence
CPU CPU CPU CPU CPU CPU CPU CPU
cache cache cache cache cache cache cache cache

directory directory

crossbar
directory directory

cache cache cache cache cache cache cache cache

CPU CPU CPU CPU CPU CPU CPU CPU


54

Directory-based cache coherence offers better scalability, but increased


latency over a snoopy-based system. The length of this latency depends upon
the number of cells in the hardware partition, since operations need to be
coordinated with these other cells. For this reason the performance of a cell-
based system may be surprisingly less than a bus-based system for certain
workload types.

54
Load-to-use latencies for
sx1000-based systems
sx1000
Madison

Memory Latency (load-to-use):


4-core interleaved or Cell Local 241 nsec
8-core interleaved 292 nsec
16-core interleaved 370 nsec
32-core interleaved 417 nsec
64-core interleaved 440 nsec

Maximum Cache-to-Cache Latency (load-to-use):


4-core 383 nsec
8-core 480 nsec
16-core 715 nsec
32-core 809 nsec
64-core 809 nsec

Bandwidth (Sustainable throughput):


CPU Buses / Cell (total of 2 CPU buses) 10.4 GB/s
Memory / Cell (total of 4 memory subsystems) 7.5 GB/s
Crossbar / Cell (total of 3 crossbar links) 6.4 GB/s
I/O / Cell (ideal mix) 1.8 GB/s
PCI-X I/O card slot 1066 MB/s Peak

55

55
Load-to-use implications
• Small bus-based systems can be much faster for
workloads with low concurrency
• Cell-based systems can offer much better
throughput because of improved scaling
• Workloads that frequently share data between
CPUs can perform poorly on cell-based systems
• A 4-CPU partition of a cell-based system is NOT
equivalent to a 4-CPU bus-based system

56

56
Further learning
• Introduction to Microarchitectural Optimization for Itanium 2 Processors:
http://www.intel.com/software/products/vtune/techtopic/software_optimization.pdf

• Paul Drongowski’s “Tech Notes” on using Caliper to tune performance on HP-UX/Itanium


http://infoeng.cup.hp.com/twiki/bin/view/Caliper/CaliperTips

• Georges Aureau IA64 400 Level Training


• http://cso.fc.hp.com/ssil/uxsk/hpux/products/ia64/

• Compiler Technical Overview


• http://devresource.hp.com/drc/STK/docs/refs/CompilersTechOverview.pdf

• HP Itanium-based Compilers: Technical Overview:


http://www.zk3.dec.com/~pjd/ipf/compiler_white_paper.pdf

• Compiling for IA64, Carol Thompson:


http://www.zk3.dec.com/~pjd/ipf/vail_IA64_compiling.pdf

• Itanium 2 Processor Reference Manual:


http://www.intel.com/design/itanium2/manuals/251110.htm

• Learn more about ccNUMA:


• http://estl.cup.hp.com/vm_training_curricul.htm

57

57

You might also like