IBM Power Systems Technical University featuring IBM AIX and Linux

September 8 12, 2008 Chicago, IL

AIX 6.1 Performance Differences

Session ID: pAI09
Speaker: Steve Nasypany

VMM Page Replacement New defaults reducing the requirement for basic performance tuning VMM File IO Pacing Enabled By Default Performance Tunables Tunables are categorized into restricted and non-restricted tunables AIO Dynamic AIO tuning AIO Fast Path for CIO JFS2 Read only access to files opened with CIO NFS Changes to TCP scaling window, R/W size and number of biod daemons Enhanced JFS no-log option MPSS support

Review AIX Page replacement algorithm

When page replacement begins to run, it selects a page type to steal based on: If the amount of file pages is above maxclient/maxperm, file pages are chosen If the number of file pages is between minperm and maxclient, the type is chosen based on re-paging history If the amount of file pages is below minperm, working storage and file pages are chosen without checking for the re-paging history Re-paging history indicates if individual pages have been written to disk and read back recently Re-paging history adds a degree of uncertainty to the selection process If re-paging history decides to pick working storage pages, system paging may begin This was intended as a safety valve, if we are too aggressive in stealing file pages, stop But, sometimes it is triggered by bad luck If re-paging history decides to pick file pages and many file pages are dirty, heavy writes to disk can occur This would probably happen eventually due to sync

How much memory is caching files

100% Pick file pages Pick file pages -orw/s pages based on recent history Pick any pages 0%
Minperm=20% Maxclient = Maxperm=80%

Contents of system memory

AIX v5 vs v6 VMM Page Replacement tuning

AIX 5.2/5.3 AIX 6.1

minperm% = 20 maxperm% = 80 maxclient% = 80 strict_maxperm = 0 strict_maxclient = 1 lru_file_repage = 1 page_steal_method = 0

minperm% = 3 maxperm% = 90 maxclient% = 90 strict_maxperm = 0 strict_maxclient = 1 lru_file_repage = 0 page_steal_method = 1

On AIX 6.1, no paging to the paging space will occur unless the system memory is over committed (AVM > 97%)

Legacy page_steal_method=0
Partition memory is broken up into page pools A page pool is a set of physical pages organized into a list One lrud per memory pool Inside each memory pool is a mix of working storage and file pages When the free list is depleted, lrud scans its page pool one scan bucket (default 128k pages) at a time The scan can be targeted for working storage pages, file pages, or either If scanning for file pages and the number of file pages is small (e.g. max_client=10) the ratio of scanned pages to freed pages will be high (e.g. 10:1) This reduces performance in two ways: CPU time in lrud Fragmentation of memory which can result in I/O coalescing being less effective


List of pages
Page Pool 0

scan for either w/s or file


List of pages
Page Pool 1 System Memory

scan for either w/s or file

List-based LRU page_steal_method=1

Partition memory is broken up into page pools A page pool is a set of physical pages There are two lists for a page pool, one that is working storage pages and another that is file pages One lrud per memory pool When the free list is depleted, lrud scans the appropriate list for the type of pages it desires one scan bucket (128k pages) at a time If scanning for file pages and the number of file pages is small (e.g. max_client=10) the ratio of scanned pages to freed pages should be low (e.g. 2:1 1:1) This improves performance in two ways: CPU time in lrud is reduced due to less scanning IO Coalescing is better preserved for reading and writing of files larger than memory

List of w/s pages

Page scan for w/s

List of file pages

Page scan for file

Page Pool 0

VMM File IO Pacing Enabled By Default

IO Pacing Enabled By Default Prevents system responsiveness issues due to large quantities of writes Limits the maximum number of pages of I/O outstanding to a file
Without I/O pacing a program can fill up large amounts of memory with written pages. Those queued I/Os can result in long waits for other programs using the storage Better solution than the file system write behind techniques

New defaults
Not very aggressive, intended to limit one or a few programs from impacting system responsiveness. Values high enough not to impact sequential write performance maxpout = 8193 minpout = 4096

Performance Tunables
Tunables now in two categories Restricted Tunables
Should not be changed unless recommended by AIX development or development support Are not shown by tuning commands unless the F flag is used Dynamic change will show a warning message Permanent change must be confirmed Permanent changes will cause an error log entry at boot time

Non-Restricted Tunable
Can have restricted tunables as dependencies

Changing restricted tunables

Changing a restricted tunable dynamically

ioo -o aio_sample_rate=6 Warning: a restricted tunable has been modified

A dynamic change of a restricted tunable will inform the user.

Changing a restricted tunable permanently

ioo -po aio_sample_rate=6 Modification to restricted tunable aio_sample_rate, confirmation yes/no

A permanent change of a restricted tunable requires a confirmation from the user. Note: The system will log changes to restricted tunable in the system error log at boot time.
List restricted tunables

> ioo -aF aio_active = 0 aio_maxreqs = 65536 ... posix_aio_minservers = 3 posix_aio_server_inactivity = 300 ##Restricted tunables aio_fastpath = 1 aio_fsfastpath = 1 aio_kprocprio = 39 aio_multitidsusp = 1 aio_sample_rate = 5 aio_samples_per_cycle = 6 j2_maxUsableMaxTransfer = 512 j2_nBufferPerPagerDevice = 512 j2_nonFatalCrashesSystem = 0 j2_syncModifiedMapped = 1 j2_syncdLogSyncInterval = 1
LABEL: IDENTIFIER: Date/Time: Sequence Number: Machine Id: Node Id: Class: Type: WPAR: Resource Name: TUNE_RESTRICTED D221BD55 Thu May 24 15:05:48 2007 637 000AB14D4C00 quake O INFO Global perftune


Why you ask?

The number of tunables in AIX had grown to a ridiculously large number 5.3 TL06: vmo 61, ioo 27, schedo 42, no 135, plus a few others 6.1 vmo 29, ioo 21, schedo 15, no 133, plus a few others The potential combinations that exist are too huge to effectively test and document Many of the tunables had been created to deal with very specific customers or situations which dont apply often This wasnt done in a vacuum, a survey of support and recent situations was employed to identify the commonly used tunables (which remain unrestricted) If a restricted tunable must be changed, a PMR should be opened to identify the issue
General trend toward file system I/O with concurrent I/O

Concurrent I/O (CIO) has been a feature of AIX since AIX 5.2 Concurrent I/O gives applications which do internal buffering of disk I/O and locking a means of by-passing operating system caching and i-node file locking This improves CPU efficiency of I/O to very near that of raw logical volumes And improves scalability by eliminating operating system i-node locking in the read/write paths Concurrent I/O is not for all applications Some applications require operating system i-node locking to function correctly Other applications do not do sophisticated storage buffering and benefit from caching in the operating system or read-ahead/writebehind mechanisms that the AIX virtual memory management subsystem provide to improve sequential file performance

CIO and Applications

DB2 Version 9.5 implements CIO as the DEFAULT mechanism for table spaces on AIX NO FILE SYSTEM CACHING/FILE SYSTEM CACHING clauses on CREATE TABLESPACE or ALTER TABLESPACE View caching DB2 GET SNAPSHOT FOR TABLES ON db DB2 has supported CIO since V8.1

Oracle 10g/11g have support, but it is not a default Requires filesystemio_options is SETALL or DIRECTIO CIO is the recommended deployment solution for JFS2, however some 3rd party tools have issues

CIO and Applications

If you use legacy VMM tuning (e.g AIX 5.2/5.3 defaults) and you switch an application from non-CIO to CIO operation, you will likely need to retune The amount and distribution of memory may change quite radically Usually, switching file usage to CIO reduces the memory required, as the operating system no longer will be buffering file pages for those files Upgrading from DB2 9.1 (non-CIO) to DB2 9.5 may require some tuning preparation With AIX 6.1 default tuning, it should not be necessary to change tuning when converting from non-CIO to CIO operation

AIX 6.1 AIO Support

Interface Changes All the AIO entries in the ODM and AIO smit panels have been removed The aioo command will not longer be shipped All the AIO tunables have current, default, minimum and maximum value that can be viewed with ioo AIO kernel extension loaded at system boot Applications no longer fail to run because you forgot to load the kernel extension (you may applaud here)
No AIO servers are active until requests are present Extremely low impact on memory requirements with this implementation

Improvements to AIO CIO

AIO Fast Path for CIO enabled by default
With the fast path, the AIO server threads no longer participate in the I/O path By removing the AIO servers from the path, we get three things
The removal of AIO servers as any potential resource bottleneck The reduction in path length for AIO read/write services, as less dispatching is required Potentially better coalescing of sequential I/O requests initiated through AIO or LISTIO services

Application File System

AIO Server

File System LVM

FS no Fast Path

Device Driver

Application File System LVM Device Driver

Fast Path enabled for LV and PVs for a long time

CIO Fast Path

No change in behavior for environments such as Oracle 10G/ASM on raw hdisks

General improvements to AIO

The number of AIO servers varies between minservers and maxservers (times #CPUs), based on workload AIO servers stay active as long as they service requests Number of AIO server dynamically increased/reduced based on the demand of the workload aio_server_inactivity defines after how many seconds idle time an AIO server will exit Do not confuse no active servers with kernel extension not loaded. The kernel extension is always loaded Changes to AIO tunables are dynamic through ioo Changes do not require system reboot minservers is changed to a per CPU tunable maxservers is changed to 30 maxreqs is changed to 65536 Benefit No longer necessary to tune the minservers/maxservers/maxreqs as in the past

AIO Tunables

ioo -a aio_active = 0 aio_maxreqs = 65536 aio_maxservers = 30 aio_minservers = 3 aio_server_inactivity = 300 posix_aio_active = 0 posix_aio_maxreqs = 65536 posix_aio_maxservers = 30 posix_aio_minservers = 3 posix_aio_server_inactivity = 300

AIO Restricted Tunables

> ioo -aF ... ##Restricted tunables aio_fastpath = 1 aio_fsfastpath = 1 aio_kprocprio = 39 aio_multitidsusp = 1 aio_sample_rate = 5 aio_samples_per_cycle = 6 posix_aio_fastpath = 1 posix_aio_fsfastpath = 1 posix_aio_kprocprio = 39 posix_aio_sample_rate = 5 posix_aio_samples_per_cycle = 6
CIO Read Mode Flag

Allows an application to open a file for CIO such that subsequent opens without CIO avoid demotion In the past, a 2nd opening of a file without CIO, would cause demotion which removes many of the benefits of CIO The 2nd read-only opening without CIO will still result in that opening having uncached reads to the file. Thus, such programs should ensure that the I/O sizes are large enough to achieve I/O efficiency Example, a backup application can access database files in read only mode while the database has the file opened in concurrent IO mode open() flag is O_CIOR procfiles does not reflect O_CIO/O_CIO_R currently

kdb 'u <slotnumber>' then for each file listed there 'file <filepointer>' gives some info

NFS Performance Improvements

RFC 1323 enabled by default Allows for TCP window scaling beyond 64K, so more one-way packets in-flight allowed between acks for large sequential transfers. We had the nfs_rfc1323 tunable before, it just wasn't enabled by default. Increase default number of biod daemons 32 biod daemons per NFS V3 mount point Very slight increase in memory (<2MB) required over previous default of 4 Enables more I/Os to be outstanding at the same, doesnt speed sequential operations much, but helps random access (e.g. OLTP) Default read/write size increased to 64k for TCP connections Was 32k previously

NFS biod changes

Having more biods allows better read-ahead and writebehind However, measured on a single-process basis, dont have huge performance differences over the AIX 5.3 defaults Results should improve in tests with multiple processes/threads operating over NFS NFS client tests, p5 520 on 1GB Ethernet with 64kB I/Os (next slide)

NFS biod changes

NFS single process throughput, over 256MB file
120000 100000 MB/second 80000 60000 40000 20000 0
se rv er re un ad ca se ch q ed re se ad rv er ra nd ca ch se ed rv er un ca w ch rit ed e se q ov er w rit w e rit e se q cr ea w rit te e ra nd cr ea te

32biod 4biod

re ad

se q

NFS biod change with Kerberos krbp5

The increase in biods has a much more positive impact when using Kerberos DES security Overlapping more compute with network traffic through more biods greatly improves throughput Same model as previous chart, krbp5 (full packet encryption) mount option

NFS biod changes with Kerberos

70000 60000 50000 40000 30000 20000 10000 0
he d d e nc ac he d rw ri t e rc ac cr ea t ch e un ca ov e se q se rv er u se rv e se rv er ra nd q cr ea te

32biod 4biod

rit e w



re ad

re ad





w rit e

se q

w r it e

Enhanced JFS nolog option

JFS2 standard metadata logging for filesystem integrity disabled via a mount option Similar to legacy JFS nointegrity option Meant to enable faster migration of data to new storage File system operation with heavy file create/delete activity can create log bottlenecks Potentially useful for temporary file systems where the filesystem can be easily recreated or fscked Mount o log=NULL during data migration phase, then unmount and mount with standard logging

Enhanced JFS nolog option - example

4-way POWER5 p550, PHP test Wikibench Test makes heavy use of file meta-data With single disk setup, bottleneck on disk writes to Enhanced JFS2 logs
%disk busy

PHP Wikibench
90 80 70 60 50 40 30 20 10 0 Default log nolog

Disk utilization over time

100 80 60 40 20 0 time default log nolog

With nolog, the log bottleneck is avoided

Multiple Page Size Segment (MPSS) Support

POWER6 provides hardware support for mixing 4kB pages and 64kB pages in the same hardware segment This allows the AIX operating system to transparently to an application promote small pages to medium pages This typically improves performance by reducing stress on hardware translation mechanisms It is controlled with the vmo vmm_default_pspa parameter (-1 turns off) This behavior is enabled as a default on AIX 6.1 on POWER6 hardware Since it is not supported on POWER5, systems running identical application conditions on POWER5 and POWER6 may differ on exact memory page usage In general, no increase in memory consumption should be noticed, however the usage of 64kB pages may increase on POWER6 System paging activity may result in 64kB pages being broken into 4kB pages 64kB pages that are broken by paging wont usually be reconstituted into 64kB pages later
MPSS Using svmon to see MPSS segments

svmon P 553068 Pid Command 553068 java PageSize s 4 KB Inuse 44652 Inuse 1132 2720 Pin 8388 Pin 244 509 Pgsp Virtual 64-bit Mthrd 16MB 37623 73342 Pgsp 4055 2098 Virtual 4798 4284 N Y N

m 64 KB

Vsid 51b10 0 3c02d d3a7 61adc 65add 51ad0 75ad9

Esid Type Description 3 work working storage 0 work kernel segment

PSize Inuse m m m sm s m m m 1879

Pin Pgsp Virtual 0 1946 3068 47 85 561 612

520 507 297 582 0

d work text or shared-lib code seg e work shared memory segment - work f work working storage 2 work process private 1 work code

0 3744 4096 311 702 17 2 1 36 5 2

549 244 20 3 1 0 2 0

MPSS Using svmon to detail MPSS segments

svmon D d3a7 Segid: d3a7 Type: working PSize: sm (4 KB - 64 KB) Address Range: 0..4095 Size of page space allocation: 3744 pages ( 14.6 MB) Virtual: 4096 frames (16.0 MB) Inuse: 582 frames ( 2.3 MB) Page Psize 0 1 2 382 435 m m m s s Frame 442176 442177 442178 362140 430534 Pin Y Y Y N N ExtSegid 2008 IBM Corporation


Implementation Considerations
AIX 5.2/3 to AIX 6.1 migration example (DB2 performance tuning) AIX 5.2/5.3 VMM page replacement tuning
reduce minperm, maxperm, maxclient turn off strict_maxclient increase minfree, maxfree Enable AIO Tune minservers, maxservers and reboot Enable CIO

AIX 6.1 VMM page replacement tuning


AIO tuning

AIO tuning

DB2 tuning
Enable CIO

DB2 tuning

Implementation Considerations (Contd)

Best Practices Do not apply legacy tuning since some tunables may now be restricted If you do an upgrade install, your old tunings will be preserved You may wish to undo them, but we wont make you This level of tune was been applied to numerous AIX 5.3 customers through field support We are confident this was a good thing However, we try to never change defaults in the service stream, so AIX 5.3 remains as it was Change restricted tunables only if recommended by AIX support

Implementation Considerations (Contd)

Problem Determination Common problems - seen in field or lab Legacy VMM tuning results in error log entries (TUNE_RESTRICTED) Tuning scripts fail due to required confirmation for permanent changes of restricted tunables Install/tuning scripts fail due missing aio0 device Diagnostics Check AIX errpt for TUNE_RESTRICTED Check /etc/tunables/lastboot.log PERFPMR

