Multipathing and SAN Storage Considerations For AIX Administrators

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 92

IBM Power Systems

Multipathing and SAN Storage


Considerations for AIX Administrators

Dan Braden – dbraden@us.ibm.com


John Hock – jrhock@us.ibm.com

IBM Power Systems Advanced Technical Skills


February 28, 2013

© 2013 IBM Corporation


IBM Power Systems

Agenda

 Multipath IO Considerations
 AIX Notification Techniques
 Monitoring SAN storage performance
 Measuring SAN storage bandwidth
►SAN storage bandwidth metrics
►SAN storage measurement challenges
►The ndisk64 tool
►Small block random IO
►Large block sequential IO
►Bandwidth of SAN storage components
 Basic IO tuning

© 2012, 2013 IBM Corporation 2


IBM Power Systems

Agenda
 Multipath IO Considerations
 AIX Notification Techniques
 Monitoring SAN storage performance
 Measuring SAN storage bandwidth
►SAN storage bandwidth metrics
►SAN storage measurement challenges
►The ndisk64 tool
►Small block random IO
►Large block sequential IO
►Bandwidth of SAN storage components
 Basic IO tuning

© 2012, 2013 IBM Corporation 3


IBM Power Systems

What is MPIO?
 MPIO is an architecture designed by AIX development (released in AIX V5.2)
 MPIO is also a commonly used acronym for Multi-Path IO (AIX PCM aka MPIO)
► In this presentation MPIO refers explicitly to the architecture, not the acronym

 Why was the MPIO architecture developed?


► With the advent of SANs, each disk subsystem vendor wrote their own multi-path code
► These multi-path code sets were usually incompatible
● Mixing disk subsystems was usually not supported on the same system, and if they
were, they usually required their own FC adapters
► Integration with AIX IO error handling and recovery
● Several levels of IO timeouts: basic IO timeout, FC path timeout, etc
 MPIO architecture details available to disk subsystem vendors
► Compliant code requires a Path Control Module (PCM) for each disk subsystem
● AIX PCMs for SCSI and FC ship with AIX and are often used by the vendors
► MPIO allows vendors to develop their own path selection algorithms
► Disk vendors have been moving towards MPIO compliant code

MPIO Common Interface

© 2012, 2013 IBM Corporation 4


IBM Power Systems

Overview of MPIO Architecture

 LUNs show up as an hdisk


►Architected for 32 K paths
►No more than 16 paths are necessary
 PCM: Path Control Module
►AIX PCMs exist for FC, SCSI
►Vendors may write optional PCMs
►May provide commands to manage paths
 Allows various algorithms to balance use
of paths
 Full support for multiple paths to rootvg Tip: to keep paths <= 16, group
sets of 4 host ports and 4 storage ports
and balance LUNs across them

 Hdisks
Hdisks can
can be Available, Defined
be Available, Defined or
or non-existent
non-existent
 Paths
Paths can
can also
also be Available, Defined,
be Available, Defined, Missing
Missing oror non-existent
non-existent
 Path status can
Path status can be enabled, disabled
be enabled, disabled or
or failed
failed ifif the
the path
path is
is Available
Available
(use chpath command
(use chpath command to to change
change status)
status)
 Add
Add path:
path: e.g.
e.g. after
after installing
installing new
new adapter
adapter and
and cable
cable to
to the
the disk
disk
run cfgmgr (or
run cfgmgr (or cfgmgr
cfgmgr –l–l <adapter>)
<adapter>)
 One
One must
must get
get the
the device
device layer
layer correct,
correct, before
before working
working with
with the
the path
path status
status layer
layer

© 2012, 2013 IBM Corporation 5


IBM Power Systems

https://tuf.hds.com/gsc/bin/view/Main/AIXODMUpdates
Disk configuration ftp://ftp.emc.com/pub/elab/aix/ODM_DEFINITIONS/
 The disk vendor…
 Dictates what multi-path code can be used
 Supplies the filesets for the disks and multipath code
 Supports the components that they supply
 A fileset is loaded to update the ODM to support the storage
 AIX then recognizes and appropriately configures the disk
 Without this, disks are configured using a generic ODM definition
 Performance and error handling may suffer as a result
 # lsdev –Pc disk displays supported storage
 The multi-path code will be a different fileset
 Unless using the MPIO that’s included with AIX

Beware of generic “Other” disk definition


No command queuing
Poor Performance & Error Handling

© 2012, 2013 IBM Corporation 6


IBM Power Systems

AIX Path Control Module (PCM) IO basics


The AIX PCM…
 Is part of the MPIO architecture
 Chooses the path each IO will take
 Is used to balance the use of resources used to connect to the storage
 Depends on the algorithm attribute for each hdisk
 Handles path failures to ensure availability with multiple paths
 Handles path failure recovery
 Checks the status of paths
 Supports boot disks
 Not all multi-path code sets do support boot disks
 Offers PCMs for both Fibre Channel and SCSI protocol disks
 Supports active/active, active/passive and ALUA disk subsystems
 But not all disk subsystems
 Supports SCSI-2 and SCSI-3 reserves
 SCSI reserves are often not used

© 2012, 2013 IBM Corporation 7


IBM Power Systems

How many paths for a LUN?

• Paths = (# of paths from server to switch) x


(# paths from storage to switch)
Server …Here there are potentially 6 paths per LUN
…But reduced via:
• LUN masking at the storage
Assign LUNs to specific FC adapters at the host,
and thru specific ports on the storage
• Zoning
FC Switch WWPN or SAN switch port zoning
• Dual SAN fabrics
divides potential paths by two
• 4 paths per LUN are sufficient for availability
and reduces CPU overhead for choosing the path
Storage • Path selection overhead is relatively low—usually negligible
• MPIO has no practical limits to number of paths
• Other products have path limits
• SDDPCM limited to 16 paths per LUN

© 2012, 2013 IBM Corporation 8


IBM Power Systems

How many paths for a LUN?, cont’d


Dual SAN Fabric for SAN Zoning Reduces Potential Paths

Server

FC Switch

Fabric 1 Fabric 2

Storage

4 X 4 = 16 paths 2 X 2 + 2 X 2 = 8 paths
With single initiator to single target zoning, both examples would have 4 paths
A popular approach is to use 4 host and 4 storage ports, zoning one host port to one
storage port, yielding 4 paths

© 2012, 2013 IBM Corporation 9


IBM Power Systems

Path selection benefits and costs


 Path
Path selection
selection algorithms
algorithms choose
choose aa path
path to
to hopefully
hopefully minimize
minimize latency
latency added
added to
to
an
an IOIO to
to send
send itit over
over the
the SAN
SAN to to the
the storage
storage
 Latency
Latency to to send
send aa 44 KB
KB IO
IO over
over aa 88 Gbps
Gbps SAN
SAN link
link is
is
44 KB
KB // (8
(8 Gb/s
Gb/s xx 0.1
0.1 B/b
B/b x1048576
x1048576 KB/GB)KB/GB) == 0.0048
0.0048 ms ms
 Multiple
Multiple links
links may
may bebe involved,
involved, andand IOs
IOs are
are round
round triptrip
 As
As compared
compared to to fastest
fastest IO
IO service
service times
times around
around 11 ms ms

 If
If the
the links
links aren’t
aren’t busy,
busy, there
there likely
likely won’t
won’t be
be much,
much, if
if any,
any, savings
savings from
from
use
use of
of sophisticated
sophisticated path
path selection
selection algorithims
algorithims vs.
vs. round
round robin
robin

Generally utilization
of links is low

 Costs
Costs of
of path
path selection
selection algorithms
algorithms (could
(could outweigh
outweigh latency
latency savings)
savings)
 CPU
CPU cycles
cycles to
to choose
choose thethe best
best path
path
 Memory
Memory toto keep
keep track
track of
of in-flight
in-flight IOs
IOs down
down each
each path,
path, or
or
 Memory
Memory toto keep
keep track
track of
of IO
IO service
service times
times down
down each
each path
path
 Latency
Latency added
added toto the
the IO
IO to
to choose
choose the
the best
best path
path
© 2012, 2013 IBM Corporation 10
IBM Power Systems

Balancing IOs with algorithms fail_over and round_robin

A fail_over algorithm can be efficiently used to balance IOs!


► Any load balancing algorithm must consume CPU and memory resources to determine
the best path to use.
► Using path priorities, it is possible to setup fail_over LUNs so that the loads are
balanced across the available FC adapters.
► Let's use an example with 2 FC adapters. Assume we correctly lay out our data so that
the IOs are balanced across the LUNs (this is usually a best practice). Then if we
assign half the LUNs to FC adapterA and half to FC adapterB, then the IOs are evenly
balanced across the adapters!
► A question to ask is, “If one adapter is handling more IO than another, will this have a
significant impact on IO latency?”
► Since the FC adapters are capable of handling more than 50,000 IOPS then we're
unlikely to bottleneck at the adapter and add significant latency to the IO.

round_robin may more easily ensure balanced IOs across the links for each LUN
● e.g., if the IOs to the LUNs aren't balanced, then it may be difficult to balance the
LUNs and their IO rates across the adapter ports with fail_over
● requires less resource that load balancing

© 2012, 2013 IBM Corporation 11


IBM Power Systems

Multi-path IO with VIO and VSCSI LUNs

VIO Client  Two layers of multi-path code: VIOC and VIOS


AIX PCM
 VSCSI disks always use AIX PCM and
all IO for a LUN normally goes to one VIOS
► algorithm = fail_over only

 Set the path priorities for the VSCSI hdisks so half use one
VIO Server VIO Server VIOS, and half use the other
Multi-path code Multi-path code

 VIOS uses the multi-path code specified for the disk


subsystem

 Typical setup: Set vsci device’s attribute vsci_err_recov


Disk to fast_fail. The default is delayed_fail. This will speed up
Subsystem path failover in the event of VIOS failure.

© 2012, 2013 IBM Corporation 12


IBM Power Systems

Multi-path IO with VIO and NPIV


 One layer of multi-path code

VIO Client  VIOC has virtual FC adapters (vFC)


Multi-path code
► Potentiallyone vFC adapter for every real FC adapter
VFC VFC VFC VFC
in each VIOC
► Maximum of 64 vFC adapters per real FC adapter
recommended
VIO Server VIO Server
HBA HBA HBA HBA
 VIOC uses multi-path code that the disk subsystem
supports

 IOs for a LUN can go thru both VIOSs


Disk
Subsystem

Mixed multi-path codes, which may be incompatible on a single LPAR, can be used on VIOC
LPARS with NPIV to share the same physical adapter, provided incompatible code isn't used
on the same LPAR. E.g. Powerpath + EMC & MPIO + DS8000.
© 2012, 2013 IBM Corporation 13
IBM Power Systems

Active/Active, Active/Passive and ALUA Disk Subsystem Controllers


 Active/Active controllers
► IOscan be sent to any controller for a LUN
► DS8000, DS6000 and XIV

 Active/Passive controllers
► IOsfor a LUN are sent to the primary controller for the LUN, except in failue scenarios
► The storage administrator balances LUNs across the controllers
● Controllers should be active for some LUNs and passive for others
► DS3/4/5000

 ALUA – Asynchronous Logical Unit Access


► IOscan be sent to any controller, but one controller is preferred (IOs passed to primary)
● Preferred due to performance considerations
► SVC, V7000 and NSeries/NetApp
● Using ALUA on NSeries/NetApp is preferred
 Set on the storage
 MPIO supports Active/Passive and Active/Active disk subsystems
► SVC and V7000 are treated as Active/Passive

 Terminology regarding active/active and active/passive varies considerably


© 2012, 2013 IBM Corporation 14
IBM Power Systems

MPIO support
Storage Subsystem Family MPIO code Multi-path algorithm
IBM Subsystem Device
IBM ESS, DS6000, DS8000, fail_over, round_robin and for
Driver Path Control
DS3950, DS4000, DS5000, SDDPCM: load balance, load
Module (SDDPCM) or AIX
SVC, V7000 balance port
PCM
AIX FC PCM
DS3/4/5000 in VIOS fail_over, round_robin
recommended
IBM XIV Storage System AIX FC PCM fail_over, round_robin

IBM System Storage N Series AIX FC PCM fail_over, round_robin

EMC Symmetrix AIX FC PCM fail_over, round_robin


Hitachi Dynamic Link fail_over, round robin,
HP & HDS Manager (HDLM) extended round robin
(varies by model)
AIX FC PCM fail_over, round_robin

SCSI AIX SCSI PCM fail_over, round_robin


VIO VSCSI AIX SCSI PCM fail_over

© 2012, 2013 IBM Corporation 15


IBM Power Systems

Non-MPIO multi-path code

Storage subsystem family Multi-path code


IBM DS6000, DS8000, SVC, V7000 SDD

IBM DS4000 Redundant Disk Array Controller (RDAC)

EMC Power Path


HP AutoPath

HDS HDLM (older versions)


Veritas-supported storage Dynamic MultiPathing (DMP)

© 2012, 2013 IBM Corporation 16


IBM Power Systems

Mixing multi-path code sets

 The disk subsystem vendor specifies what multi-path code is supported for their storage
► The disk subsystem vendor supports their storage, the server vendor generally doesn’t
 You can mix multi-path code compliant with MPIO and even share adapters
► There may be exceptions. Contact vendor for latest updates.
HP example: “Connection to a common server with different HBAs requires separate
HBA zones for XP, VA, and EVA”
 Generally one non-MPIO compliant code set can exist with other MPIO compliant code sets
► Except that SDD and RDAC can be mixed on the same LPAR
► The non-MPIO compliant code must be using its own adapters
● Except RDAC can share adapter ports with MPIO
 Devices of a given type use only one multi-path code set
► e.g., you can’t use SDDPCM for one DS8000 and SDD for another DS8000 on the same
AIX instance

© 2012, 2013 IBM Corporation 17


IBM Power Systems

Sharing Fibre Channel Adapter ports

 Disk using MPIO compliant code sets can share adapter ports

 It’s recommended that disk and tape use separate ports

Disk (typicaly small block random) and


tape (typically large block sequential) IO
are different, and stability issues have
been seen at high IO rates

© 2012, 2013 IBM Corporation 18


IBM Power Systems

MPIO Command Set


 lspath – list paths, path status, path ID, and path attributes for a disk

 chpath – change path status or path attributes


► Enable or disable paths

 rmpath – delete or change path state


► Putting a path into the defined mode means it won’t be used (from available to
defined)
► One cannot define/delete the last path of an open device

 mkpath – add another path to a device or makes a defined path available


► Generally cfgmgr is used to add new paths

 chdev – change a device’s attributes (not specific to MPIO)

 cfgmgr – add new paths to an hdisk or make defined paths available


(not specific to MPIO)

© 2012, 2013 IBM Corporation 19


IBM Power Systems

Useful MPIO Commands


 List status of the paths and the parent device (or adapter)
# lspath -Hl <hdisk#>
 List connection information for a path
# lspath -l hdisk2 -F"status parent connection path_status path_id“
Enabled fscsi0 203900a0b8478dda,f000000000000 Available 0
Enabled fscsi0 201800a0b8478dda,f000000000000 Available 1
Enabled fscsi1 201900a0b8478dda,f000000000000 Available 2
Enabled fscsi1 203800a0b8478dda,f000000000000 Available 3
 The connection field contains the storage port WWPN
► In the case above, paths go to two storage ports and WWPNs:
203900a0b8478dda
201800a0b8478dda
 List a specific path's attributes
# lspath -AEl hdisk2 -p fscsi0 –w “203900a0b8478dda,f00000000000“
scsi_id 0x30400 SCSI ID False
node_name 0x200800a0b8478dda FC Node Name False
priority 1 Priority True

© 2012, 2013 IBM Corporation 20


IBM Power Systems

Path priorities
 A Priority Attribute for paths can be used to specify a preference for path
IOs. How it works depends whether the hdisk’s algorithm attribute is set to
fail_over or round_robin.
Value specified is inverse to priority, i.e. “1” is high priority

 algorithm=fail_over
►the path with the higher priority value handles all the IOs unless there's a path failure.
►Set the primary path to be used by setting it's priority value to 1, and the next path's
priority (in case of path failure) to 2, and so on.
►ifthe path priorities are the same, the primary path will be the first listed for the hdisk
in the CuPath ODM as shown by # odmget CuPath

 algorithm=round_robin
►If the priority attributes are the same, then IOs go down each path equally.
►Inthe case of two paths, if you set path A’s priority to 1 and path B’s to 255, then for
every IO going down path A, there will be 255 IOs sent down path B.

 To change the path priority of an MPIO device on a VIO client:


# chpath -l hdisk0 -p vscsi1 -a priority=2
►Set path priorities for VSCSI disks to balance use of VIOSs

© 2012, 2013 IBM Corporation 21


IBM Power Systems

Path priorities
# lsattr -El hdisk9
PCM PCM/friend/otherapdisk Path Control Module False
algorithm fail_over Algorithm True
hcheck_interval 60 Health Check Interval True
hcheck_mode nonactive Health Check Mode True
lun_id 0x5000000000000 Logical Unit Number ID False
node_name 0x20060080e517b6ba FC Node Name False
queue_depth 10 Queue DEPTH True
reserve_policy single_path Reserve Policy True
ww_name 0x20160080e517b6ba FC World Wide Name False

# lspath -l hdisk9 -F"parent connection status path_status"


fscsi1 20160080e517b6ba,5000000000000 Enabled Available
fscsi1 20170080e517b6ba,5000000000000 Enabled Available

# lspath -AEl hdisk9 -p fscsi1 -w"20160080e517b6ba,5000000000000"


scsi_id 0x10a00 SCSI ID False
node_name 0x20060080e517b6ba FC Node Name False
priority 1 Priority True

Note:
Note: whether
whether or
or not
not path
path priorities
priorities apply
apply depends
depends onon the
the PCM.
PCM.
With
With SDDPCM,
SDDPCM, path
path priorities
priorities only
only apply
apply when
when the
the algorithm
algorithm used
used is
is fail
fail over
over (fo).
(fo).
Otherwise, they aren’t used.
Otherwise, they aren’t used.
© 2012, 2013 IBM Corporation 22
IBM Power Systems

Path priorities – why change them?


 With VIOCs, send the IOs for half the LUNs to one VIOS and half to the other

►Set priorities for half the LUNs to use VIOSa/vscsi0 and half to use
VIOSb/vscsi1
►Uses both VIOSs CPU and virtual adapters
►algorithm=fail_over is the only option at the VIOC for VSCSI disks

 With NSeries – have the IOs go the primary controller for the LUN if not using
ALUA (ALUA is preferred)
►When not using ALUA, use the dotpaths utility to set path priorities to ensure most IOs go to
the preferred controller

To see to which VIOS a vscsi adapter is connected:


# echo "cvai" | kdb | grep vscsi | grep vhost

vscsi0 0x000007 0x0000000000 0x0 vios1->vhost0


vscsi1 0x000007 0x0000000000 0x0 vios2->vhost1

© 2012, 2013 IBM Corporation 23


IBM Power Systems

Path Health Checking and Recovery


Validates a path is working & automates recovery of failed path
Note: applies to open disks only
 For SDDPCM and MPIO compliant disks, two hdisk attributes apply:
# lsattr -El hdisk26
hcheck_interval 0 Health Check Interval True
hcheck_mode nonactive Health Check Mode True

 hcheck_interval
► Defines how often (1– 3600 seconds) the health check is performed on the paths for a device.
When a value of 0 is selected (the default), health checking is disabled
► Preferably set to at least 2X IO timeout value…often 30 seconds

 hcheck_mode
► Determines which paths should be checked when the health check capability is used:

● enabled: Sends the healthcheck command down paths with a state of enabled
● failed: Sends the healthcheck command down paths with a state of failed
● nonactive: (Default) Sends the healthcheck command down paths that have no active I/O, including
paths with a state of failed. If the algorithm selected is failover, then the healthcheck command is
also sent on each of the paths that have a state of enabled but have no active IO. If the algorithm
selected is round_robin, then the healthcheck command is only sent on paths with a state of failed,
because the round_robin algorithm keeps all enabled paths active with IO.

 Consider setting up error notification for path failures (later slide)

© 2012, 2013 IBM Corporation 24


IBM Power Systems

Path Recovery
 MPIO will recover failed paths if path health checking is enabled with hcheck_mode=nonactive
or failed and the device has been opened

 Trade-offs exist:
► Lots of path health checking can create a lot of SAN traffic
► Automatic recovery requires turning on path health checking for each LUN
► Lots of time between health checks means paths will take longer to recover after repair
► Health checking for a single LUN is often sufficient to monitor all the physical paths,
but not to recover them
 SDD and SDDPCM also recover failed paths automatically
 In addition, SDDPCM provides a health check daemon to provide an automated method of
reclaiming failed paths to a closed device.

 To manually enable a failed path after repair or re-enable a disabled path:


# chpath -l hdisk1 -p <parent> –w <connection> -s enable
or run cfgmgr or reboot

© 2012, 2013 IBM Corporation 25


IBM Power Systems

Path Recovery With Flaky Links


 When a path fails, it takes AIX time to recognize it, and to redirect in-flight IOs previously sent
down the failed path
► IO stalls during this time, along with processes waiting on the IO
► Turning off a switch port results in a 20 second stall
● Other types of failures may take longer
► AIX must distinguish between slow IOs and path failures

 With flaky paths that go up and down, this can be a problem


 The MPIO timeout_policy attribute for hdisks addresses this for command timeouts
► IZ96396 for AIX 7.1, IZ96302 for AIX 6.1
► timeout_policy=retry_path Default and similar to before the attribute existed. The first
occurrence of a command timeout on the path does not cause immediate path failure.
► timeout_policy=fail_path Wait until several clean health checks then recover the path
► timeout_policy=disable_path Disable the path and leave it that way
● Manual intervention will be required so be sure to use error notification in this case

 SDDPCM recoverDEDpath attribute – similar to timeout_policy but for all kinds of path errors
► recoverDEDpath=no Default and failed paths stay that way
► recoverDEDpath=yes Allows failed paths to be recovered
► SDDPCM V2.6.3.0 or later

© 2012, 2013 IBM Corporation 26


IBM Power Systems

Path management with AIX PCM


 Includes examining, adding, removing, enabling and disabling paths
► Adapter failure/replacement or addition
► Planned VIOS outages
► Cable failure and replacement
► Storage controller/port failure and repair
 Adapter replacement
► Paths will not be in use if the adapter has failed, paths will be in the failed state
1. Remove the adapter and its child devices including the paths using the adapter with
# rmdev –Rdl <fcs#>
2. Replace the adapter
3. cfgmgr
4. Check the paths with lspath
 It’s better to stop using a path before you know the path will disappear
► Avoid timeouts, application delays or performance impacts and potential error
recovery bugs
► To disable all paths using a specific FC port on the host:
# chpath –l hdisk1 –p <parent> -s disable
© 2012, 2013 IBM Corporation 27
IBM Power Systems

Example: Active/Passive Paths

© 2012, 2013 IBM Corporation 28


IBM Power Systems

Agenda
 Multipath IO Considerations
 AIX Notification Techniques
 Monitoring SAN storage performance
 Measuring SAN storage bandwidth
►SAN storage bandwidth metrics
►SAN storage measurement challenges
►The ndisk64 tool
►Small block random IO
►Large block sequential IO
►Bandwidth of SAN storage components
 Basic IO tuning

© 2012, 2013 IBM Corporation 29


IBM Power Systems

Path Health Checking and Recovery – Notification!

 One should also set up error notification for path failure, so that someone knows
about it and can correct it before something else fails.

 This is accomplished by determining the error that shows up in the error log when a
path fails (via testing), and then

 Adding an entry to the errnotify ODM class for that error which calls a script (that you
write) that notifies someone that a path has failed.

Hint: You can use # odmget errnotify to see what the entries (or stanzas) look like,
then you create a stanza and use the odmadd command to add it to the errnotify
class.

© 2012, 2013 IBM Corporation 30


Why Notification?

• Automatic notification of incidents supports


the goal of restoring normal IT service
operations as quickly as possible
• Automatic notification of incidents can
minimize disruption to users and business
The Information Technology
operations Infrastructure Library (ITIL),
– May allow correction of problem before is a set of practices
for IT service management
critical outage (e.g. MPIO path failure)
that focuses on
• Facilitates establishment of well-defined and aligning IT services
with the needs of business.
controlled processes for effective handling
of events and alerts.
• Notification is defined as an IT Best Practice
– Within ITIL V3 Service Support Framework:
Event & Alert Management
 Event & Alert Management defines monitoring and
handling all events occurring throughout the IT services
and
© 2012, 2013 IBMsystems
Corporation 31
Why Notification? – Problem Resolution Time

Without Automated Notification

Incident Incident Support Problem Solution Corrective Return


Occurs Recognized contacted Analysis Determined Action to
and Service
Testing
With Automated Notification…Reduced Problem Resolution time
Incident
Occurs

Incident Support Problem Solution Corrective Return


Recognized contacted Analysis Determined Action to
and Service
Testing
© 2012, 2013 IBM Corporation 32
Error Logging Components in AIX

© 2012, 2013 IBM Corporation 33


Options for Error Notification

ODM-Based
ODM-Based

diag
diag
Custom
Custom Error
Error Command
Command
Notification
Notification Notification
Notification Diagnostics
Diagnostics

Concurrent
Concurrent
Error
Error
Logging
Logging

© 2012, 2013 IBM Corporation 34


Options for Error Notification

• ODM-Based
errdemon program uses errnotify ODM class for
error notification
• diag Command Diagnostics ODM-Based
ODM-Based

The diag command package contains a periodic


diagnostic procedure called diagela. Hardware
Custom
Custom
Notification
Error
Error
Notification
diag Command
diag Command
diagnostics
Notification Notification diagnostics

(only) errors generate mail messages to


members of the system group, or other email Concurrent
Concurrent
Error

addresses, as configured. Error


Logging
Logging

• Custom Notification
Write a shell script to check the error log
periodically
• Concurrent Error Logging
Start errpt –c and each error is then reported
© 2012, 2013 IBM Corporation 35

when it occurs.
Error Notification – diag Error Log Analysis
Task Selection (Diagnostics, Advanced Diagnostics, Service Aids, etc.) Menu

© 2012, 2013 IBM Corporation 36


Concurrent Error Logging - Easy

Start errpt –c to have each error reported


when it occurs.

Hint: redirect the output to the console to have ODM-Based


ODM-Based

an operator informed about each new error


entry.
# errpt -c > /dev/console & Custom
Custom
Notification
Notification
Error
Error
Notification
Notification
diag Command
diag Command
diagnostics
diagnostics

Concurrent
Concurrent
Error
Error
Logging
Logging

© 2012, 2013 IBM Corporation 37


Custom Notification Script
Write a shell script to check the error log periodically

#!/usr/bin/ksh
#######################################################
# Sample script to perform simple error notification #
#######################################################
ODM-Based
errpt > /tmp/error_log_1 # save version 1 of the error log ODM-Based

while true # loop forever checking error log


do
sleep 60 # wait one minute Custom Error diag Command
Custom Error diag Command
errpt > /tmp/error_log_2 # save version 2 of the error log Notification
Notification
Notification
Notification
diagnostics
diagnostics

# Compare version 1 and version 2 of the error logs


# If they are the same, then go back to sleep
cmp -s /tmp/error_log_1 /tmp/error_log_2 && continue Concurrent
Concurrent
Error
Error
# Files are different. A new error log entry detected Logging
Logging

# Send messages to the console and to root user


print "Warning: error log has changed" > /dev/console
mail -s "Warning: error log has changed" root <<-EOF
ALERT! Error Log Has Changed ALERT!
EOF
errpt > /tmp/error_log_1 # save new copy of error log
done # Go back to sleep
© 2012, 2013 IBM Corporation 38
ODM-based Error Notification: errnotify

The Error Notification object class specifies the conditions and


ODM-Based

actions to be taken when errors are recorded in the system ODM-Based

error log.
Custom Error diag Command
Custom
Notification Error
Notification diag Command
diagnostics
Notification Notification diagnostics

The user specifies these conditions and actions in the errnotify Concurrent
Concurrent
Error

Error Notification object.


Error
Logging
Logging

Useful ODM Commands


 odmadd

Adds objects to an object class. The odmadd command takes


an ASCII stanza file as input and populates object classes with
objects
found in the stanza file.
 odmdelete
Removes objects from an object class.
 odmshow
© 2012, 2013 IBM Corporation
Displays the description of an object class. 39
ODM-based Error Notification: errnotify
errnotify Description

# odmshow errnotify
ODM-Based
ODM-Based

class errnotify {
long en_pid; /* offset: 0xc ( 12) */ Custom
Custom
Notification
Error
Error
Notification
diag Command
diag Command
diagnostics

char en_name[16]; /* offset: 0x10 ( 16) */


Notification Notification diagnostics

short en_persistenceflg; /* offset: 0x20 ( 32) */ Concurrent


Concurrent
Error
Error

char en_label[20]; /* offset: 0x22 ( 34) */


Logging
Logging

ulong en_crcid; /* offset: 0x38 ( 56) */


char en_class[2]; /* offset: 0x3c ( 60) */
char en_type[5]; /* offset: 0x3e ( 62) */
char en_alertflg[6]; /* offset: 0x43 ( 67) */
char en_resource[16]; /* offset: 0x49 ( 73) */
char en_rtype[16]; /* offset: 0x59 ( 89) */
char en_rclass[16]; /* offset: 0x69 ( 105) */
char en_symptom[6]; /* offset: 0x79 ( 121) */
char en_err64[6]; /* offset: 0x7f ( 127) */
char en_dup[6]; /* offset: 0x85 ( 133) */
© 2012, 2013 IBM Corporation
char en_method[255]; /* offset: 0x8b ( 139) */ 40
ODM-based Error Notification: Object Descriptors

en_aertflg Indicates whether the error can be alerted. For use by alert agents. TRUE or FALSE
en_class Class of the error log entry to match: H-hw S-sw O-from errlogger U-undetermined
en_crcid Specifies the unique error identifier associated with a particular error.
en_dup If set, identifies whether duplicate errors should be matched. TRUE or FALSE
en_err64 If set, identifies whether errors from a 64-bit or 32-bit environment should be matched.
en_label Specifies the label associated with a particular error identifier as defined in errpt –t output
en_method Specifies a user-programmable action to be run when error matches selection criteria
en_name Uniquely identifies the Error Notification object. Name used when removing the object
en_persistenceflg Designates if the object should persist through boot. 0-non-persistent 1-persistent
en_pid Specifies a process ID (PID) for use in identifying the Error Notification object.
en_rclass Indentifies the class of the failing resource. Not applicable for software class
en_resource Identifies the name of the failing resource
en_rtype Identifies the type of the failing resource
en_symptom Enables notification of an error accompanied by a symptom string when set to TRUE
en_type Identifies severity of error log entries to match. INFO PEND PERM PERF TEMP UNKN
© 2012, 2013 IBM Corporation 41
ODM-based Error Notification: errnotify

Basic Configuration Steps:


1.Create an ASCII stanza file containing the Error Notification object with
ODM-Based
ODM-Based

desired conditions and actions (method).


2.Add the object to the errnotify Error Notification object class in the Custom
Custom
Notification
Error
Error
Notification
diag Command
diag Command
diagnostics

/etc/objrepos/errnotify file: Notification Notification diagnostics

Concurrent
odmadd /tmp/en_sample.add Concurrent
Error
Error
Logging
Logging

3.Copy any user-written en_method action script


to the /usr/lib/ras directory

/tmp/en_sample.add
/tmp/en_sample.add file
file mails error entry to root
errnotify:
errnotify: each time disk error
en_name
en_name == “sample”
“sample” of type PERM logged.
en_persistenceflg
en_persistenceflg == 00 Note use of $n keywords
en_class
en_class == “H”
“H”
en_type
en_type == “PERM”
“PERM”
en_rclass
en_rclass == “disk”
“disk”
en_method
en_method == “errpt
“errpt –a –l $1
–a –l $1 || mail
mail –s
–s ‘Disk
‘Disk Error’
Error’ root”
root”
© 2012, 2013 IBM Corporation 42
ODM-based Error Notification: Arguments to Notify
Method

The following keywords are automatically expanded by the Error Notification


daemon as arguments to the notify method:
$1 Sequence number from error log entry
$2 Error ID from error log entry
$3 Class from the error log entry
$4 Type from the error log entry
$5 Alert flags value from the error log entry
$6 Resource name from the error log entry
$7 Resource type from the error log entry
$8 Resource class from the error log entry
$9 Error label from the error log entry

© 2012, 2013 IBM Corporation 43


Notification – What to Monitor?
Path and Fibre Channel Related Errors

# errpt -t | egrep "PATH | FCA“ = 23 Unique ID’s


02A8BC99 SC_DISK_PCM_ERR8 PERM H PATH HAS FAILED
080784A7 DISK_ERR6 PERM H PATH HAS FAILED
13484BD0 SC_DISK_PCM_ERR16 PERM H PATH ID
14C8887A FCA_ERR10 PERM H COMMUNICATION PROTOCOL ERROR
1D20EC72 FCA_ERR1 PERM H ADAPTER ERROR
1F22F4AA FCA_ERR14 TEMP H DEVICE ERROR
278804AD FCA_ERR5 PERM S SOFTWARE PROGRAM ERROR
2BD0BD1A FCA_ERR9 TEMP H ADAPTER ERROR
3B511B1A FCA_ERR8 UNKN H UNDETERMINED ERROR
40535DDB SC_DISK_PCM_ERR17 PERM H PATH HAS FAILED
7BFEEA1F FCA_ERR4 TEMP #H errpt
LINK -atJ ERRORFCA_ERR4
---------------------------------------------------------------------------
84C2184C FCA_ERR3 PERM IDENTIFIER
H LINK 7BFEEA1F ERROR
9CA8C9AD SC_DISK_PCM_ERR12 Label:
PERM Class:
H PATH FCA_ERR4 HAS FAILED
H
A6F5AE7C SC_DISK_PCM_ERR9 INFO Type:
H PATH TEMP HAS RECOVERED
Loggable: YES Reportable: YES Alertable: NO
D666A8C7 FCA_ERR2 TEMP Description
H ADAPTER ERROR
DA930415 FCA_ERR11 TEMP LINK ERROR
H COMMUNICATION PROTOCOL ERROR
Recommended Actions
You must
DE3B8540 test for common errors PERM PERFORM
SC_DISK_ERR7 H PATH PROBLEM HAS FAILED DETERMINATION PROCEDURES
in your
E8F9BA61 CRYPT_ERROR_PATH environment Detail
INFO SENSE Data
H SOFTWARE PROGRAM ERROR
DATA
ECCE4018 FCA_ERR6 TEMP S SOFTWARE PROGRAM ERROR
© 2012, 2013 IBM Corporation 44
F29DB821 FCA_ERR7 UNKN H UNDETERMINED ERROR
Notification – What to Monitor?
Disk Related Errors

#errpt -t | egrep "DISK|SAS|SCSI“ => 1 6 6 Unique ID.s !


00B984B3 SC_DISK_ERR5 UNKN H UNDETERMINED ERROR
0118DD96 SISSAS_BAT_P PERM H BATTERY PACK FAILURE
01A236F0 SC_DISK_PCM_ERR11 PERM H REQUESTED OPERATION CANNOT BE PERFORMED
02A8BC99 SC_DISK_PCM_ERR8 PERM H PATH HAS FAILED
02E74ED4 ICS_ERR11 INFO O Additional iSCSI Adapter Information
03913B94 LVM_HWREL UNKN H HARDWARE DISK BLOCK RELOCATION ACHIEVED
0502F666 SCSI_ERR1 PERM H ADAPTER ERROR
05EFA03B SC_DISK_PCM_ERR15 PERM H REMOTE VOL MIRRORING: ILLEGAL I/O ORIGIN
0734DA1D DISKETTE_ERR3 PERM H DISKETTE MEDIA ERROR
078ED5D2 SAS_ERR1 PERM H ADAPTER ERROR
080784A7 DISK_ERR6 PERM H PATH HAS FAILED
08F9C47C SC_DISK_PCM_ERR14 PERM H SNAPSHOT REPOSITORY METADATA ERROR
0C10BB8C SC_DISK_PCM_ERR4 INFO H ARRAY CONFIGURATION CHANGED
1081B888 SISSAS_LINK_CABLE PERM H ADAPTER TO ADAPTER CABLING ERROR
12308453 SC_DISK_PCM_ERR20 PERM H SINGLE CONTROLLER RESTART FAILURE
13484BD0 SC_DISK_PCM_ERR16 PERM H PATH ID
15FD5EE8 SCSI_ARRAY_ERR5 PERM H DISK OPERATION ERROR
16F35C72 DISK_ERR2 PERM H DISK OPERATION ERROR Common Disk Errors to Monitor:
1AE69D3A FSCSI_ERR9 PERM H POTENTIAL DATA LOSS CONDITION
.
. DISK_ERR1 – Volume Failure. Action: replace
.
425BDD47 DISK_ERR1 PERM H DISK OPERATION ERROR
DISK_ERR2/3 – Device does not respond.
44C5506E ISCSI_ERR9 PERM H Action: check power supply
COMMUNICATION PROTOCOL ERROR
8580332D SISSAS_LINK_CONFIG PERM H MULTIPLE ADAPTER LINK CONFIGURATION ERRO
85D29B05 SISSAS_ERR16T TEMP H ARRAY CONFIGURATION ERROR
8647C4E2 DISK_ERR3 PERM H DISK OPERATION ERROR DISK_ERR4 – (Temporary) Bad block, etc.
F43A59CD
F4C2CCF7
FSCSI_ERR3
SISIOA_ERR01PD
PERM
PERM
H
H
ADAPTER ERROR
Action: Replace disk if persists > 1/wk
SCSI DEVICE OR MEDIA ERROR
F7863CFE SSA_DISK_ERR4 PERM H DISK OPERATION ERROR
F91ADEB5 SISSAS_LOGICAL_BAD INFO H LOGICAL READ ERROR SCSI_ERRn – SCSI communication problem.
FBEE4B29 SISSAS_ERR11P PERM H SAS FABRIC OR DEVICE ERROR
FBF0BFC1 TMSCSI_UNRECVRD_ERR PERM H Action:
ATTACHED SCSI TARGET DEVICE ERRORcheck cables, SCSI addresses,
FE9E9357 SSA_DEVICE_ERROR PERM H DISK OPERATION ERROR
FEC31570 SCSI_ERR7 PERM H UNDETERMINED ERROR terminator
FEFD41FF DISK_ERR5 UNKN H UNDETERMINED ERROR
SAS* - SAS errors
© 2012, 2013 IBM Corporation 45
Error Log - Manual Logging and Testing

errlogger command
allows the system administrator to record messages of up to 1024 bytes in the error log.
# errlogger system hard disk ‘(hdisk0)’ replaced.

Whenever you perform system maintenance activity, it is a good idea to record this activity
in the system error log
clearing entries from the error log
replacing/moving hardware
applying a software fix
re-cabling storage…

The command may also be helpful for testing notification programs:

IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION


AA8AB241 1023160212 T O OPERATOR OPERATOR NOTIFICATION

# errpt -t | grep OPERATOR


AA8AB241 OPMSG TEMP O OPERATOR NOTIFICATION

© 2012, 2013 IBM Corporation 46


Error Log – Injecting Errors
bookmark this page!

ras_logger command
allows the system administrator to record any error from the command line.
 log an error from a shell script
 test newly-created error templates
Example: /usr/lib/ras/ras_logger < tfile where,
tfile contains the error information using the error's template to determine
how to log the data. The format of the input is the following:

error_label
resource_name
64_bit_flag
detail_data_item1
detail_data_item2
...

• error_label is the error's label defined in the template in /var/adm/ras/errtmplt


• resource_name field is up to 16 characters in length.
• 64_bit_flag field's values are 0 for a 32-bit error and 1 for a 64-bit error.
• detail_data fields correspond to the Detail_Data items in the template.

© 2012, 2013 IBM Corporation 47


Example ras_logger Usage
inject a DMA error…

View Selected Error Template:


# errpt -atJ DMA_ERR
----------------------------------------------------------------
IDENTIFIER 00530EA6
Label: DMA_ERR
Class: H
Type: UNKN
Loggable: YES Reportable: YES Alertable: NO
Description
UNDETERMINED ERROR
Probable Causes # /usr/lib/ras/ras_logger < tfile
SYSTEM I/O BUS
SOFTWARE PROGRAM
ADAPTER
DEVICE tfile:
Recommended Actions +1 DMA_ERR
PERFORM PROBLEM DETERMINATION
PROCEDURES +2 resourcex
Detail Data +3 0
BUS NUMBER +4 15
CHANNEL UNIT ADDRESS
ERROR CODE +5 A0
+6 9999

© 2012, 2013 IBM Corporation 48


Example ras_logger Usage
inject a DMA error…

# errpt -a
---------------------------------------------------------------------------
# /usr/lib/ras/ras_logger < tfile LABEL: DMA_ERR
IDENTIFIER: 00530EA6
Date/Time: Wed Oct 24 10:11:28 CDT 2012
Sequence Number: 37
Machine Id: 0004A9C6D700
Node Id: hock
tfile: Class: H
Type: UNKN
+1 DMA_ERR Resource Name: resourcex
+2 resourcex Resource Class: NONE
Resource Type: NONE
+3 0 Location:
+4 15 Description
UNDETERMINED ERROR
+5 A0 Probable Causes
SYSTEM I/O BUS
+6 9999 SOFTWARE PROGRAM
ADAPTER
DEVICE
Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES
Detail Data
BUS NUMBER
0000 0015
CHANNEL UNIT ADDRESS
0000 00A0
ERROR CODE
0000 9999
© 2012, 2013 IBM Corporation 49
IBM Power Systems

Agenda
 Multipath IO Considerations
 AIX Notification Techniques
 Monitoring SAN storage performance
 Measuring SAN storage bandwidth
►SAN storage bandwidth metrics
►SAN storage measurement challenges
►The ndisk64 tool
►Small block random IO
►Large block sequential IO
►Bandwidth of SAN storage components
 Basic IO tuning

© 2012, 2013 IBM Corporation 50


IBM Power Systems

Monitoring SAN storage performance


 For random IO, look at read and write service times from

# iostat –RDTl <interval> <# intervals>


Disks: xfers read write queue time
-------------- -------------------------------- ------------------------------------ ------------------------------------ -------------------------------------- ---------
%tm bps tps bread bwrtn rps avg min max time fail wps avg min max time fail avg min max avg avg serv
act serv serv serv outs serv serv serv outs time time time wqsz sqsz qfull
hdisk1 0.9 12.0K 2.5 0.0 12.0K 0.0 0.0 0.0 0.0 0 0 2.5 9.2 0.6 92.9 0 0 3.7 0.0 71.4 0.0 0.0 0.3 15:58:27
hdisk0 0.8 12.1K 2.6 119.4 12.0K 0.0 4.4 0.1 12.1 0 0 2.5 8.7 0.8 107.0 0 0 3.3 0.0 61.1 0.0 0.0 0.3 15:58:27
hdisk2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 15:58:27
hdisk3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 15:58:27
hdisk4 66.4 58.9M 881.1 58.9M 28.8K 879.1 6.4 0.1 143.5 0 0 2.0 1.8 0.2 27.5 0 0 54.7 0.0 3.6S 48.0 5.0 746.8 15:58:27
hdisk6 66.3 51.1M 797.4 51.0M 24.5K 795.9 7.6 0.1 570.1 0 0 1.5 1.5 0.2 32.9 0 0 51.3 0.0 2.7S 41.0 6.0 678.5 15:58:27
hdisk5 61.9 55.9M 852.9 55.9M 28.5K 850.5 6.0 0.1 120.8 0 0 2.4 1.6 0.1 33.6 0 0 46.1 0.0 3.8S 39.0 5.0 714.7 15:58:27
hdisk7 58.3 55.4M 843.1 55.4M 21.2K 841.9 6.7 0.1 167.6 0 0 1.3 1.3 0.2 20.8 0 0 48.3 0.0 2.1S 40.0 5.0 734.8 15:58:27
hdisk8 42.6 53.5M 729.1 53.5M 3.4K 728.9 5.7 0.1 586.4 0 0 0.2 0.9 0.2 5.9 0 0 54.3 0.0 2.8S 39.0 4.0 687.8 15:58:27
hdisk10 44.1 37.1M 583.0 37.0M 16.9K 582.0 3.7 0.1 467.7 0 0 1.0 1.4 0.2 12.9 0 0 23.1 0.0 1.3S 13.0 2.0 465.0 15:58:27

 Misleading indicators of disk subsystem performance


► %tm_act (percent time active)
● Not meaningful for virtual disks, meaningful for real physical disks
► %iowait
● A measure of CPU idle while there are outstanding IOs

 IOPS, tps, and xfers all refer to the same thing

© 2012, 2013 IBM Corporation 51


IBM Power Systems

Monitoring SAN storage performance


# topas –D
or just press D when in topas

Avg. Read Time Avg. Write Time Avg. Queue Wait

© 2012, 2013 IBM Corporation 52


IBM Power Systems

What are reasonable IO service times?


 It depends!
► Random vs. sequential IO
● Concentrate on thruput with sequential IO where we expect poor IO latency
► Small (4-16 KB) vs. large IOs (128 KB and up)
● Larger IOs have longer transfer times
► Disk drive technology
● 10 K RPM vs. 15 K RPM
● Fibre Channel and SAS vs. SATA
● HDD vs SSD
► Using synchronous disk subsystem mirroring or not
● If mirroring, what is inter-site latency?
► Disk subsystem cache size and hit rate
● Read cache vs. write cache
► Short stroked HDDs or not

 HDD IO service times are variable and probabilistic

© 2012, 2013 IBM Corporation 53


IBM Power Systems

Disk IO service times “ZBR” Geometry


Makes more efficient use of outer
track space

 Multiple interface types


 ATA
 SATA
 SCSI
 FC
 SAS

 If the disk is very busy, IOs will wait for IOs ahead of it
 Queueing time on the disk (not queueing in the hdisk driver or elsewhere)
© 2012, 2013 IBM Corporation 54
IBM Power Systems

Seagate 7200 RPM SATA HDD performance

 As IOPS increase, IOs queue on the disk and wait for IOs ahead to complete first

© 2012, 2013 IBM Corporation 55


IBM Power Systems

What are reasonable IO service times?

 Assuming the disk isn’t too busy and IOs are not queueing there
 SSD IO service times around 0.2 to 0.4 ms and they can do over 10,000 IOPS

© 2012, 2013 IBM Corporation 56


IBM Power Systems

What are reasonable IO service times?


 Rules of thumb for IO service times for random IO and typical disk subsystems that are not mirroring data
synchronously and using HDDs
► Writes should average <= 2.5 ms
● Typically they will be around 1 ms
► Reads should average < 15 ms
● Typically they will be around 5-10 ms

 For random IO with synchronous mirroring


► Writes will take longer to get to the remote disk subsystem, write to its cache, and return an acknowledgement
► 2.5 ms + round trip latency between sites (light thru fiber travels 1 km in 0.005 ms)

 When using SSDs


► For SSDs on SAN, reads and writes should average < 2.5 ms, typically around 1 ms
► For SSDs attached to Power via SAS adapters without write cache
● Reads and writes should average < 1 ms
 Typically < 0.5 ms
 Writes take longer than reads for SSDs
► What if we don’t know if the data resides on SSDs or HDDs (e.g. in an EasyTier environment)?
● Look to the disk subsystem performance reports

 For sequential IO, don’t worry about IO service times, worry about thruput
► We hope IOs queue, wait and are ready to process

© 2012, 2013 IBM Corporation 57


IBM Power Systems

What IO service times are you experiencing?


# iostat –RDl [interval] [count]
Disks: xfers read write
-------------- -------------------------------- ------------------------------------ ------------------------------------
%tm bps tps bread bwrtn rps avg min max time fail wps avg min max time fail
act serv serv serv outs serv serv serv outs
hdisk0 0.3 26.7K 3.1 19.3K 7.5K 1.4 1.7 0.4 19.8 0 0 1.6 0.8 0.6 6.9 0 0
hdisk1 0.1 508.6 0.1 373.0 135.6 0.1 8.1 0.5 24.7 0 0 0.0 0.8 0.6 1.0 0 0
hdisk2 0.0 67.8 0.0 0.0 67.8 0.0 0.0 0.0 0.0 0 0 0.0 0.8 0.7 1.0 0 0
hdisk3 1.1 37.3K 4.4 25.1K 12.2K 2.0 0.8 0.3 10.4 0 0 2.4 4.4 0.6 638.4 0 0
hdisk4 80.1 33.6M 592.5 33.6M 38.2K 589.4 2.4 0.3 853.6 0 0 3.1 6.5 0.5 750.3 0 0
hdisk5 53.2 16.9M 304.2 16.9M 21.5K 302.2 3.0 0.3 1.0S 0 0 2.0 16.4 0.7 749.3 0 0
hdisk6 1.1 21.7K 4.2 1.9K 19.8K 0.1 0.6 0.5 0.8 0 0 4.0 2.7 0.6 495.6 0 0

(queue statistics removed for space) or

# iostat –RD hdisk0

System configuration: lcpu=4 drives=35 paths=35 vdisks=2

hdisk0 xfer: %tm_act bps tps bread bwrtn


0.3 26.7K 3.1 19.3K 7.5K
read: rps avgserv minserv maxserv timeouts fails
1.4 1.7 0.4 19.8 0 0
write: wps avgserv minserv maxserv timeouts fails
1.6 0.8 0.6 6/9 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
0.0 0.0 0.0 0.0 0.0 0.0

© 2012, 2013 IBM Corporation 58


IBM Power Systems

What if IO times are worse than that?

 You have a bottleneck somewhere from the hdisk driver to the physical disks
► Possibilities include:
● CPU (local LPAR or VIOS)
● Adapter driver
● Physical host adapter/port
● Overloaded SAN links (unlikely)
● Storage port(s) overloaded
● Disk subsystem processor overloaded
● Physical disks overloaded
● SAN switch buffer credits
● Temporary hardware errors
► Evaluate VIOS, adapter, adapter driver from AIX/VIOS
► Evaluate the storage from the storage side
 If the write IO service times are marginal, the write IO rate is low, and the read IO rate is
high, it’s often not worth worrying about
► Can occur due to caching algorithms in the storage

© 2012, 2013 IBM Corporation 59


IBM Power Systems

What about IO size and sequential IO?


Disks: xfers
-------------- --------------------------------
%tm bps tps bread bwrtn
act
hdisk4 99.6 591.4M 2327.5 590.7M 758.7K

 Large IOs typically imply sequential IO – check your iostat data


 bps/tps = bytes/transaction or bytes/IO
 591.4 MB / 2327.5 tps = 260 KB/IO - likely sequential IO
 Use filemon to examine sequentiality, e.g.:
# filemon –o /tmp/filemon.out –O all,detailed –T 1000000;sleep 60; trcstop
VOLUME: /dev/hdisk4 description: N/A
reads: 9156 (0 errs)
read sizes (blks): avg 149.2 min 8 max 512 sdev 218.2
read times (msec): avg 6.817 min 0.386 max 1635.118 sdev 22.469
read sequences: 7155*
read seq. lengths: avg 191.0 min 8 max 34816 sdev 811.9
writes: 806 (0 errs)
write sizes (blks): avg 352.3 min 8 max 512 sdev 219.2
write times (msec): avg 20.705 min 0.702 max 7556.756 sdev 283.167
write sequences: 377*
write seq. lengths: avg 753.1 min 8 max 8192 sdev 1136.7
seeks: 7531 (75.6%)*

 Here % sequential = 1-75.6% = 24.4%


 Perhaps multiple sequential IO threads accessing hdisk4
* Adjacent IOs coalesced into fewer IOs
© 2012, 2013 IBM Corporation 60
IBM Power Systems

A situation you may see


# iostat –lD
Disks: xfers read write
-------------- -------------------------------- ------------------------------------ ------------------------------------
%tm bps tps bread bwrtn rps avg min max time fail wps avg min max time fail
act serv serv serv outs serv serv serv outs
hdisk0 0.3 26.7K 3.1 19.3K 7.5K 1.4 1.7 0.4 19.8 0 0 1.6 0.8 0.6 6.9 0 0
hdisk1 0.1 508.6 0.1 373.0 135.6 0.1 8.1 0.5 24.7 0 0 0.0 0.8 0.6 1.0 0 0
hdisk2 0.0 67.8 0.0 0.0 67.8 0.0 0.0 0.0 0.0 0 0 0.0 0.8 0.7 1.0 0 0
hdisk3 1.1 37.3K 4.4 25.1K 12.2K 2.0 0.8 0.3 10.4 0 0 2.4 4.4 0.6 638.4 0 0
hdisk4 80.1 33.6M 592.5 33.6M 38.2K 589.4 2.4 0.3 853.6 0 0 3.1 6.5 0.5 750.3 0 0
hdisk5 53.2 16.9M 304.2 16.9M 21.5K 302.2 3.0 0.3 1.0S 0 0 2.0 16.4 0.7 749.3 0 0
hdisk6 1.1 21.7K 4.2 1.9K 19.8K 0.1 0.6 0.5 0.8 0 0 4.0 2.7 0.6 495.6 0 0

 Note the low write rate and high write IO service times

 Disk subsystem cache and algorithms may favor disks doing


sequential or heavy IO relative to disks doing limited IO or no IO
for several seconds
► The idea being to reduce overall IO service times
► Varies among disk subsystems

 Overall performance impact is low due to low write rates

© 2012, 2013 IBM Corporation 61


IBM Power Systems

Agenda
 Multipath IO Considerations
 AIX Notification Techniques
 Monitoring SAN storage performance
 Measuring SAN storage bandwidth
►SAN storage bandwidth metrics
►SAN storage measurement challenges
►The ndisk64 tool
►Small block random IO
►Large block sequential IO
►Bandwidth of SAN storage components
 Basic IO tuning

© 2012, 2013 IBM Corporation 62


IBM Power Systems

Disk subsystem IO bandwidth metrics


 There are many different metrics
► Maximum IOPS for some R/W ratio and IO size
● Simple but misses IO service time
► IOPS vs IO service time graph
● Used for Storage Performance Council SPC-1 benchmark reports
► Maximum MB/s for large block sequential IO for reads and/or for writes
● Similar to part of SPC-2 benchmark reports
► The above metrics for IO to/from disk subsystem cache
► The above metrics for storage ports, LUNs, backend HBAs, processors, etc

 Use a metric appropriate for your application


► Characterize your IO workload with the NMON analyzer
► Consider if sizing for both IOPS and MB/s is needed
► Most commercial applications size for sufficient IOPS

© 2012, 2013 IBM Corporation 63


IBM Power Systems

Challenges measuring disk subsystem IO bandwidth


 If the disk subsystem is being used by other systems during your testing
► Partial and variable results
 If server IO bandwidth < disk subsystem bandwidth
► Usually not a problem with Power, but potentially HBAs can be a bottleneck
► Look out for server benchmarks where the storage is the bottleneck
● Are you measuring the storage or system performance?
 Understanding the disk subsystem architecture
► Are you connected to enough storage ports?
► Are you using all the back end spindles?
► Are you using all the storage resources?
 Help from the storage administrators
 Cache effects
 Tiered storage
► Measure each tier separately
 Existing data on disk
► Stick to 100% read testing – calculate write IOPS bandwidth based on RAID levels
● Sustained RAID 5 write IOPS bandwith is almost ¼ of read IOPS bandwidth
● Sustained RAID 10 write IOPS bandwidth is almost ½ of read IOPS bandwidth
 Write IOPS will be fast until cache fills up
 Variability in the results
 How the disk subsystem is configured (RAID levels and other settings) affects its IO bandwidth

© 2012, 2013 IBM Corporation 64


IBM Power Systems

Cache effects
 Avoid using AIX file system cache
► Using raw hdisks or LVs is best

 Disk subsystem cache


► Read hit % will be at least (cache size)/(allocated disk space used for testing)
● Test 100% of the space you will allocate from the unit
► Write cache operates at electronic speeds until the cache fills
● Be aware that performance will degrade when cache fills if your write rate is
high enough
 Monitor performance for sufficient time during write tests

 To test IO rates to/from cache, use allocated space < cache size and prime
the cache for reads
► Prime the cache with # cat /dev/rhdisk10 > /dev/null

© 2012, 2013 IBM Corporation 65


IBM Power Systems

The ndisk64 IO load generator


 Generates IOs to raw disks, raw LVs, or files in file systems
 Able to generate IO to multiple devices
 User specified number of threads generating IOs
► Each thread does IOs synchronously

 Sequential or random IO
 Other inputs:
► How long the test should run in seconds
► R/W ratio
► IO size or a set of IO sizes
► There’s more but the above options cover most cases

 Use the character device (e.g. /dev/rhdisk0) for raw IO


 Google ndisk or nstress to get the nstress package which contains
ndisk64

© 2012, 2013 IBM Corporation 66


IBM Power Systems

The ndisk64 IO load generator help


# ndisk64
Command: ndisk64
Usage: ndisk64 version 6.2
Complex Disk tests - sequential or random read and write mixture
ndisk64 -S Seqential Disk I/O test (file or raw device)
-R Random Disk I/O test (file or raw device)
-t <secs> Timed duration of the test in seconds (default 5)
-f <file> use "File" for disk I/O (can be a file or raw device)
-f <list> use separated list of filenames (max 16) [separators :,+]
example: -f f1,f2,f3 or -f /dev/rlv1:/dev/rlv2
-F <file> <file> contains list of filenames, one per line
-M <num> Mutliple processes used to generate I/O
-s <size> file Size, use with K, M or G (mandatory for raw device)
examples: -s 1024K or -s 256M or -s 4G
The default is 32MB
-r <read%> Read percent min=0,max=100 (default 80 =80%read+20%write)
example -r 50 (-r 0 = write only, -r 100 = read only)
-b <size> Block size, use with K, M or G (default 4KB)
-O <size> first byte offset use with K, M or G (times by proc#)
-b <list> or use a colon separated list of block sizes (804400328 max)
example -b 512:1k:2K:8k:1M:2m
-q flush file to disk after each write (fsync())
-Q flush file to disk via open() O_SYNC flag
-i <MB> Use shared memory for I/O MB is the size(max=536874656 MB)
-v Verbose mode = gives extra stats but slower
-l Loging disk I/O mode = see *.log but slower still
-o "cmd" Other command - pretend to be this other cmd when running
Must be the last option on the line
-K num Shared memory key (default 0xdeadbeef) allows multiple programs
Note: is you halt a run, you may have a shared memory
segment left over. Use ipcs and then ipcrm to remove it.
-p Pure = each Sequential thread does read or write not both
-P file Pure with separate file for writers
-z percent Snooze percent - time spent sleeping (default 0)
To make a file use dd, for 8 GB: dd if=/dev/zero of=myfile bs=1M count=8196
For example:
dd if=/dev/zero of=bigfile bs=1m count=1024
ndisk64 -f bigfile -S -r100 -b 4096:8k:64k:1m -t 600
ndisk64 -f bigfile -R -r75 -b 4096:8k:64k:1m -q
ndisk64 -F filelist -R -r75 -b 4096:8k:64k:1m -M 16
ndisk64 -F filelist -R -r75 -b 4096:8k:64k:1m -M 16 -l -v
ndisk64a -A -F filelist -R -r50 -b 4096:8k:64k:1m -M 16 -x 8 -X 64

© 2012, 2013 IBM Corporation 67


IBM Power Systems

Using ndisk64
# lsdev -Cc disk
hdisk0 Available C7-T1-01 MPIO DS4800 Disk
# getconf DISK_SIZE /dev/hdisk0
30720 <- size needed for raw device in MB
# ndisk64 -R -t 20 -f /dev/rhdisk0 -M 1 -s 30720M -r 100
Command: ndisk64 -R -t 20 -f /dev/rhdisk0 -M 1 -s 30720M -r 100
Synchronous Disk test (regular read/write)
No. of processes = 1
I/O type = Random
Block size = 4096
Read-Write = Read Only
Sync type: none = just close the file
Number of files = 1
File size = 32212254720 bytes = 31457280 KB = 30720 MB
Run time = 20 seconds
Snooze % = 0 percent
----> Running test with block Size=4096 (4KB) .
Proc - <-----Disk IO----> | <-----Throughput------> RunTime
Num - TOTAL IO/sec | MB/sec KB/sec Seconds
1 - 3008 300.7 | 1.17 1202.97 20.00

 Monitor IO service times in another window using # iostat –RDTl <interval>


 Increase the number of threads to get a peak IOPS
 Increase queue_depth until it is >= number of threads
 Increasing the number of threads > 2X queue_depth won’t lead to more IOPS

© 2012, 2013 IBM Corporation 68


IBM Power Systems

Using ndisk64 – random read IOPS from a single LUN


DS4800 LUN Read Performance
Threads IOPS IO service time
20
1 300.7 2.8 ms
18
5 1389.5 3.5 ms
16
10 2296.8 4.3 ms
14

IO Service Time - ms
15 3020.8 5.0 ms
12
20 3662.5 5.5 ms
10
30 4576.2 6.6 ms
8
40 5114.7 7.8 ms
6
50 5620.6 8.8 ms 4
60 5872.4 10.1 ms 2
70 6099.7 11.4 ms 0
0 2000 4000 6000 8000
100 6271.0 16.0 ms
IOPS
128 6714.0 19.0 ms

 IOPS for the LUN peaked at 7082 IOPS with service times > 20 ms using 256 threads

© 2012, 2013 IBM Corporation 69


IBM Power Systems

Using ndisk64 – random read IOPS for a disk subsystem


 Ensure you understand the disk subsystem architecture and you are
doing IO to ALL the physical disks and using all the available resources
 You’ll typically need several LUNs, preferably all the same size
 Create a file of hdisk names
# cat hdisk.list
/dev/rhdisk2
/dev/rhdisk3

/dev/rhdisk10
# ndisk64 –R –r 100 –F hdisk.list –s 51200 -t 10 -M 400

Reading file of filenames "hdisk.list"


Command: ndisk64 -R -r 100 -s 51200 -t 10 -M 400 -F hdisk.list
Synchronous Disk test (regular read/write)
No. of processes = 400
I/O type = Random
Block size = 4096
Read-Write = Read Only

Proc - <-----Disk IO----> | <-----Throughput------> RunTime
Num - TOTAL IO/sec | MB/sec KB/sec Seconds
TOTALS 1256006 126497.2 | 494.13 Rand procs=400 read=100% bs= 4KB

 Create a IOPS vs. IO service time chart if you like

© 2012, 2013 IBM Corporation 70


IBM Power Systems

Using ndisk64 – other random IO tests


 Write IOPS or other than 100% reads
 Keep in mind that write cache can get filled up and IO performance drops then

 Measuring disk cache IOPS bandwidth


 Create a LUN so it fits entirely in cache, and prime the cache
# ndisk64 -R -t 10 -f /dev/rperf_testlv -M 128 -s 16M -r 100

Proc - <-----Disk IO----> | <-----Throughput------> RunTime
Num - TOTAL IO/sec | MB/sec KB/sec Seconds
TOTALS 545498 54578.0 | 213.20 Rand procs=128 read=100% bs= 4KB

 Measuring storage port IOPS bandwidth


 Disable paths to all but a single port on the storage using chpath
 Do IO to/from disk cache so that the disks are not a bottleneck

 Measuring host port IOPS bandwidth


 Disable paths to all but a single port on the host using chpath
 Be sure to have enough storage ports
 Do IO to/from disk cache so that the disks are not a bottleneck

© 2012, 2013 IBM Corporation 71


IBM Power Systems

Using ndisk64 – sequential IO


 Use one thread per “file” to get data from disk

 Use multiple threads to drive up thuput, but some/most of the data will be from
disk cache

 Have enough LUNs to get the thruput you need

 Use a large IO size, e.g. 256 KB or larger

 Are you measuring the interconnect bandwidth or the storage bandwidth?

 Be aware of the interconnect setup

© 2012, 2013 IBM Corporation 72


IBM Power Systems

Using ndisk64 – sequential IO


# ndisk64 -S -t 10 -f /dev/rhdisk0 -M 1 -s 30720M` -r 100 -b 256K
Command: ndisk64 -S -t 10 -f /dev/rhdisk0 -M 1 -s 30720M -r 100 -b 256K
Synchronous Disk test (regular read/write)
No. of processes = 1
I/O type = Sequential
Block size = 262144
Read-Write = Read Only
Sync type: none = just close the file
Number of files = 1
File size = 32212254720 bytes = 31457280 KB = 30720 MB
Run time = 10 seconds
Snooze % = 0 percent
----> Running test with block Size=262144 (256KB) .
Proc - <-----Disk IO----> | <-----Throughput------> RunTime
Num - TOTAL IO/sec | MB/sec KB/sec Seconds
1 - 10368 1036.8 | 259.19 265414.62 10.00
# timex dd if=/dev/rhdisk0 of=/dev/null bs=256K count=4000
4000+0 records in
4000+0 records out

real 3.95 -> (256 KB x 4000)/3.95s = 259.24 MB/s


user 0.00
sys 0.16
# ndisk64 -S -t 10 -f /dev/rhdisk0 -M 4 -s 30720M -r 100 -b 256K
Proc - <-----Disk IO----> | <-----Throughput------> RunTime
Num - TOTAL IO/sec | MB/sec KB/sec Seconds

TOTALS 15284 1528.4 | 382.09 Seq procs= 4 read=100% bs=256KB
 This setup has a single 4 Gb FC adapter
 With a 4 Gb SAN, we can get close to 400 MB/s simplex per link
© 2012, 2013 IBM Corporation 73
IBM Power Systems

Agenda
 Multipath IO Considerations
 AIX Notification Techniques
 Monitoring SAN storage performance
 Measuring SAN storage bandwidth
►SAN storage bandwidth metrics
►SAN storage measurement challenges
►The ndisk64 tool
►Small block random IO
►Large block sequential IO
►Bandwidth of SAN storage components
 Basic IO tuning

© 2012, 2013 IBM Corporation 74


IBM Power Systems

Introduction to AIX IO Tuning

 Tuning IO involves removing logical bottlenecks in the AIX IO stack


 Requires some understanding of the AIX IO stack
 General rule is to increase buffers and queue depths so no IOs wait unecesarily
due to lack of a resource, but not to send so many IOs to the disk subsystem that
it loses the IO requests

 Four possible situations:


1. No IOs waiting unnecessarily
 No tuning needed
2. Some IOs are waiting and IO service times are good
 Tuning will help
3. Some IOs are waiting and IO service times are poor
 Tuning may or may not help
 Poor IO service times indicate a bottleneck further down the stack and
typically at the storage
 Often needs more storage resources or storage tuning
4. The disk subsystem is losing IOs and IO service times are bad
 Leads to IO retransmissions, error handling code, blocked IO stalls and
crashes

© 2012, 2013 IBM Corporation 75


IBM Power Systems

AIX IO Stack
Application Application memory area caches data to
avoid IO
Logical file system
Raw disks
Raw LVs

NFS caches file attributes


JFS JFS2 NFS Other NFS has a cached filesystem for NFS clients

VMM JFS and JFS2 cache use extra system RAM

LVM (LVM device drivers)


Multi-path IO driver (optional)
Disk Device Drivers Queues exist for both adapters and disks
Adapter Device Drivers Adapter device drivers use DMA for IO
Disk subsystem (optional) Disk subsystems have read and write cache
Disk Disks have memory to store commands/data
Write cache Read cache or memory area used for IO
© 2012, 2013 IBM Corporation 76
IBM Power Systems

AIX IO Stack – Basic Tunables


Application Application memory area size

Logical file system


Raw disks
Raw LVs

JFS JFS2 NFS Other File system buffers or fsbufs

VMM Cache size or use of cache

LVM (LVM device drivers) Disk buffers or pbufs


Multi-path IO driver (optional)
Disk Device Drivers Hdisk queue depth
Adapter Device Drivers Adapter queue depth and DMA
Disk subsystem (optional) Disk subsystem tunables – varies
Disk
Write cache Read cache or memory area used for IO
© 2012, 2013 IBM Corporation 77
IBM Power Systems

AIX IO Facts
 Fewer larger IOs get more thruput than more smaller IOs
 IOs can be coalesced (good) or split up (bad) as they go thru the IO stack
 Adjacent IOs in a file/LV/disk can be coalesced into a single IO
 IOs greater than the maximum IO size supported will be split up
 Data layout affects IO performance more than tuning
 The goal is to balance the IOs evenly across the physical disks
 Requires extra work to fix after the fact
 Queues and buffers control the number of in-flight IOs for a structure
 hdisk queue_depth controls the number of in-flight IOs from the hdisk driver for an
hdisk
 A queue_depth of 10 means you can have up to 10 IOs in-flight for the hdisk, while
if more are requested, they will wait until other IOs complete
 file system buffers control the number of in-flight IOs from the file system layer for a
file system
 Reducing real IOs improves application performance, and often also improves IO service
times for the remaining real IOs

© 2012, 2013 IBM Corporation 78


IBM Power Systems

Filesystem and Disk Buffers


# vmstat –v

0 pending disk I/Os blocked with no pbuf
171 paging space I/Os blocked with no psbuf
2228 filesystem I/Os blocked with no fsbuf
66 client filesystem I/Os blocked with no fsbuf
17 external pager filesystem I/Os blocked with no fsbuf

 Numbers are counts of temporarily blocked IOs since boot


 blocked count / uptime = rate of IOs blocked/second
 Low rates of blocking implies less improvement from tuning
 For pbufs, use lvmo to increase pv_pbuf_count (see the next slide)
 For psbufs, stop paging (add memory or use less) or add paging spaces
 For filesystem fsbufs, increase numfsbufs with ioo
 For external pager fsbufs, increase j2_dynamicBufferPreallocation with ioo
 For client filesystem fsbufs, increase nfso's nfs_v3_pdts and nfs_v3_vm_bufs (or the
NFS4 equivalents)
 Run # ioo –FL to see defaults, current settings and what’s required to make the changes
go into effect

© 2012, 2013 IBM Corporation 79


IBM Power Systems

Disk Buffers
# lvmo –v rootvg -a
vgname = rootvg
pv_pbuf_count = 512 Number of pbufs added when one PV is added to the VG
total_vg_pbufs = 512 Current pbufs available for the VG
max_vg_pbuf_count = 16384 Max pbufs available for this VG, requires remount to change
pervg_blocked_io_count = 1243 Delayed IO count since last varyon for this VG
pv_min_pbuf = 512 Minimum number of pbufs added when PV is added to any VG
global_blocked_io_count = 1243 System wide delayed IO count for all VGs and disks

# lvmo –v rootvg -o pv_pbuf_count=1024 Increases pbufs for rootvg and is dynamic

 Check disk buffers for each VG

© 2012, 2013 IBM Corporation 80


IBM Power Systems

Hdisk queue depth tuning


 The queue_depth attribute controls the maximum number of in-flight IOs for the hdisk
 This cannot be changed dynamically – requires varyoff of the VG
# lsattr -El hdisk0
PCM PCM/friend/vscsi Path Control Module False
algorithm fail_over Algorithm True
hcheck_cmd test_unit_rdy Health Check Command True
hcheck_interval 0 Health Check Interval True
hcheck_mode nonactive Health Check Mode True
max_transfer 0x40000 Maximum TRANSFER Size True
pvid cee79e5f30f8a20000000000000000 Physical volume identifier False
queue_depth 3 Queue DEPTH True
reserve_policy no_reserve Reserve Policy True

# lsattr -Rl hdisk0 -a queue_depth


1...256 (+1) Allowable values for the attribute

© 2012, 2013 IBM Corporation 81


IBM Power Systems

Hdisk queue depth tuning


# iostat -lD hdisk0
System configuration: lcpu=4 drives=35 paths=35 vdisks=2

Disks: xfers read write


-------------------------------------------- ------------------------------------ ------------------------------
%tm bps tps bread bwrtn rps avg min max time fail wps avg min max time fail
act serv serv serv outs serv serv serv outs
hdisk0 0.1 1.3K 0.2 308.8 955.6 0.0 3.6 0.3 149.2 0 0 0.2 8.3 0.5 219.4 0 0

Disks: queue
-------------- --------------------------------------
avg min max
time time time
avg avg serv
wqsz sqsz qfull
This data reformatted for readability
hdisk0 6.8 0.0 980.0 0.1 0.0 0.1
Rate at which IOs are submitted to a full queue
# iostat -D hdisk0
System configuration: lcpu=4 drives=35 paths=35 vdisks=2

hdisk0 xfer: %tm_act bps tps bread bwrtn


0.1 1.3K 0.2 308.9 955.6
read: rps avgserv minserv maxserv timeouts fails
0.0 3.6 0.3 149.2 0 0
write: wps avgserv minserv maxserv timeouts fails
0.2 8.3 0.5 219.4 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
6.8 0.0 980.0 0.0 0.0 0.1

 From the application point of view, IO service time is the read/write avg. serv. plus avg
time in the queue
 Where to tune: hdisks with non-zero values for sqfull or avg time in the queue
 Especially with high IOPS

© 2012, 2013 IBM Corporation 82


IBM Power Systems

Hdisk queue depth tuning


 If IO service times are good, and IOs are waiting in the queue, we can eliminate the
wait by increasing queue_depth

# lsattr -HEl hdisk0


attribute value description user_settable
PCM PCM/friend/vscsi Path Control Module False
algorithm fail_over Algorithm True
hcheck_cmd test_unit_rdy Health Check Command True
hcheck_interval 0 Health Check Interval True
hcheck_mode nonactive Health Check Mode True
max_transfer 0x40000 Maximum TRANSFER Size True
pvid 00cee79e5f30f8a20000000000000000 Physical volume identifier False
queue_depth 3 Queue DEPTH True
reserve_policy no_reserve Reserve Policy True

# lsattr -Rl hdisk0 -a queue_depth


1...256 (+1) <- allowable values for queue_depth for this hdisk
# chdev -l hdisk0 -a queue_depth=8 <- change queue_depth when hdisk not in use
hdisk0 changed
# chdev -l hdisk0 -a queue_depth=8 –P <- change queue_depth when hdisk in use, requires reboot
hdisk0 changed

 The –P flag for chdev makes the change in the ODM and it goes into effect at reboot
 The attribute can be changed without a reboot if you stop using the device

© 2012, 2013 IBM Corporation 83


IBM Power Systems

Hdisk queue depth tuning


 A thruput and IO service time tradeoff (go for thruput)
 As you increase queue_depth, more in-flight IOs will be sent to the disk subsystem
 Expect IO service times to slightly degrade, but thruput to improve
 Allows the disk subsystem to use elevator algorithms to improve thruput
 Reduces actuator seek times
 Conversely, low queue depths help ensure good IO service times and less thruput
 But more waiting in the queue

Using elevator algorithm Not using elevator algorithm

© 2012, 2013 IBM Corporation 84


IBM Power Systems

Hdisk queue depth tuning


 If the storage has poor IO service times, increasing queue depth may or may not
improve performance
 The storage is already a bottleneck

 If the storage administrator won’t allow greater queue depths, ask for more LUNs

 Potential IOPS = queue_depth/avg. IO service time


e.g. IOPS = 3 / 0.010 = 300 IOPS

 Total in-flight IOs <= sum of the hdisk queue depths

 How much will this help?


 As IOs are often done in parallel, it’s hard to determine
 IO time savings = IOPS x avg time in the queue
e.g. 10,000 IOPS x 0.003 s = 30 seconds of savings each second
 Proportional savings estimate: queue wait time / (queue wait time + IO service time)
 e.g. 3 ms in queue / (3 ms + 5 ms IO service time ) = 37.5% improvement

© 2012, 2013 IBM Corporation 85


IBM Power Systems

FC adapter port tuning


 The num_cmd_elems attribute controls the maximum number of in-flight IOs for the FC port
 The max_xfer_size attribute controls the maximum IO size the adapter will send to the
storage, as well as a memory area to hold IO data
 Doesn’t apply to virtual adapters
 Default memory area is 16 MB at the default max_xfer_size=0x100000
 Memory area is 128 MB for any other allowable value
 This cannot be changed dynamically – requires stopping use of adapter port
# lsattr -El fcs0
DIF_enabled no DIF (T10 protection) enabled True
bus_intr_lvl Bus interrupt level False
bus_io_addr 0xff800 Bus I/O address False
bus_mem_addr 0xffe76000 Bus memory address False
bus_mem_addr2 0xffe78000 Bus memory address False
init_link auto INIT Link flags False
intr_msi_1 209024 Bus interrupt level False
intr_priority 3 Interrupt priority False
lg_term_dma 0x800000 Long term DMA True
max_xfer_size 0x100000 Maximum Transfer Size True
num_cmd_elems 200 Maximum number of COMMANDS to queue to the adapter True
pref_alpa 0x1 Preferred AL_PA True
sw_fc_class 2 FC Class for Fabric True
tme no Target Mode Enabled True

© 2012, 2013 IBM Corporation 86


IBM Power Systems

FC adapter port queue depth tuning


# fcstat fcs0
FIBRE CHANNEL STATISTICS REPORT: fcs0
Device Type: 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)

World Wide Port Name: 0x10000000C99C184E

Port Speed (supported): 8 GBIT
Port Speed (running): 8 GBIT

FC SCSI Adapter Driver Information <- Look at this section: numbers are counts of blocked IOs since boot
No DMA Resource Count: 452380 <- increase max_xfer_size for large values
No Adapter Elements Count: 726832 <- increase num_cmd_elems for large values
No Command Resource Count: 342000 <- increase num_cmd_elems for large values

FC SCSI Traffic Statistics

Input Bytes: 56443937589435
Output Bytes: 4849112157696
# chdev –l fcs0 –a num_cmd_elems=4096 –a max_xfer_size=0x200000 –P <- requires reboot
fcs0 changed

 Calculate the rate the IOs are blocked


 # blocked / uptime (or since the adapter was made Available)
 Bigger tuning improvements when the rate of blocked IOs is higher
 If you’ve increased num_cmd_elems and max_xfer_size and still get blocked IOs, it
suggests you need another adapter port for more bandwidth

© 2012, 2013 IBM Corporation 87


IBM Power Systems

VSCSI adapter queue depth sizing


 VSCSI adapters also have a queue but it’s not tunable

 We ensure we don’t run out of VSCSI queue slots by limiting the number of hdisks using the
adapter, and their individual queue depths
hdisk Max
 Adapter queue slots are a resource shared by the queue hdisks per
hdisks depth vscsi
adapter*
 Max hdisks per adapter = 3 - default 85
INT{510 / [(sum of (hdisk queue depths + 3)]}
10 39
24 18
 You can exceed these limits to the extent that the 32 14
average service queue size is less than the queue
64 7
depth
100 4
128 3
252 2
* To assure no blocking of IOs at the vscsi adapter 256 1

© 2012, 2013 IBM Corporation 88


IBM Power Systems

NPIV adapter tuning


 The real adapters’ queue slots and DMA memory area are shared by the vFC NPIV adapters

 Tip: Set num_cmd_elems to it’s maximum value and max_xfer_size to 0x200000 on the
real FC adapter for maximum bandwidth, to avoid having to tune it later. Some
configurations won’t allow this and will result in errors in the error log or devices showing
up as Defined.

 Only tune num_cmd_elems for the vFC adapter based on fcstat statistics

© 2012, 2013 IBM Corporation 89


IBM Power Systems

Asynchronous IO
 Asynchronous IO (aka. AIO) is a programming technique which allows applications to request
a lot of IO without waiting for each IO to complete
 The tuning goal is to ensure sufficient AIO servers when the application uses them
 AIO kernel threads automatically exit after aio_server_inactivity seconds
 AIO kernel threads not used for AIO to raw LVs or CIO mounted file systems
 Only aio_maxservers and aio_maxreqs need to be changed
 Defaults are 21 and 8K respectively per logical CPU
 Set via ioo
 Some may want to adjust minservers for heavy AIO use
 maxservers is the maximum number of AIOs that can be processed at any one time
 maxreqs is the maximum number of AIO requests that can be handled at one time and is
a total for the system (they are queued to the AIO kernel threads)
 Typical values:

Default OLTP SAP


minservers 3 200 400
maxservers 10 800 1200
maxreqs 4096 16384 16384

© 2012, 2013 IBM Corporation 90


IBM Power Systems

AIO tuning
 Use iostat –A to monitor AIO (or -P for POSIX AIO)
# iostat -A <interval> <number of intervals>
System configuration: lcpu=4 drives=1 ent=0.50
aio: avgc avfc maxg maxf maxr avg-cpu: %user %sys %idle %iow physc %entc
25 6 29 10 4096 30.7 36.3 15.1 17.9 0.0 81.9

Disks: % tm_act Kbps tps Kb_read Kb_wrtn


hdisk0 100.0 61572.0 484.0 8192 53380

 avgc - Average global non-fastpath AIO request count per second for the specified interval
 avfc - Average AIO fastpath request count per second for the specified interval for IOs to
raw LVs (doesn’t include CIO fast path IOs)
 maxg - Maximum non-fastpath AIO request count since the last time this value was fetched
 maxf - Maximum fastpath request count since the last time this value was fetched
 maxr - Maximum AIO requests allowed - the AIO device maxreqs attribute
 If maxg or maxf gets close to maxr or maxservers then increase maxreqs or maxservers

© 2012, 2013 IBM Corporation 91


IBM Power Systems

Thank You !

© 2012, 2013 IBM Corporation 92

You might also like