Multipathing and SAN Storage Considerations For AIX Administrators

IBM Power Systems
Multipathing and SAN Storage

Considerations for AIX Administrators
Dan Braden – dbraden@us.ibm.com

John Hock – jrhock@us.ibm.com
IBM Power Systems Advanced Technical Skills

February 28, 2013
© 2013 IBM Corporation

IBM Power Systems
Agenda
Multipath IO Considerations
AIX Notification Techniques
Monitoring SAN storage performance
Measuring SAN storage bandwidth
►SAN storage bandwidth metrics
►SAN storage measurement challenges
►The ndisk64 tool
►Small block random IO
►Large block sequential IO
►Bandwidth of SAN storage components
Basic IO tuning
© 2012, 2013 IBM Corporation 2

IBM Power Systems
Agenda
►The ndisk64 tool
Basic IO tuning

IBM Power Systems
What is MPIO?
MPIO is an architecture designed by AIX development (released in AIX V5.2)
MPIO is also a commonly used acronym for Multi-Path IO (AIX PCM aka MPIO)
► In this presentation MPIO refers explicitly to the architecture, not the acronym
Why was the MPIO architecture developed?

► With the advent of SANs, each disk subsystem vendor wrote their own multi-path code
► These multi-path code sets were usually incompatible
● Mixing disk subsystems was usually not supported on the same system, and if they
were, they usually required their own FC adapters
► Integration with AIX IO error handling and recovery
● Several levels of IO timeouts: basic IO timeout, FC path timeout, etc
MPIO architecture details available to disk subsystem vendors
► Compliant code requires a Path Control Module (PCM) for each disk subsystem
● AIX PCMs for SCSI and FC ship with AIX and are often used by the vendors
► MPIO allows vendors to develop their own path selection algorithms
► Disk vendors have been moving towards MPIO compliant code
MPIO Common Interface

IBM Power Systems
Overview of MPIO Architecture
LUNs show up as an hdisk

►Architected for 32 K paths
►No more than 16 paths are necessary
PCM: Path Control Module
►AIX PCMs exist for FC, SCSI
►Vendors may write optional PCMs
►May provide commands to manage paths
Allows various algorithms to balance use
of paths
Full support for multiple paths to rootvg Tip: to keep paths <= 16, group
sets of 4 host ports and 4 storage ports
and balance LUNs across them
Hdisks
Hdisks can
can be Available, Defined
be Available, Defined or
or non-existent
non-existent
Paths
Paths can
can also
also be Available, Defined,
be Available, Defined, Missing
Missing oror non-existent
non-existent
Path status can
Path status can be enabled, disabled
be enabled, disabled or
or failed
failed ifif the
the path
path is
is Available
Available
(use chpath command
(use chpath command to to change
change status)
status)
Add
Add path:
path: e.g.
e.g. after
after installing
installing new
new adapter
adapter and
and cable
cable to
to the
the disk
disk
run cfgmgr (or
run cfgmgr (or cfgmgr
cfgmgr –l–l <adapter>)
<adapter>)
One
One must
must get
get the
the device
device layer
layer correct,
correct, before
before working
working with
with the
the path
path status
status layer
layer

IBM Power Systems
https://tuf.hds.com/gsc/bin/view/Main/AIXODMUpdates
Disk configuration ftp://ftp.emc.com/pub/elab/aix/ODM_DEFINITIONS/
The disk vendor…
Dictates what multi-path code can be used
Supplies the filesets for the disks and multipath code
Supports the components that they supply
A fileset is loaded to update the ODM to support the storage
AIX then recognizes and appropriately configures the disk
Without this, disks are configured using a generic ODM definition
Performance and error handling may suffer as a result
# lsdev –Pc disk displays supported storage
The multi-path code will be a different fileset
Unless using the MPIO that’s included with AIX
Beware of generic “Other” disk definition

No command queuing
Poor Performance & Error Handling

IBM Power Systems
AIX Path Control Module (PCM) IO basics

The AIX PCM…
Is part of the MPIO architecture
Chooses the path each IO will take
Is used to balance the use of resources used to connect to the storage
Depends on the algorithm attribute for each hdisk
Handles path failures to ensure availability with multiple paths
Handles path failure recovery
Checks the status of paths
Supports boot disks
Not all multi-path code sets do support boot disks
Offers PCMs for both Fibre Channel and SCSI protocol disks
Supports active/active, active/passive and ALUA disk subsystems
But not all disk subsystems
Supports SCSI-2 and SCSI-3 reserves
SCSI reserves are often not used

IBM Power Systems
How many paths for a LUN?
• Paths = (# of paths from server to switch) x

(# paths from storage to switch)
Server …Here there are potentially 6 paths per LUN
…But reduced via:
• LUN masking at the storage
Assign LUNs to specific FC adapters at the host,
and thru specific ports on the storage
• Zoning
FC Switch WWPN or SAN switch port zoning
• Dual SAN fabrics
divides potential paths by two
• 4 paths per LUN are sufficient for availability
and reduces CPU overhead for choosing the path
Storage • Path selection overhead is relatively low—usually negligible
• MPIO has no practical limits to number of paths
• Other products have path limits
• SDDPCM limited to 16 paths per LUN

IBM Power Systems
How many paths for a LUN?, cont’d

Dual SAN Fabric for SAN Zoning Reduces Potential Paths
Server
FC Switch
Fabric 1 Fabric 2
Storage
4 X 4 = 16 paths 2 X 2 + 2 X 2 = 8 paths
With single initiator to single target zoning, both examples would have 4 paths
A popular approach is to use 4 host and 4 storage ports, zoning one host port to one
storage port, yielding 4 paths

IBM Power Systems
Path selection benefits and costs

Path
Path selection
selection algorithms
algorithms choose
choose aa path
path to
to hopefully
hopefully minimize
minimize latency
latency added
added to
to
an
an IOIO to
to send
send itit over
over the
the SAN
SAN to to the
the storage
storage
Latency
Latency to to send
send aa 44 KB
KB IO
IO over
over aa 88 Gbps
Gbps SAN
SAN link
link is
is
44 KB
KB // (8
(8 Gb/s
Gb/s xx 0.1
0.1 B/b
B/b x1048576
x1048576 KB/GB)KB/GB) == 0.0048
0.0048 ms ms
Multiple
Multiple links
links may
may bebe involved,
involved, andand IOs
IOs are
are round
round triptrip
As
As compared
compared to to fastest
fastest IO
IO service
service times
times around
around 11 ms ms
If
If the
the links
links aren’t
aren’t busy,
busy, there
there likely
likely won’t
won’t be
be much,
much, if
if any,
any, savings
savings from
from
use
use of
of sophisticated
sophisticated path
path selection
selection algorithims
algorithims vs.
vs. round
round robin
robin
Generally utilization
of links is low
Costs
Costs of
of path
path selection
selection algorithms
algorithms (could
(could outweigh
outweigh latency
latency savings)
savings)
CPU
CPU cycles
cycles to
to choose
choose thethe best
best path
path
Memory
Memory toto keep
keep track
track of
of in-flight
in-flight IOs
IOs down
down each
each path,
path, or
or
Memory
Memory toto keep
keep track
track of
of IO
IO service
service times
times down
down each
each path
path
Latency
Latency added
added toto the
the IO
IO to
to choose
choose the
the best
best path
path
IBM Power Systems
Balancing IOs with algorithms fail_over and round_robin
A fail_over algorithm can be efficiently used to balance IOs!

► Any load balancing algorithm must consume CPU and memory resources to determine
the best path to use.
► Using path priorities, it is possible to setup fail_over LUNs so that the loads are
balanced across the available FC adapters.
► Let's use an example with 2 FC adapters. Assume we correctly lay out our data so that
the IOs are balanced across the LUNs (this is usually a best practice). Then if we
assign half the LUNs to FC adapterA and half to FC adapterB, then the IOs are evenly
balanced across the adapters!
► A question to ask is, “If one adapter is handling more IO than another, will this have a
significant impact on IO latency?”
► Since the FC adapters are capable of handling more than 50,000 IOPS then we're
unlikely to bottleneck at the adapter and add significant latency to the IO.
round_robin may more easily ensure balanced IOs across the links for each LUN
● e.g., if the IOs to the LUNs aren't balanced, then it may be difficult to balance the
LUNs and their IO rates across the adapter ports with fail_over
● requires less resource that load balancing

IBM Power Systems
Multi-path IO with VIO and VSCSI LUNs
VIO Client Two layers of multi-path code: VIOC and VIOS

AIX PCM
VSCSI disks always use AIX PCM and
all IO for a LUN normally goes to one VIOS
► algorithm = fail_over only
Set the path priorities for the VSCSI hdisks so half use one
VIO Server VIO Server VIOS, and half use the other
Multi-path code Multi-path code
VIOS uses the multi-path code specified for the disk

subsystem
Typical setup: Set vsci device’s attribute vsci_err_recov

Disk to fast_fail. The default is delayed_fail. This will speed up
Subsystem path failover in the event of VIOS failure.

IBM Power Systems
Multi-path IO with VIO and NPIV

One layer of multi-path code
VIO Client VIOC has virtual FC adapters (vFC)

Multi-path code
► Potentiallyone vFC adapter for every real FC adapter
VFC VFC VFC VFC
in each VIOC
► Maximum of 64 vFC adapters per real FC adapter
recommended
VIO Server VIO Server
HBA HBA HBA HBA
VIOC uses multi-path code that the disk subsystem
supports
IOs for a LUN can go thru both VIOSs

Disk
Subsystem
Mixed multi-path codes, which may be incompatible on a single LPAR, can be used on VIOC
LPARS with NPIV to share the same physical adapter, provided incompatible code isn't used
on the same LPAR. E.g. Powerpath + EMC & MPIO + DS8000.
IBM Power Systems
Active/Active, Active/Passive and ALUA Disk Subsystem Controllers

Active/Active controllers
► IOscan be sent to any controller for a LUN
► DS8000, DS6000 and XIV
Active/Passive controllers
► IOsfor a LUN are sent to the primary controller for the LUN, except in failue scenarios
► The storage administrator balances LUNs across the controllers
● Controllers should be active for some LUNs and passive for others
► DS3/4/5000
ALUA – Asynchronous Logical Unit Access

► IOscan be sent to any controller, but one controller is preferred (IOs passed to primary)
● Preferred due to performance considerations
► SVC, V7000 and NSeries/NetApp
● Using ALUA on NSeries/NetApp is preferred
Set on the storage
MPIO supports Active/Passive and Active/Active disk subsystems
► SVC and V7000 are treated as Active/Passive
Terminology regarding active/active and active/passive varies considerably

IBM Power Systems
MPIO support
Storage Subsystem Family MPIO code Multi-path algorithm
IBM Subsystem Device
IBM ESS, DS6000, DS8000, fail_over, round_robin and for
Driver Path Control
DS3950, DS4000, DS5000, SDDPCM: load balance, load
Module (SDDPCM) or AIX
SVC, V7000 balance port
PCM
AIX FC PCM
DS3/4/5000 in VIOS fail_over, round_robin
recommended
IBM XIV Storage System AIX FC PCM fail_over, round_robin
IBM System Storage N Series AIX FC PCM fail_over, round_robin
EMC Symmetrix AIX FC PCM fail_over, round_robin

Hitachi Dynamic Link fail_over, round robin,
HP & HDS Manager (HDLM) extended round robin
(varies by model)
AIX FC PCM fail_over, round_robin
SCSI AIX SCSI PCM fail_over, round_robin

VIO VSCSI AIX SCSI PCM fail_over

IBM Power Systems
Non-MPIO multi-path code
Storage subsystem family Multi-path code

IBM DS6000, DS8000, SVC, V7000 SDD
IBM DS4000 Redundant Disk Array Controller (RDAC)
EMC Power Path

HP AutoPath
HDS HDLM (older versions)

Veritas-supported storage Dynamic MultiPathing (DMP)

IBM Power Systems
Mixing multi-path code sets
The disk subsystem vendor specifies what multi-path code is supported for their storage
► The disk subsystem vendor supports their storage, the server vendor generally doesn’t
You can mix multi-path code compliant with MPIO and even share adapters
► There may be exceptions. Contact vendor for latest updates.
HP example: “Connection to a common server with different HBAs requires separate
HBA zones for XP, VA, and EVA”
Generally one non-MPIO compliant code set can exist with other MPIO compliant code sets
► Except that SDD and RDAC can be mixed on the same LPAR
► The non-MPIO compliant code must be using its own adapters
● Except RDAC can share adapter ports with MPIO
Devices of a given type use only one multi-path code set
► e.g., you can’t use SDDPCM for one DS8000 and SDD for another DS8000 on the same
AIX instance

IBM Power Systems
Sharing Fibre Channel Adapter ports
Disk using MPIO compliant code sets can share adapter ports
It’s recommended that disk and tape use separate ports
Disk (typicaly small block random) and

tape (typically large block sequential) IO
are different, and stability issues have
been seen at high IO rates

IBM Power Systems
MPIO Command Set

lspath – list paths, path status, path ID, and path attributes for a disk
chpath – change path status or path attributes

► Enable or disable paths
rmpath – delete or change path state

► Putting a path into the defined mode means it won’t be used (from available to
defined)
► One cannot define/delete the last path of an open device
mkpath – add another path to a device or makes a defined path available

► Generally cfgmgr is used to add new paths
chdev – change a device’s attributes (not specific to MPIO)
cfgmgr – add new paths to an hdisk or make defined paths available

(not specific to MPIO)

IBM Power Systems
Useful MPIO Commands

List status of the paths and the parent device (or adapter)
# lspath -Hl <hdisk#>
List connection information for a path
# lspath -l hdisk2 -F"status parent connection path_status path_id“
Enabled fscsi0 203900a0b8478dda,f000000000000 Available 0
The connection field contains the storage port WWPN
► In the case above, paths go to two storage ports and WWPNs:
203900a0b8478dda
201800a0b8478dda
List a specific path's attributes
# lspath -AEl hdisk2 -p fscsi0 –w “203900a0b8478dda,f00000000000“
scsi_id 0x30400 SCSI ID False
node_name 0x200800a0b8478dda FC Node Name False
priority 1 Priority True

IBM Power Systems
Path priorities
A Priority Attribute for paths can be used to specify a preference for path
IOs. How it works depends whether the hdisk’s algorithm attribute is set to
fail_over or round_robin.
Value specified is inverse to priority, i.e. “1” is high priority
algorithm=fail_over
►the path with the higher priority value handles all the IOs unless there's a path failure.
►Set the primary path to be used by setting it's priority value to 1, and the next path's
priority (in case of path failure) to 2, and so on.
►ifthe path priorities are the same, the primary path will be the first listed for the hdisk
in the CuPath ODM as shown by # odmget CuPath
algorithm=round_robin
►If the priority attributes are the same, then IOs go down each path equally.
►Inthe case of two paths, if you set path A’s priority to 1 and path B’s to 255, then for
every IO going down path A, there will be 255 IOs sent down path B.
To change the path priority of an MPIO device on a VIO client:

# chpath -l hdisk0 -p vscsi1 -a priority=2
►Set path priorities for VSCSI disks to balance use of VIOSs

IBM Power Systems
Path priorities
# lsattr -El hdisk9
PCM PCM/friend/otherapdisk Path Control Module False
algorithm fail_over Algorithm True
hcheck_interval 60 Health Check Interval True
hcheck_mode nonactive Health Check Mode True
lun_id 0x5000000000000 Logical Unit Number ID False
node_name 0x20060080e517b6ba FC Node Name False
queue_depth 10 Queue DEPTH True
reserve_policy single_path Reserve Policy True
ww_name 0x20160080e517b6ba FC World Wide Name False
…
# lspath -l hdisk9 -F"parent connection status path_status"

fscsi1 20160080e517b6ba,5000000000000 Enabled Available
fscsi1 20170080e517b6ba,5000000000000 Enabled Available
# lspath -AEl hdisk9 -p fscsi1 -w"20160080e517b6ba,5000000000000"

scsi_id 0x10a00 SCSI ID False
node_name 0x20060080e517b6ba FC Node Name False
priority 1 Priority True
Note:
Note: whether
whether or
or not
not path
path priorities
priorities apply
apply depends
depends onon the
the PCM.
PCM.
With
With SDDPCM,
SDDPCM, path
path priorities
priorities only
only apply
apply when
when the
the algorithm
algorithm used
used is
is fail
fail over
over (fo).
(fo).
Otherwise, they aren’t used.
Otherwise, they aren’t used.
IBM Power Systems
Path priorities – why change them?

With VIOCs, send the IOs for half the LUNs to one VIOS and half to the other
►Set priorities for half the LUNs to use VIOSa/vscsi0 and half to use
VIOSb/vscsi1
►Uses both VIOSs CPU and virtual adapters
►algorithm=fail_over is the only option at the VIOC for VSCSI disks
With NSeries – have the IOs go the primary controller for the LUN if not using
ALUA (ALUA is preferred)
►When not using ALUA, use the dotpaths utility to set path priorities to ensure most IOs go to
the preferred controller
To see to which VIOS a vscsi adapter is connected:

# echo "cvai" | kdb | grep vscsi | grep vhost
vscsi0 0x000007 0x0000000000 0x0 vios1->vhost0

vscsi1 0x000007 0x0000000000 0x0 vios2->vhost1

IBM Power Systems
Path Health Checking and Recovery

Validates a path is working & automates recovery of failed path
Note: applies to open disks only
For SDDPCM and MPIO compliant disks, two hdisk attributes apply:
# lsattr -El hdisk26
hcheck_interval
► Defines how often (1– 3600 seconds) the health check is performed on the paths for a device.
When a value of 0 is selected (the default), health checking is disabled
► Preferably set to at least 2X IO timeout value…often 30 seconds
hcheck_mode
► Determines which paths should be checked when the health check capability is used:
● enabled: Sends the healthcheck command down paths with a state of enabled
● failed: Sends the healthcheck command down paths with a state of failed
● nonactive: (Default) Sends the healthcheck command down paths that have no active I/O, including
paths with a state of failed. If the algorithm selected is failover, then the healthcheck command is
also sent on each of the paths that have a state of enabled but have no active IO. If the algorithm
selected is round_robin, then the healthcheck command is only sent on paths with a state of failed,
because the round_robin algorithm keeps all enabled paths active with IO.
Consider setting up error notification for path failures (later slide)

IBM Power Systems
Path Recovery
MPIO will recover failed paths if path health checking is enabled with hcheck_mode=nonactive
or failed and the device has been opened
Trade-offs exist:
► Lots of path health checking can create a lot of SAN traffic
► Automatic recovery requires turning on path health checking for each LUN
► Lots of time between health checks means paths will take longer to recover after repair
► Health checking for a single LUN is often sufficient to monitor all the physical paths,
but not to recover them
SDD and SDDPCM also recover failed paths automatically
In addition, SDDPCM provides a health check daemon to provide an automated method of
reclaiming failed paths to a closed device.
To manually enable a failed path after repair or re-enable a disabled path:

# chpath -l hdisk1 -p <parent> –w <connection> -s enable
or run cfgmgr or reboot

IBM Power Systems
Path Recovery With Flaky Links

When a path fails, it takes AIX time to recognize it, and to redirect in-flight IOs previously sent
down the failed path
► IO stalls during this time, along with processes waiting on the IO
► Turning off a switch port results in a 20 second stall
● Other types of failures may take longer
► AIX must distinguish between slow IOs and path failures
With flaky paths that go up and down, this can be a problem

The MPIO timeout_policy attribute for hdisks addresses this for command timeouts
► IZ96396 for AIX 7.1, IZ96302 for AIX 6.1
► timeout_policy=retry_path Default and similar to before the attribute existed. The first
occurrence of a command timeout on the path does not cause immediate path failure.
► timeout_policy=fail_path Wait until several clean health checks then recover the path
► timeout_policy=disable_path Disable the path and leave it that way
● Manual intervention will be required so be sure to use error notification in this case
SDDPCM recoverDEDpath attribute – similar to timeout_policy but for all kinds of path errors
► recoverDEDpath=no Default and failed paths stay that way
► recoverDEDpath=yes Allows failed paths to be recovered
► SDDPCM V2.6.3.0 or later

IBM Power Systems
Path management with AIX PCM

Includes examining, adding, removing, enabling and disabling paths
► Adapter failure/replacement or addition
► Planned VIOS outages
► Cable failure and replacement
► Storage controller/port failure and repair
Adapter replacement
► Paths will not be in use if the adapter has failed, paths will be in the failed state
1. Remove the adapter and its child devices including the paths using the adapter with
# rmdev –Rdl <fcs#>
2. Replace the adapter
3. cfgmgr
4. Check the paths with lspath
It’s better to stop using a path before you know the path will disappear
► Avoid timeouts, application delays or performance impacts and potential error
recovery bugs
► To disable all paths using a specific FC port on the host:
# chpath –l hdisk1 –p <parent> -s disable
IBM Power Systems
Example: Active/Passive Paths

IBM Power Systems
Agenda
►The ndisk64 tool
Basic IO tuning

IBM Power Systems
Path Health Checking and Recovery – Notification!
One should also set up error notification for path failure, so that someone knows
about it and can correct it before something else fails.
This is accomplished by determining the error that shows up in the error log when a
path fails (via testing), and then
Adding an entry to the errnotify ODM class for that error which calls a script (that you
write) that notifies someone that a path has failed.
Hint: You can use # odmget errnotify to see what the entries (or stanzas) look like,
then you create a stanza and use the odmadd command to add it to the errnotify
class.

Why Notification?
• Automatic notification of incidents supports

the goal of restoring normal IT service
operations as quickly as possible
• Automatic notification of incidents can
minimize disruption to users and business
The Information Technology
operations Infrastructure Library (ITIL),
– May allow correction of problem before is a set of practices
for IT service management
critical outage (e.g. MPIO path failure)
that focuses on
• Facilitates establishment of well-defined and aligning IT services
with the needs of business.
controlled processes for effective handling
of events and alerts.
• Notification is defined as an IT Best Practice
– Within ITIL V3 Service Support Framework:
Event & Alert Management
Event & Alert Management defines monitoring and
handling all events occurring throughout the IT services
and
© 2012, 2013 IBMsystems
Corporation 31
Why Notification? – Problem Resolution Time
Without Automated Notification
Incident Incident Support Problem Solution Corrective Return

Occurs Recognized contacted Analysis Determined Action to
and Service
Testing
With Automated Notification…Reduced Problem Resolution time
Incident
Occurs
Incident Support Problem Solution Corrective Return

Recognized contacted Analysis Determined Action to
and Service
Testing
Error Logging Components in AIX

Options for Error Notification
ODM-Based
ODM-Based
diag
diag
Custom
Custom Error
Error Command
Command
Notification
Notification Notification
Notification Diagnostics
Diagnostics
Concurrent
Concurrent
Error
Error
Logging
Logging

Options for Error Notification
• ODM-Based
errdemon program uses errnotify ODM class for
error notification
• diag Command Diagnostics ODM-Based
ODM-Based
The diag command package contains a periodic

diagnostic procedure called diagela. Hardware
Custom
Custom
Notification
Error
Error
Notification
diag Command
diag Command
diagnostics
Notification Notification diagnostics
(only) errors generate mail messages to

members of the system group, or other email Concurrent
Concurrent
Error
addresses, as configured. Error

Logging
Logging
• Custom Notification
Write a shell script to check the error log
periodically
• Concurrent Error Logging
Start errpt –c and each error is then reported
when it occurs.
Error Notification – diag Error Log Analysis
Task Selection (Diagnostics, Advanced Diagnostics, Service Aids, etc.) Menu

Concurrent Error Logging - Easy
Start errpt –c to have each error reported

when it occurs.
Hint: redirect the output to the console to have ODM-Based

ODM-Based
an operator informed about each new error

entry.
# errpt -c > /dev/console & Custom
Custom
Notification
Notification
Error
Error
Notification
Notification
diag Command
diag Command
diagnostics
diagnostics
Concurrent
Concurrent
Error
Error
Logging
Logging

Custom Notification Script
Write a shell script to check the error log periodically
#!/usr/bin/ksh
#######################################################
# Sample script to perform simple error notification #
#######################################################
ODM-Based
errpt > /tmp/error_log_1 # save version 1 of the error log ODM-Based
while true # loop forever checking error log

do
sleep 60 # wait one minute Custom Error diag Command
Custom Error diag Command
errpt > /tmp/error_log_2 # save version 2 of the error log Notification
Notification
Notification
Notification
diagnostics
diagnostics
# Compare version 1 and version 2 of the error logs

# If they are the same, then go back to sleep
cmp -s /tmp/error_log_1 /tmp/error_log_2 && continue Concurrent
Concurrent
Error
Error
# Files are different. A new error log entry detected Logging
Logging
# Send messages to the console and to root user

print "Warning: error log has changed" > /dev/console
mail -s "Warning: error log has changed" root <<-EOF
ALERT! Error Log Has Changed ALERT!
EOF
errpt > /tmp/error_log_1 # save new copy of error log
done # Go back to sleep
ODM-based Error Notification: errnotify
The Error Notification object class specifies the conditions and

ODM-Based
actions to be taken when errors are recorded in the system ODM-Based
error log.
Custom Error diag Command
Custom
Notification Error
Notification diag Command
diagnostics
The user specifies these conditions and actions in the errnotify Concurrent
Concurrent
Error
Error Notification object.

Error
Logging
Logging
Useful ODM Commands

odmadd
Adds objects to an object class. The odmadd command takes

an ASCII stanza file as input and populates object classes with
objects
found in the stanza file.
odmdelete
Removes objects from an object class.
odmshow
© 2012, 2013 IBM Corporation
Displays the description of an object class. 39
errnotify Description
# odmshow errnotify
ODM-Based
ODM-Based
class errnotify {
long en_pid; /* offset: 0xc ( 12) */ Custom
Custom
Notification
Error
Error
Notification
diag Command
diag Command
diagnostics
char en_name[16]; /* offset: 0x10 ( 16) */

short en_persistenceflg; /* offset: 0x20 ( 32) */ Concurrent

Concurrent
Error
Error
char en_label[20]; /* offset: 0x22 ( 34) */

Logging
Logging
ulong en_crcid; /* offset: 0x38 ( 56) */

char en_class[2]; /* offset: 0x3c ( 60) */
char en_type[5]; /* offset: 0x3e ( 62) */
char en_alertflg[6]; /* offset: 0x43 ( 67) */
char en_resource[16]; /* offset: 0x49 ( 73) */
char en_rtype[16]; /* offset: 0x59 ( 89) */
char en_rclass[16]; /* offset: 0x69 ( 105) */
char en_symptom[6]; /* offset: 0x79 ( 121) */
char en_err64[6]; /* offset: 0x7f ( 127) */
char en_dup[6]; /* offset: 0x85 ( 133) */
© 2012, 2013 IBM Corporation
char en_method[255]; /* offset: 0x8b ( 139) */ 40
ODM-based Error Notification: Object Descriptors
en_aertflg Indicates whether the error can be alerted. For use by alert agents. TRUE or FALSE
en_class Class of the error log entry to match: H-hw S-sw O-from errlogger U-undetermined
en_crcid Specifies the unique error identifier associated with a particular error.
en_dup If set, identifies whether duplicate errors should be matched. TRUE or FALSE
en_err64 If set, identifies whether errors from a 64-bit or 32-bit environment should be matched.
en_label Specifies the label associated with a particular error identifier as defined in errpt –t output
en_method Specifies a user-programmable action to be run when error matches selection criteria
en_name Uniquely identifies the Error Notification object. Name used when removing the object
en_persistenceflg Designates if the object should persist through boot. 0-non-persistent 1-persistent
en_pid Specifies a process ID (PID) for use in identifying the Error Notification object.
en_rclass Indentifies the class of the failing resource. Not applicable for software class
en_resource Identifies the name of the failing resource
en_rtype Identifies the type of the failing resource
en_symptom Enables notification of an error accompanied by a symptom string when set to TRUE
en_type Identifies severity of error log entries to match. INFO PEND PERM PERF TEMP UNKN
Basic Configuration Steps:

1.Create an ASCII stanza file containing the Error Notification object with
ODM-Based
ODM-Based
desired conditions and actions (method).

2.Add the object to the errnotify Error Notification object class in the Custom
Custom
Notification
Error
Error
Notification
diag Command
diag Command
diagnostics
/etc/objrepos/errnotify file: Notification Notification diagnostics
Concurrent
odmadd /tmp/en_sample.add Concurrent
Error
Error
Logging
Logging
3.Copy any user-written en_method action script

to the /usr/lib/ras directory
/tmp/en_sample.add
/tmp/en_sample.add file
file mails error entry to root
errnotify:
errnotify: each time disk error
en_name
en_name == “sample”
“sample” of type PERM logged.
en_persistenceflg
en_persistenceflg == 00 Note use of $n keywords
en_class
en_class == “H”
“H”
en_type
en_type == “PERM”
“PERM”
en_rclass
en_rclass == “disk”
“disk”
en_method
en_method == “errpt
“errpt –a –l $1
–a –l $1 || mail
mail –s
–s ‘Disk
‘Disk Error’
Error’ root”
root”
ODM-based Error Notification: Arguments to Notify
Method
The following keywords are automatically expanded by the Error Notification

daemon as arguments to the notify method:
$1 Sequence number from error log entry
$2 Error ID from error log entry
$3 Class from the error log entry
$4 Type from the error log entry
$5 Alert flags value from the error log entry
$6 Resource name from the error log entry
$7 Resource type from the error log entry
$8 Resource class from the error log entry
$9 Error label from the error log entry

Notification – What to Monitor?
Path and Fibre Channel Related Errors
# errpt -t | egrep "PATH | FCA“ = 23 Unique ID’s

02A8BC99 SC_DISK_PCM_ERR8 PERM H PATH HAS FAILED
080784A7 DISK_ERR6 PERM H PATH HAS FAILED
13484BD0 SC_DISK_PCM_ERR16 PERM H PATH ID
14C8887A FCA_ERR10 PERM H COMMUNICATION PROTOCOL ERROR
1D20EC72 FCA_ERR1 PERM H ADAPTER ERROR
1F22F4AA FCA_ERR14 TEMP H DEVICE ERROR
278804AD FCA_ERR5 PERM S SOFTWARE PROGRAM ERROR
2BD0BD1A FCA_ERR9 TEMP H ADAPTER ERROR
3B511B1A FCA_ERR8 UNKN H UNDETERMINED ERROR
40535DDB SC_DISK_PCM_ERR17 PERM H PATH HAS FAILED
7BFEEA1F FCA_ERR4 TEMP #H errpt
LINK -atJ ERRORFCA_ERR4
---------------------------------------------------------------------------
84C2184C FCA_ERR3 PERM IDENTIFIER
H LINK 7BFEEA1F ERROR
9CA8C9AD SC_DISK_PCM_ERR12 Label:
PERM Class:
H PATH FCA_ERR4 HAS FAILED
H
A6F5AE7C SC_DISK_PCM_ERR9 INFO Type:
H PATH TEMP HAS RECOVERED
Loggable: YES Reportable: YES Alertable: NO
D666A8C7 FCA_ERR2 TEMP Description
H ADAPTER ERROR
DA930415 FCA_ERR11 TEMP LINK ERROR
H COMMUNICATION PROTOCOL ERROR
Recommended Actions
You must
DE3B8540 test for common errors PERM PERFORM
SC_DISK_ERR7 H PATH PROBLEM HAS FAILED DETERMINATION PROCEDURES
in your
E8F9BA61 CRYPT_ERROR_PATH environment Detail
INFO SENSE Data
H SOFTWARE PROGRAM ERROR
DATA
ECCE4018 FCA_ERR6 TEMP S SOFTWARE PROGRAM ERROR
F29DB821 FCA_ERR7 UNKN H UNDETERMINED ERROR
Notification – What to Monitor?
Disk Related Errors
#errpt -t | egrep "DISK|SAS|SCSI“ => 1 6 6 Unique ID.s !

00B984B3 SC_DISK_ERR5 UNKN H UNDETERMINED ERROR
0118DD96 SISSAS_BAT_P PERM H BATTERY PACK FAILURE
01A236F0 SC_DISK_PCM_ERR11 PERM H REQUESTED OPERATION CANNOT BE PERFORMED
02A8BC99 SC_DISK_PCM_ERR8 PERM H PATH HAS FAILED
02E74ED4 ICS_ERR11 INFO O Additional iSCSI Adapter Information
03913B94 LVM_HWREL UNKN H HARDWARE DISK BLOCK RELOCATION ACHIEVED
0502F666 SCSI_ERR1 PERM H ADAPTER ERROR
05EFA03B SC_DISK_PCM_ERR15 PERM H REMOTE VOL MIRRORING: ILLEGAL I/O ORIGIN
0734DA1D DISKETTE_ERR3 PERM H DISKETTE MEDIA ERROR
078ED5D2 SAS_ERR1 PERM H ADAPTER ERROR
080784A7 DISK_ERR6 PERM H PATH HAS FAILED
08F9C47C SC_DISK_PCM_ERR14 PERM H SNAPSHOT REPOSITORY METADATA ERROR
0C10BB8C SC_DISK_PCM_ERR4 INFO H ARRAY CONFIGURATION CHANGED
1081B888 SISSAS_LINK_CABLE PERM H ADAPTER TO ADAPTER CABLING ERROR
12308453 SC_DISK_PCM_ERR20 PERM H SINGLE CONTROLLER RESTART FAILURE
13484BD0 SC_DISK_PCM_ERR16 PERM H PATH ID
15FD5EE8 SCSI_ARRAY_ERR5 PERM H DISK OPERATION ERROR
16F35C72 DISK_ERR2 PERM H DISK OPERATION ERROR Common Disk Errors to Monitor:
1AE69D3A FSCSI_ERR9 PERM H POTENTIAL DATA LOSS CONDITION
.
. DISK_ERR1 – Volume Failure. Action: replace
.
425BDD47 DISK_ERR1 PERM H DISK OPERATION ERROR
DISK_ERR2/3 – Device does not respond.
44C5506E ISCSI_ERR9 PERM H Action: check power supply
COMMUNICATION PROTOCOL ERROR
8580332D SISSAS_LINK_CONFIG PERM H MULTIPLE ADAPTER LINK CONFIGURATION ERRO
85D29B05 SISSAS_ERR16T TEMP H ARRAY CONFIGURATION ERROR
8647C4E2 DISK_ERR3 PERM H DISK OPERATION ERROR DISK_ERR4 – (Temporary) Bad block, etc.
F43A59CD
F4C2CCF7
FSCSI_ERR3
SISIOA_ERR01PD
PERM
PERM
H
H
ADAPTER ERROR
Action: Replace disk if persists > 1/wk
SCSI DEVICE OR MEDIA ERROR
F7863CFE SSA_DISK_ERR4 PERM H DISK OPERATION ERROR
F91ADEB5 SISSAS_LOGICAL_BAD INFO H LOGICAL READ ERROR SCSI_ERRn – SCSI communication problem.
FBEE4B29 SISSAS_ERR11P PERM H SAS FABRIC OR DEVICE ERROR
FBF0BFC1 TMSCSI_UNRECVRD_ERR PERM H Action:
ATTACHED SCSI TARGET DEVICE ERRORcheck cables, SCSI addresses,
FE9E9357 SSA_DEVICE_ERROR PERM H DISK OPERATION ERROR
FEC31570 SCSI_ERR7 PERM H UNDETERMINED ERROR terminator
FEFD41FF DISK_ERR5 UNKN H UNDETERMINED ERROR
SAS* - SAS errors
Error Log - Manual Logging and Testing
errlogger command
allows the system administrator to record messages of up to 1024 bytes in the error log.
# errlogger system hard disk ‘(hdisk0)’ replaced.
Whenever you perform system maintenance activity, it is a good idea to record this activity
in the system error log
clearing entries from the error log
replacing/moving hardware
applying a software fix
re-cabling storage…
The command may also be helpful for testing notification programs:
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION

AA8AB241 1023160212 T O OPERATOR OPERATOR NOTIFICATION
# errpt -t | grep OPERATOR

AA8AB241 OPMSG TEMP O OPERATOR NOTIFICATION

Error Log – Injecting Errors
bookmark this page!
ras_logger command
allows the system administrator to record any error from the command line.
log an error from a shell script
test newly-created error templates
Example: /usr/lib/ras/ras_logger < tfile where,
tfile contains the error information using the error's template to determine
how to log the data. The format of the input is the following:
error_label
resource_name
64_bit_flag
detail_data_item1
detail_data_item2
...
• error_label is the error's label defined in the template in /var/adm/ras/errtmplt

• resource_name field is up to 16 characters in length.
• 64_bit_flag field's values are 0 for a 32-bit error and 1 for a 64-bit error.
• detail_data fields correspond to the Detail_Data items in the template.

Example ras_logger Usage
inject a DMA error…
View Selected Error Template:

# errpt -atJ DMA_ERR
----------------------------------------------------------------
IDENTIFIER 00530EA6
Label: DMA_ERR
Class: H
Type: UNKN
Loggable: YES Reportable: YES Alertable: NO
Description
UNDETERMINED ERROR
Probable Causes # /usr/lib/ras/ras_logger < tfile
SYSTEM I/O BUS
SOFTWARE PROGRAM
ADAPTER
DEVICE tfile:
Recommended Actions +1 DMA_ERR
PERFORM PROBLEM DETERMINATION
PROCEDURES +2 resourcex
Detail Data +3 0
BUS NUMBER +4 15
CHANNEL UNIT ADDRESS
ERROR CODE +5 A0
+6 9999

Example ras_logger Usage
inject a DMA error…
# errpt -a
---------------------------------------------------------------------------
# /usr/lib/ras/ras_logger < tfile LABEL: DMA_ERR
IDENTIFIER: 00530EA6
Date/Time: Wed Oct 24 10:11:28 CDT 2012
Sequence Number: 37
Machine Id: 0004A9C6D700
Node Id: hock
tfile: Class: H
Type: UNKN
+1 DMA_ERR Resource Name: resourcex
+2 resourcex Resource Class: NONE
Resource Type: NONE
+3 0 Location:
+4 15 Description
UNDETERMINED ERROR
+5 A0 Probable Causes
SYSTEM I/O BUS
+6 9999 SOFTWARE PROGRAM
ADAPTER
DEVICE
Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES
Detail Data
BUS NUMBER
0000 0015
CHANNEL UNIT ADDRESS
0000 00A0
ERROR CODE
0000 9999
IBM Power Systems
Agenda
►The ndisk64 tool
Basic IO tuning

IBM Power Systems

For random IO, look at read and write service times from
# iostat –RDTl <interval> <# intervals>

Disks: xfers read write queue time
-------------- -------------------------------- ------------------------------------ ------------------------------------ -------------------------------------- ---------
%tm bps tps bread bwrtn rps avg min max time fail wps avg min max time fail avg min max avg avg serv
act serv serv serv outs serv serv serv outs time time time wqsz sqsz qfull
hdisk1 0.9 12.0K 2.5 0.0 12.0K 0.0 0.0 0.0 0.0 0 0 2.5 9.2 0.6 92.9 0 0 3.7 0.0 71.4 0.0 0.0 0.3 15:58:27
hdisk0 0.8 12.1K 2.6 119.4 12.0K 0.0 4.4 0.1 12.1 0 0 2.5 8.7 0.8 107.0 0 0 3.3 0.0 61.1 0.0 0.0 0.3 15:58:27
hdisk2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 15:58:27
hdisk3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 15:58:27
hdisk4 66.4 58.9M 881.1 58.9M 28.8K 879.1 6.4 0.1 143.5 0 0 2.0 1.8 0.2 27.5 0 0 54.7 0.0 3.6S 48.0 5.0 746.8 15:58:27
hdisk6 66.3 51.1M 797.4 51.0M 24.5K 795.9 7.6 0.1 570.1 0 0 1.5 1.5 0.2 32.9 0 0 51.3 0.0 2.7S 41.0 6.0 678.5 15:58:27
hdisk5 61.9 55.9M 852.9 55.9M 28.5K 850.5 6.0 0.1 120.8 0 0 2.4 1.6 0.1 33.6 0 0 46.1 0.0 3.8S 39.0 5.0 714.7 15:58:27
hdisk7 58.3 55.4M 843.1 55.4M 21.2K 841.9 6.7 0.1 167.6 0 0 1.3 1.3 0.2 20.8 0 0 48.3 0.0 2.1S 40.0 5.0 734.8 15:58:27
hdisk8 42.6 53.5M 729.1 53.5M 3.4K 728.9 5.7 0.1 586.4 0 0 0.2 0.9 0.2 5.9 0 0 54.3 0.0 2.8S 39.0 4.0 687.8 15:58:27
hdisk10 44.1 37.1M 583.0 37.0M 16.9K 582.0 3.7 0.1 467.7 0 0 1.0 1.4 0.2 12.9 0 0 23.1 0.0 1.3S 13.0 2.0 465.0 15:58:27
Misleading indicators of disk subsystem performance

► %tm_act (percent time active)
● Not meaningful for virtual disks, meaningful for real physical disks
► %iowait
● A measure of CPU idle while there are outstanding IOs
IOPS, tps, and xfers all refer to the same thing

IBM Power Systems

# topas –D
or just press D when in topas
Avg. Read Time Avg. Write Time Avg. Queue Wait

IBM Power Systems
What are reasonable IO service times?

It depends!
► Random vs. sequential IO
● Concentrate on thruput with sequential IO where we expect poor IO latency
► Small (4-16 KB) vs. large IOs (128 KB and up)
● Larger IOs have longer transfer times
► Disk drive technology
● 10 K RPM vs. 15 K RPM
● Fibre Channel and SAS vs. SATA
● HDD vs SSD
► Using synchronous disk subsystem mirroring or not
● If mirroring, what is inter-site latency?
► Disk subsystem cache size and hit rate
● Read cache vs. write cache
► Short stroked HDDs or not
HDD IO service times are variable and probabilistic

IBM Power Systems
Disk IO service times “ZBR” Geometry

Makes more efficient use of outer
track space
Multiple interface types

ATA
SATA
SCSI
FC
SAS
If the disk is very busy, IOs will wait for IOs ahead of it
Queueing time on the disk (not queueing in the hdisk driver or elsewhere)
IBM Power Systems
Seagate 7200 RPM SATA HDD performance
As IOPS increase, IOs queue on the disk and wait for IOs ahead to complete first

IBM Power Systems
Assuming the disk isn’t too busy and IOs are not queueing there
SSD IO service times around 0.2 to 0.4 ms and they can do over 10,000 IOPS

IBM Power Systems

Rules of thumb for IO service times for random IO and typical disk subsystems that are not mirroring data
synchronously and using HDDs
► Writes should average <= 2.5 ms
● Typically they will be around 1 ms
► Reads should average < 15 ms
● Typically they will be around 5-10 ms
For random IO with synchronous mirroring

► Writes will take longer to get to the remote disk subsystem, write to its cache, and return an acknowledgement
► 2.5 ms + round trip latency between sites (light thru fiber travels 1 km in 0.005 ms)
When using SSDs

► For SSDs on SAN, reads and writes should average < 2.5 ms, typically around 1 ms
► For SSDs attached to Power via SAS adapters without write cache
● Reads and writes should average < 1 ms
Typically < 0.5 ms
Writes take longer than reads for SSDs
► What if we don’t know if the data resides on SSDs or HDDs (e.g. in an EasyTier environment)?
● Look to the disk subsystem performance reports
For sequential IO, don’t worry about IO service times, worry about thruput
► We hope IOs queue, wait and are ready to process

IBM Power Systems
What IO service times are you experiencing?

# iostat –RDl [interval] [count]
Disks: xfers read write
-------------- -------------------------------- ------------------------------------ ------------------------------------
%tm bps tps bread bwrtn rps avg min max time fail wps avg min max time fail
act serv serv serv outs serv serv serv outs
hdisk0 0.3 26.7K 3.1 19.3K 7.5K 1.4 1.7 0.4 19.8 0 0 1.6 0.8 0.6 6.9 0 0
hdisk1 0.1 508.6 0.1 373.0 135.6 0.1 8.1 0.5 24.7 0 0 0.0 0.8 0.6 1.0 0 0
hdisk2 0.0 67.8 0.0 0.0 67.8 0.0 0.0 0.0 0.0 0 0 0.0 0.8 0.7 1.0 0 0
hdisk3 1.1 37.3K 4.4 25.1K 12.2K 2.0 0.8 0.3 10.4 0 0 2.4 4.4 0.6 638.4 0 0
hdisk4 80.1 33.6M 592.5 33.6M 38.2K 589.4 2.4 0.3 853.6 0 0 3.1 6.5 0.5 750.3 0 0
hdisk5 53.2 16.9M 304.2 16.9M 21.5K 302.2 3.0 0.3 1.0S 0 0 2.0 16.4 0.7 749.3 0 0
hdisk6 1.1 21.7K 4.2 1.9K 19.8K 0.1 0.6 0.5 0.8 0 0 4.0 2.7 0.6 495.6 0 0
(queue statistics removed for space) or
# iostat –RD hdisk0
System configuration: lcpu=4 drives=35 paths=35 vdisks=2
hdisk0 xfer: %tm_act bps tps bread bwrtn

0.3 26.7K 3.1 19.3K 7.5K
read: rps avgserv minserv maxserv timeouts fails
1.4 1.7 0.4 19.8 0 0
write: wps avgserv minserv maxserv timeouts fails
1.6 0.8 0.6 6/9 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
0.0 0.0 0.0 0.0 0.0 0.0

IBM Power Systems
What if IO times are worse than that?
You have a bottleneck somewhere from the hdisk driver to the physical disks
► Possibilities include:
● CPU (local LPAR or VIOS)
● Adapter driver
● Physical host adapter/port
● Overloaded SAN links (unlikely)
● Storage port(s) overloaded
● Disk subsystem processor overloaded
● Physical disks overloaded
● SAN switch buffer credits
● Temporary hardware errors
► Evaluate VIOS, adapter, adapter driver from AIX/VIOS
► Evaluate the storage from the storage side
If the write IO service times are marginal, the write IO rate is low, and the read IO rate is
high, it’s often not worth worrying about
► Can occur due to caching algorithms in the storage

IBM Power Systems
What about IO size and sequential IO?

Disks: xfers
-------------- --------------------------------
%tm bps tps bread bwrtn
act
hdisk4 99.6 591.4M 2327.5 590.7M 758.7K
Large IOs typically imply sequential IO – check your iostat data

bps/tps = bytes/transaction or bytes/IO
591.4 MB / 2327.5 tps = 260 KB/IO - likely sequential IO
Use filemon to examine sequentiality, e.g.:
# filemon –o /tmp/filemon.out –O all,detailed –T 1000000;sleep 60; trcstop
VOLUME: /dev/hdisk4 description: N/A
reads: 9156 (0 errs)
read sizes (blks): avg 149.2 min 8 max 512 sdev 218.2
read times (msec): avg 6.817 min 0.386 max 1635.118 sdev 22.469
read sequences: 7155*
read seq. lengths: avg 191.0 min 8 max 34816 sdev 811.9
writes: 806 (0 errs)
write sizes (blks): avg 352.3 min 8 max 512 sdev 219.2
write times (msec): avg 20.705 min 0.702 max 7556.756 sdev 283.167
write sequences: 377*
write seq. lengths: avg 753.1 min 8 max 8192 sdev 1136.7
seeks: 7531 (75.6%)*
Here % sequential = 1-75.6% = 24.4%

Perhaps multiple sequential IO threads accessing hdisk4
* Adjacent IOs coalesced into fewer IOs
IBM Power Systems
A situation you may see

# iostat –lD
-------------- -------------------------------- ------------------------------------ ------------------------------------
hdisk0 0.3 26.7K 3.1 19.3K 7.5K 1.4 1.7 0.4 19.8 0 0 1.6 0.8 0.6 6.9 0 0
hdisk1 0.1 508.6 0.1 373.0 135.6 0.1 8.1 0.5 24.7 0 0 0.0 0.8 0.6 1.0 0 0
hdisk2 0.0 67.8 0.0 0.0 67.8 0.0 0.0 0.0 0.0 0 0 0.0 0.8 0.7 1.0 0 0
hdisk3 1.1 37.3K 4.4 25.1K 12.2K 2.0 0.8 0.3 10.4 0 0 2.4 4.4 0.6 638.4 0 0
hdisk4 80.1 33.6M 592.5 33.6M 38.2K 589.4 2.4 0.3 853.6 0 0 3.1 6.5 0.5 750.3 0 0
hdisk5 53.2 16.9M 304.2 16.9M 21.5K 302.2 3.0 0.3 1.0S 0 0 2.0 16.4 0.7 749.3 0 0
hdisk6 1.1 21.7K 4.2 1.9K 19.8K 0.1 0.6 0.5 0.8 0 0 4.0 2.7 0.6 495.6 0 0
Note the low write rate and high write IO service times
Disk subsystem cache and algorithms may favor disks doing

sequential or heavy IO relative to disks doing limited IO or no IO
for several seconds
► The idea being to reduce overall IO service times
► Varies among disk subsystems
Overall performance impact is low due to low write rates

IBM Power Systems
Agenda
►The ndisk64 tool
Basic IO tuning

IBM Power Systems
Disk subsystem IO bandwidth metrics

There are many different metrics
► Maximum IOPS for some R/W ratio and IO size
● Simple but misses IO service time
► IOPS vs IO service time graph
● Used for Storage Performance Council SPC-1 benchmark reports
► Maximum MB/s for large block sequential IO for reads and/or for writes
● Similar to part of SPC-2 benchmark reports
► The above metrics for IO to/from disk subsystem cache
► The above metrics for storage ports, LUNs, backend HBAs, processors, etc
Use a metric appropriate for your application

► Characterize your IO workload with the NMON analyzer
► Consider if sizing for both IOPS and MB/s is needed
► Most commercial applications size for sufficient IOPS

IBM Power Systems
Challenges measuring disk subsystem IO bandwidth

If the disk subsystem is being used by other systems during your testing
► Partial and variable results
If server IO bandwidth < disk subsystem bandwidth
► Usually not a problem with Power, but potentially HBAs can be a bottleneck
► Look out for server benchmarks where the storage is the bottleneck
● Are you measuring the storage or system performance?
Understanding the disk subsystem architecture
► Are you connected to enough storage ports?
► Are you using all the back end spindles?
► Are you using all the storage resources?
Help from the storage administrators
Cache effects
Tiered storage
► Measure each tier separately
Existing data on disk
► Stick to 100% read testing – calculate write IOPS bandwidth based on RAID levels
● Sustained RAID 5 write IOPS bandwith is almost ¼ of read IOPS bandwidth
● Sustained RAID 10 write IOPS bandwidth is almost ½ of read IOPS bandwidth
Write IOPS will be fast until cache fills up
Variability in the results
How the disk subsystem is configured (RAID levels and other settings) affects its IO bandwidth

IBM Power Systems
Cache effects
Avoid using AIX file system cache
► Using raw hdisks or LVs is best
Disk subsystem cache

► Read hit % will be at least (cache size)/(allocated disk space used for testing)
● Test 100% of the space you will allocate from the unit
► Write cache operates at electronic speeds until the cache fills
● Be aware that performance will degrade when cache fills if your write rate is
high enough
Monitor performance for sufficient time during write tests
To test IO rates to/from cache, use allocated space < cache size and prime
the cache for reads
► Prime the cache with # cat /dev/rhdisk10 > /dev/null

IBM Power Systems
The ndisk64 IO load generator

Generates IOs to raw disks, raw LVs, or files in file systems
Able to generate IO to multiple devices
User specified number of threads generating IOs
► Each thread does IOs synchronously
Sequential or random IO
Other inputs:
► How long the test should run in seconds
► R/W ratio
► IO size or a set of IO sizes
► There’s more but the above options cover most cases
Use the character device (e.g. /dev/rhdisk0) for raw IO

Google ndisk or nstress to get the nstress package which contains
ndisk64

IBM Power Systems
The ndisk64 IO load generator help

# ndisk64
Command: ndisk64
Usage: ndisk64 version 6.2
Complex Disk tests - sequential or random read and write mixture
ndisk64 -S Seqential Disk I/O test (file or raw device)
-R Random Disk I/O test (file or raw device)
-t <secs> Timed duration of the test in seconds (default 5)
-f <file> use "File" for disk I/O (can be a file or raw device)
-f <list> use separated list of filenames (max 16) [separators :,+]
example: -f f1,f2,f3 or -f /dev/rlv1:/dev/rlv2
-F <file> <file> contains list of filenames, one per line
-M <num> Mutliple processes used to generate I/O
-s <size> file Size, use with K, M or G (mandatory for raw device)
examples: -s 1024K or -s 256M or -s 4G
The default is 32MB
-r <read%> Read percent min=0,max=100 (default 80 =80%read+20%write)
example -r 50 (-r 0 = write only, -r 100 = read only)
-b <size> Block size, use with K, M or G (default 4KB)
-O <size> first byte offset use with K, M or G (times by proc#)
-b <list> or use a colon separated list of block sizes (804400328 max)
example -b 512:1k:2K:8k:1M:2m
-q flush file to disk after each write (fsync())
-Q flush file to disk via open() O_SYNC flag
-i <MB> Use shared memory for I/O MB is the size(max=536874656 MB)
-v Verbose mode = gives extra stats but slower
-l Loging disk I/O mode = see *.log but slower still
-o "cmd" Other command - pretend to be this other cmd when running
Must be the last option on the line
-K num Shared memory key (default 0xdeadbeef) allows multiple programs
Note: is you halt a run, you may have a shared memory
segment left over. Use ipcs and then ipcrm to remove it.
-p Pure = each Sequential thread does read or write not both
-P file Pure with separate file for writers
-z percent Snooze percent - time spent sleeping (default 0)
To make a file use dd, for 8 GB: dd if=/dev/zero of=myfile bs=1M count=8196
For example:
dd if=/dev/zero of=bigfile bs=1m count=1024
ndisk64 -f bigfile -S -r100 -b 4096:8k:64k:1m -t 600
ndisk64 -f bigfile -R -r75 -b 4096:8k:64k:1m -q
ndisk64 -F filelist -R -r75 -b 4096:8k:64k:1m -M 16
ndisk64 -F filelist -R -r75 -b 4096:8k:64k:1m -M 16 -l -v
ndisk64a -A -F filelist -R -r50 -b 4096:8k:64k:1m -M 16 -x 8 -X 64

IBM Power Systems
Using ndisk64
# lsdev -Cc disk
hdisk0 Available C7-T1-01 MPIO DS4800 Disk
# getconf DISK_SIZE /dev/hdisk0
30720 <- size needed for raw device in MB
# ndisk64 -R -t 20 -f /dev/rhdisk0 -M 1 -s 30720M -r 100
Command: ndisk64 -R -t 20 -f /dev/rhdisk0 -M 1 -s 30720M -r 100
Synchronous Disk test (regular read/write)
No. of processes = 1
I/O type = Random
Block size = 4096
Read-Write = Read Only
Sync type: none = just close the file
Number of files = 1
File size = 32212254720 bytes = 31457280 KB = 30720 MB
Run time = 20 seconds
Snooze % = 0 percent
----> Running test with block Size=4096 (4KB) .
Proc - <-----Disk IO----> | <-----Throughput------> RunTime
Num - TOTAL IO/sec | MB/sec KB/sec Seconds
1 - 3008 300.7 | 1.17 1202.97 20.00
Monitor IO service times in another window using # iostat –RDTl <interval>

Increase the number of threads to get a peak IOPS
Increase queue_depth until it is >= number of threads
Increasing the number of threads > 2X queue_depth won’t lead to more IOPS

IBM Power Systems
Using ndisk64 – random read IOPS from a single LUN

DS4800 LUN Read Performance
Threads IOPS IO service time
20
1 300.7 2.8 ms
18
5 1389.5 3.5 ms
16
10 2296.8 4.3 ms
14
IO Service Time - ms
15 3020.8 5.0 ms
12
20 3662.5 5.5 ms
10
30 4576.2 6.6 ms
8
40 5114.7 7.8 ms
6
50 5620.6 8.8 ms 4
60 5872.4 10.1 ms 2
70 6099.7 11.4 ms 0
0 2000 4000 6000 8000
100 6271.0 16.0 ms
IOPS
128 6714.0 19.0 ms
IOPS for the LUN peaked at 7082 IOPS with service times > 20 ms using 256 threads

IBM Power Systems
Using ndisk64 – random read IOPS for a disk subsystem

Ensure you understand the disk subsystem architecture and you are
doing IO to ALL the physical disks and using all the available resources
You’ll typically need several LUNs, preferably all the same size
Create a file of hdisk names
# cat hdisk.list
/dev/rhdisk2
/dev/rhdisk3
…
/dev/rhdisk10
# ndisk64 –R –r 100 –F hdisk.list –s 51200 -t 10 -M 400
Reading file of filenames "hdisk.list"

Command: ndisk64 -R -r 100 -s 51200 -t 10 -M 400 -F hdisk.list
I/O type = Random
Block size = 4096
…
TOTALS 1256006 126497.2 | 494.13 Rand procs=400 read=100% bs= 4KB
Create a IOPS vs. IO service time chart if you like

IBM Power Systems
Using ndisk64 – other random IO tests

Write IOPS or other than 100% reads
Keep in mind that write cache can get filled up and IO performance drops then
Measuring disk cache IOPS bandwidth

Create a LUN so it fits entirely in cache, and prime the cache
# ndisk64 -R -t 10 -f /dev/rperf_testlv -M 128 -s 16M -r 100
…
TOTALS 545498 54578.0 | 213.20 Rand procs=128 read=100% bs= 4KB
Measuring storage port IOPS bandwidth

Disable paths to all but a single port on the storage using chpath
Do IO to/from disk cache so that the disks are not a bottleneck
Measuring host port IOPS bandwidth

Disable paths to all but a single port on the host using chpath
Be sure to have enough storage ports
Do IO to/from disk cache so that the disks are not a bottleneck

IBM Power Systems
Using ndisk64 – sequential IO

Use one thread per “file” to get data from disk
Use multiple threads to drive up thuput, but some/most of the data will be from
disk cache
Have enough LUNs to get the thruput you need
Use a large IO size, e.g. 256 KB or larger
Are you measuring the interconnect bandwidth or the storage bandwidth?
Be aware of the interconnect setup

IBM Power Systems
Using ndisk64 – sequential IO

# ndisk64 -S -t 10 -f /dev/rhdisk0 -M 1 -s 30720M` -r 100 -b 256K
Command: ndisk64 -S -t 10 -f /dev/rhdisk0 -M 1 -s 30720M -r 100 -b 256K
I/O type = Sequential
Block size = 262144
Sync type: none = just close the file
Number of files = 1
File size = 32212254720 bytes = 31457280 KB = 30720 MB
Run time = 10 seconds
Snooze % = 0 percent
----> Running test with block Size=262144 (256KB) .
1 - 10368 1036.8 | 259.19 265414.62 10.00
# timex dd if=/dev/rhdisk0 of=/dev/null bs=256K count=4000
4000+0 records in
4000+0 records out
real 3.95 -> (256 KB x 4000)/3.95s = 259.24 MB/s

user 0.00
sys 0.16
# ndisk64 -S -t 10 -f /dev/rhdisk0 -M 4 -s 30720M -r 100 -b 256K
…
TOTALS 15284 1528.4 | 382.09 Seq procs= 4 read=100% bs=256KB
This setup has a single 4 Gb FC adapter
With a 4 Gb SAN, we can get close to 400 MB/s simplex per link
IBM Power Systems
Agenda
►The ndisk64 tool
Basic IO tuning

IBM Power Systems
Introduction to AIX IO Tuning
Tuning IO involves removing logical bottlenecks in the AIX IO stack

Requires some understanding of the AIX IO stack
General rule is to increase buffers and queue depths so no IOs wait unecesarily
due to lack of a resource, but not to send so many IOs to the disk subsystem that
it loses the IO requests
Four possible situations:

1. No IOs waiting unnecessarily
No tuning needed
2. Some IOs are waiting and IO service times are good
Tuning will help
3. Some IOs are waiting and IO service times are poor
Tuning may or may not help
Poor IO service times indicate a bottleneck further down the stack and
typically at the storage
Often needs more storage resources or storage tuning
4. The disk subsystem is losing IOs and IO service times are bad
Leads to IO retransmissions, error handling code, blocked IO stalls and
crashes

IBM Power Systems
AIX IO Stack
Application Application memory area caches data to
avoid IO
Logical file system
Raw disks
Raw LVs
NFS caches file attributes

JFS JFS2 NFS Other NFS has a cached filesystem for NFS clients
VMM JFS and JFS2 cache use extra system RAM
LVM (LVM device drivers)

Multi-path IO driver (optional)
Disk Device Drivers Queues exist for both adapters and disks
Adapter Device Drivers Adapter device drivers use DMA for IO
Disk subsystem (optional) Disk subsystems have read and write cache
Disk Disks have memory to store commands/data
Write cache Read cache or memory area used for IO
IBM Power Systems
AIX IO Stack – Basic Tunables

Application Application memory area size
Logical file system

Raw disks
Raw LVs
JFS JFS2 NFS Other File system buffers or fsbufs
VMM Cache size or use of cache
LVM (LVM device drivers) Disk buffers or pbufs

Multi-path IO driver (optional)
Disk Device Drivers Hdisk queue depth
Adapter Device Drivers Adapter queue depth and DMA
Disk subsystem (optional) Disk subsystem tunables – varies
Disk
Write cache Read cache or memory area used for IO
IBM Power Systems
AIX IO Facts
Fewer larger IOs get more thruput than more smaller IOs
IOs can be coalesced (good) or split up (bad) as they go thru the IO stack
Adjacent IOs in a file/LV/disk can be coalesced into a single IO
IOs greater than the maximum IO size supported will be split up
Data layout affects IO performance more than tuning
The goal is to balance the IOs evenly across the physical disks
Requires extra work to fix after the fact
Queues and buffers control the number of in-flight IOs for a structure
hdisk queue_depth controls the number of in-flight IOs from the hdisk driver for an
hdisk
A queue_depth of 10 means you can have up to 10 IOs in-flight for the hdisk, while
if more are requested, they will wait until other IOs complete
file system buffers control the number of in-flight IOs from the file system layer for a
file system
Reducing real IOs improves application performance, and often also improves IO service
times for the remaining real IOs

IBM Power Systems
Filesystem and Disk Buffers

# vmstat –v
…
0 pending disk I/Os blocked with no pbuf
171 paging space I/Os blocked with no psbuf
2228 filesystem I/Os blocked with no fsbuf
66 client filesystem I/Os blocked with no fsbuf
17 external pager filesystem I/Os blocked with no fsbuf
Numbers are counts of temporarily blocked IOs since boot

blocked count / uptime = rate of IOs blocked/second
Low rates of blocking implies less improvement from tuning
For pbufs, use lvmo to increase pv_pbuf_count (see the next slide)
For psbufs, stop paging (add memory or use less) or add paging spaces
For filesystem fsbufs, increase numfsbufs with ioo
For external pager fsbufs, increase j2_dynamicBufferPreallocation with ioo
For client filesystem fsbufs, increase nfso's nfs_v3_pdts and nfs_v3_vm_bufs (or the
NFS4 equivalents)
Run # ioo –FL to see defaults, current settings and what’s required to make the changes
go into effect

IBM Power Systems
Disk Buffers
# lvmo –v rootvg -a
vgname = rootvg
pv_pbuf_count = 512 Number of pbufs added when one PV is added to the VG
total_vg_pbufs = 512 Current pbufs available for the VG
max_vg_pbuf_count = 16384 Max pbufs available for this VG, requires remount to change
pervg_blocked_io_count = 1243 Delayed IO count since last varyon for this VG
pv_min_pbuf = 512 Minimum number of pbufs added when PV is added to any VG
global_blocked_io_count = 1243 System wide delayed IO count for all VGs and disks
# lvmo –v rootvg -o pv_pbuf_count=1024 Increases pbufs for rootvg and is dynamic
Check disk buffers for each VG

IBM Power Systems
Hdisk queue depth tuning

The queue_depth attribute controls the maximum number of in-flight IOs for the hdisk
This cannot be changed dynamically – requires varyoff of the VG
# lsattr -El hdisk0
PCM PCM/friend/vscsi Path Control Module False
hcheck_cmd test_unit_rdy Health Check Command True
max_transfer 0x40000 Maximum TRANSFER Size True
pvid cee79e5f30f8a20000000000000000 Physical volume identifier False
reserve_policy no_reserve Reserve Policy True
# lsattr -Rl hdisk0 -a queue_depth

1...256 (+1) Allowable values for the attribute

IBM Power Systems

# iostat -lD hdisk0

-------------------------------------------- ------------------------------------ ------------------------------
hdisk0 0.1 1.3K 0.2 308.8 955.6 0.0 3.6 0.3 149.2 0 0 0.2 8.3 0.5 219.4 0 0
Disks: queue
-------------- --------------------------------------
avg min max
time time time
avg avg serv
wqsz sqsz qfull
This data reformatted for readability
hdisk0 6.8 0.0 980.0 0.1 0.0 0.1
Rate at which IOs are submitted to a full queue
# iostat -D hdisk0
hdisk0 xfer: %tm_act bps tps bread bwrtn

0.1 1.3K 0.2 308.9 955.6
read: rps avgserv minserv maxserv timeouts fails
0.0 3.6 0.3 149.2 0 0
write: wps avgserv minserv maxserv timeouts fails
0.2 8.3 0.5 219.4 0 0
queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull
6.8 0.0 980.0 0.0 0.0 0.1
From the application point of view, IO service time is the read/write avg. serv. plus avg
time in the queue
Where to tune: hdisks with non-zero values for sqfull or avg time in the queue
Especially with high IOPS

IBM Power Systems

If IO service times are good, and IOs are waiting in the queue, we can eliminate the
wait by increasing queue_depth
# lsattr -HEl hdisk0

attribute value description user_settable
PCM PCM/friend/vscsi Path Control Module False
hcheck_cmd test_unit_rdy Health Check Command True
max_transfer 0x40000 Maximum TRANSFER Size True
pvid 00cee79e5f30f8a20000000000000000 Physical volume identifier False
reserve_policy no_reserve Reserve Policy True
# lsattr -Rl hdisk0 -a queue_depth

1...256 (+1) <- allowable values for queue_depth for this hdisk
# chdev -l hdisk0 -a queue_depth=8 <- change queue_depth when hdisk not in use
hdisk0 changed
# chdev -l hdisk0 -a queue_depth=8 –P <- change queue_depth when hdisk in use, requires reboot
hdisk0 changed
The –P flag for chdev makes the change in the ODM and it goes into effect at reboot
The attribute can be changed without a reboot if you stop using the device

IBM Power Systems

A thruput and IO service time tradeoff (go for thruput)
As you increase queue_depth, more in-flight IOs will be sent to the disk subsystem
Expect IO service times to slightly degrade, but thruput to improve
Allows the disk subsystem to use elevator algorithms to improve thruput
Reduces actuator seek times
Conversely, low queue depths help ensure good IO service times and less thruput
But more waiting in the queue
Using elevator algorithm Not using elevator algorithm

IBM Power Systems

If the storage has poor IO service times, increasing queue depth may or may not
improve performance
The storage is already a bottleneck
If the storage administrator won’t allow greater queue depths, ask for more LUNs
Potential IOPS = queue_depth/avg. IO service time

e.g. IOPS = 3 / 0.010 = 300 IOPS
Total in-flight IOs <= sum of the hdisk queue depths
How much will this help?

As IOs are often done in parallel, it’s hard to determine
IO time savings = IOPS x avg time in the queue
e.g. 10,000 IOPS x 0.003 s = 30 seconds of savings each second
Proportional savings estimate: queue wait time / (queue wait time + IO service time)
e.g. 3 ms in queue / (3 ms + 5 ms IO service time ) = 37.5% improvement

IBM Power Systems
FC adapter port tuning

The num_cmd_elems attribute controls the maximum number of in-flight IOs for the FC port
The max_xfer_size attribute controls the maximum IO size the adapter will send to the
storage, as well as a memory area to hold IO data
Doesn’t apply to virtual adapters
Default memory area is 16 MB at the default max_xfer_size=0x100000
Memory area is 128 MB for any other allowable value
This cannot be changed dynamically – requires stopping use of adapter port
# lsattr -El fcs0
DIF_enabled no DIF (T10 protection) enabled True
bus_intr_lvl Bus interrupt level False
bus_io_addr 0xff800 Bus I/O address False
bus_mem_addr 0xffe76000 Bus memory address False
bus_mem_addr2 0xffe78000 Bus memory address False
init_link auto INIT Link flags False
intr_msi_1 209024 Bus interrupt level False
intr_priority 3 Interrupt priority False
lg_term_dma 0x800000 Long term DMA True
max_xfer_size 0x100000 Maximum Transfer Size True
num_cmd_elems 200 Maximum number of COMMANDS to queue to the adapter True
pref_alpa 0x1 Preferred AL_PA True
sw_fc_class 2 FC Class for Fabric True
tme no Target Mode Enabled True

IBM Power Systems
FC adapter port queue depth tuning

# fcstat fcs0
FIBRE CHANNEL STATISTICS REPORT: fcs0
Device Type: 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)
…
World Wide Port Name: 0x10000000C99C184E
…
Port Speed (supported): 8 GBIT
Port Speed (running): 8 GBIT
…
FC SCSI Adapter Driver Information <- Look at this section: numbers are counts of blocked IOs since boot
No DMA Resource Count: 452380 <- increase max_xfer_size for large values
No Adapter Elements Count: 726832 <- increase num_cmd_elems for large values
No Command Resource Count: 342000 <- increase num_cmd_elems for large values
…
FC SCSI Traffic Statistics
…
Input Bytes: 56443937589435
Output Bytes: 4849112157696
# chdev –l fcs0 –a num_cmd_elems=4096 –a max_xfer_size=0x200000 –P <- requires reboot
fcs0 changed
Calculate the rate the IOs are blocked

# blocked / uptime (or since the adapter was made Available)
Bigger tuning improvements when the rate of blocked IOs is higher
If you’ve increased num_cmd_elems and max_xfer_size and still get blocked IOs, it
suggests you need another adapter port for more bandwidth

IBM Power Systems
VSCSI adapter queue depth sizing

VSCSI adapters also have a queue but it’s not tunable
We ensure we don’t run out of VSCSI queue slots by limiting the number of hdisks using the
adapter, and their individual queue depths
hdisk Max
Adapter queue slots are a resource shared by the queue hdisks per
hdisks depth vscsi
adapter*
Max hdisks per adapter = 3 - default 85
INT{510 / [(sum of (hdisk queue depths + 3)]}
10 39
24 18
You can exceed these limits to the extent that the 32 14
average service queue size is less than the queue
64 7
depth
100 4
128 3
252 2
* To assure no blocking of IOs at the vscsi adapter 256 1

IBM Power Systems
NPIV adapter tuning

The real adapters’ queue slots and DMA memory area are shared by the vFC NPIV adapters
Tip: Set num_cmd_elems to it’s maximum value and max_xfer_size to 0x200000 on the
real FC adapter for maximum bandwidth, to avoid having to tune it later. Some
configurations won’t allow this and will result in errors in the error log or devices showing
up as Defined.
Only tune num_cmd_elems for the vFC adapter based on fcstat statistics

IBM Power Systems
Asynchronous IO
Asynchronous IO (aka. AIO) is a programming technique which allows applications to request
a lot of IO without waiting for each IO to complete
The tuning goal is to ensure sufficient AIO servers when the application uses them
AIO kernel threads automatically exit after aio_server_inactivity seconds
AIO kernel threads not used for AIO to raw LVs or CIO mounted file systems
Only aio_maxservers and aio_maxreqs need to be changed
Defaults are 21 and 8K respectively per logical CPU
Set via ioo
Some may want to adjust minservers for heavy AIO use
maxservers is the maximum number of AIOs that can be processed at any one time
maxreqs is the maximum number of AIO requests that can be handled at one time and is
a total for the system (they are queued to the AIO kernel threads)
Typical values:
Default OLTP SAP

minservers 3 200 400
maxservers 10 800 1200
maxreqs 4096 16384 16384

IBM Power Systems
AIO tuning
Use iostat –A to monitor AIO (or -P for POSIX AIO)
# iostat -A <interval> <number of intervals>
System configuration: lcpu=4 drives=1 ent=0.50
aio: avgc avfc maxg maxf maxr avg-cpu: %user %sys %idle %iow physc %entc
25 6 29 10 4096 30.7 36.3 15.1 17.9 0.0 81.9
Disks: % tm_act Kbps tps Kb_read Kb_wrtn

hdisk0 100.0 61572.0 484.0 8192 53380
avgc - Average global non-fastpath AIO request count per second for the specified interval
avfc - Average AIO fastpath request count per second for the specified interval for IOs to
raw LVs (doesn’t include CIO fast path IOs)
maxg - Maximum non-fastpath AIO request count since the last time this value was fetched
maxf - Maximum fastpath request count since the last time this value was fetched
maxr - Maximum AIO requests allowed - the AIO device maxreqs attribute
If maxg or maxf gets close to maxr or maxservers then increase maxreqs or maxservers

IBM Power Systems
Thank You !

Multipathing and SAN Storage Considerations For AIX Administrators

Uploaded by

Copyright:

Available Formats

You might also like

Multipathing and SAN Storage Considerations For AIX Administrators

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multipathing and SAN Storage Considerations For AIX Administrators

Uploaded by

Copyright:

Available Formats

IBM Power Systems

Multipathing and SAN Storage

Dan Braden – dbraden@us.ibm.com

IBM Power Systems Advanced Technical Skills

© 2013 IBM Corporation

© 2012, 2013 IBM Corporation 2

© 2012, 2013 IBM Corporation 3

Why was the MPIO architecture developed?

MPIO Common Interface

© 2012, 2013 IBM Corporation 4

Overview of MPIO Architecture

LUNs show up as an hdisk

© 2012, 2013 IBM Corporation 5

Beware of generic “Other” disk definition

© 2012, 2013 IBM Corporation 6

AIX Path Control Module (PCM) IO basics

© 2012, 2013 IBM Corporation 7

How many paths for a LUN?

• Paths = (# of paths from server to switch) x

© 2012, 2013 IBM Corporation 8

How many paths for a LUN?, cont’d

© 2012, 2013 IBM Corporation 9

Path selection benefits and costs

Balancing IOs with algorithms fail_over and round_robin

A fail_over algorithm can be efficiently used to balance IOs!

© 2012, 2013 IBM Corporation 11

Multi-path IO with VIO and VSCSI LUNs

VIO Client Two layers of multi-path code: VIOC and VIOS

VIOS uses the multi-path code specified for the disk

Typical setup: Set vsci device’s attribute vsci_err_recov

© 2012, 2013 IBM Corporation 12

Multi-path IO with VIO and NPIV

VIO Client VIOC has virtual FC adapters (vFC)

IOs for a LUN can go thru both VIOSs

Active/Active, Active/Passive and ALUA Disk Subsystem Controllers

ALUA – Asynchronous Logical Unit Access

Terminology regarding active/active and active/passive varies considerably

IBM System Storage N Series AIX FC PCM fail_over, round_robin

EMC Symmetrix AIX FC PCM fail_over, round_robin

SCSI AIX SCSI PCM fail_over, round_robin

© 2012, 2013 IBM Corporation 15

Non-MPIO multi-path code

Storage subsystem family Multi-path code

IBM DS4000 Redundant Disk Array Controller (RDAC)

EMC Power Path

HDS HDLM (older versions)

© 2012, 2013 IBM Corporation 16

Mixing multi-path code sets

© 2012, 2013 IBM Corporation 17

Sharing Fibre Channel Adapter ports

It’s recommended that disk and tape use separate ports

Disk (typicaly small block random) and

© 2012, 2013 IBM Corporation 18

MPIO Command Set

chpath – change path status or path attributes

rmpath – delete or change path state

mkpath – add another path to a device or makes a defined path available

chdev – change a device’s attributes (not specific to MPIO)