Professional Documents
Culture Documents
Multipathing and SAN Storage Considerations For AIX Administrators
Multipathing and SAN Storage Considerations For AIX Administrators
Multipathing and SAN Storage Considerations For AIX Administrators
Agenda
Multipath IO Considerations
AIX Notification Techniques
Monitoring SAN storage performance
Measuring SAN storage bandwidth
►SAN storage bandwidth metrics
►SAN storage measurement challenges
►The ndisk64 tool
►Small block random IO
►Large block sequential IO
►Bandwidth of SAN storage components
Basic IO tuning
Agenda
Multipath IO Considerations
AIX Notification Techniques
Monitoring SAN storage performance
Measuring SAN storage bandwidth
►SAN storage bandwidth metrics
►SAN storage measurement challenges
►The ndisk64 tool
►Small block random IO
►Large block sequential IO
►Bandwidth of SAN storage components
Basic IO tuning
What is MPIO?
MPIO is an architecture designed by AIX development (released in AIX V5.2)
MPIO is also a commonly used acronym for Multi-Path IO (AIX PCM aka MPIO)
► In this presentation MPIO refers explicitly to the architecture, not the acronym
Hdisks
Hdisks can
can be Available, Defined
be Available, Defined or
or non-existent
non-existent
Paths
Paths can
can also
also be Available, Defined,
be Available, Defined, Missing
Missing oror non-existent
non-existent
Path status can
Path status can be enabled, disabled
be enabled, disabled or
or failed
failed ifif the
the path
path is
is Available
Available
(use chpath command
(use chpath command to to change
change status)
status)
Add
Add path:
path: e.g.
e.g. after
after installing
installing new
new adapter
adapter and
and cable
cable to
to the
the disk
disk
run cfgmgr (or
run cfgmgr (or cfgmgr
cfgmgr –l–l <adapter>)
<adapter>)
One
One must
must get
get the
the device
device layer
layer correct,
correct, before
before working
working with
with the
the path
path status
status layer
layer
https://tuf.hds.com/gsc/bin/view/Main/AIXODMUpdates
Disk configuration ftp://ftp.emc.com/pub/elab/aix/ODM_DEFINITIONS/
The disk vendor…
Dictates what multi-path code can be used
Supplies the filesets for the disks and multipath code
Supports the components that they supply
A fileset is loaded to update the ODM to support the storage
AIX then recognizes and appropriately configures the disk
Without this, disks are configured using a generic ODM definition
Performance and error handling may suffer as a result
# lsdev –Pc disk displays supported storage
The multi-path code will be a different fileset
Unless using the MPIO that’s included with AIX
Server
FC Switch
Fabric 1 Fabric 2
Storage
4 X 4 = 16 paths 2 X 2 + 2 X 2 = 8 paths
With single initiator to single target zoning, both examples would have 4 paths
A popular approach is to use 4 host and 4 storage ports, zoning one host port to one
storage port, yielding 4 paths
If
If the
the links
links aren’t
aren’t busy,
busy, there
there likely
likely won’t
won’t be
be much,
much, if
if any,
any, savings
savings from
from
use
use of
of sophisticated
sophisticated path
path selection
selection algorithims
algorithims vs.
vs. round
round robin
robin
Generally utilization
of links is low
Costs
Costs of
of path
path selection
selection algorithms
algorithms (could
(could outweigh
outweigh latency
latency savings)
savings)
CPU
CPU cycles
cycles to
to choose
choose thethe best
best path
path
Memory
Memory toto keep
keep track
track of
of in-flight
in-flight IOs
IOs down
down each
each path,
path, or
or
Memory
Memory toto keep
keep track
track of
of IO
IO service
service times
times down
down each
each path
path
Latency
Latency added
added toto the
the IO
IO to
to choose
choose the
the best
best path
path
© 2012, 2013 IBM Corporation 10
IBM Power Systems
round_robin may more easily ensure balanced IOs across the links for each LUN
● e.g., if the IOs to the LUNs aren't balanced, then it may be difficult to balance the
LUNs and their IO rates across the adapter ports with fail_over
● requires less resource that load balancing
Set the path priorities for the VSCSI hdisks so half use one
VIO Server VIO Server VIOS, and half use the other
Multi-path code Multi-path code
Mixed multi-path codes, which may be incompatible on a single LPAR, can be used on VIOC
LPARS with NPIV to share the same physical adapter, provided incompatible code isn't used
on the same LPAR. E.g. Powerpath + EMC & MPIO + DS8000.
© 2012, 2013 IBM Corporation 13
IBM Power Systems
Active/Passive controllers
► IOsfor a LUN are sent to the primary controller for the LUN, except in failue scenarios
► The storage administrator balances LUNs across the controllers
● Controllers should be active for some LUNs and passive for others
► DS3/4/5000
MPIO support
Storage Subsystem Family MPIO code Multi-path algorithm
IBM Subsystem Device
IBM ESS, DS6000, DS8000, fail_over, round_robin and for
Driver Path Control
DS3950, DS4000, DS5000, SDDPCM: load balance, load
Module (SDDPCM) or AIX
SVC, V7000 balance port
PCM
AIX FC PCM
DS3/4/5000 in VIOS fail_over, round_robin
recommended
IBM XIV Storage System AIX FC PCM fail_over, round_robin
The disk subsystem vendor specifies what multi-path code is supported for their storage
► The disk subsystem vendor supports their storage, the server vendor generally doesn’t
You can mix multi-path code compliant with MPIO and even share adapters
► There may be exceptions. Contact vendor for latest updates.
HP example: “Connection to a common server with different HBAs requires separate
HBA zones for XP, VA, and EVA”
Generally one non-MPIO compliant code set can exist with other MPIO compliant code sets
► Except that SDD and RDAC can be mixed on the same LPAR
► The non-MPIO compliant code must be using its own adapters
● Except RDAC can share adapter ports with MPIO
Devices of a given type use only one multi-path code set
► e.g., you can’t use SDDPCM for one DS8000 and SDD for another DS8000 on the same
AIX instance
Disk using MPIO compliant code sets can share adapter ports
Path priorities
A Priority Attribute for paths can be used to specify a preference for path
IOs. How it works depends whether the hdisk’s algorithm attribute is set to
fail_over or round_robin.
Value specified is inverse to priority, i.e. “1” is high priority
algorithm=fail_over
►the path with the higher priority value handles all the IOs unless there's a path failure.
►Set the primary path to be used by setting it's priority value to 1, and the next path's
priority (in case of path failure) to 2, and so on.
►ifthe path priorities are the same, the primary path will be the first listed for the hdisk
in the CuPath ODM as shown by # odmget CuPath
algorithm=round_robin
►If the priority attributes are the same, then IOs go down each path equally.
►Inthe case of two paths, if you set path A’s priority to 1 and path B’s to 255, then for
every IO going down path A, there will be 255 IOs sent down path B.
Path priorities
# lsattr -El hdisk9
PCM PCM/friend/otherapdisk Path Control Module False
algorithm fail_over Algorithm True
hcheck_interval 60 Health Check Interval True
hcheck_mode nonactive Health Check Mode True
lun_id 0x5000000000000 Logical Unit Number ID False
node_name 0x20060080e517b6ba FC Node Name False
queue_depth 10 Queue DEPTH True
reserve_policy single_path Reserve Policy True
ww_name 0x20160080e517b6ba FC World Wide Name False
…
Note:
Note: whether
whether or
or not
not path
path priorities
priorities apply
apply depends
depends onon the
the PCM.
PCM.
With
With SDDPCM,
SDDPCM, path
path priorities
priorities only
only apply
apply when
when the
the algorithm
algorithm used
used is
is fail
fail over
over (fo).
(fo).
Otherwise, they aren’t used.
Otherwise, they aren’t used.
© 2012, 2013 IBM Corporation 22
IBM Power Systems
►Set priorities for half the LUNs to use VIOSa/vscsi0 and half to use
VIOSb/vscsi1
►Uses both VIOSs CPU and virtual adapters
►algorithm=fail_over is the only option at the VIOC for VSCSI disks
With NSeries – have the IOs go the primary controller for the LUN if not using
ALUA (ALUA is preferred)
►When not using ALUA, use the dotpaths utility to set path priorities to ensure most IOs go to
the preferred controller
hcheck_interval
► Defines how often (1– 3600 seconds) the health check is performed on the paths for a device.
When a value of 0 is selected (the default), health checking is disabled
► Preferably set to at least 2X IO timeout value…often 30 seconds
hcheck_mode
► Determines which paths should be checked when the health check capability is used:
● enabled: Sends the healthcheck command down paths with a state of enabled
● failed: Sends the healthcheck command down paths with a state of failed
● nonactive: (Default) Sends the healthcheck command down paths that have no active I/O, including
paths with a state of failed. If the algorithm selected is failover, then the healthcheck command is
also sent on each of the paths that have a state of enabled but have no active IO. If the algorithm
selected is round_robin, then the healthcheck command is only sent on paths with a state of failed,
because the round_robin algorithm keeps all enabled paths active with IO.
Path Recovery
MPIO will recover failed paths if path health checking is enabled with hcheck_mode=nonactive
or failed and the device has been opened
Trade-offs exist:
► Lots of path health checking can create a lot of SAN traffic
► Automatic recovery requires turning on path health checking for each LUN
► Lots of time between health checks means paths will take longer to recover after repair
► Health checking for a single LUN is often sufficient to monitor all the physical paths,
but not to recover them
SDD and SDDPCM also recover failed paths automatically
In addition, SDDPCM provides a health check daemon to provide an automated method of
reclaiming failed paths to a closed device.
SDDPCM recoverDEDpath attribute – similar to timeout_policy but for all kinds of path errors
► recoverDEDpath=no Default and failed paths stay that way
► recoverDEDpath=yes Allows failed paths to be recovered
► SDDPCM V2.6.3.0 or later
Agenda
Multipath IO Considerations
AIX Notification Techniques
Monitoring SAN storage performance
Measuring SAN storage bandwidth
►SAN storage bandwidth metrics
►SAN storage measurement challenges
►The ndisk64 tool
►Small block random IO
►Large block sequential IO
►Bandwidth of SAN storage components
Basic IO tuning
One should also set up error notification for path failure, so that someone knows
about it and can correct it before something else fails.
This is accomplished by determining the error that shows up in the error log when a
path fails (via testing), and then
Adding an entry to the errnotify ODM class for that error which calls a script (that you
write) that notifies someone that a path has failed.
Hint: You can use # odmget errnotify to see what the entries (or stanzas) look like,
then you create a stanza and use the odmadd command to add it to the errnotify
class.
ODM-Based
ODM-Based
diag
diag
Custom
Custom Error
Error Command
Command
Notification
Notification Notification
Notification Diagnostics
Diagnostics
Concurrent
Concurrent
Error
Error
Logging
Logging
• ODM-Based
errdemon program uses errnotify ODM class for
error notification
• diag Command Diagnostics ODM-Based
ODM-Based
• Custom Notification
Write a shell script to check the error log
periodically
• Concurrent Error Logging
Start errpt –c and each error is then reported
© 2012, 2013 IBM Corporation 35
when it occurs.
Error Notification – diag Error Log Analysis
Task Selection (Diagnostics, Advanced Diagnostics, Service Aids, etc.) Menu
Concurrent
Concurrent
Error
Error
Logging
Logging
#!/usr/bin/ksh
#######################################################
# Sample script to perform simple error notification #
#######################################################
ODM-Based
errpt > /tmp/error_log_1 # save version 1 of the error log ODM-Based
error log.
Custom Error diag Command
Custom
Notification Error
Notification diag Command
diagnostics
Notification Notification diagnostics
The user specifies these conditions and actions in the errnotify Concurrent
Concurrent
Error
# odmshow errnotify
ODM-Based
ODM-Based
class errnotify {
long en_pid; /* offset: 0xc ( 12) */ Custom
Custom
Notification
Error
Error
Notification
diag Command
diag Command
diagnostics
en_aertflg Indicates whether the error can be alerted. For use by alert agents. TRUE or FALSE
en_class Class of the error log entry to match: H-hw S-sw O-from errlogger U-undetermined
en_crcid Specifies the unique error identifier associated with a particular error.
en_dup If set, identifies whether duplicate errors should be matched. TRUE or FALSE
en_err64 If set, identifies whether errors from a 64-bit or 32-bit environment should be matched.
en_label Specifies the label associated with a particular error identifier as defined in errpt –t output
en_method Specifies a user-programmable action to be run when error matches selection criteria
en_name Uniquely identifies the Error Notification object. Name used when removing the object
en_persistenceflg Designates if the object should persist through boot. 0-non-persistent 1-persistent
en_pid Specifies a process ID (PID) for use in identifying the Error Notification object.
en_rclass Indentifies the class of the failing resource. Not applicable for software class
en_resource Identifies the name of the failing resource
en_rtype Identifies the type of the failing resource
en_symptom Enables notification of an error accompanied by a symptom string when set to TRUE
en_type Identifies severity of error log entries to match. INFO PEND PERM PERF TEMP UNKN
© 2012, 2013 IBM Corporation 41
ODM-based Error Notification: errnotify
Concurrent
odmadd /tmp/en_sample.add Concurrent
Error
Error
Logging
Logging
/tmp/en_sample.add
/tmp/en_sample.add file
file mails error entry to root
errnotify:
errnotify: each time disk error
en_name
en_name == “sample”
“sample” of type PERM logged.
en_persistenceflg
en_persistenceflg == 00 Note use of $n keywords
en_class
en_class == “H”
“H”
en_type
en_type == “PERM”
“PERM”
en_rclass
en_rclass == “disk”
“disk”
en_method
en_method == “errpt
“errpt –a –l $1
–a –l $1 || mail
mail –s
–s ‘Disk
‘Disk Error’
Error’ root”
root”
© 2012, 2013 IBM Corporation 42
ODM-based Error Notification: Arguments to Notify
Method
errlogger command
allows the system administrator to record messages of up to 1024 bytes in the error log.
# errlogger system hard disk ‘(hdisk0)’ replaced.
Whenever you perform system maintenance activity, it is a good idea to record this activity
in the system error log
clearing entries from the error log
replacing/moving hardware
applying a software fix
re-cabling storage…
ras_logger command
allows the system administrator to record any error from the command line.
log an error from a shell script
test newly-created error templates
Example: /usr/lib/ras/ras_logger < tfile where,
tfile contains the error information using the error's template to determine
how to log the data. The format of the input is the following:
error_label
resource_name
64_bit_flag
detail_data_item1
detail_data_item2
...
# errpt -a
---------------------------------------------------------------------------
# /usr/lib/ras/ras_logger < tfile LABEL: DMA_ERR
IDENTIFIER: 00530EA6
Date/Time: Wed Oct 24 10:11:28 CDT 2012
Sequence Number: 37
Machine Id: 0004A9C6D700
Node Id: hock
tfile: Class: H
Type: UNKN
+1 DMA_ERR Resource Name: resourcex
+2 resourcex Resource Class: NONE
Resource Type: NONE
+3 0 Location:
+4 15 Description
UNDETERMINED ERROR
+5 A0 Probable Causes
SYSTEM I/O BUS
+6 9999 SOFTWARE PROGRAM
ADAPTER
DEVICE
Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES
Detail Data
BUS NUMBER
0000 0015
CHANNEL UNIT ADDRESS
0000 00A0
ERROR CODE
0000 9999
© 2012, 2013 IBM Corporation 49
IBM Power Systems
Agenda
Multipath IO Considerations
AIX Notification Techniques
Monitoring SAN storage performance
Measuring SAN storage bandwidth
►SAN storage bandwidth metrics
►SAN storage measurement challenges
►The ndisk64 tool
►Small block random IO
►Large block sequential IO
►Bandwidth of SAN storage components
Basic IO tuning
If the disk is very busy, IOs will wait for IOs ahead of it
Queueing time on the disk (not queueing in the hdisk driver or elsewhere)
© 2012, 2013 IBM Corporation 54
IBM Power Systems
As IOPS increase, IOs queue on the disk and wait for IOs ahead to complete first
Assuming the disk isn’t too busy and IOs are not queueing there
SSD IO service times around 0.2 to 0.4 ms and they can do over 10,000 IOPS
For sequential IO, don’t worry about IO service times, worry about thruput
► We hope IOs queue, wait and are ready to process
You have a bottleneck somewhere from the hdisk driver to the physical disks
► Possibilities include:
● CPU (local LPAR or VIOS)
● Adapter driver
● Physical host adapter/port
● Overloaded SAN links (unlikely)
● Storage port(s) overloaded
● Disk subsystem processor overloaded
● Physical disks overloaded
● SAN switch buffer credits
● Temporary hardware errors
► Evaluate VIOS, adapter, adapter driver from AIX/VIOS
► Evaluate the storage from the storage side
If the write IO service times are marginal, the write IO rate is low, and the read IO rate is
high, it’s often not worth worrying about
► Can occur due to caching algorithms in the storage
Note the low write rate and high write IO service times
Agenda
Multipath IO Considerations
AIX Notification Techniques
Monitoring SAN storage performance
Measuring SAN storage bandwidth
►SAN storage bandwidth metrics
►SAN storage measurement challenges
►The ndisk64 tool
►Small block random IO
►Large block sequential IO
►Bandwidth of SAN storage components
Basic IO tuning
Cache effects
Avoid using AIX file system cache
► Using raw hdisks or LVs is best
To test IO rates to/from cache, use allocated space < cache size and prime
the cache for reads
► Prime the cache with # cat /dev/rhdisk10 > /dev/null
Sequential or random IO
Other inputs:
► How long the test should run in seconds
► R/W ratio
► IO size or a set of IO sizes
► There’s more but the above options cover most cases
Using ndisk64
# lsdev -Cc disk
hdisk0 Available C7-T1-01 MPIO DS4800 Disk
# getconf DISK_SIZE /dev/hdisk0
30720 <- size needed for raw device in MB
# ndisk64 -R -t 20 -f /dev/rhdisk0 -M 1 -s 30720M -r 100
Command: ndisk64 -R -t 20 -f /dev/rhdisk0 -M 1 -s 30720M -r 100
Synchronous Disk test (regular read/write)
No. of processes = 1
I/O type = Random
Block size = 4096
Read-Write = Read Only
Sync type: none = just close the file
Number of files = 1
File size = 32212254720 bytes = 31457280 KB = 30720 MB
Run time = 20 seconds
Snooze % = 0 percent
----> Running test with block Size=4096 (4KB) .
Proc - <-----Disk IO----> | <-----Throughput------> RunTime
Num - TOTAL IO/sec | MB/sec KB/sec Seconds
1 - 3008 300.7 | 1.17 1202.97 20.00
IO Service Time - ms
15 3020.8 5.0 ms
12
20 3662.5 5.5 ms
10
30 4576.2 6.6 ms
8
40 5114.7 7.8 ms
6
50 5620.6 8.8 ms 4
60 5872.4 10.1 ms 2
70 6099.7 11.4 ms 0
0 2000 4000 6000 8000
100 6271.0 16.0 ms
IOPS
128 6714.0 19.0 ms
IOPS for the LUN peaked at 7082 IOPS with service times > 20 ms using 256 threads
Use multiple threads to drive up thuput, but some/most of the data will be from
disk cache
Agenda
Multipath IO Considerations
AIX Notification Techniques
Monitoring SAN storage performance
Measuring SAN storage bandwidth
►SAN storage bandwidth metrics
►SAN storage measurement challenges
►The ndisk64 tool
►Small block random IO
►Large block sequential IO
►Bandwidth of SAN storage components
Basic IO tuning
AIX IO Stack
Application Application memory area caches data to
avoid IO
Logical file system
Raw disks
Raw LVs
AIX IO Facts
Fewer larger IOs get more thruput than more smaller IOs
IOs can be coalesced (good) or split up (bad) as they go thru the IO stack
Adjacent IOs in a file/LV/disk can be coalesced into a single IO
IOs greater than the maximum IO size supported will be split up
Data layout affects IO performance more than tuning
The goal is to balance the IOs evenly across the physical disks
Requires extra work to fix after the fact
Queues and buffers control the number of in-flight IOs for a structure
hdisk queue_depth controls the number of in-flight IOs from the hdisk driver for an
hdisk
A queue_depth of 10 means you can have up to 10 IOs in-flight for the hdisk, while
if more are requested, they will wait until other IOs complete
file system buffers control the number of in-flight IOs from the file system layer for a
file system
Reducing real IOs improves application performance, and often also improves IO service
times for the remaining real IOs
Disk Buffers
# lvmo –v rootvg -a
vgname = rootvg
pv_pbuf_count = 512 Number of pbufs added when one PV is added to the VG
total_vg_pbufs = 512 Current pbufs available for the VG
max_vg_pbuf_count = 16384 Max pbufs available for this VG, requires remount to change
pervg_blocked_io_count = 1243 Delayed IO count since last varyon for this VG
pv_min_pbuf = 512 Minimum number of pbufs added when PV is added to any VG
global_blocked_io_count = 1243 System wide delayed IO count for all VGs and disks
Disks: queue
-------------- --------------------------------------
avg min max
time time time
avg avg serv
wqsz sqsz qfull
This data reformatted for readability
hdisk0 6.8 0.0 980.0 0.1 0.0 0.1
Rate at which IOs are submitted to a full queue
# iostat -D hdisk0
System configuration: lcpu=4 drives=35 paths=35 vdisks=2
From the application point of view, IO service time is the read/write avg. serv. plus avg
time in the queue
Where to tune: hdisks with non-zero values for sqfull or avg time in the queue
Especially with high IOPS
The –P flag for chdev makes the change in the ODM and it goes into effect at reboot
The attribute can be changed without a reboot if you stop using the device
If the storage administrator won’t allow greater queue depths, ask for more LUNs
We ensure we don’t run out of VSCSI queue slots by limiting the number of hdisks using the
adapter, and their individual queue depths
hdisk Max
Adapter queue slots are a resource shared by the queue hdisks per
hdisks depth vscsi
adapter*
Max hdisks per adapter = 3 - default 85
INT{510 / [(sum of (hdisk queue depths + 3)]}
10 39
24 18
You can exceed these limits to the extent that the 32 14
average service queue size is less than the queue
64 7
depth
100 4
128 3
252 2
* To assure no blocking of IOs at the vscsi adapter 256 1
Tip: Set num_cmd_elems to it’s maximum value and max_xfer_size to 0x200000 on the
real FC adapter for maximum bandwidth, to avoid having to tune it later. Some
configurations won’t allow this and will result in errors in the error log or devices showing
up as Defined.
Only tune num_cmd_elems for the vFC adapter based on fcstat statistics
Asynchronous IO
Asynchronous IO (aka. AIO) is a programming technique which allows applications to request
a lot of IO without waiting for each IO to complete
The tuning goal is to ensure sufficient AIO servers when the application uses them
AIO kernel threads automatically exit after aio_server_inactivity seconds
AIO kernel threads not used for AIO to raw LVs or CIO mounted file systems
Only aio_maxservers and aio_maxreqs need to be changed
Defaults are 21 and 8K respectively per logical CPU
Set via ioo
Some may want to adjust minservers for heavy AIO use
maxservers is the maximum number of AIOs that can be processed at any one time
maxreqs is the maximum number of AIO requests that can be handled at one time and is
a total for the system (they are queued to the AIO kernel threads)
Typical values:
AIO tuning
Use iostat –A to monitor AIO (or -P for POSIX AIO)
# iostat -A <interval> <number of intervals>
System configuration: lcpu=4 drives=1 ent=0.50
aio: avgc avfc maxg maxf maxr avg-cpu: %user %sys %idle %iow physc %entc
25 6 29 10 4096 30.7 36.3 15.1 17.9 0.0 81.9
avgc - Average global non-fastpath AIO request count per second for the specified interval
avfc - Average AIO fastpath request count per second for the specified interval for IOs to
raw LVs (doesn’t include CIO fast path IOs)
maxg - Maximum non-fastpath AIO request count since the last time this value was fetched
maxf - Maximum fastpath request count since the last time this value was fetched
maxr - Maximum AIO requests allowed - the AIO device maxreqs attribute
If maxg or maxf gets close to maxr or maxservers then increase maxreqs or maxservers
Thank You !