Professional Documents
Culture Documents
Fabric Bottleneck Monitoring
Fabric Bottleneck Monitoring
Version: 1.0
Abstract:
Protective Mark:
CSC Private
Document Number:
Document Owner:
Sumit Arora
Document Approver:
CSC Private
Page 1
of 12
CSC Private
Global Infrastructure &
Enterprise Services
Contents
Purpose ...................................................................................................................................................... 3
Scope .......................................................................................................................................................... 3
Overview .................................................................................................................................................... 3
Status & Query process ............................................................................................................................ 5
Troubleshooting Steps .............................................................................................................................. 8
Amendment Record
Version Control Log
No
Date
By
Nature of change
09-05-2013
Sumit Arora
New document
CSC Private
Page 2
of 12
CSC Private
Global Infrastructure &
Enterprise Services
Purpose
Troubleshooting Document for the SAN FABRIC Bottleneck Monitoring.
Scope
The scope of this document applies to all the SAP SAN Fabric Troubleshooting.
Overview
Bottleneck Detection
A bottleneck is a port in the fabric where frames cannot get through as fast as they should. In other words, a
bottleneck is a port where the offered load is greater than the achieved egress throughput. Bottlenecks can cause
undesirable degradation in throughput on various links. When a bottleneck occurs at one place, other points in the
fabric can experience bottlenecks as the traffic backs up.
The bottleneck detection feature detects two types of bottlenecks:
Latency bottleneck
Congestion bottleneck
A latency bottleneck is a port where the offered load exceeds the rate at which the other end of the link can
continuously accept traffic, but does not exceed the physical capacity of the link. This condition can be caused by
a device attached to the fabric that is slow to process received frames and send back credit returns. A latency
bottleneck due to such a device can spread through the fabric and can slow down unrelated flows that share links
with the slow flow. In this case, the load does not exceed the physical capacity of the channel as such, but can
occur because of an underperforming device connected to the F_Port, or because of back pressure from other
congestion or latency bottlenecks on the E_Ports.
A congestion bottleneck is a port that is unable to transmit frames at the offered rate because the offered rate is
greater than the physical data rate of the line. Congestion bottleneck arises from link over-utilization. This happens
when the offered load exceeds throughput and throughput is at 100%. Frames attempt to egress at a faster rate than
the line rate allows. For example, this condition can be caused by trying to transfer data at 8 Gbps over a 4 Gbps
ISL.
Bottleneck Detection identifies and alerts you about "slow-drain" devices that can cause latency and I/O timeouts.
This capability is particularly valuable for optimizing performance in highly virtualized server environments.
Latency detection is frame-based and identifies buffer credit problems. One of the major strength of Fibre Channel
is that it creates lossless connections by implementing a flow control scheme based on buffer credits. The
disadvantage of such an approach is that the number of available buffers is limited and may eventually be totally
consumed. The temporary unavailability of buffer credits creates a temporary bottleneck. The longer the credits
are unavailable, the more serious the bottleneck.
CSC Private
Page 3
of 12
CSC Private
Global Infrastructure &
Enterprise Services
Some temporary credit loss is expected in normal Fibre Channel operation. It is the longer durations that concern
use here. Long periods without buffer credits are typically manifested as performance problems and are usually the
result of device latencies. Exceptional situations cause fabric back pressure that can extend all the way across the
fabric and back. Excessive back pressure can create serious problems in an operational SAN. Chronic back
pressure can exacerbate the effect of hardware failures and misbehaving devices and also contribute to serious
operational issues as the existence of existing bottlenecks will increase the probability of a failure.
Storage ports (targets) often produce latencies that can slow down applications because they do not
deliver data at the rate expected by the host platform. Even well-architected storage array performance
can deteriorate over time. For example, LUN provisioning policies such as putting too many LUNs
behind a given port can contribute to poor performance of the storage if the control processor in the
Array cannot deliver data from all the LUNs quickly enough to satisfy read requests. The overheads of dealing
with a very large number of LUNSs may the cause for the slow delivery.
Hosts (initiators) may also produce significant latencies by requesting more data than they are capable of
processing in a timely manner.
Distance links can frequently consume all the buffer credits reserved for them and create a serious
bottleneck in the middle of a fabric that can have serious consequences for any applications sharing that
link.
Misbehaving devices such as defective HBAs can create havoc in a well-constructed SAN and increase
the threat to the fabric.
Bottleneck Detection can detect ports that are blocked due to lost credits and generate special stuck VC and lost
credit alerts for the E_Port with the lost credits.
There could be a slow drain device "down the road" - the slow drain device could be connected to the adjacent
switch or to another one connected to it. Credit starvation typically pressures back to affect wide areas of the
fabric.
The ISL could have too few buffers. Maybe the link is just too long. Or the average framesize is much smaller
than expected.
CSC Private
Page 4
of 12
CSC Private
Global Infrastructure &
Enterprise Services
throughput. Its very important to note it can spread throughout the fabric and can slow down traffic unrelated to
the slow draining device within the fabric.
What causes slow draining devices?
The most common cause is within the device or server itself. If the device is overloaded in terms of CPU or
memory it may have a hard time handling the data it has requested. Another common cause is devices that have
slower link rates then the rest of the environment.
Queue Depth settings can help to mitigate slow draining devices, which are the most common cause of credit
issues. Queue depth settings impact the number of transactions that can be open by a device at any one time and
credit issues deal with individual frames rather than transactions. By limiting the number of transactions you can
throttle slow devices down to levels of data that they can consume and prevent them from impacting other devices
that share some of the same resources such as ISLs and storage ports. Typically, queue depth settings are set too
high for optimum performance. The SAN Performance Probe allows users to really see the true impact of the
queue depth settings and overall latency in the environment.
CSC Private
Page 5
of 12
CSC Private
Global Infrastructure &
Enterprise Services
You can filter the output to display only latency or congestion bottleneck statistics.
c5154556@ls2928:~> rek -l fcd2026 "bottleneckmon --show -latency" | more
==================================================================
Wed May 08 20:08:36 UTC 2013
==================================================================
List of bottlenecked ports in most recent interval:
None
==================================================================
Number of
From
To
bottlenecked ports
==================================================================
May 08 20:08:26
May 08 20:08:36
0
May 08 20:08:16
May 08 20:08:26
0
May 08 20:08:06
May 08 20:08:16
0
May 08 20:07:56
May 08 20:08:06
0
May 08 20:07:46
May 08 20:07:56
0
May 08 20:07:36
May 08 20:07:46
0
May 08 20:07:26
May 08 20:07:36
0
May 08 20:07:16
May 08 20:07:26
0
May 08 20:07:06
May 08 20:07:16
0
May 08 20:06:56
May 08 20:07:06
0
May 08 20:06:46
May 08 20:06:56
0
May 08 20:06:36
May 08 20:06:46
0
May 08 20:06:26
May 08 20:06:36
0
May 08 20:06:16
May 08 20:06:26
0
May 08 20:06:06
May 08 20:06:16
0
May 08 20:05:56
May 08 20:06:06
0
May 08 20:05:46
May 08 20:05:56
0
Computer Sciences Corporation
All rights reserved
CSC Private
Page 6
of 12
CSC Private
Global Infrastructure &
Enterprise Services
May 08 20:05:36
May 08 20:05:26
May 08 20:05:16
May 08 20:05:06
May 08 20:04:56
May 08 20:04:46
May 08 20:04:36
May 08 20:04:26
May 08 20:04:16
May 08 20:04:06
May 08 20:03:56
May 08 20:03:46
May 08 20:03:36
May 08 20:05:46
May 08 20:05:36
May 08 20:05:26
May 08 20:05:16
May 08 20:05:06
May 08 20:04:56
May 08 20:04:46
May 08 20:04:36
May 08 20:04:26
May 08 20:04:16
May 08 20:04:06
May 08 20:03:56
May 08 20:03:46
0
0
0
0
0
0
0
0
0
0
0
0
0
The --show option displays a history of the bottleneck severity for a specified port or for all ports. Each
line of output shows the percentage of one-second intervals affected by bottleneck conditions during the
time window shown on that line. When issued for all ports, the union of all port statistics is displayed in
addition to individual port statistics. The union value provides a good indicator for the overall bottleneck
Severity on the switch.
Configuration at SAP is default:
Bottleneck detection - Enabled
==============================
Switch-wide alerting parameters:
============================
Alerts
- Yes
- 0.100
- 300 seconds
- 300 seconds
CSC Private
Page 7
of 12
CSC Private
Global Infrastructure &
Enterprise Services
Troubleshooting Steps:
1.) Link Speed: one possible reason for the bottleneck is the link speed at which server is operating, check for
the link speed at which the server is operating from switchshow command.
Fcd2026:
12 1 12 200c00 id N2 Online
FC F-Port 10:00:00:00:c9:51:58:cb
Fcd2025:
12 1 12 210c00 id N2 Online
FC F-Port 10:00:00:00:c9:51:58:ed
Generally in our environment all ports are set to AUTO means speed at which port is operating depends upon
speed set at server side, we can check the status from below mentioned command:
c5154556@ls2928:~> rek -l fcd2025 "portcfgshow" | more
Ports of Slot 1 0 1 2 3
4 5 6 7
8 9 10 11 12 13 14 15
-----------------+---+---+---+---+-----+---+---+---+-----+---+---+---+-----+---+---+--Speed
AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN
Fill Word
3 3 3 3
3 3 3 3
3 3 3 3
3 3 3 3
AL_PA Offset 13 .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Trunk Port
ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON
Long Distance
VC Link Init
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Locked L_Port
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Locked G_Port
ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON
Disabled E_Port .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Locked E_Port
ISL R_RDY Mode
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
RSCN Suppressed .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Persistent Disable .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Computer Sciences Corporation
All rights reserved
CSC Private
Page 8
of 12
CSC Private
Global Infrastructure &
Enterprise Services
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
NPIV capability ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON
NPIV PP Limit
126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126
QOS E_Port
AE AE AE AE AE AE AE AE AE AE AE AE AE AE AE AE
EX Port
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Mirror Port
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Rate Limit
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Credit Recovery ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON
Fport Buffers
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Fault Delay
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Suggestive Action: Raise a ticket with server team to take the speed to appropriate level to handle the load.
2.) Port Errors: In most of the cases erroneous port on switch is the reason for the latency and leads to
bottleneck problems in the fabric. So we need to look for the port errors and restrict them before they spread
in the entire fabric through the ISL.
Slow drain devices leads to latency and latency leads to time outs hence C3 discards, so these errors are generated
and results in bottlenecks and performance degradation.
frames
tx
rx
enc crc crc too too bad enc disc link loss loss frjt fbsy
in err g_eof shrt long eof
============================================================================================
=============
149: 0
0 146 292
CSC Private
Page 9
of 12
0
Document Ref : <CSCSAP-011>
Issues v4.0
CSC Private
Global Infrastructure &
Enterprise Services
151: 0
0 146 292
0
0
0
0
Suggestive Action: Please refer document to deal with different kind of port errors, located at below mentioned
location:
\\usphlhost\storage\CSC\SAP Germany\Fabric Troubleshooting
3.) SFP: Faulty SFP is the cause of bottleneck in many cases, where the Tx/Rx power, SFP speed is determining
factors of SFP health.
Suggestive Action: If SFP is faulty, please raise a case with EMC and get it replaced.
4.) Cabling: Sometimes it is noticed that cabling is not proper and having so many folds which leads to error
and bad signal strength which also results in Bottleneck problems:
Suggestive Action: Raise a case with cabling team to clean the cable and if still issue appears get the cable
changed.
CSC Private
Page 10
of 12
CSC Private
Global Infrastructure &
Enterprise Services
5.) Throughput: In most of the cases we should check the Throughput achieved for the trunks created in the
fabric to have an idea on bottlenecks at the ISL level.
c5154556@ls2928:~> rek -l fcd2019 "trunckshow -perf" | more
rbash: trunckshow: command not found
c5154556@ls2928:~> rek -l fcd2019 "trunkshow -perf" | more
1: 18-> 64 10:00:00:05:33:9c:f3:00 29 deskew 15 MASTER
19-> 65 10:00:00:05:33:9c:f3:00 29 deskew 15
Tx: Bandwidth 16.00Gbps, Throughput 241.57Mbps (1.76%)
Rx: Bandwidth 16.00Gbps, Throughput 87.82Mbps (0.64%)
Tx+Rx: Bandwidth 32.00Gbps, Throughput 329.39Mbps (1.20%)
CSC Private
Page 11
of 12
CSC Private
Global Infrastructure &
Enterprise Services
CSC Private
Page 12
of 12