Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

CSC Private

Global Infrastructure &


Enterprise Services

FABRIC BOTTLENECK MONITORING AND TROUBLESHOOTING

Version: 1.0

Abstract:

This document outlines the FABRIC Bottleneck Monitoring &


Troubleshooting

Protective Mark:

CSC Private

Document Number:
Document Owner:

Sumit Arora

Document Approver:

Gurpreet Wadhwa/ Sandeep Kumar Nandi Duli

Computer Sciences Corporation


All rights reserved

CSC Private
Page 1
of 12

Document Ref : <CSCSAP-011>


Issues v4.0

CSC Private
Global Infrastructure &
Enterprise Services

Contents
Purpose ...................................................................................................................................................... 3
Scope .......................................................................................................................................................... 3
Overview .................................................................................................................................................... 3
Status & Query process ............................................................................................................................ 5
Troubleshooting Steps .............................................................................................................................. 8

Amendment Record
Version Control Log
No

Date

By

Nature of change

09-05-2013

Sumit Arora

New document

Computer Sciences Corporation


All rights reserved

CSC Private
Page 2
of 12

Document Ref : <CSCSAP-011>


Issues v4.0

CSC Private
Global Infrastructure &
Enterprise Services

Purpose
Troubleshooting Document for the SAN FABRIC Bottleneck Monitoring.

Scope
The scope of this document applies to all the SAP SAN Fabric Troubleshooting.

Overview

Bottleneck Detection
A bottleneck is a port in the fabric where frames cannot get through as fast as they should. In other words, a
bottleneck is a port where the offered load is greater than the achieved egress throughput. Bottlenecks can cause
undesirable degradation in throughput on various links. When a bottleneck occurs at one place, other points in the
fabric can experience bottlenecks as the traffic backs up.
The bottleneck detection feature detects two types of bottlenecks:
Latency bottleneck
Congestion bottleneck
A latency bottleneck is a port where the offered load exceeds the rate at which the other end of the link can
continuously accept traffic, but does not exceed the physical capacity of the link. This condition can be caused by
a device attached to the fabric that is slow to process received frames and send back credit returns. A latency
bottleneck due to such a device can spread through the fabric and can slow down unrelated flows that share links
with the slow flow. In this case, the load does not exceed the physical capacity of the channel as such, but can
occur because of an underperforming device connected to the F_Port, or because of back pressure from other
congestion or latency bottlenecks on the E_Ports.
A congestion bottleneck is a port that is unable to transmit frames at the offered rate because the offered rate is
greater than the physical data rate of the line. Congestion bottleneck arises from link over-utilization. This happens
when the offered load exceeds throughput and throughput is at 100%. Frames attempt to egress at a faster rate than
the line rate allows. For example, this condition can be caused by trying to transfer data at 8 Gbps over a 4 Gbps
ISL.
Bottleneck Detection identifies and alerts you about "slow-drain" devices that can cause latency and I/O timeouts.
This capability is particularly valuable for optimizing performance in highly virtualized server environments.
Latency detection is frame-based and identifies buffer credit problems. One of the major strength of Fibre Channel
is that it creates lossless connections by implementing a flow control scheme based on buffer credits. The
disadvantage of such an approach is that the number of available buffers is limited and may eventually be totally
consumed. The temporary unavailability of buffer credits creates a temporary bottleneck. The longer the credits
are unavailable, the more serious the bottleneck.

Computer Sciences Corporation


All rights reserved

CSC Private
Page 3
of 12

Document Ref : <CSCSAP-011>


Issues v4.0

CSC Private
Global Infrastructure &
Enterprise Services

Some temporary credit loss is expected in normal Fibre Channel operation. It is the longer durations that concern
use here. Long periods without buffer credits are typically manifested as performance problems and are usually the
result of device latencies. Exceptional situations cause fabric back pressure that can extend all the way across the
fabric and back. Excessive back pressure can create serious problems in an operational SAN. Chronic back
pressure can exacerbate the effect of hardware failures and misbehaving devices and also contribute to serious
operational issues as the existence of existing bottlenecks will increase the probability of a failure.

There are several common sources of high latencies:

Storage ports (targets) often produce latencies that can slow down applications because they do not
deliver data at the rate expected by the host platform. Even well-architected storage array performance
can deteriorate over time. For example, LUN provisioning policies such as putting too many LUNs
behind a given port can contribute to poor performance of the storage if the control processor in the

Array cannot deliver data from all the LUNs quickly enough to satisfy read requests. The overheads of dealing
with a very large number of LUNSs may the cause for the slow delivery.

Hosts (initiators) may also produce significant latencies by requesting more data than they are capable of
processing in a timely manner.

Distance links can frequently consume all the buffer credits reserved for them and create a serious
bottleneck in the middle of a fabric that can have serious consequences for any applications sharing that
link.

Misbehaving devices such as defective HBAs can create havoc in a well-constructed SAN and increase
the threat to the fabric.

Bottleneck Detection can detect ports that are blocked due to lost credits and generate special stuck VC and lost
credit alerts for the E_Port with the lost credits.
There could be a slow drain device "down the road" - the slow drain device could be connected to the adjacent
switch or to another one connected to it. Credit starvation typically pressures back to affect wide areas of the
fabric.
The ISL could have too few buffers. Maybe the link is just too long. Or the average framesize is much smaller
than expected.

What is a slow draining device?


Slow draining devices are devices that are requesting more information than they can consume. This can be
because they are running at a slower link rate than the rest of the environment, or could just mean that other factors
within the devices are preventing them from functioning as fast as they could. A slow draining device can exist at
any link utilization level where achieved throughput into the slow draining port is lower compared to the intended
Computer Sciences Corporation
All rights reserved

CSC Private
Page 4
of 12

Document Ref : <CSCSAP-011>


Issues v4.0

CSC Private
Global Infrastructure &
Enterprise Services

throughput. Its very important to note it can spread throughout the fabric and can slow down traffic unrelated to
the slow draining device within the fabric.
What causes slow draining devices?
The most common cause is within the device or server itself. If the device is overloaded in terms of CPU or
memory it may have a hard time handling the data it has requested. Another common cause is devices that have
slower link rates then the rest of the environment.

Queue Depth settings can help to mitigate slow draining devices, which are the most common cause of credit
issues. Queue depth settings impact the number of transactions that can be open by a device at any one time and
credit issues deal with individual frames rather than transactions. By limiting the number of transactions you can
throttle slow devices down to levels of data that they can consume and prevent them from impacting other devices
that share some of the same resources such as ISLs and storage ports. Typically, queue depth settings are set too
high for optimum performance. The SAN Performance Probe allows users to really see the true impact of the
queue depth settings and overall latency in the environment.

How to check status?


We can identify the ports contributing in the bottleneck situation with the help of below mentioned command:
switch:admin> bottleneckmon --show
======================================================
Fri Feb 26 22:00:00 UTC 2010
======================================================
List of bottlenecked ports in most recent interval:
13 16
=======================================================
Number of
From To bottlenecked ports
=======================================================
Feb 26 21:59:50 Feb 26 22:00:00 2
Feb 26 21:59:40 Feb 26 21:59:50 0
Feb 26 21:59:30 Feb 26 21:59:40 0
Feb 26 21:59:20 Feb 26 21:59:30 0
Feb 26 21:59:10 Feb 26 21:59:20 0
Feb 26 21:59:00 Feb 26 21:59:10 0
Feb 26 21:58:50 Feb 26 21:59:00 0
Feb 26 21:58:40 Feb 26 21:58:50 0
Feb 26 21:58:30 Feb 26 21:58:40 0
Feb 26 21:58:20 Feb 26 21:58:30 2
Feb 26 21:58:10 Feb 26 21:58:20 3
Feb 26 21:58:00 Feb 26 21:58:10 3
Feb 26 21:57:50 Feb 26 21:58:00 3
Feb 26 21:57:40 Feb 26 21:57:50 3
Feb 26 21:57:30 Feb 26 21:57:40 2
Feb 26 21:57:20 Feb 26 21:57:30 2
Feb 26 21:57:10 Feb 26 21:57:20 0
Feb 26 21:57:00 Feb 26 21:57:10 0
Feb 26 21:56:50 Feb 26 21:57:00 0
Computer Sciences Corporation
All rights reserved

CSC Private
Page 5
of 12

Document Ref : <CSCSAP-011>


Issues v4.0

CSC Private
Global Infrastructure &
Enterprise Services

Feb 26 21:56:40 Feb 26 21:56:50 0


Feb 26 21:56:30 Feb 26 21:56:40 0
Feb 26 21:56:20 Feb 26 21:56:30 0

To display bottleneck statistics for a single port:


switch:admin> bottleneckmon --show -interval 5 -span 30 2/4
=============================================
Wed Jan 13 18:54:35 UTC 2010
=============================================
Percentage of
From To affected secs
==============================================
Jan 13 18:54:05 Jan 13 18:54:10 20.00%
Jan 13 18:54:10 Jan 13 18:54:15 60.00%
Jan 13 18:54:15 Jan 13 18:54:20 0.00%
Jan 13 18:54:20 Jan 13 18:54:25 0.00%
Jan 13 18:54:25 Jan 13 18:54:30 40.00%
Jan 13 18:54:30 Jan 13 18:54:35 80.00%

You can filter the output to display only latency or congestion bottleneck statistics.
c5154556@ls2928:~> rek -l fcd2026 "bottleneckmon --show -latency" | more
==================================================================
Wed May 08 20:08:36 UTC 2013
==================================================================
List of bottlenecked ports in most recent interval:
None
==================================================================
Number of
From
To
bottlenecked ports
==================================================================
May 08 20:08:26
May 08 20:08:36
0
May 08 20:08:16
May 08 20:08:26
0
May 08 20:08:06
May 08 20:08:16
0
May 08 20:07:56
May 08 20:08:06
0
May 08 20:07:46
May 08 20:07:56
0
May 08 20:07:36
May 08 20:07:46
0
May 08 20:07:26
May 08 20:07:36
0
May 08 20:07:16
May 08 20:07:26
0
May 08 20:07:06
May 08 20:07:16
0
May 08 20:06:56
May 08 20:07:06
0
May 08 20:06:46
May 08 20:06:56
0
May 08 20:06:36
May 08 20:06:46
0
May 08 20:06:26
May 08 20:06:36
0
May 08 20:06:16
May 08 20:06:26
0
May 08 20:06:06
May 08 20:06:16
0
May 08 20:05:56
May 08 20:06:06
0
May 08 20:05:46
May 08 20:05:56
0
Computer Sciences Corporation
All rights reserved

CSC Private
Page 6
of 12

Document Ref : <CSCSAP-011>


Issues v4.0

CSC Private
Global Infrastructure &
Enterprise Services

May 08 20:05:36
May 08 20:05:26
May 08 20:05:16
May 08 20:05:06
May 08 20:04:56
May 08 20:04:46
May 08 20:04:36
May 08 20:04:26
May 08 20:04:16
May 08 20:04:06
May 08 20:03:56
May 08 20:03:46
May 08 20:03:36

May 08 20:05:46
May 08 20:05:36
May 08 20:05:26
May 08 20:05:16
May 08 20:05:06
May 08 20:04:56
May 08 20:04:46
May 08 20:04:36
May 08 20:04:26
May 08 20:04:16
May 08 20:04:06
May 08 20:03:56
May 08 20:03:46

0
0
0
0
0
0
0
0
0
0
0
0
0

The --show option displays a history of the bottleneck severity for a specified port or for all ports. Each
line of output shows the percentage of one-second intervals affected by bottleneck conditions during the
time window shown on that line. When issued for all ports, the union of all port statistics is displayed in
addition to individual port statistics. The union value provides a good indicator for the overall bottleneck
Severity on the switch.
Configuration at SAP is default:
Bottleneck detection - Enabled
==============================
Switch-wide alerting parameters:
============================
Alerts

- Yes

Latency threshold for alert

- 0.100

Congestion threshold for alert - 0.800


Averaging time for alert

- 300 seconds

Quiet time for alert

- 300 seconds

Computer Sciences Corporation


All rights reserved

CSC Private
Page 7
of 12

Document Ref : <CSCSAP-011>


Issues v4.0

CSC Private
Global Infrastructure &
Enterprise Services

Troubleshooting Steps:
1.) Link Speed: one possible reason for the bottleneck is the link speed at which server is operating, check for
the link speed at which the server is operating from switchshow command.

Fcd2026:
12 1 12 200c00 id N2 Online

FC F-Port 10:00:00:00:c9:51:58:cb

Fcd2025:
12 1 12 210c00 id N2 Online

FC F-Port 10:00:00:00:c9:51:58:ed

Generally in our environment all ports are set to AUTO means speed at which port is operating depends upon
speed set at server side, we can check the status from below mentioned command:
c5154556@ls2928:~> rek -l fcd2025 "portcfgshow" | more
Ports of Slot 1 0 1 2 3

4 5 6 7

8 9 10 11 12 13 14 15

-----------------+---+---+---+---+-----+---+---+---+-----+---+---+---+-----+---+---+--Speed

AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN

Fill Word

3 3 3 3

3 3 3 3

3 3 3 3

3 3 3 3

AL_PA Offset 13 .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Trunk Port

ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON

Long Distance
VC Link Init

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Locked L_Port

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Locked G_Port

ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON

Disabled E_Port .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Locked E_Port
ISL R_RDY Mode

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

RSCN Suppressed .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Persistent Disable .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Computer Sciences Corporation
All rights reserved

CSC Private
Page 8
of 12

Document Ref : <CSCSAP-011>


Issues v4.0

CSC Private
Global Infrastructure &
Enterprise Services

LOS TOV enable

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

NPIV capability ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON
NPIV PP Limit

126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126

QOS E_Port

AE AE AE AE AE AE AE AE AE AE AE AE AE AE AE AE

EX Port

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Mirror Port

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Rate Limit

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Credit Recovery ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON
Fport Buffers

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

Port Auto Disable .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..


CSCTL mode

Fault Delay

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Suggestive Action: Raise a ticket with server team to take the speed to appropriate level to handle the load.

2.) Port Errors: In most of the cases erroneous port on switch is the reason for the latency and leads to
bottleneck problems in the fabric. So we need to look for the port errors and restrict them before they spread
in the entire fabric through the ISL.
Slow drain devices leads to latency and latency leads to time outs hence C3 discards, so these errors are generated
and results in bottlenecks and performance degradation.
frames
tx

rx

enc crc crc too too bad enc disc link loss loss frjt fbsy
in err g_eof shrt long eof

out c3 fail sync sig

============================================================================================
=============
149: 0

Computer Sciences Corporation


All rights reserved

0 146 292
CSC Private
Page 9
of 12

0
Document Ref : <CSCSAP-011>
Issues v4.0

CSC Private
Global Infrastructure &
Enterprise Services

150: 5.5m 6.0m 0

151: 0

0 146 292

0
0

0
0

Suggestive Action: Please refer document to deal with different kind of port errors, located at below mentioned
location:
\\usphlhost\storage\CSC\SAP Germany\Fabric Troubleshooting
3.) SFP: Faulty SFP is the cause of bottleneck in many cases, where the Tx/Rx power, SFP speed is determining
factors of SFP health.

Suggestive Action: If SFP is faulty, please raise a case with EMC and get it replaced.
4.) Cabling: Sometimes it is noticed that cabling is not proper and having so many folds which leads to error
and bad signal strength which also results in Bottleneck problems:
Suggestive Action: Raise a case with cabling team to clean the cable and if still issue appears get the cable
changed.

Computer Sciences Corporation


All rights reserved

CSC Private
Page 10
of 12

Document Ref : <CSCSAP-011>


Issues v4.0

CSC Private
Global Infrastructure &
Enterprise Services

5.) Throughput: In most of the cases we should check the Throughput achieved for the trunks created in the
fabric to have an idea on bottlenecks at the ISL level.
c5154556@ls2928:~> rek -l fcd2019 "trunckshow -perf" | more
rbash: trunckshow: command not found
c5154556@ls2928:~> rek -l fcd2019 "trunkshow -perf" | more
1: 18-> 64 10:00:00:05:33:9c:f3:00 29 deskew 15 MASTER
19-> 65 10:00:00:05:33:9c:f3:00 29 deskew 15
Tx: Bandwidth 16.00Gbps, Throughput 241.57Mbps (1.76%)
Rx: Bandwidth 16.00Gbps, Throughput 87.82Mbps (0.64%)
Tx+Rx: Bandwidth 32.00Gbps, Throughput 329.39Mbps (1.20%)

2: 30-> 0 10:00:00:05:33:ec:0a:7a 19 deskew 15 MASTER


31-> 1 10:00:00:05:33:ec:0a:7a 19 deskew 15
Tx: Bandwidth 16.00Gbps, Throughput 4.06Kbps (0.00%)
Rx: Bandwidth 16.00Gbps, Throughput 468.48Kbps (0.00%)
Tx+Rx: Bandwidth 32.00Gbps, Throughput 472.54Kbps (0.00%)

3: 46-> 30 10:00:00:05:33:22:c6:00 96 deskew 15 MASTER


47-> 31 10:00:00:05:33:22:c6:00 96 deskew 15
Tx: Bandwidth 16.00Gbps, Throughput 434.14Kbps (0.00%)
Rx: Bandwidth 16.00Gbps, Throughput 6.30Mbps (0.05%)
Tx+Rx: Bandwidth 32.00Gbps, Throughput 6.73Mbps (0.02%)

Suggestive Action: Please add more trunks to distribute the load.


Computer Sciences Corporation
All rights reserved

CSC Private
Page 11
of 12

Document Ref : <CSCSAP-011>


Issues v4.0

CSC Private
Global Infrastructure &
Enterprise Services

Computer Sciences Corporation


All rights reserved

CSC Private
Page 12
of 12

Document Ref : <CSCSAP-011>


Issues v4.0

You might also like