Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

Known Issues on Cisco 7600 Router ES+ Line Cards

VERSION 2

Introduction

ES+ linecards on Cisco 7600 Series Routers are using highly programmable components. Some
of the issues observed on these cards had a symptom that would normally be interpretted as a
hardware faiure, e.g. double-bit or repeated single-bit parity errors.

Purpose Of This Document


This documents provides an overview of known issues related to ES+ linecards on Cisco 7600
Serier Routers, with a twofold purpose:

1. increase awareness of the fact that the old-time notion of what is a HW and what is a SW
failure may not be applicable any more
2. help Cisco customers and partner evaluate issues observed on ES+ card

This is not an exhaustive list. If your symptom does not match any of the ddtses listed in this
document, please do make an additional search in the Bug Toolkit before opening a TAC Service
Request.

Where To Look For Failure Symptoms


ES+ linecards have a local flash disk used for storing on-board logging data and for crashinfo
files.

Locations where ES+ failure symptoms should be looked up are:

 logging buffer on the active supervisor:


o relevant command: "show logging"
 logging buffer on the ES+ linecard:
o relevant command: "remote command module <slot> show logging"
 on-board logging file on the ES+ linecard's flash disk
o relevant command: "remote command module <slot> show logging onboard"
 crashinfo and mini-crashinfo files on ES+ linecard's flash disk
o relevant command: "dir dfc#<slot>-bootdisk:", "more dfc#<slot>-
bootdisk:<filename>"
o NOTE: before running the "more" command, execute "terminal length 0"

On-board log may show isolated occurrence(s) of Single-bit parity errors. This should not be a
concern becase:

1. isolated single-bit parity errors can be considered soft-parity errors, caused by sources
external to the memory chip
2. ECC logic on ES+ linecards corrects single-bit errors

List Of Known Issues


 CSCsv05515: x40g: Improve the message wordings for recoverable tcam errors
 CSCsw31515: ES+: %DEV_SELENE-DFCx-3-SRAM_ECC: Selene SRAM ECC Errors
 CSCtb76621: ES+ ROMMON: MPC8548 DDR20 errata fix for Multi-bit ECC errors
 CSCtb78538: ES+ ROMMON: controller setting changes to prevent Multi-bit ECC
errors
 CSCtc17311: ES+: TCAM_MGR_HW_ERR: TCAM device had corrupted data errors
 CSCtd66014: ES+: ECC_DOUBLE: Double-bit ECC error detected on NP - High T,
Normal V
 CSCtd99244: ES+: ECC_SINGLE or ECC_DOUBLE error detected on NP
 CSCtd99248: ES+: ECC_DOUBLE: Double-bit ECC error detected on NP
 CSCte14535: Invalid LinkFPGA or LINKFPGA Bus Error
 CSCtg31984: DBUS-HDR error in ES/ES+ Modules
 CSCth11714: ES+ ECC_DOUBLE: Double-bit ECC error or reset due to
eznp_ecc_err_isr
 CSCth15790: Low-queue ES+: ECC_DOUBLE: Double-bit ECC error detected on NP,
Mem 16
 CSCth20868: Link FPGA Update Failures with Different signatures
 CSCth25959: IOS changes for updating the new temperature thresholds for ES+
 CSCti80887: Temperature incorrect when sensor is Not_Operational
 CSCtn41667: IOS fix for handling the Power calcuation issues with ES+ Combo cards
 CSCtn68668: Fix LC inlet temp issue (ES+XC) and Alarm handling issues (All ES+)
 CSCtn95122: ES+: ECC_DOUBLE: Double-bit ECC error detected on NP, Mem 17
 CSCto55567: ES+: LC failed to recover due to Metropolis lockup
 CSCtq07626: ES+: DEV_SELENE XAUI_LEN, FIFO_FULL, XAUI_GNT and
XAUI_MIN errors
 CSCtr37182: XAUI code error reporting needs to be changed
 CSCtr74529: ES+: LONGBUSYREAD: C2W Interface busy for long time reading
temp sensor
 CSCtr74953: ES+: Watchdog resets fail to write crashinfo, causing Keep Alive failure
 CSCts25729: ES+: PCI read hang causes Keep Alive failure, fails to write crashinfo
 CSCtt13344: ES+: Ingress traffic will not pass with > 7091 bytes packet size
 CSCsy88170: x40g: Failed to read register Id while reading NP registers val
 CSCsz04660: Traceback %X40G-DFC4-3-TCAM_MGR_HW_ERR: GTM HW ERROR

IMPORTANT NOTE: There were multiple issues related to ECC parity errors on ES+ linecard.
All of the known issues are fixed in latest release, the recommendation for customers who have
ES+ deployment is to upgrade software to 12.2(33)SRE5 or 15.0(1)S5 or future releases.

Customer who have deployed 12.2(33)SRD release and if they cannot upgrade to 12.2(33)SRE5
for some reason then the recommendation is to have them upgrade to latest rebuild -
12.2(33)SRD6.

========================================
CSCsv05515
x40g: Improve the message wordings for recoverable tcam errors
----------------------------------------

If this error message is encountered, please contact Cisco TAC for further support.

========================================
CSCsw31515
ES+: %DEV_SELENE-DFCx-3-SRAM_ECC: Selene SRAM ECC Errors
----------------------------------------

If this error message is encountered, please contact Cisco TAC for further support.

========================================
CSCtb76621
ES+ ROMMON: MPC8548 DDR20 errata fix for Multi-bit ECC errors
----------------------------------------
Symptom:

%C6K_MEM_ECC-DFCx-2-MBE: Multiple bit error detected at ...


%C6K_MEM_ECC-DFCx-3-SYNDROME_MBE: 8-bit Syndrome for the detected Multi-bit
error: ...
%C7600_MEM_ECC-DFCx-2-MBE: Multiple bit error detected at ...
%C7600_MEM_ECC-DFCx-3-SYNDROME_MBE: 8-bit Syndrome for the detected Multi-bit
error: ...

Conditions:

Observed on ES+ line card of Cisco 7600 Series Router.

Workaround:

There is no workaround.

Further Problem Description:


This fix is integrated in the 12.2(33r)SRD7 ROMMON image for ES+ card. SRD7 rommon
image is bundled into IOS package for Cisco 7600 Series Router starting from 15.0(1)S. Cisco
7600 Series Routers running an image from 12.2(33)SRD or 12.2(33)SRE version may also run
SRD7 rommon. If affected by this issue, contact Cisco TAC and request the 12.2(33r)SRD7
image. Please refer this link for the rommon upgrade procedure:
http://www.cisco.com/en/US/docs/routers/7600/rommon/rsp720_rommon.html#wp180816

========================================
CSCtb78538
ES+ ROMMON: controller setting changes to prevent Multi-bit ECC errors
----------------------------------------

If this error message is encountered, please contact Cisco TAC for further support.

========================================
CSCtc17311
ES+: TCAM_MGR_HW_ERR: TCAM device had corrupted data errors
----------------------------------------
Symptoms: TCAM device is reporting corrupted data:

%X40G-DFC2-3-TCAM_MGR_HW_ERR: GTM HW ERROR: TCAM device had corrupted


data, the error is corrected for channel ...

Conditions: Observed on ES+ linecards of Cisco 7600 Series Routers, by a background TCAM
consistency checker.

Workaround: There is no workaround.

Further Problem Description: These messages can safely be ignored as the entries are already
corrected.

========================================
CSCtd66014
ES+: ECC_DOUBLE: Double-bit ECC error detected on NP - High T, Normal V
----------------------------------------
Symptoms: ES+ line card crashes at powerup of a Cisco 7600 router that is
running Cisco IOS 12.2SRE image if either the Traffic Manager or Frame
memories in the ES+ Network processors report a double bit ECC error. The ES+
line card crashinfo will have the following string:

%NP_DEV-DFC2-3-ECC_DOUBLE: Double-bit ECC error detected on NP 0, Mem 19,


SubMem 0x1,SingleErr 1, DoubleErr 1 Count 1 Total 1

Conditions: Router reloads, OIR of ES+ cards, system environment temperatures


that slowly vary around an ambient temperature of about 30 degreesC. This
happens at system powerup. We have seen double bit ECC problems reported
after a few hours of traffic if the ambient temperatures vary around 30
degreesC.

Workaround: No configuration workaround is available. The line card will


reset itself and will be operational in the second reload.

========================================
CSCtd99244
ES+: ECC_SINGLE or ECC_DOUBLE error detected on NP
----------------------------------------
Symptoms:

7600 series router with ES+ line card crashes reporting single bit or double bit ECC error.

%NP_DEV-DFC2-3-ECC_SINGLE: Single-bit ECC error detected on NP 0, Mem 18, SubMem


0x1,SingleErr 1, DoubleErr 0 Count 1 Total 1

%NP_DEV-DFC2-3-ECC_DOUBLE: Double-bit ECC error detected on NP 0, Mem 19,


SubMem
0x1,SingleErr 1, DoubleErr 1 Count 1 Total 1

Conditions:

Symptom observed on ES+ linecard of C7600 series routers, usually in the initial phases of line
card
bootup, but this has also been reported after a few hours of traffic through the ES+ line card
ports.

Workaround:

There is no workaround.

Further Problem Description:

Software fix is available in :


12.2(33)SRD5 or higher
12.2(33)SRE2 or higher
15.0(1)S or higher

If symptom persists after IOS upgrade please contact Cisco TAC.

========================================
CSCtd99248
ES+: ECC_DOUBLE: Double-bit ECC error detected on NP
----------------------------------------
Symptoms:

7600 series routers with ES+ line cards there could be occasional double bit ECC errors for the
traffic manager and other metadata memories that are reported on the Network processor on the
ES+ line card.

Example error message:


%NP_DEV-DFC9-3-ECC_DOUBLE: Double-bit ECC error detected on NP 3, Mem 18,
SubMem 0x1,SingleErr 1, DoubleErr 1 Count 1 Total 1

Conditions:

This symptom is observed when the router reloads, OIR of ES+ cards, system environment
temperatures that slowly vary around an ambient temperature of about 30 degreesC. This
happens at system power up. The double bit ECC errors reported after a few hours of traffic.

Workaround: No configuration workaround is available. The line card will


reset itself and will be operational in the second reload.

Further Problem Description:

Software fix is available in :


12.2(33)SRD5 or higher
12.2(33)SRE2 or higher
15.0(1)S or higher

If symptom persists after IOS upgrade please contact Cisco TAC.

========================================
CSCte14535
Invalid LinkFPGA or LINKFPGA Bus Error
----------------------------------------
Symptom:

Possible symptoms are:

%FPD_MGMT-3-INVALID_IMG_VER: Invalid ... LinkFPGA .. image version detected for ...


card in slot-dc ...
%FPD_MGMT-6-UPGRADE_PASSED: ... LinkFPGA ... image in the ... card in slot-dc 7-2 has
been successfully updated from version ?.? to version ...
%C7600_ES-2-IOFPGA_IO_BUS_ERROR: C7600-ES Line Card IOFPGA IO LINKFPGA Bus
Error

Conditions:
Observed during boot/reload of ES+ line card in Cisco 7600 Series Routers. Rare in normal
working ES+ cards.

Workaround:
This fix is an enhancement which adds an additional recovery cycle for reading the LinkFPGA.

Further Problem Description:


The link FPGA should recover in the next recovery reload of the ES+. If the recovery does not
happen after 3 consecutive times, then a persistent hardware fault may be the reason. Contact
TAC for RMA procedures.

========================================
CSCtg31984
DBUS-HDR error in ES/ES+ Modules
----------------------------------------
Symptom:
7600 with ES/ES+ module may report error EARL_L2_ASIC-DFC2-4-DBUS_HDR_ERR on
after boot up. There is no function impact to the switch due to this error.

Conditions:
7600 with ES/ES+ modules present. The problem can happen up to a few hours
after boot up.

Workaround:
No workaround. Problem has been resolved in 12.2(33)SRD5 and 12.2(33)SRE2.

========================================
CSCth11714
ES+ ECC_DOUBLE: Double-bit ECC error or reset due to eznp_ecc_err_isr
----------------------------------------
Symptom:

7600 Series router with ES+ line card crashes reporting error:

%NP_DEV-DFC2-3-ECC_DOUBLE: Double-bit ECC error detected on NP 0, Mem 19,


SubMem 0x1,SingleErr 1, DoubleErr 1 Count 1 Total 1

Another possible symptom is:

%PM_SCP-SP-1-LCP_FW_ERR: System resetting module 1 to recover from error:


eznp_ecc_err_isr: ECC intr handler for NP: 1 failed

Conditions:

Symptom observed on ES+ linecard of C7600 series routers.

Workaround:

None.

Further Problem Description:

Software fix is available in :


12.2(33)SRD5 or higher
12.2(33)SRE2 or higher
15.0(1)S or higher
If symptom persists after IOS upgrade please contact Cisco TAC.

========================================
CSCth15790
Low-queue ES+: ECC_DOUBLE: Double-bit ECC error detected on NP, Mem 16
----------------------------------------
Symptoms:

%NP_DEV-DFC9-3-ECC_DOUBLE: Double-bit ECC error detected on NP 3, Mem 16,


SubMem
0x1,SingleErr 1, DoubleErr 1 Count 1 Total 1

Conditions:
Symptom observed on Low-queue ES+ line cards (ES+T) of C7600 series routers, in NP Mem
16.

Workaround:
There is no workaround.

Further Problem Description:


If symptom persists after IOS upgrade please contact Cisco TAC.

========================================
CSCth20868
Link FPGA Update Failures with Different signatures
----------------------------------------
Symptom:

ES+ card crashes with different failure messages during production. In Most of the cases the
initial message for reload will be FPD upgrade failure for multiple attempts.

The crash messages in this case will be different at different bootup attempts. These messages
can be System Exception, FPD upgrade failure, IOFPGA bus error. Message Examples are

Initial symptom would be:

%FPD_MGMT-3-INVALID_IMG_VER: Invalid 20x1G LinkFPGA (FPD ID=7) image version


detected for 7600-ES+20G card in slot-dc 7-2.

IOFPGA bus error symptom:

%C7600_ES-DFC7-2-IOFPGA_IO_BUS_ERROR: C7600-ES Line Card IOFPGA IO


LINKFPGA Bus Error:
and other system Exceptions.

Conditions:

Symptom observed during boot-up of 7600-ES+ linecards.

Workaround:

None.

========================================
CSCth25959
ENV-4-MINORTEMPALARM - updating the new temperature thresholds for ES+
----------------------------------------
Symptom:
Temperature alarm (ENV-4-MINORTEMPALARM) is reported, with AMBER LED on the line
card faceplate.

Conditions:
7600 series router with any model of the ES+ line card.

Workaround:
No workaround.

Further Problem Description:


Temperature thresholds were set too low before this bug-fix . Correct settings are:

--------------------------------------------
Sensor Minor Major
ID Threshold Threshold
--------------------------------------------
BB Outlet 0 65 80
BB Outlet 1 70 85
--------------------------------------------

It is recommended to evaluate also the related bug CSCtn68668.

========================================
CSCti80887
Temperature 128 degC reported when sensor is Not_Operational
----------------------------------------
Symptom:
Faceplate LED on the linecard is red. Temperature sensor is reporting 128 degC.

In addition, following I2C error may be reported by the linecard, confirming that the temperature
sensor can not be read:

I2C Read Error READ bus=0x1 addr=0x4D port_sel=0x0 flags = 0x0 cmd=0x0 size=2

Conditions:

Faulty sensor on a ES+ linecard of a C7600 Series Router.

Workaround:

None.

Further Problem Description:

This SW fix is correcting the reporting of an invalid sensor. Under same circumstances, 'NO'
(Not Operational) will be reported instead of 128 degC.
========================================
CSCtn41667
IOS fix for handling the Power calcuation issues with ES+ Combo cards
----------------------------------------
Symptom:
Following ES+ PIDS consume more power than the expected values.

76-ES+XC-20G3C
76-ES+XC-20G3CXL
76-ES+XC-40G3C
76-ES+XC-40G3CXL

This might lead to situation of other modules getting powered down due to "power deny" .

Conditions:
Specific to ES+XC variants (Combo cards) of Cisco 7600 Series Routers.

Workaround:
Configure power redundancy-mode combined until the IOS is upgraded to a release with
correct power settings.

========================================
CSCtn68668
Fix LC inlet temp issue (ES+XC) and Alarm handling issues (All ES+)
----------------------------------------
Symptoms: The following symptoms are observed:

1. The STATUS LED on the line card faceplate is amber.


2. The remote command module module
show platform hardware environment temperature command
reports high line card inlet temperature:

Router#remote command mod 1 show plat hard env temp

----------------------------------------------------------
Temperature and Threshold Table
----------------------------------------------------------
Sensor Minor Major Current
ID Threshold Threshold Temperature
----------------------------------------------------------
BB Outlet 0 60 75 47
BB Inlet 0 50 65 27
BB Outlet 1 75 85 54
BB Inlet 1 50 65 32
PE Outlet 60 75 53
PE Inlet 50 65 34
LC Outlet 60 75 49
LC Inlet 50 65 50 <<<<<<<<

Conditions: This issue is specific to the following Cisco 7600 ES+ combo
cards:

76-ES+XC-20G3C
76-ES+XC-20G3CXL
76-ES+XC-40G3C
76-ES+XC-40G3CXL

Line card inlet sensor is inappropriately positioned in a place where


temperatures are higher than on the inlet point.

Workaround: There is no workaround.

Further Problem Description: There are no problems with the functioning of


the board. Only the external communication is affected. "BB Inlet 1" shows
the actual inlet temperature. It can be used for reliable measurement of line
card inlet temperature.
========================================
CSCtn95122
ES+: ECC_DOUBLE: Double-bit ECC error detected on NP, Mem 17
----------------------------------------
Symptoms: The ECC double-bit error is reported in syslog, followed with a linecard crash:

%NP_DEV-DFC5-3-ECC_DOUBLE: Double-bit ECC error detected on NP ... Mem 17

Conditions: Observed on ES+ linecards of C7600 Series Routers when heavy configuration
changes are applied to the linecard. In addition, there are other unknown race conditions that can
cause this. This bug-fix is specific to Double-bit errors on Mem 17.

Workaround: There is no workaround.

========================================
CSCto55567
ES+: FABRICCRCERRS after SSO due to Metropolis lockup
----------------------------------------
Symptoms: line card reports fabric errors:

%FABRIC_INTF_ASIC-DFC9-4-FABRICCRCERRS: Fabric ASIC 0: 322 Fabric CRC error


events in 100ms period

Also, TestMacNotification and TestFabricCh0Health diagnostic tests are failing.

Conditions: Symptom is observed on ES+ line cards of C7600 Series Routers after SSO with
multicast traffic flowing through the line card.

Workaround: Soft reload the line card using the hw-module module module reset exec
command.

========================================
CSCtq07626
ES+: DEV_SELENE XAUI_LEN, FIFO_FULL, XAUI_GNT and XAUI_MIN errors
----------------------------------------
Symptom:
Errors detected by selene ASIC:

%DEV_SELENE-DFC1-3-XAUI_LEN
%DEV_SELENE-DFC1-3-FIFO_FULL
%DEV_SELENE-DFC1-3-XAUI_GNT
%DEV_SELENE-DFC1-3-XAUI_MIN

Conditions:
Observed on ES+ linecards of Cisco 7600 Series Routers.

Workaround:
None.

Further Problem Description:


Listed error types are not HW failures. Instead of being reported through error messages,
occurrence of these errors can be tracked through CLI: remote command module module show
platform hardware drops.

========================================
CSCtr37182
ES+: single occurrence of DEV_SELENE XAUI_CODE error
----------------------------------------
Symptoms: Single occurrence of XAUI_CODE and XAUI_RX_RDY message in the syslog:

%DEV_SELENE-DFC1-3-XAUI_CODE: Selene 1 XAUI 1 Coding Error


%DEV_SELENE-DFC1-3-XAUI_RX_RDY: Selene 1 XAUI 1 Rx Rdy changed state

Conditions: This symptom is observed on ES+ linecards of Cisco 7600 series router.

Workaround: There is no workaround.

Further Problem Description: Single occurrence of this error can safely be ignored.

========================================
CSCtr74529
ES+: LONGBUSYREAD: C2W Interface busy for long time reading temp sensor
----------------------------------------
Symptoms:

%ENVM-4-LONGBUSYREAD: C2W Interface busy for long time reading temperature sensor

Conditions: Observed on ES+ linecard of Cisco 7600 Series Routers.

Workaround: There is no workaround.

========================================
CSCtr74953
ES+: Watchdog resets fail to write crashinfo, causing Keep Alive failure
----------------------------------------
Symptom:
%OIR-SP-3-PWRCYCLE: Card in module 1, is being power-cycled off (Module not responding
to Keep Alive polling)
%C7600_PWR-SP-4-DISABLED: power to module in slot 1 set off (Module not responding to
Keep Alive polling)

There is no crashifo file created.

Conditions:
Observed on ES+ linecards of Cisco 7600 Series Routers. This bug is specific to a condition
where no other explanations exist for the failure of Keep Alive polling.

Workaround:
There is no workaround.

Further Problem Description:


This fix does not prevent the line card crash, but it prevents the silent crash. This fix ensures that
a crashifo will be written on the ES+ line card flash disk. It also ensures that the line card is reset
as soon as the error condition is detected, as opposed to waiting for a Keep Alive failure.

========================================
CSCts25729
ES+: PCI read hang causes Keep Alive failure, fails to write crashinfo
----------------------------------------
Symptom:

%OIR-SP-3-PWRCYCLE: Card in module 1, is being power-cycled off (Module not responding


to Keep Alive polling)
%C7600_PWR-SP-4-DISABLED: power to module in slot 1 set off (Module not responding to
Keep Alive polling)

There is no crashifo file created.

Conditions:
Observed on ES+ linecards of Cisco 7600 Series Routers. This bug is specific to a condition
where no other explanations exist for the failure of Keep Alive polling.

Workaround:
There is no workaround.

Further Problem Description:


This fix does not prevent the line card crash, but it prevents the silent crash. This fix ensures that
a crashifo and mini crashinfo will be written on the ES+ line card flash disk. It also ensures that
the line card is reset as soon as the error condition is detected, as opposed to waiting for a Keep
Alive failure.
========================================
CSCtt13344
ES+: Ingress traffic will not pass with > 7091 bytes packet size
----------------------------------------
Symptom:

Traffic will not pass with greater than 7091 byte packet size.

Conditions:

When MTU is set greater than 7091, sending packet size with > 7092 bytes may hit the issue.
There is no specific trigger for this. But when issue is hit , ifdma_status register last byte reads
"C0".

From ES+, run below command to read the ifdma_status register.

show platform hardware npc 1 register all | i ifdma


1 ifdma_config 297 0x34010000
1 ifdma_counter_base 298 0x00000001
1 ifdma_frame_length 299 0x3DC0003C
1 ifdma_buffer_recycle 300 0x00F300CE
1 ifdma_enable 301 0xFFFFFFFF
1 ifdma_status 302 0x000200C0<<<<

Where npc "1" is NP number.

Workaround:

Fix is to disable the buffer recovery mechanism.

========================================
CSCsy88170

x40g: Failed to read register Id while reading NP registers val

----------------------------------------

Symptom:

DFC3: ERROR! number: 0x80003902, NPprmReg_Read_NP_3c: register <num> is not


supported for NP-3c2.

Conditions:
Observed on the console or syslog of ES+ linecards of Cisco 7600 Series Routers.

Workaround:

None.

Further Problem Description:

Issue is cosmetic. Some registers are not meant to be read by the firmware on the chip. When the
chip tries to read these registers, it prints the error.

========================================

CSCsz04660

Traceback %X40G-DFC4-3-TCAM_MGR_HW_ERR: GTM HW ERROR

----------------------------------------

Symptom:

On bootup or normal operations, a few ES+ cards might show the following traceback.

%X40G-DFC4-3-TCAM_MGR_HW_ERR: GTM HW ERROR: TCAM device contains


corrupted uninitialized data for channel

Conditions:

Observed on a small number of ES+ linecards of Cisco 7600 Series Routers.

Workaround:

None
Further Problem Description:

This message indicates that the TCAM consistency checker has detected a few TCAM entries
that were not in the initialized states. The TCAM consistency checker has already corrected these
TCAM entries.

You might also like