Professional Documents
Culture Documents
172254
172254
Lucent TechnologiesProprietary This document contains proprietary information of Lucent Technologies and is not to be disclosed or used except in accordance with applicable agreements. Copyright 2000 Lucent Technologies Unpublished and Not for Publication All rights Reserved
Notice
Every effort was made to ensure that the information in this document was complete and accurate at the time of printing. However, information is subject to change.
Security Statement
In rare instances, unauthorized individuals make connections to the telecommunications network through the use of remote access features. In such event, applicable tariffs require that the customer pay all network charges for trafc. Lucent Technologies cannot be responsible for such charges and will not make any allowance or give any credit for charges that result from unauthorized access.
Trademarks
5ESS is a registered trademark of Lucent Technologies. AUTOPLEX is a registered trademark of Lucent Technologies. AutoPACE is a registered trademark of Lucent Technologies. BILLDATS is a registered trademark of Lucent Technologies. DEFINITY is a registered trademark of Lucent Technologies. DOS Windows is a trademark of Sun Microsystems, Inc. Informix is a registered trademark of Informix Software, Inc. Intel is a registered trademark of the Intel Corporation. Motorola is a registered trademark of the Motorola Corporation. Paradyne is a trademark of Paradyne Corporation. Sun is a trademark of Sun Microsystems, Inc. Solaris is a trademark of Sun Microsystems, Inc. SPARC is a trademark of Sun Microsystems, Inc. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company Ltd. Other trademarks may appear in this document as well. They are marked on rst usage.
Contents
About This Document
s s s s s s s s
Purpose Reasons for Reissue Intended Audience How to Use This Document Conventions Used Product Safety Labels How to Order Documentation How to Comment on This Document
1-1 1-1 1-2 1-3 1-3 1-4 1-4 1-5 1-5 1-6 1-6 1-6 1-7 1-7 1-7 1-8 1-8 1-8 1-9 1-9 1-10 1-10 1-10 1-11 1-11 1-12
s s s s s s s s s s s
DSN/CSN/ICN Hardware Descriptions CDN Hardware Description CDN CDN-I CDN-II CDN-IIx CDN-III RPCN Hardware Description Direct Link Node Hardware Description SS7 Node Hardware Description EIN Ethernet Interface Node CNI Integrity Process Descriptions Error Analysis and Recovery Process Automatic Ring Recovery Process Node Audit Capability Ring Audit Capability RPCN Token Audit CNI Safety Net Capability Inhibiting CNI Safety Net Allowing CNI Safety Net Feature General Maintenance Daily Activity Recommendation Faulty Node Recovery Strategy Routine Diagnostics Fault Descriptions
Issue 16.0
December 2000
iii
401-661-045
Contents
RAC Parity/Format Error Unexplained Loss of Token SRC Match RAC Output Parity Error General RAC Error Detected Node Audit Failure Interframe Buffer Parity Error Read Format Error Write Format Error Emergency Maintenance Ring Down Recovery Rolling CNI Initializations Global CDN Recovery Single CDN Recovery 1-12 1-17 1-21 1-27 1-30 1-32 1-35 1-38 1-39 1-41 1-41 1-41 1-47 1-48
2-1 2-1 2-3 2-5 2-6 2-6 2-7 2-7 2-9 2-10 2-11 2-13 2-13 2-13 2-16 2-17 2-18 2-19 2-20 2-20 2-20 2-21
s s s
General Operation of the Ring Ring Nodes Ring Peripheral Controller Nodes Basic IMS User Nodes Direct Link Nodes (DLN) Call Processor/Data Base Nodes (CDN) Interframe Buffers Node Names and Addresses Ring Message Format Reconfigurations Node Quarantine Node Isolation The Ring Config Module Initializations Level-3 IMS Initializations (FPI and Boot) Level-4 IMS Initializations (FPI and Boot) Audits Central Node Control Audit (AUD CNC) Node State Audit (AUD NODEST) Node Audit
iv
Issue 16.0
December 2000
Contents
3 Ring Maintenance
s s
3-1 3-1 3-3 3-3 3-11 3-25 3-25 3-36 3-39 3-66 3-67 3-85
Overview Automatic Ring Maintenance EAR or Ring Recovery ARR or Deferrable Node Recovery Manual Ring Maintenance Ring Maintenance Interfaces Ring Diagnostics Guide to Critical Ring Maintenance Examples of Ring Maintenance Responses to Single, Ring-Related Faults Responses to Multiple, Ring-Related Faults
4-1 4-1 4-3 4-3 4-6 4-11 4-19 4-21 4-21 4-21 4-22 4-22 4-22 4-22 4-25 4-25 4-25 4-25 4-25 4-26 4-29 4-30
Introduction Ring Fault Conditions and Maintenance Approach Ring Node Out-of-Service Single-Ring Node Isolation Multiple-Ring Node Isolation Ring Down Ring Generic Access Package (RGRASP) Feature Definition Feature Description Software Impact Software Description User Profile Description of Feature Operation Equipment Configuration Data (ECD) Recent Change Procedures Measurement Network Management Impact Maintenance/Troubleshooting Impact Recording Output Messages Audits
Issue 16.0
December 2000
401-661-045
Contents
Critical Events Support Tools Related Documentation Cross-References 4-30 4-30 4-30
Introduction Critical Event Message Output Logging Critical Events Short Form CNCE Message Long Form CNCE Message Using the CHG:CEPARM Command CNCE Descriptions
6-1 6-1 6-1 6-1 6-2 6-5 6-6 6-6 6-8 6-72 6-73 6-73 6-75
Introduction Overview Diagnostics Hardware and Interfaces System Maintenance Interfaces Performing Diagnostics Diagnostic Message Structure System Diagnostics Denied Diagnostic Requests Inhibiting Diagnostic Requests Diagnostic Aborts and Audits Operating System Diagnostics
Introduction Equipment Description and Handling Precautions Power Packs and Fusing Descriptions Fan and Filter Maintenance Ring Node Circuit Pack Handling Precautions
vi
Issue 16.0
December 2000
Contents
Ring Node Equipment Visual Indicators Removing Affected Equipment From Service UN122C and UN123B Combination Circuit Pack Installation Voice Frequency Link Hardware Equipment Replacement Procedures 7-17 7-17 7-23 7-28
A-1 A-1 A-1 A-2 A-3 A-6 A-8 A-10 A-12 A-14 A-16 A-18 A-20 A-21 A-23 A-25 A-26 A-28 A-30 A-30 A-31
Introduction Data Structures General Information Blockage Error Hard Ring Parity Errors Orphan Byte Error Soft Ring Parity Error Interframe Buffer Parity Error RAC Output Parity Error Write Format Error Read Format Error Received Too Short Error Read Inhibit Error Excessive Ring Command Interrupts Token Removed from Ring Source Match Error Miscellaneous RAC Problem Unexpected Loss of Token Checksum Audit Failure Node Processor Parity Failure
Ring Transport Errors Ring-Related Errors Node-Related Errors Errors Without Consequences Unexplained Loss of Token Some IMS Input Messages
Lucent TechnologiesProprietary See notice on rst page
Issue 16.0
December 2000
vii
401-661-045
Contents
s s
Setting the ECD Flag for Manual Ring Mode ECD Values for Interframe Buffers
B-6 B-7
viii
Issue 16.0
December 2000
Figures
Ring Maintenance
3-1. 3-2. 3-3. 3-4. 3-5. A 1105 Display Page An 1106 Display Page Isolated RACs of BISO and EISO Nodes Manual Recovery - Method One Manual Recovery - Method Two
Issue 16.0
December 2000
ix
401-661-045
Figures
4-3. 4-4. 4-5. 4-6. 4-7. New BISO Established Diagnosing EISO Node Two or More Faulty Nodes New BISO Node More Than One Faulty Node 4-9 4-10 4-14 4-16 4-18
5-1 5-3
6-1 6-7
7-1
Issue 16.0
December 2000
Tables
1-1
2-1
Ring Maintenance
3-1. 3-2. 3-3. 3-4. 3-5. 3-6. Node Problems Mapped to Maintenance States and EAR Actions ARR Responses to Maintenance-States Output Messages that Report ARR Actions Alarms Associated with IMS Output Messages 1105-Page Symbols of Node Major States Circuit Pack LED States
4-1
5-1 5-5
Issue 16.0
December 2000
xi
401-661-045
Tables
6-6. 6-7. 6-8. 6-9. 6-10. 6-11. 6-12. 6-13. 6-14. 6-15. 6-16. 6-17. 6-18. 6-19. 6-20. 6-21. 6-22. 6-23. 6-24. 6-25. 6-26. 6-27. 6-28. 6-29. 6-30. IRN LN (LI4S/SS7) Node Diagnostic Phases IRN DLNE Node Diagnostic Phases IRN2 DLN30 Node Diagnostic Phases IRN2 DLN60 Node Diagnostic Phases IRN CDN-I Diagnostic Phases IRN2 CDN-II/CDN-IIx Diagnostic Phases IRN2 CDN-III Diagnostic Phases IRN2 EIN Node Diagnostic Phases IRN MDL (SCN, DSN, ICN) Diagnostic Phases Discontinued Availability CP Listings IRN and IRN2 RPC Trouble Location CP List IRN LN (LIN-E/SS7) Trouble Location CP List IRN LN (LI4S/SS7) Trouble Location CP List IRN DLNE Trouble Location CP List IRN2 DLN30 Trouble Location CP List IRN2 DLN60 Trouble Location CP List IRN CDN-I Manual Trouble Location CP List IRN2 CDN-II/CDN-IIx Manual Trouble Location CP List IRN2 CDN-III Trouble Location CP List IRN2 EIN Node Trouble Location CP List IRN MDL (CSN, DSN, ICN) Trouble Location CP List Physical Node ID (Decimal Representation) Physical Node ID (Hexadecimal Representation) Physical Node Addresses (Decimal Representation) Physical Node Addresses (Hexadecimal Representation) 6-12 6-14 6-15 6-17 6-18 6-20 6-22 6-23 6-24 6-25 6-25 6-27 6-28 6-30 6-32 6-33 6-34 6-37 6-38 6-39 6-40 6-44 6-47 6-50 6-53
xii
Issue 16.0
December 2000
Tables
A-1
B-1 B-5
Issue 16.0
December 2000
xiii
401-661-045
Tables
xiv
Issue 16.0
December 2000
This chapter gives an overview of the contents, intended audience, and use of the Flexent/AUTOPLEX Wireless Network Systems Common Network Interface (CNI) Ring Maintenance manual.
Purpose
This guide gives you the instructions to maintain and troubleshoot the CNI Ring as used in a Flexent/AUTOPLEX wireless network. NOTE: This document is not intended for use with the 5ESS Digital Cellular Switch (DCS) component of a Flexent/AUTOPLEX wireless network. The 5ESS DCS documentation should be used for ring maintenance.
To correct erroneous information To revise any technical errors To make quality improvements
Issue 16.0
December 2000
xv
401-661-045
Intended Audience
The audience for this guide includes users who maintain the CNI ring. This may be the Lucent Technologies support personnel (CTSO) or the cellular providers technicians.
Chapter 1Overview of the CNI Ring Describes the components of a CNI ring.
Chapter 3Ring Maintenance Explains the maintenance philosophy behind the CNI ring.
Chapter 4Ring and Ring Node Maintenance Procedures Explains how to run the maintenance procedures for both the ring and the ring nodes.
Chapter 5Ring Critical Events Explains events that indicate abnormal behavior in the ring.
Chapter 6Diagnostic Users Guide Explains how to perform diagnostics on ring nodes for a CNI ring-based ofce.
Chapter 7Equipment Handling Procedures Describes how to handle equipment when replacing hardware on the CNI ring.
Appendix ARing Error Analysis and Recovery Describes the ring error analysis and recovery procedures and mechanisms.
Appendix BRing Maintenance Reference Material Contains material in reference to maintaining the CNI ring.
s s
xvi
Issue 16.0
December 2000
Conventions Used
Specic typography is used in this guide to show actions or results. Commands you enter on the keyboard are shown in bold Data screens or responses from the system are shown in
constant width
Options for commands are shown in italics Keys that must be pressed on your keyboard are shown in ENTER
DANGER:
Indicates the presence of a hazard that will cause death or severe personal injury if the hazard is not avoided.
WARNING:
Indicates the presence of a hazard that can cause death or severe personal injury if the hazard is not avoided.
Indicates the presence of a hazard that will or can cause minor personal injury or property damage if the hazard is not avoided.
CAUTION:
Issue 16.0
December 2000
xvii
401-661-045
December 2000
Locations outside of the United States: Australia and all European countries: (317) 322-6416 Asia Pacic and China: (317) 322-6411 North America (excluding U.S.) and all other countries: (317) 322-6646 FAX for all international customers: (317) 322-6699 Product documentation can be ordered by mail using this address: Lucent Technologies Customer Information Center Attention: Order Entry Section 2855 N. Franklin Road P.O. Box 19901 Indianapolis, Indiana 46219 U.S.A. To order documentation electronically, visit the Lucent Technologies Customer Information Center web site at:
http://www.cic.lucent.com
Issue 16.0
December 2000
xix
401-661-045
xx
Issue 16.0
December 2000
1
1-1 1-2 1-3 1-3 1-4 1-4 1-4 1-4 1-5 1-5 1-6 1-6 1-6 1-6 1-7 1-7 1-8 1-8 1-8 1-9 1-9 1-10 1-10 1-10
Contents
DSN/CSN/ICN Hardware Descriptions CDN Hardware Description
s s
s s s
CDN CDN-I Double Plate CDN-I Single Plate CDN-I CDN-II CDN-IIx CDN-III
RPCN Hardware Description Direct Link Node Hardware Description SS7 Node Hardware Description CNI Integrity Process Descriptions Error Analysis and Recovery Process Automatic Ring Recovery Process Node Audit Capability Ring Audit Capability RPCN Token Audit CNI Safety Net Capability
s s
Inhibiting CNI Safety Net Allowing CNI Safety Net Feature Daily Activity Recommendation Faulty Node Recovery Strategy
General Maintenance
s s
Issue 16.0
December 2000
1-i
401-661-045
Contents
s
Routine Diagnostics RAC Parity/Format Error Cause Effect Craft Recovery Action Unexplained Loss of Token Effect Craft Recovery Action SRC Match Cause Effect Craft Recovery Action RAC Output Parity Error Cause Effect Craft Recovery Action General RAC Error Detected Cause Effect Craft Recovery Action Node Audit Failure Cause Effect Craft Recovery Action Interframe Buffer Parity Error Cause Effect Craft Recovery Action Read Format Error Cause Effect Craft Recovery Action Write Format Error Cause Effect Craft Recovery Action
1-11 1-11 1-12 1-12 1-12 1-12 1-17 1-17 1-17 1-21 1-21 1-21 1-21 1-27 1-27 1-27 1-27 1-30 1-30 1-30 1-30 1-32 1-32 1-32 1-32 1-35 1-35 1-35 1-35 1-38 1-38 1-38 1-38 1-39 1-39 1-40 1-40 1-41
Fault Descriptions
s
Emergency Maintenance
1-ii
Issue 16.0
December 2000
Contents
s s s s
Ring Down Recovery Rolling CNI Initializations Global CDN Recovery Single CDN Recovery
Issue 16.0
December 2000
1-iii
401-661-045
Contents
1-iv
Issue 16.0
December 2000
The Common Network Interface (CNI) ring serves as the medium that connects the various cellular processors together. The following sections describe the basic hardware conguration of each type of processor.
Issue 16.0
December 2000
1-1
401-661-045
MC3F026A1B UN303C MC3F026A1C UN304 All of these versions can be used in a CSN, DSN or ICN. The IRN board can be found in the Node Processor (NP) slot of each node. A new circuit pack, the UN304/UN304B, has replaced the UN303 in many applications. When the UN304 is used, the node is called an IRN2. When the UN304B is used, the node is called the IRN2B. Unless specically stated, the term IRN can apply to any of these circuit packs. When an IRN2B is used in a CSN, it is known as a CSN Enhanced (CSNE). Unless specied otherwise, all references to CSN can include the CSNE. The memory data link (MDL) circuit pack handles the transfer of information between the data links and the node processor. A CSN can be equipped with two MDL boards (MDL0 and MDL1), with each MDL capable of handling four data links. DSNs and ICNs should be equipped with only one MDL board. There are two types of MDL circuit packs: a TN1317 version and a TN1640 version. Either type can be used in a CSN, DSN or ICN. The TN1640 version provides additional message throughput and should be used in CSNs containing heavily loaded cell sites. See the System Capacity Monitoring and Engineering Guidelines, 401-610-009, for recommendations on how to assign CSN, DSN or ICN data links. The data links coming into each of these node types connect to an 11A, 12A, 13A, or 13B adaptor board. The 11A adaptor board is used for RS232 connections, the 12A adaptor board is used for RS449 connections, and the 13A and 13B adaptor boards are used for V.35 connections. These adaptor boards are attached to the backplane of the CSN/DSN/ICN on the vertical slot location occupied by the MDL boards. Each adaptor board holds up to four data links and there is one adaptor board for each equipped MDL board.
CDN CDN-I [sometimes referred to as a Standard Multi-Application Real Time (SMART) Node (SN)] CDN-II [sometimes referred to as a Turbo CDN (TCDN)]
December 2000
s s
CDN-IIx CDN-III.
Unless specied otherwise, references to CDN in this document apply to any of these versions.
CDN
The original CDN used a double-plate RAP with 2-Mbyte memory boards. A double plate CDN occupies two horizontal mounting plate locations in a CNI frame. The CCC and CCS pair can be either a UN237 and UN236 pair or a UN625 and UN626 pair. They must be a matched pair. That is, a UN2XX series CCC/CCS board is not compatible with a UN6XX series CCC/CCS board. The MASC board can be either a UN95 board or a UN295 board. There can be up to four MASC boards in the FLEXENT/AUTOPLEX environment (MASC0 MASC3). The MASA boards are always TN56 boards. Each TN56 board provides 2 Mbytes of memory, and there can be up to eight MASA boards per MASC memory group. The NPI board is always a TN1349 board.
CDN-I
In the FLEXENT/AUTOPLEX environment, the node is always equipped with an IRN circuit pack. Only two of the three possible microcode versions are approved for use in a CDN-I. The approved versions are: MC3F018A1 MC3F026A1 UN303B UN303B
The RAP portion of a CDN-I is a 3B15-based computer. The basic functional components that make up this unit are a central controller cache (CCC) board, a central controller support (CCS) board, a main store controller (MASC) board, the main store array (MASA) memory boards, and a node processor interface (NPI) board. A CDN-I comes in two different versions commonly referred to as double plate or single plate CDN-I.
Issue 16.0
December 2000
1-3
401-661-045
CDN-II
The CDN-II is a Turbo CDN node type. The CDN-II is composed of an IRN2, an\ 80386-based NP, and an AP30 (prime) attached processor (AP). The AP30 is a 68030-based processor board with 80 Mbytes of local memory (16 Mbytes on the base board and an additional 64 Mbytes of zig-zag in-line package (ZIP) memory on a mezzanine board).
December 2000
CDN-IIx
The CDN-IIx is a modied Turbo CDN node type. The CDN-II is composed of an IRN2, an 80386-based NP, and a modied AP30 attached processor. The modied AP30 is a 68030-based processor board with 16 Mbytes of local memory on the base board and from 64 to 256 Mbytes on a mezzanine board. The additional memory comes from two to eight 32-Mbyte serial in-line memory modules (SIMM). Unless otherwise specied, any reference to CDN-II applies to both the CDN-II and CDN-IIx.
CDN-III
The CDN-III is an improved CDN that may be used to upgrade CDN-II or CDN-IIx type nodes. The CDN-III consists of an IRN2 node core and AP60 attached processor (TN2523), providing greater processing and memory capacity than previous CDNs. The AP60 uses an MC68LC060 processor.
Never use MC3F014A1 or MC3F18A1 microcode versions in an RPCN. Doing so could seriously hinder the rings ability to perform automatic fault recovery tasks. The RPCN can also be equipped with an IRN2 or IRN2B board, the UN304 or UN304B. This board is also located in the NP slot of the RPCN. The RPCN has a duplex dual serial bus selector (DDSBS) which basically terminates the ECPs connection to the ring. This board is a TN69B and has a connection from the RPCN to each Control Unit (CU) of the ECP (CU0, CU1).
CAUTION:
Issue 16.0
December 2000
1-5
401-661-045
The RPCN also contains a 3B Interface (3BI) board which serves as the interface between the DDSBS an the NP of the RPCN. This board is a TN914.
The DLNE has IRNB, AP30, 3BI, and DDSBS boards. The DLN30 replaces the IRNB board with an IRN2B to provide increased performance and higher reliability. The DLN60 provides more processing power and memory than previous types of DLNs. The DLN60 uses an IRN2 node core with an AP60 attached processor. The DLN60 does not have a 3B21D computer interface.
December 2000
Integrated Ring Node (IRN) 2 (IRN2) circuit pack (CP), UN304B (MC3F024AIB) EIN Link Interface (ELI) CP, TN4016 Paddleboard, 9822EB Cable ED3F064-37 G80.
s s s
Issue 16.0
December 2000
1-7
401-661-045
If this is the second time a node has been removed from service by EAR in the past hour, ARR will diagnose the node and only restore the unit if it passes all diagnostic phases. If this is the third time a node has been removed from service by EAR in the past hour, the node will be left in the out-of-service state. This link node will remain in this state until craft takes the appropriate recovery action to restore the node to service.
December 2000
Issue 16.0
December 2000
1-9
401-661-045
Enter a 42 poke command. Enter i (inhibit) for the parameter value. Next, a 50 initialization is required to set the ag in ECP memory.
Once Safety Net has been inhibited, it will remain in this state until a 54 initialization occurs or the inhibit ag is cleared from the EAI page (see following section). Whenever Safety Net is inhibited, it is critical that craft personnel remember to turn the feature back on once the source of the fault has been cleared. Failure to do so could result in an extended outage which Safety Net may have avoided.
Enter a 42 poke command. Enter a to allow the feature to function. Enter a 50 initialization is required to clear the inhibit ag in ECP memory.
General Maintenance
This section provides craft with information which could assist in identifying potentially faulty hardware before the problem is serious enough to cause a ring outage. Also included in this section are descriptions of common CNI ring faults and the steps necessary to correct the situation.
December 2000
Issue 16.0
December 2000
1-11
401-661-045
Routine Diagnostics
Given the rings ability to detect and report suspected faulty hardware, it is not recommended that diagnostics be performed on every node around the ring. However, it is recommended that RPCNs, CDNs and DLNs be taken down at least once a month (weekly if possible) and diagnosed. These nodes have been selected for preventive maintenance due to both their importance to system performance, and the extended amount of time it takes to diagnose and restore these nodes should a fault occur. While CSNs, DSNs, ICNs and SS7 are certainly important to the system, their loss does not seriously threaten system performance. Also, in the event one of these nodes is lost, the recovery time is minimal if this is the rst fault. NOTE: On the subject of performing routine diagnostics, it should be noted that there is a critical difference between a single plate and double plate (TN1398 or TN56 memory boards) CDN-I unit. Requesting diagnostics on a double plate CDN-I will result in the entire CDN-I being diagnosed. The same can not be said of a single plate CDN-I. For a single plate CDN-I, craft MUST specify that demand phases 54 through 61 be executed. These phases are responsible for diagnosing the 16Mbyte memory boards (one phase for each MASA board equipped). These memory diagnostics are done on a demand basis only due to the time required to complete memory diagnostics on the TN1398 circuit packs.
Fault Descriptions
This section describes various CNI ring faults. The output message associated with the fault is presented, followed by the cause of the fault, the effect the fault has on the ring, and the recovery action to clear the fault. For a more detailed description of possible faults, see Appendix A, Ring Error Analysis and Recovery. In the following descriptions, the terms upstream node and downstream node will be used. These terms describe relative position of nodes and are based on the direction of data ow on the rings. Basically, any particular node will RECEIVE data from its upstream neighbor and will SEND data to its downstream neighbor. Since the data ows in opposite directions on the two rings, a nodes upstream neighbor on ring 1 is the downstream neighbor on ring 0 and its upstream neighbor on ring 0 is the downstream neighbor on ring 1. For example, with respect to ring 0, LN00-7s upstream neighbor is LN00-6 and its downstream neighbor is LN00-8.
December 2000
Cause
The reporting node, LN00-7 in this example, is reporting that its upstream neighbor on RAC 0 (LN00 6) tried to pass a bad message to it. This message is used to report both bad parity and an orphan byte failure. The effect and recovery action is the same regardless of which error type it is, so it is not necessary to determine which fault type it is from a craft perspective.
Effect
The node which had the bad message presented to it will refuse to accept the message. This will force the node offering the bad message to report ring blockage to EAR. EAR will attempt to reestablish normal ring communication by performing a Level 0 ring recovery. If this fails to correct the error condition, EAR will escalate to a Level 1 ring recovery which could result in nodes being removed and isolated.
Issue 16.0
December 2000
1-13
401-661-045
1. If there is a pair of interframe buffer boards (IFB) between the node reporting the fault and the upstream neighbor, replace the IFB associated with the node reporting the problem. 2. If the fault persists, and IFBs are involved, replace the IFB in the node upstream of the node reporting the fault. 3. If the fault persists, replace the IRN board in the node upstream of the node reporting the problem. 4. If the fault persists, replace the IRN board in the node reporting the problem. 5. If the fault persists, and there are IFBs involved, there could be a cable problem. Call for assistance to isolate the source of the fault. See Figure 1-1 on page 1-15.
December 2000
Chart 1
ATP?
ATP?
1st occurrence?
Y Done Transient fault. Monitor /etc/log/RPTERR1 log file for several weeks. If fault returns, go to 1st occurrence no leg
Go to Chart 1A
Done
Figure 1-1.
Issue 16.0
December 2000
1-15
401-661-045
Chart 1A
Note 1: If RAC 0 is implicated in the output message, the upstream neighbor is the lower node number (LN32-4 is upstream of LN32-5). If RAC 1 is implicated, the upstream neighbor is the higher node number (LN32-6 is upstream of LN32-5).
Go to Chart 1B
Done
Figure 1-1.
December 2000
Chart 1B
Note 2: RPCN32 is upstream of the last node in group 00 (or group 31 if equipped) on RAC 1 and downstream on RAC 0. RPCN00 is upstream of the last node in group 32 (or group 63 if equipped) on RAC 1 and downstream on RAC 0. Replace IFB in node reporting the fault Y
Cleared?
Cleared? N
Replace IRN in node reporting the fault. Note 3: If RPCN and it has no IRN, then replace the R0 board if RAC 0 implicated or R1 if RAC 1 implicated. Y N Possible cable problem. Call for assistance in swapping cables between rings
Cleared?
Bad cable. Configure cables so that the faulty cable is in RAC 1. Obtain new cable ASAP!
Fault move? N
Done
Figure 1-1.
Issue 16.0
December 2000
1-17
401-661-045
Effect
EAR will initiate a token tracking procedure in an attempt to determine where the token was last seen. If the procedure is successful, the following message will result: REPT TOKEN TRACK TOKEN WAS LOST BETWEEN LN63 1 AND LN63 6 ON RING: 0 X00000000 X3F63F104 X00300001 X40040001 There are several other versions of the message that could result depending on outcome of the token tracking procedure. Reference the FLEXENT/AUTOPLEX Output Message Manual for the other versions of this message which could result. EAR will attempt to reestablish normal ring communication by performing a Level 0 ring recovery. If this fails to correct the error condition, EAR will escalate the ring recovery to a Level 1 which could result in nodes being removed and isolated.
December 2000
1. If there is a pair of interframe buffer boards (IFB) between the two nodes identied in the token tracking report, replace the IFB in one of the nodes. 2. If the fault persists, and IFBs are involved, replace the IFB in the other node identied in the token tracking report. 3. If the fault persists, replace the IRN board in one of the two nodes identied in the token tracking report. 4. If the fault persists, replace the IRN board in the other node identied in the token tracking report. 5. If the fault persists, call for assistance. See Figure 1-2 on page 1-20.
Issue 16.0
December 2000
1-19
401-661-045
Chart 2
Examine ROP & UNIX file /etc/log/RPTERR1 for token tracking report
Report successful?
1st occurrence? Y Transient fault. Monitor /etc/log/RPTERR1 log file for several weeks to see if fault returns
Diagnose both nodes Replace packs & diagnose as per TLP list
ATP? Y
Replace IRN board in one of the nodes. If RPCN and it is not an IRN, then replace the R0 board if ring 0 is implicated or R1 if ring 1 is implicated
ATP? Y
Cleared? N
Done Go to Chart 2A
Done
Figure 1-2.
December 2000
Chart 2A
Cleared? N
Cleared? N Possible cable problem. Call for assistance in swapping cables between rings
Bad cable. Configure cables so that the faulty cable is in RAC 1. Obtain new cable ASAP!
Fault move?
Done
Figure 1-2.
Issue 16.0
December 2000
1-21
401-661-045
SRC Match
The output message present on the ROP and the RPTERR1 log le for this fault is as follows: REPT RING TRANSPORT ERR RMV LN33 7 RQSTD; SRC MATCH RPTD BY LN31 6 X6FB015F4 X352070B8 (2834204595)
Cause
An SRC match failure results when a node does not take a message from the CNI ring that was addressed to it. This message will eventually return to the source node, who will remove the message from the ring and will report an SRC match to the ECP against the destination node.
Effect
As stated above, the message will eventually return to the source node. The source node will remove the message from the ring and report the SRC match to the EAR. This will always result in the destination node being removed from service. ARR will then restore the node to service either conditionally or unconditionally, depending on the frequency of the faults against this node.
December 2000
NOTE: Miller Stevenson Company markets an aerosol form of the solvent-lubricant which is recommended (1.0 percent OS-124 in Freon TA) for use on CNI ring backplanes and circuit packs. This product is marketed as MS-181. If the fault persists, replace circuit packs in the following order: 1. If the faults are occurring immediately after the node is restored to service, check the ECD (rcvecd) and the application database (apxrcv, iun form) to verify they are in sync with respect to the node type. 2. If the fault persists, replace the IRN circuit pack. 3. If the fault persists, replace the MDL boards one at a time, or replace the LLI board if the node is an SS7 node. 4. If the node is a CDN, check the RPTERR1 log le for the existence of a CDN panic message in the form of: REPT COM100 TBL LN00 07 NADR: XC07 Panic : Hardware Local Bus Parity Error: CCS0(lba=0x0): CSRs=0x61100028,0x0 MASC0(lba=0x100000): CSRs=0x422054,0x4c00b500
CCS 61100028 MASC 00422054 NPI 00000000 5. If a message similar to this appears, it is not necessarily a local bus parity error. Go directly to page 3 of Figure 1-3 for CDN assistance. 6. If the fault persists, or the panic message is not present for a CDN, call for assistance in clearing the fault. See Figure 1-3 on page 1-24.
Issue 16.0
December 2000
1-23
401-661-045
Chart 3
SRC match
Run diagnostics on the faulted node N Replace packs & diagnose as per TLP list
ATP? Y
Examine UNIX file /etc/log/RPTERR1 Transient fault. Monitor /etc/log/RPTERR1 log file several weeks to see if the fault returns
ATP? Y
for
1st occurrence? N Determine fault frequency by examining ROP or RPTERR1 log file Done
Done
Go to Chart 3A
Agree? Y
Cleared? Y
Go to Chart 3A
Done
Figure 1-3.
SRC Match
December 2000
Chart 3A
Cleared? N Go to Chart 3B Y
Cleared? Y Done
Figure 1-3.
Issue 16.0
December 2000
1-25
401-661-045
Chart 3B
Check RPTERR1 error log for a PANIC: HARDWARE message for this CDN N
Present? Y
Cache error
Cleared? Y
Done
Go to Chart 3C
Figure 1-3.
December 2000
Chart 3C
Starting at demand Phase 54, run one phase for each MASA board equipped (54-61)
ATP? N
Go to next two pages for instructions on converting address in the panic message to a MASA board location
N Replace suspected MASA board Cleared? Y Done Insert two new TN56 boards in the first two MASA slots. If fault still exists, return original boards and slide new boards to the next slot. Continue until the two new boards have been tried in each MASA position Insert a new MASC board. If fault still exists, return the original board and slide the new board to the next MASC until new board has been tried in each MASC Y N Y N Y Valid board number N
ATP?
Done
Insert a new TN1398 board in the first MASA slot. If fault still exists, return original board and slide new board to the next slot. Continue until new board has been tried in each MASA slot
Cleared? Y
Done
Figure 1-3.
Issue 16.0
December 2000
1-27
401-661-045
Cause
The node reporting the fault detected that it had attempted to write a message with bad parity to the ring.
Effect
The node which had the bad message presented to it will refuse to accept the message. This will force the node offering the bad message to report ring blockage to EAR. EAR will attempt to reestablish normal ring communication by performing a Level 0 ring recovery. As part of this recovery process, each node will reread the message that it had presented to the downstream neighbor. When doing this, the node reporting the fault detected that it had presented a message containing bad parity to its downstream neighbor. If this fails to correct the error condition, EAR will escalate the ring recovery to a Level 1 which could result in nodes being removed and isolated.
December 2000
2. If the fault persists, call for assistance. See Figure 1-4 on page 1-30.
Issue 16.0
December 2000
1-29
401-661-045
Chart 4
ATP? Y
ATP? Y
1st occurrence? N
Transient fault. Monitor /etc/log/RPTERR1 log file for several weeks to see if fault returns
Replace the IRN board in the node reporting the problem. Note: If RPCN and it has no IRN, then replace the R0 board if ring 0 is implicated or R1 board if ring 1 is implicated
Cleared? Y
Done
Figure 1-4.
December 2000
Cause
This is a catch all error type used to report unexpected node hardware or software hardware conditions.
Effect
The node reporting the problem will not accept any data from the upstream neighbor node, thus forcing that node to report blockage.
Issue 16.0
December 2000
1-31
401-661-045
Chart 5
ATP? Y
ATP? Y
Replace the IRN board in the node reporting the problem. Note: If RPCN and it has no IRN, then replace the R0 board if ring 0 is implicated or R1 board if ring 1 is implicated
1st occurrence?
Transient fault. Monitor /etc/log/RPTERR1 log file for several weeks to see if fault returns
Cleared?
Replace the IRN in the upstream neighbor. Note: If RAC 0 is implicated, the upstream neighbor is the lower node # (LN32-4 is upstream of LN32-5). If RAC 1 is implicated, the upstream neighbor is the higher node # (LN32-6 is upstream of LN32-5)
Cleared? Y
Done
Figure 1-5.
December 2000
Cause
The Node Audit process has detected a node that is not responding to the node audit requests, but the rest of the ring seems to be functioning normally.
Effect
The node at fault will be removed from service.
Issue 16.0
December 2000
1-33
401-661-045
Chart 6
NAUD failure
Diagnose faulty node Replace & diagnose packs as per TLP list Y ATP? N N ATP? Y Examine UNIX file /etc/log/RPTERR1 Y
1st occurrence? N
Transient fault, monitor RPTERR1 for several weeks to see if fault returns
Is node a CDN? N Familiar with CMpfcnts tool? Y This fault could be the result of noisy data links. Run CMpfcnts to identify possible problem links
Cleared? Y
Cleared? Y Done
Go to Chart 6A
Figure 1-6.
NAUD Failure
December 2000
Chart 6A
Cleared? N
Done
Figure 1-6.
Issue 16.0
December 2000
1-35
401-661-045
Cause
The IFB board upstream of the node reporting the fault detected that a message with bad parity has been presented to it.
Effect
The IFB will set a bit and pass the message on to the downstream node. This node will refuse to accept the bad message, thus forcing the node which presented the bad message to the IFB to report ring blockage to EAR. EAR will attempt to reestablish normal ring communication by performing a Level 0 ring recovery. If this fails to correct the error condition, EAR will escalate the ring recovery to a Level 1 which could result in nodes being removed and isolated.
December 2000
2. If the fault persists, replace the IFB in the node upstream of the node reporting the fault. 3. If the fault persists, replace the IFB in the node reporting the fault. 4. If the fault persists, replace the IRN board in the node reporting the problem. 5. If the fault persists, call for assistance. See Figure 1-7 on page 1-38.
Issue 16.0
December 2000
1-37
401-661-045
Chart 7
Run diagnostics on the node reporting the problem Replace packs & diagnose as per TLP list N N
ATP? Y
ATP? Y
Replace the IRN in the upstream neighbor. Note 1: If RAC 0 is implicated the upstream neighbor is the lower node # (LN32-4 is upstream of LN32-5). If RAC 1 is implicated, the upstream neighbor is the higher node # (LN32-6 is upstream of LN32-5) If RPCN, see Note 2 Y N
1st occurrence?
Transient fault. Monitor the RPTERR1 log file for several weeks to see if fault returns
Cleared?
Cleared? N Replace the IFB in the node reporting problem Replace the IRN board in node reporting the error. Note 2: If RPCN & it has no IRN, replace R0 board if ring 0 is implicated or R1 if ring 1 is implicated N N
Cleared?
Cleared? Y
Done
Figure 1-7.
December 2000
Cause
The reporting node, LN00-7 in this example, is reporting the upstream neighbor on RAC 0 (LN00 6) tried to pass a message which had a bad message length. This error usually indicates there is a node on the ring which is clipping/mutilating messages as they pass through this node. This fault type requires immediate attention. A clipped message, if undetected, could take the appearance of a valid maintenance message. This maintenance message could take the appearance of one which would force all nodes into a set quarantine state, thus removing them from service and resulting in a system outage.
Effect
The node which had the bad message presented to it will refuse to accept the message a will send a error report to the home RPCN. This will force the node offering the bad message to report ring blockage to EAR. EAR will attempt to reestablished normal ring communication by performing a level 0 ring recovery. If this fails to correct the error condition, EAR will escalate to a level 1 ring recovery which could result in nodes being removed and isolated.
Issue 16.0
December 2000
1-39
401-661-045
NOTE: WRITE FORMAT ERROR messages may also be present and can be used to assist in locating the faulty segment. All nodes in the suspected ring segment should be diagnosed. If diagnostics do not nd a problem with any node, attempt to clear the fault by cleaning and reseating the circuit packs in the suspected segment using the recommended contact cleaner. NOTE: Miller Stevenson Company markets an aerosol form of the solvent-lubricant that is recommended (1.0 percent OS-124 in Freon TA) for use on CNI Ring backplanes and circuit packs. This product is marketed as MS-181. If the fault persists, replace packs in the following order: 1. Select the rst node in the suspected segment and replace the UN303 board. Monitor the RPTERR data daily to determine if fault has been cleared. 2. If fault persists, examine the additional faults reported. If the node reporting the fault is in the suspected segment, all nodes from the node reporting this new fault to the previous nodes reporting the fault can be removed from the suspected faulty list. 3. Repeat Step 1 for the next logical link node in the suspected faulty ring segment. If any node contains IFBs, replace these as well once the UN303 has been eliminated as a suspected pack. 4. If fault persists, and all packs in suspected segment have been replaced, call for assistance.
Cause
The reporting node, LN00-7 in this example, is reporting a message it was attempting to write to the ring failed a validation check. This message is similar to the READ FORMAT ERROR type in that it usually indicates there is a node on the ring which is clipping/mutilating messages as they pass through this node. This
December 2000
fault type requires immediate attention. A clipped message, if undetected, could take the appearance of a valid maintenance message. This maintenance message could take the appearance of one which would force all nodes into a set quarantine state, thus removing them from service and resulting in a system outage.
Effect
The node which was trying to write the message will not do so, nor accept the message being offered to it, and a error report is sent to the home RPCN. The nodes previous to the reporting node will report ring blockage to EAR. EAR will attempt to re-established normal ring communication by performing a level 0 ring recovery. If this fails to correct the error condition, EAR will escalate to a level 1 ring recovery which could result in nodes being removed and isolated.
Issue 16.0
December 2000
1-41
401-661-045
2. If fault persists, examine the additional faults reported. If the node reporting the fault is in the suspected segment, all nodes from the node reporting this new fault to the previous nodes reporting the fault can be removed from the suspected faulty list. 3. Repeat Step 1 for the next logical link node in the suspected faulty ring segment. If any node contains IFBs, replace these as well once the UN303 has been eliminated as a suspect pack. 4. If fault persists, and all packs in suspected segment have been replaced, call for assistance.
Emergency Maintenance
This section is intended to assist craft in those instances where the CNI ring appears to be at on its back and requires craft intervention to get the system operational. While this data provides useful information, it should not be used as a replacement for calling for immediate assistance when such a situation occurs. Lucent Technologies personnel should be contacted whenever system recovery is involved rather than waiting until the Ring Down Recovery section of this chapter has exhausted its helpful hints.
December 2000
1. Determine if CNI Safety Net is requesting the CNI initializations. Do this by checking the ROP for the existence of SI15, SI22 or SI24 Defensive Check failures. If present, go to Step 2, else go to Step 5. 2. Disable CNI Safety Net by going to the Emergency Action Interface page and entering a 42 poke command. When the parameter eld appears, enter i to inhibit Safety Net. Next, perform a 50 initialization to set the inhibit ag in memory. This should stop the rolling initializations so that the problem can be investigated. If so, go to Step 3, else go to Step 5. 3. If Safety Net was requesting the initializations due to no CDNs being active (SI24 asserts), determine if the rest of the ring appears to be up. If so, go to Step 4; for anything else, go to Step 5. 4. No CDNs are active, but the rest of the ring seems to be up. Go to the Global CDN Recovery and Single CDN Recovery sections in this chapter for assistance in recovering from this fault. 5. Either the ring is in a rolling initialization due to CNI not being able to get an RPCN up or SI15/SI22 asserts were present due to CNI Safety Net ring. 6. Verify that there are no power interruptions to the ring. 7. If the problem persists, examine the ROP closely to determine if CNI software is agging any node, or group of nodes, as being a possible source of the problem. If so, pull the IRN board out of those nodes to force isolation around that segment. 8. If the RPCNs are equipped with IRN boards, verify that they have the proper microcode versions. Again, only MC3F026A1 is approved for use in a RPCN. 9. If problem persists, power down RPCN32 to force the ring to come up on RPCN00. 10. If the problem persists, restore power to RPCN32. Maybe the problem is related to a bad CU in the ECP. Force the ECP to do a CU switch and attempt a CNI Level 4 initialization. 11. If problem persists, force isolated segments by removing power from one mounting plate at a time (group of three nodes). After power is removed from a group of nodes, request a CNI Level 4. If the problem persists, restore power to the previous group and remove power from the next group. Repeat this step until every node has been tried in an isolated segment. 12. Again, it is assumed that you have already called for assistance, but if not, do so immediately. See Figure 1-8 on page 1-44.
Issue 16.0
December 2000
1-43
401-661-045
Chart 8 Ring down Ring is down but taking no recovery action Request a CNI Level 3 INIT to restart the driver Y Ring up? N Request a CNI Level 4 INIT to repump the ring Y Power down RPCN32 and request a CNI INIT 4 N Rolling INITS stop? Y Go to Chart 8A N
Rolling CNI INITS? Y Check the ROP for the presence of SI15, SI22 or SI24 asserts Present? N Y Inhibit safety net from the EAI page. Use pokes 42, I for inhibit and 50 boot to set new value.
Rolling INITS stop? N Power RPCN32 back up & power down RPCN00. Request a CNI INIT 4
Rolling INITS stop? N Verify that the RPCNs have the correct IRN micro code. Only MC3F026A1 can be used in an RPCN
Correct?
Done
Go to Chart 8B Go to Chart 8A
Figure 1-8.
Ring Down
December 2000
Chart 8A
Mention missing files? N Rolling ring reconfigurations? Y Lost token Report? N Repeated RAC parity errors on both rings? Y
Token tracking information? Y Pull the IRN from the two nodes mentioned in the token tracking report & request CNI INIT 4 Ring up? Y Follow normal maintenance procedures to correct faulty nodes Done
If each RPCN is reporting a fault or one RPCN & the upstream neighbor of the other (last node in groups 31 or 63), then there could be two IFB problems. Power down one RPCN to force that segment out of ring. Place a new IFB in the other RPCN. If problem still present, place a new IFB in the neighbor node. If problem still exists, try new IRN. If RPCN is not IRN type, replace the R0 or R1 board based on which ring the fault is reported on if the fault does not involve both pairs. If 1506, 1509, or 1803 IFBs, then pull the IRNs from the two nodes reporting the fault to force a isolated segment.
Go to Chart 8C
Go to Chart 8B
Figure 1-8.
Issue 16.0
December 2000
1-45
401-661-045
Chart 8B If you are here, either your ring is in a rolling boot state or the ring is thrashing trying to find a usable segment Power OK? Y N Correct problem and INIT CNI 4 N Ring up? Y
It may be a 3B problem. Force a CU switch and do a CNI INIT 4 Ring up? N Power down RPCN 00 & LN00-6. Do a CNI INIT 4 Ring up? Y Y
N Restore RPCN power. Remove power from last LN00 node, and do a CNI INIT 4. Ring up? N Continue this process of forcing small (three-six nodes) isolated segments until every node has been part of an isolated segment N Y
Done
Figure 1-8.
December 2000
Chart 8C
Present? N Maybe the CDN data bases need to be repumped. Power cycle one CDN and perform a manual restore. Note: Wait about 5 minutes before trying the restore so memory can initialize
CDN up? N
Done
Figure 1-8.
Issue 16.0
December 2000
1-47
401-661-045
December 2000
Issue 16.0
December 2000
1-49
401-661-045
December 2000
2
2-1 2-3 2-5 2-6 2-6 2-7 2-7 2-7 2-8 2-8 2-8 2-9 2-10 2-11 2-13 2-13 2-13 2-16 2-17 2-18 2-19 2-20 2-20 2-20 2-21
Contents
General Operation of the Ring Ring Nodes
s s s s
Ring Peripheral Controller Nodes Basic IMS User Nodes Direct Link Nodes (DLN) Call Processor/Data Base Nodes (CDN) CDN-I CDN-II CDN-IIx CDN-III Interframe Buffers
Node Quarantine Node Isolation The Ring Config Module Level-3 IMS Initializations (FPI and Boot) Level-4 IMS Initializations (FPI and Boot) Central Node Control Audit (AUD CNC) Node State Audit (AUD NODEST) Node Audit
Initializations
s s
Audits
s s s
Issue 16.0
December 2000
2-i
401-661-045
Contents
2-ii
Issue 16.0
December 2000
General
The Interprocess Message Switch (IMS) is a packet switch composed of ring-based communication nodes centered upon a 3B21D computer. Each ring node is controlled by a microcomputer called the node processor. The nodes are distributed around dual, parallel communication rings that propagate data in opposite directions. Ring 0, the outer ring in the illustration below, propagates data clockwise; and ring 1, the inner ring, propagates data counter-clockwise. Ordinarily, of the two ring paths, ring 0 is actively involved in transmitting user messages, while ring 1 performs as a path for internal IMS communications. Each ring node contains one interface to each of the two rings and one interface either to the 3B21D or to a user's external system. Thus, IMS has two types of nodes: nodes interconnecting the ring and the 3B21D, the most important of which are called ring peripheral controller nodes (RPCNs), and nodes interconnecting the ring with the user's external system, most of which are called basic IMS user nodes (basic IUNs). As a processing resource, the centralized 3B21D is also available to users, but its principal purpose is to provide operational, administrative and maintenance control of the switch.
Issue 16.0
December 2000
2-1
401-661-045
3B21D
LEGEND
Figure 2-1.
Conceptual Illustration of an IMS Ring The real situation is somewhat more complicated than this description, because IMS has other types of nodes and because users are represented not only by an external communication system but also by internal hardware and software residing in certain nodes. A full discussion of all classes of IMS nodes appears shortly below. IMS may be used either as a local area network or as a switching system. More commonly it is used as a switch to transfer user messages from incoming transmission facilities to user-specied outgoing transmission facilities. A user message typically enters IMS through the external or user interface of an IUN, is formatted and addressed to a destination IUN by the resident node processor, and is inserted on the ring by the resident ring interface. It then passes around the ring to the destination IUN where it is recognized and extracted by the ring interface, reformatted by the node processor, delivered to the user interface and, then, returned to the user. In this typical transmission the 3B21D is not directly involved, though it can be involved, depending on user requirements. When access to the 3B21D is needed, a user message enters the ring as described above but is rst removed by an RPCN or similarly functioning node, which delivers it to the 3B21D, which processes it. The 3B21D then returns the processed message to an RPCN, which inserts it on the ring, from which it is removed by the destination IUN, which further processes and returns it to the user.
December 2000
In this illustration of IMS switching, a user message is transferred between processes residing in different processors. By itself the illustration is misleading, because IMS is not an interprocessor message switch but an interprocess message switch. It is capable of transmitting messages between any two processes, whether user- or IMS-owned, residing in the same or in different processors. This capability is provided by a major IMS software module called the message switch.
Issue 16.0
December 2000
2-3
401-661-045
If the message was addressed to the resident node or was a broadcast message,1 the bytes composing it are offered by means of handshakes to the node processor via the 18-bit DMA channel. If the message was not addressed to the resident node, the bytes composing it are offered by means of handshakes to the downstream node via the next segment of the ring bus.
XMIT logic
Figure 2-2.
A Ring Access Circuit on the IMS Ring IMS employs a token message on each ring to ensure that only one node at a time writes messages to the ring. A token continuously traverses a ring. When a node is ready to insert a message or a block of messages on a ring, it waits for the upstream node to offer a data byte that its receive logic recognizes as the rst byte of the token header. It delays accepting this byte (does not assert the data-taken lead) until it can insert its message or messages, byte by byte, on the ring. Then it accepts and transmits the token message downstream, making it available to the next node that has messages to write.
IMS has two types of broadcast messages-general broadcasts, which are read by every node, and selective broadcasts,which are read by previously dened groups of nodes. Selective broadcasting-achieved by virtual addressing-allows such practices as parallel downloading of data or code into similar node types.
December 2000
Ring Nodes
IMS has two classes of ring nodes-RPCNs and IUNs. RPCNs are nodes that contain no user software and that interconnect the ring and the 3B21D. IUNs, which contain both IMS and user software, perform a variety of functions. The class of IUNs has two subclasses-unextended IUNs, in which the node processor provides the only processing resource, and extended IUNs, in which the processing function is supplemented by an attached processor. At present, all unextended IUNs contain external user interfaces, but no extended IUNs do. This condition, however, is arbitrary and therefore subject to change. Currently there is one type of unextended IUNs; the basic IUNs. There are two types of extended IUNs-direct link nodes (DLNs) and call processor/database nodes (CDN-I). All ring nodes of either class have a ring interface and a node processor. In this document the units of a node other than the ring interface and the node processor are called auxiliary components. Ring node hardware utilizes very large scale integration hardware, housing the ring-interface and the node-processor functions in a single integrated circuit pack. These are called integrated ring nodes (IRNs). There are two versions of IRNs: the IRN/IRNB (UN303/UN303B) and the IRN2/IRN2B (UN304/UN304B). Node processors are microcomputers composed of a CPU, memory, interrupt logic, I/O ports, and DMA circuitry. They are supplemented in DLNs by an additional microcomputer called the attached processor and in CDNs by an additional minicomputer called the ring application processor. In unextended IUNs, the node processor contains both IMS and user code. In extended IUNs, user code resides only in the attached processor, whereas both node and attached processors contain IMS code. The content of user code is determined by user needs. Typically it provides or contributes to such functions as controlling user hardware resident in the node, managing the user's network, and providing real-time user services such as protocol conversion and message addressing. The code provided by IMS manages the ring-interface and node-processor hardware. It includes code for initialization and automatic maintenance and for such switching functions as message formatting and temporary message storage. It provides an operating system, boot monitor, memory, timers, and measurements. Except for the boot monitor, all code residing in node processors and attached processors is downloaded from the 3B21D.
Issue 16.0
December 2000
2-5
401-661-045
A duplex dual serial bus selector (DDSBS) serves as a termination point between the ring and the dual serial channels of the 3B21D. It converts the parallel output of the ring to the serial format of the dual serial channels and vice versa. The DDSBS is duplexed, with one DDSBS function connected to the dual serial channel of the on-line 3B21D control unit and one to the off-line control unit. A 3B21D computer interface (3BI) circuit pack serves as a buffer between the node processor and the DDSBS. It also provides data conversion between the node processor's 16-bit data bus and the DDSBS's 36-bit data bus. The 3BI communication occurs either via a DMA channel or a program I/O utility of the 3B21D operating system. The DMA channel is ordinarily used for standard message interchange. The program I/O is initiated and used by the 3B21D to issue urgent commands to the RPCN or to synchronize data transfers.
December 2000
An attached processor that resides on the node-processor bus and communicates with the node processor via a dual-ported memory and hardware interrupts. The attached processor contains both IMS and user code. A 3B21D computer interface (3BI) and a duplex dual serial bus selector (DDSBS) that perform in the same way and serve the same functions as they do for RPCNs, as described above.
CDN-I
IMS offers an extended node for users who require more processing power in the nodes than can be supplied by basic IUNs. The node is called a CDN-I [sometimes referred to as a standard multi-application real time node (SMART node or SN)]. It serves as an alternative to the 3B21D for the substantive processing of user data. Currently, the CDN-I has only an interface to the ring. It is capable, however, of having an external user interface, and it may have one in the future. In addition to a ring interface and a node processor that contains only IMS code, a CDN-I is composed of the following elements:
s
An attached processor called a ring application processor (RAP). The RAP is a 3B15 computer mounted on an IMS backplane that has been redesigned to conform with the design of IMS ring-node frames/cabinets and the 3B15. The older version has 2 megabytes of memory and is capable of growing an additional 94 megabytes. The newer version has 16 megabytes of memory and is capable of growing an additional 112 megabytes. The following circuit packs compose the RAP:
Issue 16.0
December 2000
2-7
401-661-045
Central controller cache (CCC) Central controller support (CCS) Main store controller(s) (MASC) Main store arrays (MASAs)
s
A power control interface and display (PCID) that provides manual-power, reset, and diagnostics controls and LEDs that indicate power and diagnostic failures. A node-processor interface (NPI) that provides message exchange between the node processor and the RAP.
CDN-II
The CDN-II (sometimes referred to as the Turbo CDN) creates a new node that is used to replace the CDN-I. The CDN-II requires only two boards and ts in a standard 3-node shelf or the new 5-node shelf. The CDN-II provides a newer technology, higher performance CDN. The performance of CDN-II is about four times the performance of the CDN-I. CDN-II has a xed 80 Mbytes of memory and consists of the IRN2B (UN304B) and an AP (TN1630B).
CDN-IIx
The CDN-IIx has identical features to the CDN-II, but different hardware. It uses the IRN2B (UN304B) and an AP (TN1720x) but can have up to 272 Mbytes of memory using multiple AP boards. A CDN-II can be upgraded to a CDN-IIx by ordering a memory growth upgrade kit.
CDN-III
The CDN-III is an improved CDN that may be used to upgrade CDN-II or CDN-IIx type nodes. The CDN-III consists of an IRN2 node core and AP60 attached processor, providing greater processing and memory capacity than previous CDNs. The AP60 uses an MC68LC060 processor.
December 2000
Interframe Buffers
Interframe buffers (IFBs) are required to extend the parallel ring buses where the distance between adjacent ring nodes is greater than a few inches. In an IRN ring, the distance is 24 inches or more. Such internodal distances occur at the boundaries of frames or cabinets where the two rings must be extended by two lengths of cable. At times they may also occur within frames/cabinets. At these boundaries, an interframe-buffer circuit pack must be inserted at each end of the parallel cables, between the cables and the nodes that are separated by the cables. Interframe-buffer circuit packs are always employed in pairs. Each member of a pair contains both send and receive circuitry. Therefore, the paired packs are mutually dependent, with each providing half of the buffering function for each parallel ring bus. The following graphic iilustrates the pairing of the interframe buffers.
ring 0
RAC 0 RI RAC 1 SEND IFB RCV
ring 1
Figure 2-3.
Interframe Buffers Thus, if either member of a pair fails, the pair fails. In addition to providing necessary drive capability without slowing down the internodal byte transfer rate, interframe buffers in padded form may be used to increase the effective lengths of small rings, thereby permitting them to employ longer messages. For this purpose, two pairs of 4104-byte buffers may be inserted in small IRN rings. The pairs should be placed diametrically on the ring to minimize the possibility that both would be included in an isolation. If additional interframe buffers are needed, they should be of the standard 16-byte capacity. The 16-byte capacity is adequate for use on large rings where employment of long messages requires no buffer padding. Technicians should ensure that the actual sizes of their interframe buffers correspond to the sizes entered in equipment conguration data (ECD). See ``ECD Values for Interframe Buffers'' in Appendix B, Ring Maintenance Reference Material.
Issue 16.0
December 2000
2-9
401-661-045
December 2000
C 1 0 0 0 0 0 0 0 0
6 DC
4 RR
3 CF
1 CC
word count source address SR dest.address DR dest.address word count source address
data
last data
LEGEND CC = Control Code CF = Control Flag RR = Rac Reset DC = Destination Control SR = Source Ring ID DR = Destination R
Figure 2-4.
Issue 16.0
December 2000
2-11
401-661-045
The illustration leaves blank ll bits and bits that are not examined by ring-interface hardware. The rst 8 bytes constitute the message header. The rst byte contains a 7-bit control eld from which the RAC learns how to respond to the message. Within the rst byte, the control code (CC) denes the message function. Functions are token, software, destroy, set/clear quarantine, set/clear isolation, processor reset. The destination control (DC) identies the address-type. Types are normal address match, general broadcast, selective broadcast, and take message. In addition to the 8 data-bits, there is a ninth bit, called the control or C-bit, which is always set to logic-one to identify the beginning byte of every message. From association with this feature, the entire rst message byte is often referred to in documentation as the control or C-byte. The tenth bit is a parity bit which provides odd parity over the data byte and C-bit. When a RAC writes a message to the ring, it generates the C-bit and modies the parity bit from node-processor memory to include the C-bit. When a RAC reads a message from the ring, the C-bit is removed and parity is changed back to its original form before being written to node-processor memory. The word count in the second message byte informs the RAC of the total number of 32-bit words in the message. Each message contains 4N bytes, where N is the value of this 7-bit word count. All messages are padded out to contain an integral number of 32-bit words. The longest possible message that can be placed on the ring is limited to the maximum value of this word count, which is 127 32-bit words (508 bytes) for rings that allow the short message and 543 32-bit words (2172 bytes) for rings that allow the long message. For explanations of conditions that permit short and long messages, see the discussion of interframe buffers above. The third and fourth header bytes contain the source address, and the fth and sixth header bytes contain the destination address. The ring-interface hardware performs address matching on the 12-bit node address and the 1-bit ring id (that identies which of the two rings is used for the message). The lower 10 bits of the ring address are referred to as the node identication. Each node is assigned a unique 10-bit node identication via the ID0-ID9 backplane straps. This header information enables the RAC to determine message disposition and the source and destination addresses, to check for errors in parity, format, and message length, and to perform hardware control functions required for ring maintenance.
December 2000
Recongurations
The types and number of nodes composing any ring are selected to meet the requirements of a specic user. Thus, only a ring whose components are fully in service may be thought of as properly congured. Yet rings must sometimes be temporarily recongured for such reasons as the need to repair or replace equipment. IMS recongures a ring by removing one or more nodes from service. Nodes that have been removed from service are ordinarily in one of two states. They may be quarantined or they may be isolated.
Node Quarantine
Quarantining a node consists of electrically severing the node processor from its associated ring interface, an action that prevents the node processor from communicating through or to the ring interface. However, the action does not prevent the 3B21D or other nodes from limited communications with the node processor which they accomplish by setting registers in the ring interface. When a node is placed in quarantine, both RACs are set to forced-propagate mode, which allows them to continue propagating messages on the rings but prevents them from reading messages from or writing messages to the rings. Quarantining is the appropriate response to a fault that occurs in a node processor or in any of the auxiliary components of a node. Quarantining has the advantage over isolation in that it disturbs the ring subsystem only slightly. Throughout this document the term "quarantine'' is used solely to represent a node that is in the state described above and that is in the active ring. Nodes in isolation or nodes during initialization or recovery sequences may have their node processors electrically severed from their ring interfaces, which are in forced-propagate mode. Such nodes will not be called "`quarantined'' since they are not in the active ring.
Node Isolation
Quarantining a node insulates the active ring from faults or activities in the node processor and in auxiliary components. Isolating a node insulates the active ring from the entire node. It is achieved by converting the ring subsystem from one dual-ring structure to two single-ring structures. Of the two single-ring structures, one is the active segment that continues to transmit user messages, and the other is the isolated segment that contains the isolated node or nodes. Isolated segments do not have a token message. The following gure schematically represents an isolated ring.
Issue 16.0
December 2000
2-13
401-661-045
3B21D
LEGEND
Figure 2-5.
Illustration of an Isolated Ring In this illustration, the active segment is composed of the unlettered nodes and of basic IUNs A and C, and the isolated segment is composed of RPCN B only. Basic IUNs A and C are called, respectively, the Beginning-of-Isolation (BISO) node and the End-of-Isolation (EISO) node. They are participants in the active ring that have the special function of altering the dual rings to form the isolated ring. They achieve this alteration by means of internal data selectors that can shunt trafc from one parallel ring to the other. This phenomenon is represented in the following illustration of a node before and after it becomes a BISO or EISO node.
December 2000
Ring 0
D S
RAC 0
RAC 1
D S
Ring 1
Ring 0
D S
RAC 0
RAC 1
D S
Ring 1
Figure 2-6.
Before (top) and After (bottom) Becoming a BISO or EISO Node Because all nodes have this shunting capability, any node of any class can perform as a BISO or an EISO node. The nodes actually selected to perform these functions are determined by the location of the node(s)-to-be-isolated. The node selected to be the BISO node is ordinarily the rst node upstream on ring 0 of the node(s)-to-be-isolated (and therefore the next lower-numbered node), and the node selected to be the EISO node is ordinarily the rst node downstream on ring 0 of the node(s)-to-be-isolated (and therefore the next higher-numbered node). If more than one node must be isolated (a phenomenon called a multiple isolation), IMS software chooses to recongure the ring in such a way as to
Issue 16.0
December 2000
2-15
401-661-045
include the smallest number of nodes possible. Nodes included in a multiple isolation, not because they contain faults, but because they lie between faulty nodes, are called innocent victim nodes. The BISO and EISO nodes also provide the means by which maintenance messages are transmitted between the active and the isolated segments of an isolated ring. BISO and EISO nodes have one RAC participating in the active segment and one RAC participating in the isolated segment. Messages destined for either ring segment may be read from the sending segment by the EISO or BISO RAC participating in it, transmitted via the node processor to the RAC participating in the receiving segment, and then written to the receiving segment. It is by this means that diagnostic code is downloaded by the 3B21D into isolated nodes and diagnostic results are returned to the 3B21D. Isolation is a more drastic means than quarantine for removing a faulty node from service. It is an appropriate response to a fault in the ring interface or in the medium between ring interfaces (this may be a fault that prevents messages from being propagated on the ring).
to issue one token message, when the ring contains an isolation, or two token messages, when it does not to restart the message switch; or-if continuity is badto abort and return control to the process that initiated ring cong.
s s
The ring cong module may be executed by IMS initialization software, by Error Analysis and Recovery (EAR) software, by Automatic Ring Restoral (ARR) software, or by manual commands to change the structure of the ring. The processes mentioned here are described at length later in this document.
December 2000
Initializations
IMS offers seven levels of System Initialization - 0, 1A, 1B, 3(FPI), 3(BOOT), 4(FPI), and 4(BOOT) - with each higher level providing more complete initialization and greater impact on the user.
s
Levels 0, 1A, and 1B reinitialize certain data in the 3B21D; they are usually run in response to program faults in the 3B21D or in response to 3B21D operating system initializations that affect 3B21D-to-node DMA interfaces. An escalation strategy ensures that repeated problems with these lower-level initializations will result in one or more of the higher-level initializations being attempted. Full Process Initialization(s) (FPI) occur without a preceding IMS abort and, therefore, require little initialization of IMS software in the 3B21D. Instead of copying all IMS code and data resident in the 3B21D from disk, FPI initializations restart the principal body of IMS code, the driver. The FPI feature has the advantage of saving initialization time-particularly in level 3(FPI) initializations-and of greatly simplifying the initialization sequence. The BOOT initializations are preceded by abort and boot sequences of IMS in the 3B21D. Thus, the two FPI levels provide partial initialization of IMS in the 3B21D, and the two BOOT levels provide full initialization of IMS in the 3B21D.
In the ring, completeness of initialization increases with the numbers. The level 3 initializations (FPI and BOOT) attempt to conserve system usability by reinstating the ring structure that existed prior to initialization. They resort to establishing a new structure only if tests indicate that the existing structure is not viable. By contrast, the level-4s make no attempt to reinstate the previous ring but immediately set about testing all nodes to determine the optimum ring structure. Thus, the level-3s provide partial initialization of the ring, and the level-4s provide complete initialization of the ring. IMS software can request the three lower levels of initialization but not the four higher levels. Instead, it responds to internally-detected problems requiring higher-level initializations in one of two ways. It can request the user choose one of the four higher-levels. Or it can abort, thereby forcing the user to choose either level 3(BOOT) or level 4(BOOT). In general, it responds to an indication of software mutilation by aborting. Otherwise, it allows the user to decide how to respond. The user can also independently request any of the four higher levels.
Issue 16.0
December 2000
2-17
401-661-045
conrm that the token is or the tokens are present identify the current ring structure by examining the positions of node data selectors check for inconsistencies between the actual ring structure and ECD data, and verify that all RACs can propagate data on the ring. During the audit, message trafc between nodes on the ring is permitted to continue, though message trafc to the 3B21D is denied.
3. If all audit tests pass, the ring cong module is called to establish a ring in conformity with the prior ring-structure. But if any audit test fails or reveals an inconsistency, a new strategy of empirically testing for ring continuity begins by sending test messages on the ring. If the tests reveal ring continuity, the ring cong module is called to establish the normal two-ring structure; but if the tests reveal discontinuity, ring cong is called to establish an isolated ring that excludes the problem node or nodes. In either case, the IMS initialization process exits when ring cong has established a viable ring or aborts when it is unable to.2 Unlike the audit stage, the continuity-test stage of level-3 ring initialization requires ring silence. Thus, during continuity tests, user message trafc on the ring is halted. 4. With an active ring in place, the 3B21D now queries each IUN and quarantines those that do not respond.
An exception to this statement occurs whenmanual ring mode is in effect.For an explanation of manual ring mode, see the ``Manual Ring Maintenance'' section of Chapter 3, Ring Maintenance.
December 2000
Issue 16.0
December 2000
2-19
401-661-045
Audits
The following information about IMS audits is offered chiey because output messages concerning audits will occasionally appear on the ROP. Technicians should rarely have occasion to use the input commands that manually initiate them.
December 2000
Node Audit
An automatic, internal audit of nodes allows maintenance software in the 3B21D to continuously monitor the health of the ring and all ring nodes. The node audit is run routinely every few seconds. By this means, the 3B21D veries that each active node is operating correctly, checks the communication paths of both rings, and nds nodes that have quarantined themselves or that need to be quarantined. The work of the node audit is transparent to technicians and users of IMS, unless it detects a problem that causes a node to be removed from service.
Issue 16.0
December 2000
2-21
401-661-045
December 2000
Ring Maintenance
3
3-1 3-3 3-3 3-3 3-5 3-7 3-7 3-7 3-8 3-8 3-9 3-11 3-11 3-12 3-13 3-14 3-14 3-15 3-16 3-16 3-17 3-18 3-19 3-20 3-20 3-21 3-23
Contents
Overview Automatic Ring Maintenance
s
EAR or Ring Recovery Error Detection Mechanisms Underlying Reinstatement and Reconfiguration Unexplained Loss of Token Token Track Reinstatement and Reconfiguration Ring Error Threshold Multiple Faults EAR Ring Recovery Intervals and Output Messages ARR or Deferrable Node Recovery Overview of ARR Treatment of Out-of-Service Nodes Maintenance States Ring States Node Major States Node Minor States: Ring Position Node Minor States: Ring Interface Node Minor States: Node Processor Node Minor States: Maintenance Mode Summary of EAR Actions Three ARR Rules The One-Restoral-at-a-Time Rule The Fourth-Time Rule ARR Treatment of Unstartable, Quarantined Nodes ARR Treatment of Isolated Nodes ARR Recovery Intervals and Output Messages
Issue 16.0
December 2000
3-i
401-661-045
Contents
Manual Ring Maintenance
s
3-25 3-25 3-25 3-25 3-25 3-26 3-26 3-28 3-28 3-32 3-36 3-37 3-38 3-39 3-39 3-40 3-42 3-47 3-48 3-51 3-53 3-56 3-58 3-59 3-63 3-64 3-64 3-65 3-65 3-65 3-65 3-66 3-66 3-67 3-67 3-70 3-75 3-78
Ring Maintenance Interfaces Alarms Critical Alarms Major Alarms Minor Alarms Special IMS Indicators Display Pages Page 1105 The Ring Status Summary Page Page 1106 The Ring Node Status Page Ring Diagnostics Obtaining Diagnostic Results Diagnostic Listings Using Diagnostics Guide to Critical Ring Maintenance IMS Input Messages Critical Maintenance Procedures for Nodes Critical Maintenance Procedures for Nodes in Isolation Low-Phase Ambiguity Guideline to Single-Node Isolations Guideline to Multiple-Node Isolations Responding to Ring Down Employing Manual Ring Mode Ring Application Processor Critical Maintenance Procedure Recognizing and Finding Intermittent Faults Other Suggestions for Troubleshooting New Circuit Pack; Old Failure Unconditional Restorals Unexplained Loss of Token Avoiding Trouble Recording Trouble New Installations or Ring Growth Responses to Single, Ring-Related Faults Automatic Recovery from a Transient Fault by EAR Level 0 Manual Recovery from a Hard Fault Automatic Recovery from a Transient Fault by ARR Manual Recovery from a Hard Fault on a Small Ring
3-ii
Issue 16.0
December 2000
Contents
s
Responses to Multiple, Ring-Related Faults Manual Recovery from Multiple Hard Faults Automatic Recovery from Two Intermittent Faults
Issue 16.0
December 2000
3-iii
401-661-045
Contents
3-iv
Issue 16.0
December 2000
Ring Maintenance
Overview
The design of ring maintenance reects the need to recover rapidly from faults that disrupt the transportation of messages on the ring or that prevent the processing and transmission of messages within nodes. Ring maintenance addresses this need with three types of automatic recovery actions which are called reinstatement, reconguration, and node restoral. When ring maintenance software determines that a fault has disrupted the ring subsystem, it acts to resume operation by one of two means. It can attempt to reinstate the current ring; that is, to return the ring to service as it was constituted prior to the fault. Or it can recongure faulty nodes out of the ring, thereby, resuming operation with the surviving resources. If it recongures the ring, ring maintenance software then acts, in parallel with resumed operation, to restore to service nodes it has removed; or if it cannot restore them to service, it directs technicians to repair or replace them and then to restore them to service manually. Reinstatement may be achieved locally and unannounced by such means as Direct Memory Access (DMA) restarts or reexamining evidence for a fault, or it may be achieved globally and visibly by ring initializations or ring restarts. Ring restarts occur when the ring cong module is called with instructions to reset the data selectors to their current positions. Reconguration is achieved either by quarantining or isolating faulty nodes. The design of ring maintenance associates faults with nodes. A fault in the ring interface, the node processor, or an auxiliary component is associated with the host node. A fault in the ring bus between nodes or in an interframe buffer is associated with the node immediately downstream of the fault. Associating faults
Issue 16.0
December 2000
3-1
401-661-045
with nodes means the ring can respond to faults by removing nodes from service, either by quarantining or isolating them. The type of reconguration chosen depends on the impact of the fault. If the impact is conned to the internal operations of the node, then the node will be quarantined. But if the fault has disrupted operation of the ring, then the node associated with the fault will be isolated. Automatic node quarantine occurs in response to instructions from the node processor of the faulty node or from the 3B21D. Automatic node isolation occurs when the ring cong module is called with instructions to set the data selectors in positions that create an isolated segment. Reinstatement will succeed in response to most soft faults, while most hard faults require reconguration. Soft faults are transient hardware problems or glitches in software, either of which is likely to be temporary. Soft faults may often be corrected simply by resuming operation of the system or of the component they have disrupted. (Sometimes, however, the effects of soft faults are sufciently severe that recovery requires reconguration.) By contrast, hard faults are failures in hardware or software which, once manifested, are likely to persist until they or their causes are corrected. Both reinstatement and reconguration provide rapid recovery, with the former usually being faster but less rigorous. When confronted with a fault in the ring subsystem, ring maintenance software must always choose to resume operation by one of these two means. When its rst choice is reinstatement, and that choice fails to achieve a stable and usable ring, it next tries reconguration. When, on the other hand, its rst choice is reconguration, reinstatement will not ordinarily follow, since reconguration, being the more thorough action, should succeed in all but the rarest cases. Reconguration precipitates the third type of recovery action employed by ring maintenance, node restoral. Node restoral occurs after operation of the recongured ring has resumed. It begins with ring maintenance software testing quarantined or isolated nodes to determine how best to treat them. In some cases, it can and does return them to service by automatic means. When it cannot or does not return them to service, it alerts technicians to repair or replace them and then to return them to service manually. Reinstatement and reconguration occur automatically. The work of node restoral also begins with automatic procedures, which give way to manual means only if the automatic procedures fail repeatedly or if diagnostics reveal a hard fault. Thus the usual role of technicians is to support ring maintenance by manually completing tasks software has begun. In some instances, however, manual intervention in the automatic machinery may be indicated. The organization of the next two chapters reects the operational division between automatic and manual ring maintenance. The next chapter describes the maintenance procedures that occur automatically, and the chapter that follows explains the related responsibilities of technicians.
December 2000
Ring Maintenance
Error Detection
The ring assumes that faults will produce errors in message format or message delivery, so it searches for faults by looking for errors. Errors may occur as messages are propagated on the ring that is, they may occur within ring interfaces or in the medium between ring interfaces as messages are transmitted or processed by node processors or auxiliary components, or as messages are transmitted between the ring and the 3B21D. The task of detecting and reporting errors is assigned chiey to the ring nodes. By means of circuitry in their ring interfaces and software in their node processors, nodes are usually able to detect errors internal to themselves. Moreover by means of failures in message delivery, nodes can often detect external errors, errors occurring in association with other nodes. When a node detects an error, it will, if it can, report the error to the 3B21D for analysis. An error associated with a fault that disrupts trafc on the ring is ordinarily rst detected by the circuitry of the ring interface. Every ring interface contains circuits for checking parity on the ring path as well as for detecting format errors in the messages it reads, writes, and propagates. When a ring-interface circuit detects an error, it informs its node processor by means of an interrupt. The node
Issue 16.0
December 2000
3-3
401-661-045
processor then interrogates the ring-interface hardware to determine the cause of the problem and reports, if it can, the identity and location of the error to the 3B21D via one or both rings. An error associated with a fault that prevents the transmission or processing of messages within nodes will usually be detected by the node processor. Such an error is typically caused by a fault in the node processor or by a node-processor detectable fault in one of the auxiliary components. From some errors of this type, nodes can recovery immediately by means of local reinstatement. They may, for example, be able to restart an attached processor that has incurred an error. Usually, however, reinstatement is not possible, and the node processor responds to the error by placing itself in quarantine, a condition that prevents it from reporting its state to the 3B21D. Instead the 3B21D usually learns of the condition from a report made by the rst node that attempts to send a message to the quarantined node. During normal operation, messages are read from the ring by the destination node. A node in quarantine, however, cannot read messages. Instead, a message addressed to it will, after traversing the entire ring, be detected and removed from the ring by the sending node, which will understand this condition as a SOURCE MATCH error and report it to the 3B21D. If a source match fails to materialize, however, or if an injured node processor is unable to quarantine itself, the condition will be detected by a node audit and reported to the 3B21D which responds, if needed, by quarantining the disabled node. Source-match errors are one of two means by which ring nodes detect errors external to themselves. The other is ring blockage. Blockage is the condition that exists when an upstream node cannot propagate data to its downstream neighbor. Every node has a timer on the output of each of its two ring paths. The timer expires if a byte of data being offered by the upstream node is not taken by the downstream node within a specied interval. Expiration of the timer implies a problem in the downstream node, for a node processor ordinarily reacts to an error that implicates its ring interface by forcing blockage on its ring input path. In this context, all interconnections between nodes, including interframe buffer circuits, are considered part of the downstream node. When a node processor detects blockage, it immediately drains the ring of any remaining data, including the token message, and reports the blockage to the 3B21D via the alternate ring.1 Errors may also be detected during the testing phase of ring initialization. Testing, which is more extensive in level-4 than in level-3 initializations, is in neither of these levels of initialization so detailed as in diagnostics. Nevertheless, errors
The node that rst detects blockage drains the ring to avoid confusing the 3B21D as to which node is immediately upstream of the faulty node. If it did not drain the ring, mass congestion would ensue, causing many upstream nodes to experience and report blockage. Even so, the initial blockage condition will often trigger two or three upstream blockage reports before the ring can be drained.
December 2000
Ring Maintenance
detected during these tests result in the same kinds of system actions as those detected during normal operations. Therefore, a ring may become active with some of its nodes newly quarantined or isolated. Finally, errors that are transparent to the ring may be detected by a user and reported to the ring. Such errors result from faults that occur in user hardware or rmware residing in the node or in user software residing in the node processor or in an attached processor.
An extended IUN with an attached-processor problem offers an important example of local reinstatement. Such a node does not quarantine itself immediately. Instead, the node processor audits the operational code of the attached processor and, if the audit passes, attempts to restart the attached processor. Only if the attached processor fails the audit or fails to restart after one attempt does the node processor report the condition to the 3B21D and then quarantine itself. Since most errors involve blockage, EAR usually receives at least two reports, one from the downstream node that detected the error directly and another from the upstream node that experienced the blockage.
Issue 16.0
December 2000
3-5
401-661-045
next higher-level action is tried, and so on. In this context, failure to return the ring to service means failure to resume operation of the ring at all or to sustain operation through a condence interval of 5 seconds. The levels of ring recovery actions are as follows: level 0 Unless the frequency of reported ring faults has exceeded a user-dened threshold, EAR rst attempts to reinstate the current ring by restarting it. Even when the frequency has exceeded the threshold, EAR still attempts to restart the ring, if analysis of error messages indicates that isolating the fault would seriously impact service to a user. If restarting fails or is not attempted, EAR uses error reports to locate the node or nodes associated with the fault, and it isolates them. If in response to level-1 action the ring failed to recover at all, EAR expands the isolated segment one node in each direction. If in response to level-1 action the ring failed to sustain its recovery through the condence interval, EAR bases the expansion on analysis of any additional ring transport error messages received. Level 2 is skipped on small rings. If these attempts, based on the original ring transport error reports, fail to achieve a stable ring, EAR discards the reports and initiates a new and comprehensive recovery tactic that attempts to locate and isolate the fault by employing tests for ring continuity. The continuity tests, which may escalate through three levels of increasing thoroughness, are designed to locate faults empirically by systematically testing message trafc on the ring. The two highest-level continuity tests include soak periods for nding transient faults. If a continuity test fails to nd a fault, the escalative recovery strategy is terminated and the ring reinstated in conformity with its structure prior to the rst level of EAR activity. Ordinarily the lowest-level continuity test will nd and successfully isolate the fault. If, however, each of the three levels of continuity testing nds a fault to isolate but fails in turn to establish or to retain through the condence interval a usable ring, IMS in the 3B21D aborts. Or, if the user prefers, the 3B21D undertakes a full process initialization and is reinitialized by the user. The levels of EAR escalative recovery actions are described in still greater detail in the reference chapter of this document.
level 1
level 2
levels 3-5
December 2000
Ring Maintenance
Token Track
The token track module runs automatically when an unexplained loss of token occurs. Its purpose is to inform technicians of the probable area where the token was lost. It is not otherwise used by ring software. Its rst act is to conduct a ring continuity test. If the test fails, indicating that the loss of token was caused by a hard fault, token track aborts. If the test succeeds, indicating that the loss of token was probably caused by a transient fault, token track proceeds to search for the vicinity of the ring where the token was lost, and it reports this information to technicians in a REPT TOKEN TRACK message. The message reports either that the token was lost between specic nodes or else that, owing to failure of the continuity test, the program was unable to perform the analysis necessary to determine the area of loss. In instances when EAR continuity tests cannot locate an intermittent problem, token track may guide technicians to its vicinity. Token track operates by means of ip-ops that are toggled by the token message each time it passes a node. All IRN circuit packs are equipped with these ip-ops. Of the other pairs, 122/123 are not equipped, 122B/123B are not equipped, and 122C/123B are equipped with the ip-ops. On a ring with no token track circuit packs, token track will not work; and on a ring with a mixture of token track and nontoken track circuit packs, token track may not work effectively because the area identied for token loss may be impracticably large.
RPCNs periodically check for the presence of the token by attempting to write to the ring. If they are prevented from writing by the absence of a token, they report this condition to EAR in the 3B21D.
Issue 16.0
December 2000
3-7
401-661-045
Reinstatement of the ring by restarting occurs as level 0 in EAR's escalative recovery strategy. It also occurs after a ring continuity test fails to nd a fault. Reconguration by node isolation occurs as levels 1 and 2 of EAR's escalative recovery strategy. It also occurs after any ring continuity test succeeds in nding a fault. Reconguration by node quarantine occurs in response to an instruction from the resident node processor or from the 3B21D. The ring employs the following rules for deciding whether to respond to a fault by reinstating or reconguring the ring. When a fault can be corrected locally and immediately, IMS reinstates the current ring. When a fault cannot be corrected by local reinstatement but can be treated by quarantine, IMS recongures the ring by quarantining the faulty node. When a fault is of the type that may require node isolation, IMS rst tries, subject to certain conditions described below, to reinstate the current ring by restarting it and resorts to isolating the fault only if restarting fails to achieve a stable ring.
the frequency of faults to be permitted by IMS before its practice of responding initially to ring-related faults with EAR level-0 (restarting the ring) is discontinued and replaced by EAR level-1 (isolating the fault), or after an unexplained loss of token, replaced by EAR level-3 (ring continuity testing).
The user sets the frequency by specifying both the number of faults to be allowed and the interval of time over which they are allowed. After the threshold is exceeded, an error-free period the length of the threshold interval is required before IMS returns to its normal practice concerning ring restarts.
Multiple Faults
If a fault occurs in the active segment of a ring that currently contains a fault-generated isolation, a multiple-fault condition exists. In this case the 3B21D determines the relative size of two ring segments as measured in each direction from the beginning of the current isolation to the point of new failure. It then directs that the larger segment become the active ring and places all nodes that comprise the smaller segment in isolation. Often when multiple faults occur, the isolated segment that results will contain innocent victim nodes, nodes that are isolated, not because they are defective, but because they are surrounded by defective nodes. Multiple faults are statistically rare but have the potential for causing many nodes to be out-of-service.5
December 2000
Ring Maintenance
Blockage RAC Parity/Format Error Interframe Buffer Parity Error Source Match and SRC Match NAUD Failure, and Unexplained Loss of Token.
The outages that occur during ring recovery actions are chiey the result of ring silence. Ring silence is a condition imposed upon the nodes while the ring is restarting, initializing, or reconguring to achieve an isolation. During ring silence the nodes are not permitted to write to the ring. Although the actions of the IMS ring cong module to restart the ring or to achieve an isolation require only a brief period of ring silence, the periods of silence required by continuity tests are signicantly longer. Nevertheless, most EAR ring recovery attempts will be completed very rapidly. The lower levels of EAR escalative recovery actions are brief. A level 0, 1, or 2 recovery attempt may take from to 1 second to complete, while a level 3 attempt will usually take from 1.3 to 2 seconds. The soak periods of levels 4 and 5 make them somewhat more expensive. Typically, a level 4 attempt consumes 11 to 14 seconds and a level 5 attempt 90 seconds to 3 minutes, depending on ring size.
Overall system tolerance to these partial ring outages depends on the application. Where applications require very high availability of a particular user-node function, that function can be replicated on two or more nodes. By spacing these nodes equally around the ring, at least one member of the set should remain in the active ring segment for most cases of multiple ring faults.
Issue 16.0
December 2000
3-9
401-661-045
The brevity of all but the longest of these ring recovery attempts mean that technicians will ordinarily learn of them after they have completed. Moreover, with one exception, it is the practice of the 3B21D to queue error messages and send them to the MCRT only after the recovery level to which they apply has completed its attempt to return the ring to service. Technicians may infer, however, that a high-level recovery attempt is underway from previous output messages indicating failed recovery attempts at lower levels, as well as from the blinking of the ``no token'' lights on the circuit packs of all ring nodes, indicating that tests are occurring. The output messages concerning each ring recovery attempt will usually consist of the following items of information in the order shown: 1. A REPT RING CFR message announcing a specic level of EAR recovery attempt. 2. If the attempt was successful, a REPT RING CFR message indicating that the ring has been congured and is identifying the new ring structure. 3. If the attempt was unsuccessful, an REPT RING CFR message indicating the reason for failure. 4. Separate REPT RING TRANSPORT ERR messages identifying each error that was received by the 3B21D in response to the fault that gave rise to the recovery attempt. Notice that REPT RING TRANSPORT ERR messages ordinarily appear on the MCRT and ROP following the REPT RING CFR messages to which they apply. Yet, because each of these message types is stamped in milliseconds by the realtime clock, it is possible to conrm their relations. The real-time stamp on a REPT RING CFR message indicates the completion time of the attempt being reported. The real-time stamp on a REPT RING TRANSPORT ERR message indicates the time the report arrived at the 3B21D from a ring node. Remembering that, after receiving a ring transport error report that may lead to node isolation, the 3B21D observes a listening period of 100 milliseconds before analyzing its reports and acting upon them, technicians can reconstruct system events. One exception exists to the rule that the 3B21D queues error messages until the completion of the recovery attempt to which they give rise. If the 3B21D receives a loss-of-token report, then waits the 100-millisecond listening period without receiving another error report, it immediately reports REPT RING TRANSPORT ERR/UNEXPLAINED LOSS OF TOKEN to the MCRT and ROP before jumping to a level-3 recovery attempt. Therefore, in this single case the 3B21D reports events in the order of their occurrence. There is no time stamp on messages announcing loss of token. Though quarantining a node recongures the ring, it is not accomplished by the ring cong module and, therefore, produces no REPT RING CFR output message. Instead, technicians learn that a node has become quarantined from
December 2000
Ring Maintenance
RMV RPCN or RMV IUN output messages and from indicators on display pages. Also, when a node experiences a fault that leads to quarantine, it attempts to send a message to the 3B21D identifying the type of error that occurred. Currently EAR does not use the message for fault analysis. It does, however, report the error on the MCRT and ROP in the second line of a REPT ERROR output message. In the event of an intractable problem, technicians should record and report this line. The line will indicate, among other matters, whether the error was soft (requiring no system action), rm (requiring a restart), or hard (requiring a repump of the node software).
In response to a few error-types, however, a self-quarantined node does not attempt to restart itself but waits for the 3B21D to detect its state and to return it to service by restoring it in the manner described below.
Issue 16.0
December 2000
3-11
401-661-045
internal audit, or unable to restart after one attempt, the 3B21D will detect its disabled condition, and if it is not already quarantined, quarantine it. Then ARR in the 3B21D will restore the node to service. ARR restores a node by downloading it with new operational code and placing the code into execution. Nodes may be restored either unconditionally without being previously diagnosed or conditionally by having their return to service depend on their passing all automatically-run diagnostic tests.
Maintenance States
ARR is driven to do its work by system indicators called IMS maintenance states. Maintenance states identify the operational mode of the ring and the operational mode, functionality, and condition of each ring node. They are determined and announced by programs in the 3B21D, mainly by EAR software. In addition to driving ARR to do its work, maintenance states serve as a primary source of system information for IMS users and for technicians who should always consult them before taking any manual action. Technicians may learn of current maintenance states from the IMS 1106 display page or from the OP:RING command. They should keep in mind that because maintenance states represent the central processor's knowledge of a distributed system, this knowledge under certain conditions may be temporarily incorrect. A node processor, for example, is allowed to quarantine itself if it detects certain irregularities in its software, but the 3B21D may not learn of this change of state until it has conducted a node audit or received a source match error. The following are the different classes of maintenance states:
s s s s s s
Ring state Node major state Node minor state: ring position Node minor state: ring interface Node minor state: node processor Node minor state: maintenance mode.
Ring States
The ring state identies the current operational mode of the ring. The following states are possible:
December 2000
Ring Maintenance
Ring Normal - This state represents the two-ring conguration, with one ring serving as the active path that chiey transmits user messages and the other serving as a standby path that may also transmit administrative and maintenance messages. A normal ring contains no isolated segment, but it may contain quarantined nodes. Ring Isolated - In this state the ring contains an isolated segment. The nodes that bound the isolation are active and are identied as the beginning-of-isolation (BISO) and the end-of-isolation (EISO) nodes. Any node, including an RPCN, may act as a BISO or an EISO node. The ring cannot contain more than one isolated segment. Ring Restoring - When Ring Restoring appears as a transitory state, it indicates a condition that occurs very briey during ring reconguration. When Ring Restoring appears as an extended state, it indicates the responses of automatic maintenance to a failed BISO or EISO node. When a BISO or EISO node experiences a node-processor failure, critical node recovery (CNR) software rst attempts to conditionally restore it. (Restoral software knows to run only those diagnostic phases that do not require isolation.) If the conditional restoral fails, ring cong extends the isolated segment to include the faulty node. Attending to a failed BISO or EISO node is the highest priority activity of ARR/CNR. Ring Conguring - In this state the ring is initializing, restarting, being recongured to isolate or unisolate one or more nodes, or engaged in one or more levels of EAR escalative recovery action. Ring Down - Chief among conditions that cause the ring to go down are when the 3B21D cannot communicate with it through any RPCN or when it is so fragmented by faults that EAR cannot dene an active segment long enough to satisfy the criterion for minimum length. The rst condition is most likely to occur when, in a two-RPCN environment, one RPCN has been manually taken out of service, after which the other experiences a failure in its 3B interface or duplex dual serial bus selector. During the time the ring is down, it is possible in some applications of IMS that all IUNs will continue to receive and transmit messages on the ring.7 For a fuller discussion of this matter, see the section ``Responding to Ring Down'' in this chapter.
Issue 16.0
December 2000
3-13
401-661-045
ACT - Active. An active node is on-line and capable, unless the ring is silenced or conguring, of performing all required functions. An active node is neither quarantined nor isolated. In this document, the expression ``to return a node to service'' means to give it ACT status. OOS - Out of service. An out-of-service node is unavailable for certain uses. The uses depend upon whether the node is quarantined or isolated. If the ring position (see below) of an out-of-service node is NORM, then the node is quarantined and can propagate messages on the ring, although it cannot read, write, or otherwise process messages. If the ring position of an out-of-service node is isolated, the node is entirely excluded from the active ring. Nodes in either OOS state are ordinarily able to receive and transmit only maintenance information and instructions. STBY - Standby. This designation is used for RPCNs only. It indicates that a healthy RPCN is prevented from doing its work by the circumstance that the ring is down or conguring. It also appears as a transitional condition when an RPCN is being grown and during system-wide initializations. INIT - Initializing. The attached processor of an extended node is being restarted or restored. The INIT state occurs as the second stage of restarting or restoring extended nodes. In the rst stage, the node processor is restarted or, in the case of restorals, downloaded with operational code and set to executing. In the second or INIT stage, the attached processor is treated similarly. For DLNs the second stage also includes tests of the DMA channel. OFL - Off-line. The node is quarantined out-of-service preliminary to being assigned a role in the active ring. Nodes should not be allowed to remain long in this condition, because their quarantined state prevents their node processor from fullling its important and unassignable role of error detection and reporting. GROW - Grow. The node is physically being added to or removed from the ring. During growth or degrowth, the node must always be isolated. UNEQ - Unequipped. Either the unequipped node has no hardware, or ring connections physically bypass it. Still, a place holder for the node exists in IMS software.
NORM - Normal. The node is included in the active ring and is neither a BISO nor an EISO node. A node in the NORM state may be quarantined; if it is quarantined, its node major state will be OOS or OFL. BISO - The node is included in the active segment of an isolated ring and bounds the beginning of the isolated segment.
December 2000
Ring Maintenance
EISO - The node is included in the active segment of an isolated ring and bounds the ending of the isolated segment. ISOL - Isolated: The node is contained in the isolated segment of an isolated ring. Its node major state will be OOS or OFL.
USBL - Usable. This is the default state. In other words, IMS regards ring-interface hardware as usable unless it has received an error message, a diagnostic result, or has detected a ring condition indicating otherwise. QUSBL - Quarantine-usable, that is, usable by the ring to propagate data but not usable by the node processor, which is insulated from the ring as in the quarantine (OOS NORM) state. IMS sets ring-interface hardware of any node to QUSBL when diagnostics nd or suspects a fault in the ring interface that does not prevent it from propagating messages on the ring. A node that fails only diagnostic phase 10, for example, would be set to QUSBL. When, under these circumstances, a ring interface is set to QUSBL, IMS unisolates the node if possible, quarantines it, and changes its maintenance mode (see below) to manual. Before performing diagnostics or other maintenance functions on the ring interface of the node, however, the node must be isolated. IMS sets the ring interface of an IUN to QUSBL and the node processor to FLTY when, during a level-4 initialization, the node fails a communication test of its ability to receive downloaded code. If this occurs, the ring will return to service with the node in question quarantined and in the automatic maintenance mode. IMS sets the ring interface of a node to QUSBL as a way of unisolating a node that is suspected of being faulty but that, as a member of an isolated segment, has passed phases 1 and 2 diagnostics without being subjected to further diagnostic phases.
FLTY - Faulty. The 3B21D has received information indicating that the ring-interface hardware is faulty. Thus the node is, or is about to be, isolated. UNTSTD - Untested. The minor states of nodes are maintained in core memory only, not on disk or in ECD. Therefore, during a level 3 or level 4 initialization, the system loses knowledge of the ring-interface states of out-of-service nodes and must retest them. The testing is done during initialization, during which time their ring-interface states will briey be UNTSTD.
Issue 16.0
December 2000
3-15
401-661-045
USBL - Usable. This is the default state. In other words, IMS regards node processors and auxiliary components as usable unless it has received an error message, a diagnostic result, or has detected a ring condition indicating otherwise. FLTY - Faulty The node processor and/or one or more auxiliary components is known or suspected to be faulty. The 3B21D sets the node-processor state to FLTY when it receives error messages implicating the node processor or an auxiliary component. It also sets the state to FLTY when it learns that a node has quarantined itself. Nodes ordinarily quarantine themselves when they detect a problem in their node processors or in an auxiliary component. Thus the node-processor FLTY state does not necessarily mean that a problem is in the node processor. It could be in the node processor or in any of the auxiliary components of the node. UNTSTD - Untested. Node minor states are maintained in current memory only, not on disk or in ECD. Therefore, during a level-3 or level-4 ring initialization, the system loses knowledge of the node-processor states of out-of-service nodes and must retest them. The testing is done during initialization, during which time their node-processor states will briey be untested.8
AUTO - Automatic. In this mode a node is under control of IMS software. Nodes in the ACT state are always under automatic control. Nodes in the OOS state are under automatic control as long as ARR software is acting upon them. MAN - Manual. This mode indicates that an out-of-service node is under the control of technicians. Control will change to manual because of the following:
If, during ring initialization, a fault occurs requiring an isolation that includes innocent victim nodes, the node-processor hardware of the innocent victims might not have been tested before the isolation occurred and could not be tested during the isolation. In this case, the innocent victims would be quarantined, their ring-interface states set to usable, and their node-processor states set to untested. Then, when the isolation is dissolved, ARR, assuming that UNTSTD equals USBL, returns the nodes to service in accordance with its standard algorithm which is explained below.
December 2000
Ring Maintenance
A technician has entered a form of the RMV, DGN, or RST command or has entered a command with similar consequences from the 1106 display page ARR has determined from diagnostics that a hard fault exists and is directing technicians to correct it ARR nds itself being asked for the fourth time within an hour to return the same node to service (explained below), or The application has requested that the node be placed in the manual mode. When such a request occurs, the ring-interface and node-processor states will remain set at USBL.
NODE PROBLEM
None Local restart of an attached processor Faulty NP or auxiliary component User request to test user interface Faulty RI hardware Faulty RI hardware that does not interfere with propagating messages on the ring Innocent Victim
NODE STATE
ACT INIT
RING POSITION
NORM/ BISO/EISO NORM/ BISO/EISO NORM/ BISO/EISO NORM/ BISO/EISO ISOL NORM
RI STATE
USBL USBL
NP STATE
USBL USBL
MAINT. MODE
AUTO AUTO
Quarantine the node Quarantine the node Isolate the node Quarantine the node
OOS
USBL
FLTY
AUTO
OOS
USBL
USBL
AUTO
OOS OOS
FLTY QUSBL
USBL USBL
AUTO AUTO
OOS
ISOL
USBL
USBL
AUTO
Issue 16.0
December 2000
3-17
401-661-045
Table 3-1.
Node Problems Mapped to Maintenance States and EAR Actions (Page 2 of 2) EAR ACTION
Isolate the node
NODE PROBLEM
Faulty NP or auxiliary component and faulty RI Needed to begin an isolation Needed to end an isolation Untested NP
NODE STATE
OOS
RING POSITION
ISOL
RI STATE
FLTY
NP STATE
FLTY
MAINT. MODE
AUTO
ACT
BISO
USBL
USBL
AUTO
ACT OOS
EISO NORM
USBL USBL
USBL UNTSTD
AUTO AUTO
December 2000
Ring Maintenance
6. Application-nominated critical nodes with low priority (quarantined) 7. Innocent victim IUNs (isolated) 8. Other IUNs (quarantined)
Nodes awaiting ARR restoral efforts may be contained in the active ring segment; or they may be contained in, or as BISO and EISO nodes associated with, the isolated segment. Because ARR's highest priority is to dissolve isolations, it deals rst with nodes contained in or associated with an isolated segment. First, it attempts to return to service any node that has become inactive after being designated a BISO or EISO node.9 Next, it attempts to restore nodes that, by virtue of having faulty ring interfaces, are responsible for the isolation. Then, it restores healthy nodes that were victims of the isolation. Finally, having dissolved the isolation by restoring all isolated nodes, ARR turns to restore any quarantined nodes. The restoral priority list does not apply to node restarts, however, which occur independent of, and may occur in parallel with, node restorals.
9 10
These are termed IMS critical nodes. Their recovery efforts go by the special title critical node recovery (CNR), a title that may appear on IMS display pages. Technicians may learn of the status of IMS requests at MIRA from the RTR OP:DMQ command, as well as from IMS 1105 and 1106 display pages, which are discussed in the this chapter.
Issue 16.0
December 2000
3-19
401-661-045
out-of-service and changes its maintenance mode from automatic to manual, thus, delegating the problem to technicians. In this document this practice is called the fourth-time rule. Self-initiated node restarts are not counted in the fourth-time rule, nor are unconditional and conditional restorals distinguished. Thus any combination of four restorals during a 60-minute interval violates the rule.
December 2000
Ring Maintenance
application interface. If the external user interface passes diagnostics, ARR automatically returns the node to service. If it fails diagnostics, ARR changes the maintenance mode of the node to manual.
Issue 16.0
December 2000
3-21
401-661-045
Table 3-2.
NODE STATE OOS
OOS
NORM
USBL
UNTSTD
n/a
OOS
NORM
QUSBL
FLTY
n/a
OOS
NORM
USBL
FLTY
extended node
OOS
ISOL
FLTY
USBL
manual maintenance
December 2000
Ring Maintenance
Table 3-2.
NODE STATE OOS
OOS
ISOL
USBL
FLTY
n/a
OOS
ISOL
USBL
USBL
isolation ends
pump & return to service chg. BISO to NORM chg. EISO to NORM
ACT ACT
BISO EISO
USBL USBL
USBL USBL
Table 3-3.
Issue 16.0
December 2000
3-23
401-661-045
Table 3-3.
The time taken by ARR to return a node to service varies considerably, depending on such factors as the type of restoral and the number of jobs waiting in MIRA's queue. An unconditional restoral usually takes 30 to 90 seconds. A full and successful diagnosis of a basic IUN or RPCN may take 5 to 8 minutes, while a failing diagnosis usually takes somewhat longer. Diagnosis of an extended node takes longer still, perhaps as much as 15 minutes.
December 2000
Ring Maintenance
Alarms
The following alarms indicate trouble that may affect IMS equipment:
Critical Alarms
A critical condition or fault in or associated with the IMS ring will be indicated by an asterisk C (*C) preceding the ROP output message that identies the problem. It may also be indicated by an audible alarm and a red CRITICAL indicator on each MCRT display-page header.
Major Alarms
A major condition or fault in the IMS ring is indicated by two asterisks (**) preceding the ROP output message that identies the problem. It may also be indicated by the following:
s s s
An audible alarm A red MAJOR indicator on each MCRT display-page header, and A red lamp on the aisle containing the frame/cabinet where the fault or failure occurred.
See the Special IMS Indicators'' section in this chapter for descriptions of other indicators that may appear with a major alarm.
Issue 16.0
December 2000
3-25
401-661-045
If a major alarm is caused by a power failure, the POWER indicator on each MCRT display-page header will show red, and display page 1111 will identify the type and location of the problem. If the problem is a failed power converter circuit pack in an IMS frame/cabinet, the lamp at the aisle containing the disabled frame/ cabinet will show red, and inside the frame/cabinet the power alarm light at the top-left will show red also.
Minor Alarms
A minor condition or fault in the IMS ring is indicated by one asterisk (*) preceding the ROP output message that identies the problem. It may also be indicated by the following:
s s s
An audible alarm A red MINOR indicator on each MCRT display-page header, and A yellow lamp on the aisle containing the frame/cabinet where the fault or failure occurred.
See Special IMS Indicators'' below for descriptions of other indicators that may appear with a minor alarm. If a minor alarm is caused by a power failure, the POWER indicator on each MCRT display-page header will show red, and display page 1111 will identify the type and location of the problem. If the problem is a single failed fan in an IMS frame/cabinet, the lamp at the aisle containing the disabled frame/cabinet will show yellow, and inside the frame/cabinet the power alarm light at the top-left will show red.
December 2000
Ring Maintenance
initialization tests conrm that the rmware within the pack is executing. The nature and uses of these LEDs are explained in the section ``Ring Application Processor Critical Maintenance Procedure.'' The application-processor circuit pack in a direct link node (DLN) is equipped with green, red, and yellow LEDs. The green stays on during normal operation and goes off when the node is taken out-of-service, when a hard panic occurs in the node processor, or when diagnostic code begins to be downloaded, whichever occurs rst. The red and yellow LEDs come into play as either diagnostic or operational code is downloaded. Diagnostic phase 41 begins with a rmware test. During the test the red and yellow LEDs come on and stay on permanently if the test fails. If the test passes, the red goes off briey, then joins the yellow back on again as the diagnostic proper begins. If the diagnostic fails, the yellow goes off and the red stays on. If the diagnostic passes, the red goes off and the yellow stays on until the node processor receives the diagnostic results, at which time it goes off. Then red and yellow come on and go off again as operational code is downloaded, and the green comes on as the attached processor is placed in execution. If technicians wish to consult support about the performance of a DLN, they might rst observe the behavior of these LEDs so they can report it. Output messages on the ROP are preceded, when appropriate, by an M or an A, indicating that the action described in the message is the result of a manual or an automatic IMS request. Table 3-4 on page 3-27 shows the IMS output messages accompanied by the types of alarms. Table 3-4. Alarms Associated with IMS Output Messages (Page 1 of 2) SEVERITY MESSAGE CRT
REPT DB INIT REPT ERROR REPT IMSDRV AUD REPT IMSDRV FLT REPT IMSDRV INIT REPT IUN REPT MSDC FLT REPT OP_RTM FLT REPT PSDO_UMS>P FLT REPT RING GROWTH REPT RING INIT X X X
MAJ
X X
MIN
X X X X X X X X X X
Issue 16.0
December 2000
3-27
401-661-045
Table 3-4.
Alarms Associated with IMS Output Messages (Page 2 of 2) SEVERITY MESSAGE CRT MAJ MIN
X X X X
REPT RING TRANSPORT ERR REPT TDTP FLT AUD CNC AUD NODEST
Other IMS output messages are not accompanied by audible or visual alarms.
Display Pages
IMS provides technicians with two MCRT display pages, page 1105, the Ring Status Summary Page, and page 1106, the Ring Node Status Page. These pages are similar in appearance and function to RTR display pages, and the procedure used to access them is also the same. The rst three lines of the IMS pages, consisting of the standard header information that appears on all RTR display pages, are omitted from the illustrations that follow. For more information on Status Display Page(s), see 410-610-160, The FLEXENT/AUTOPLEX Wireless Networks, Executive Cellular Processor (ECP) Operations, Administration, and Maintenance Guide. To access a particular display page, perform the following actions in the order indicated. 1. Type the NORM/DISP key. 2. Place the MCRT in the command mode by typing the CMD/MSG key. 3. Type and enter 1105 or 1106 on the numeric key pad. During ring initialization and conguration, indicators or data shown on display pages may be invalid or out of date; and during disk independent operation, the display page process is terminated.
December 2000
Ring Maintenance
-- 1105 RING STATUS SUMMARY -[Ring Error Threshold State] CMD Function 400 OP Ring Detailed [ARR Restore; System Indicator; IMSRTS.P indicator] [ARR Restart] [ACNR Restore or Restart] 00AAAOAAAiigAOO... 01.AAAAOOAA...AAAA 02.AAAAAAAAA...AAA
32AAAAAAAAOOOAAA..
33.AAAAAAAAAAAAAAA
34.AOOOOOAAAAAAAAA
Figure 3-1. A 1105 Display Page The 1105 page, as exemplied in the above gure, offers the following information and capabilities: The rst line contains, on the left, the CMD> prompt for command entries and, on the right, the page title. To enter display commands, move the cursor to the CMD> prompt by typing the CMD/MSG key, then enter the command. The next three lines identify, in square brackets, locations on the page where the types of information, shown within the square brackets, will appear, when appropriate. The brackets themselves will not appear on display pages.
s
[Ring Major State] appears at the location where the current ring state will be displayed. One of the following states should always be present: RING RING RING RING RING STATE ACTIVE STAT ISOLATED SEGMENT STAT CONFIGURING STAT DOWN STAT RESTORE
[Ring Error Threshold State] is the location where a message will appear when the Ring Error Threshold has been exceeded. The threshold is set by the user to indicate the number of faults per interval of time to be permitted before the IMS practice of responding initially to ring-related faults with EAR level-0 (restarting the ring) is discontinued and replaced by
Issue 16.0
December 2000
3-29
401-661-045
EAR level-1 (isolating the fault) or, in response to unexplained loss of token, by EAR level-3 (ring continuity testing). After the threshold is exceeded, an error-free period of time the length of the threshold interval is required before IMS returns to its normal practice concerning ring restarts. When IMS returns to its normal practice, the Ring Error Threshold Exceeded tag will disappear from the 1105 page, and the location will be blank.
s
The information CMD Function/400 OP Ring Detailed appears permanently on the 1105 page to remind technicians that the page also allows entry, at the CMD> prompt, of the 400 command, which produces the same output as the input message OP:RING;DETD. [ARR Restore; System Indicator; imsrts.p Indicator] appears at the location where a, b, or c, below, will appear: A node that ARR is currently attempting to restore, conditionally or unconditionally. The identication will read ARR followed by the method of restoral (UCL for unconditional, COND for conditional) followed by the node name in the form NODEa b. If ARR is attempting to restore an EISO or BISO node (see "Three ARR Rules'' above), CNR will appear in place of ARR. One of the following system states of IMS:
s
IMS FPI PROLOGUE (appears during the initial stage of an FPI initialization) IMS SYS BOOT (appears during the initial stage of level-3 or -4 BOOT initialization) IMS LVL3 INIT (appears during subsequent stages of a level-3 initialization) IMS LVL4 INIT (appears during subsequent stages of a level-4 initialization) IMS SYS CRIT SEQ CMPL (appears at the conclusion of a level-3 or -4 FPI or BOOT initialization) IMS SYS ABORT (appears prior to a level-3 or level-4 BOOT initialization) IMSRTS.P CREATED (see below)
One of the following states of the imsrts.p process, which creates the IMS display pages:
s s
If ARR is not currently attempting to restore a node and none of the system or IMSRTS.P conditions exist, the location will be blank.
December 2000
Ring Maintenance
[ARR Restart] appears at the location where any node (other than an application-nominated critical node) that ARR is currently attempting to restart will be identied. Node restarts that are initiated locally by the node processor are not recognized nor recorded by this indicator. [ACNR Restore or Restart] appears at the location where any application-nominated critical node (see ``Three ARR Rules'' above) that ARR is currently attempting to restore or restart will be identied. Because one ARR restart and one ACNR restart may occur in parallel and because one or both restarts may occur in parallel with a single restore, it is possible to have all three node-activity indicators lighted simultaneously. It is not, however, possible to have two restorals occurring simultaneously, since IMS can restore only one node at a time (see "Three ARR Rules'' above).
The next section of the display page, beginning in the above example with the fth line, identies all frames/cabinets in the IMS system, each node within each frame/cabinet, and the major state of each node. The nodes that occupy a frame/ cabinet are called a group. The example shows six groups identied by their group numbers as 00, 01, 02, 32, 33, and 34. To the right of the group numbers are characters representing the sixteen nodes or node positions within each group. Thus the rst character represents the RPCN, and the next fteen characters represent IUNs. In the IMS numbering scheme, nodes are identied by the formula RPCNa b or IUNa b, where a is the two-digit group number and b is a number between 00 and 15 that corresponds to the sequential location of the node within its group on the downstream path of ring 0. Thus RPCNs are always numbered 00 and IUNs are always numbered 01 to 15. The characters also identify, in accordance with the following formulas, the current major state of each of the sixteen nodes. See Table 3-5 on page 3-31. Table 3-5.
Active Standby Out of service, quarantined Out of service, isolated Grow Ofine Unequipped Initializing
Issue 16.0
December 2000
3-31
401-661-045
In the instances that provide an alternative of an upper- or a lower-case letter, the lower-case signies that the node is isolated, and the upper-case signies that the node is in the active ring. In the example of an 1105 page above:
s s s s s s s s s
RPCN00 00 is in the active node major state LN00 01 and LN00 02 are also active LN00 03 is out-of-service quarantined LN00 04, LN00 05, and LN00 06 are active LN00 07 and LN00 08 are out-of-service isolated LN00 09 is in the grow state and is isolated LN00 10 is active LN00 11 and LN00 12 are out-of-service quarantined, and LN00 13, LN00 14, and LN00 15 are unequipped12
12
When a group contains any out-of-service nodes, IMS color-codes the entire group with red background on white lettering. For additional information on the node and ring maintenance states, refer to the ``ARR or Deferrable Node Recovery section of this chapter.
December 2000
Ring Maintenance
CMD> NODE> [Ring Status] [ARR Restore, etc.] 01 [ARR Restart] 02 [ACNR Restore or Restart] 03 CMS FUNCTION 04 2xx RMV node (line xx) 05 3xx RST node (line xx)(UCL) 06 400 BISO-EISO 07 401/402 all non-ACT(next/prev) 08 403/404 all Equipped(next/prev) 09 500 DGN Isolated Segment 10 5xx DGN node (line xx) 11 6nn Group nn 12 7xx RST node (line xx)(COND)13 14 TOTAL 15 16 NODE NAME RPCN00 00 LN00 01 LN00 02 LN00 03 LN00 04 LN00 09 LN00 14 LN00 15
-- 1106 - RING NODE STATUS -RING MAJOR RI NP MAINT POS STATE STATE STATE MODE NORM ACT USBL USBL AUTO NORM ACT USBL USBL AUTO BISO ACT USBL USBL AUTO ISO OOS FLTY USBL MAN ISO OOS FLTY USBL MAN EISO ACT USBL USBL AUTO NORM OOS USBL FLTY AUTO NORM ACT USBL USBL AUTO
Figure 3-2.
An 1106 Display Page The 1106 page is composed of three areas. The area to the right, beginning with and including the column of line numbers 01 through 16, displays the major and minor states of a group of up to sixteen technician-specied nodes. In this document, this is called the display area. The area at the top left beginning CMD> and ending ACNR Restore or Restart is the command-interface and system-status area. In this document, this is called the command area. The area below the command area and to the left of the column of line numbers is a nonselectable command menu. In this document, this is called the menu area. The Menu Area. Entries in the CMS column of the menu area list the input forms for commands identied under the FUNCTION column. These commands may be typed and entered at the CMD> prompt. The xx in the rst, second, seventh, and ninth commands represent a line numbernot a node numberfrom the column of numbers, beginning 01 and ending 16, at the center of the page. Each line number is associated with the node to its right. In the above example, line 02 represents IUN00 01; and to quarantine IUN00 01, a technician would enter 202 at the CMD> prompt. By contrast, the nn in the next-to-the-last command represents not a line number but a group number. In the above example, to have the nodes contained in group 32 displayed, a technician would enter 632. Below is a listing of the results obtained from entering these 3-digit commands: 2xx 3xx Quarantines the node identied on line xx. Unconditionally restores the node identied on line xx.
Issue 16.0
December 2000
3-33
401-661-045
400
Displays, if the ring has an isolated segment, currently isolated nodes preceded by the BISO node and followed by the EISO node. If the isolated segment is greater than 14 nodes, the display will list rst the BISO node, then the rst seven isolated nodes downstream of the BISO node, then the last seven isolated nodes upstream of the EISO node, then the EISO node. It can be recognized from the Total line below the menu area that a portion of an isolated segment is missing (because the isolation contains more than 14 nodes). After the 400 command is entered, this displays a number that includes all currently isolated nodes plus the BISO and EISO nodes. The count on the Total line updates interactively. Initially provides in the display area a list of nodes in the ring that are neither active nor unequipped. Thus it lists any nodes that are in the out-of-service, standby, initializing, and grow states. After the 401 command is entered, the total number of nonactive nodes will be given on the Total line below the menu area and updated interactively. If this number is greater than 16, technicians may page forward and backward in the list by reentering 401 and 402, respectively. Entered the rst time provides a list of nodes in the ring that are equipped. Thus it lists all nodes that are in the active, out-of-service, standby, initializing, and grow states. After the 403 command is entered, the total number of equipped nodes will be given on the Total line below the menu area and updated interactively. If this number is greater than 16, technicians may page forward and backward in the list by reentering 403 and 404, respectively. Runs diagnostic phases 1 and 2 on all RACs in the isolated ring segment. Runs all automatic diagnostic phases on the node identied at line xx. Displays all equipped nodes in group nn, where nn is not the line number but the group number. After the 6nn command is entered, the total number of equipped nodes within the group will be given on the Total line below the menu area and updated interactively. Conditionally restores the node identied on line xx.
401
403
7xx
The Command Area. CMD> is the prompt for any of the 3-character commands listed in the command menu. Entering a valid command here evokes an OK response. Entering an invalid command evokes an NG response. To enter a command, manipulate the cursor with the CMD/MSG key until it is at the prompt.
December 2000
Ring Maintenance
Then type and enter a 3-character command from the CMS column of the menu area. The prompt also accepts as input display-page numbers to which the technician wishes to turn. Node> is the prompt for a command that allows technicians to select the sequence of nodes displayed, after having entered a 401 or 403 command. To employ this feature, enter 401 or 403, manipulate the cursor with the arrow keys to the Node> prompt, and then type and enter the identication, in the form IUNa b or RPCNa b, of the node you wish to form the starting point of the sequence. The display will be redrawn with the specied node as the last entry in the 401 display and as the rst entry in the 403 display. This feature is not available for the 400 and 6nn commands where its reordering might be confusing. [Ring Status] appears at the location where the current ring state will be displayed. One of the following states should always be present: RING STATE ACTIVE RING STAT ISOLATED SEGMENT RING STAT RESTORING RING STAT CONFIGURING RING STAT DOWN [ARR Restore, etc] [ARR Restart] [ACNR Restore or Restart] provide the same information as they do for the 1105 display page, as explained above. Because one ARR restart and one ACNR restart may occur in parallel and because one or both restarts may occur in parallel with a single restore, it is possible to have all three node-activity indicators appear simultaneously. It is not possible, however, to have two restorals appear simultaneously, since IMS can restore only one node at a time (see "Three ARR Rules'' above). The Display Area. The display area lists up to 16 nodes and identies their major and minor maintenance states. Node major and minor states are explained above in the ``ARR or Deferrable Node Recovery'' section of this chapter. A listing of the maintenance states follows:
s
Node Major States ACT - Active OOS - Out of service STBY - Standby INIT - Initializing OFL - Off-line
Issue 16.0
December 2000
3-35
401-661-045
Node Minor States: Ring Position NORM - Normal BISO - Beginning of Isolation EISO - End of Isolation ISOL - Isolated
Node Minor States: ring interface USBL - Usable QUSBL - Quarantine-usable FLTY - Faulty UNTSTD - Untested
Node Minor States: node processor USBL - Usable FLTY - Faulty UNTSTD - Untested
Nodes may be added to 401 and 403 displays by manipulating the cursor to any vacant line in the display and typing and entering a node name in the form LNa b or RPCNa b. The display will provide status information for the node and also display the line number in reverse video, indicating its special status. The special status node will disappear when a new command is entered at the CMD> prompt. Prior to that time the node may be deleted manually by manipulating the cursor to the line and then typing only the RETURN key.
Ring Diagnostics
IMS provides diagnostic tests for all circuit packs that reside in the ring node frames/cabinets except power supplies. These tests are submitted as requests to MIRA and performed in a manner similar to standard RTR diagnostics. They may be initiated automatically by ARR or manually by technicians through input messages or display-page commands.
December 2000
Ring Maintenance
Each IMS node-type is tested by a distinct diagnostic routine; each diagnostic routine is composed of units of sequential execution called phases; and each phase tests functionally-related hardware. Phases are automatic or optional (available on demand). Automatic phases are executed when a diagnostic is run at the request of ARR or in response to a manual request without the PH option. Optional phases are executed only in response to manual requests in which they are specied in the PH option. Phases are identied by the node-type on which they are executed and by phase numbers. Node-types are further distinguished by their hardware composition. The currently available node-types are IRN RPCNs, IRN2 RPCNs, IRN LNs (LIN-E/SS7), IRN LNs (LI4S/SS7), IRN DLNEs, IRN DLN30s, IRN CDN-Is, IRN CDN-IIs, IRN CDN-IIxs, CDN-IIIs, SS7NEs, DLN6os and IRN MDLs. Phase numbers reect the relative order in which phases are run within a routine. Diagnostic phases 1 and 2 are special in two ways. They are common to all node-types; and when full, automatic diagnostics are requested whether manually or by ARR on any node (thus requiring that the node be isolated), phases 1 and 2 test the entire path within the isolation as a preliminary step to testing the specied node. Testing the isolated path requires partial tests of all nodes and interframe buffers within the isolated segment as well as tests of the isolated RACs of the EISO and BISO nodes. Running phases 1 and 2 also has the effect of clearing RAC status registers. RAC status registers may become improperly set as a consequence of a fault, of the node being powered down, or of the RAC circuit pack being removed or reset. Phase 40 is a critical juncture in IMS diagnostics. When a diagnostic request includes only phases above 39, IMS quarantines the node before running the diagnostic phases on it. When, on the other hand, a diagnostic request includes any phases below 40, IMS attempts to isolate the node prior to running diagnostics on it. If, however, ring conditions do not permit the node to be isolated, IMS runs all requested phases that do not require the node be isolated while the node is quarantined. These will include all requested phases above 40 and some requested phases below 40. Most IMS diagnostic routines terminate at the end of a phase in which a test fails. A few terminate at the end of a failing test. Important exceptions to this statement are as follows: If phase 1 or 2 fails in any node-type, all of phases 1 and 2 are still run. If either or both phases 1 or 2 fails in RPCNs, phases 10 through 27 are still run unless a test fails in these upper phases, in which case diagnostics terminate at the end of the failing upper phase.
Issue 16.0
December 2000
3-37
401-661-045
Diagnostic Fault Tables, also available for each node type, associate phases with the circuit packs they test, thereby providing a list of suspect circuit packs for any failing phase. Whether diagnostics are initiated automatically or manually, their results appear as output messages on the ROP. The DGN output message identies failing phases and failing tests for a faulty node. And the ANALY TLPFILE output message provides a list of suspect circuit packs in the faulty node. The ANALY TLPFILE message, invoked by the TLP option of the RST command, is always included by ARR requests to restore a node. In the ANALY TLPFILE message, each circuit pack associated with a diagnostic failure is assigned a number between one and ten. The number represents the probability as calculated by IMS software that the location of the fault is in the pack; the higher the number, the greater the probability. The DGN and ANALY TLPFILE output messages are primary sources of diagnostic information for technicians.
Diagnostic Listings
If the information provided by ROP output messages fails to identify faulty equipment, further scrutiny of the diagnostic results is possible using diagnostic listings. A diagnostic listing is a document that describes a particular diagnostic phase. Common Network Interface has available the diagnostic listings that pertain to the CNI conguration of the ring. They consist of the listings for ring peripheral controller nodes, link nodes, attached processors, and ring application processors. A diagnostic listing is composed of a prologue and a statement sequence. The prologue introduces the subject phase by explaining what it tests, how the testing is done, and what hardware is involved. All lines in the prologue begin with the character C, indicating they are comments. The statement sequence consists of information, arranged into numbered statements, about each command within the series of commands that constitutes the phase. Each statement contains a statement number, a source-le version of the command, and an ASCII representation of the executable version of the command. The ASCII representation is on a line that begins with the string * adr, unless the command generates a test, in which case the line begins with * test followed by the test number. Most statements are preceded by one or more comment lines that explain the purpose of the command that follows. Statement numbers correspond to numbers that appear in early termination output messages and in DGN AUDIT RING output messages. They are also used in the EX input message. Test numbers correspond to the test numbers that appear in DGN output messages. For technicians, test numbers are the most important information in diagnostic listings.
December 2000
Ring Maintenance
Some long diagnostic listings subdivide the statement sequence into program units. Program units correspond to divisions of phases that serve explanatory rather than programming functions. Each program unit is preceded by a prologue that provides introductory information about the commands within the unit.
Using Diagnostics
IMS ring diagnostics serve three principal purposes to conrm faults, to locate faults, and to verify repairs. When IMS software removes a node suspected of being faulty from services, it sometimes employs diagnostics to conrm and to locate the fault. After replacing or repairing equipment indicated as faulty, technicians employ diagnostics manually to verify that the fault has been corrected before returning the node to service. Because conditional restoral requests of ARR always include the TLP option, technicians usually have no need to manually diagnose a node in order to conrm or locate its fault. Instead, they should consult the diagnostic results on the ROP that was generated by ARR's restoral attempt. If, however, a restoral attempt fails for nondiagnostic reasons, technicians will ordinarily need to run diagnostics on the node before performing maintenance on it.
Issue 16.0
December 2000
3-39
401-661-045
13
These commands may conform either to the Program Documentation Standards (PDS) except that terminal exclamation marks are supplied automatically by software or to the Man-Machine Interface Language (MML). Technicians should select one or the other of these message conventions by setting the RTR ECD spooler ag to PDS or MML. For an explanation of the PDS input-message format, consult 3B21D Computer, UNIX RTR Operating System, Input Message Manual, PDS ``Section 2, User Guidelines. For a complete description of PDS, consult the Bell Laboratories Program Documentation Standards Reference Manual. For an explanation of the MML input-message format, consult 3B21D Computer, UNIX RTR Operating System, Input Message Manual, MML ``Section 2, User Guidelines. For a complete description of MML, consult the CCITT MML Recommendations (Z.301-Z.341) which are available from OMNICOM, Inc. Vienna, Virginia. To set the spooler ag, see the layout for the ECD splrinfo form in the RTR Operating System, Recent Change and Verify Manual for the 3B21D Computer.
December 2000
Ring Maintenance
If any phases below 40 are specied, DGN:NODE behaves as above except that it attempts to run only the specied phases. If only phases above 39 are specied, DGN:NODE runs the phases on the node after quarantining it (if it was not already quarantined). If a node was active or quarantined prior to the request for diagnostics, DGN:NODE attempts to quarantine it after diagnostics have completed. If a node was in another state, DGN:NODE leaves the node in the state in which it found it, provided that diagnostic results do not require a different state. (Technicians would ordinarily return a quarantined node that had passed diagnostics to service by unconditionally restoring it.) Before entering DGN:NODE for an active node with an active external user interface, remove from service the communication link or links that terminate in the node. RST:NODE Entered unconditionally for an out-of-service node that is not sandwiched in isolation between nodes with faulty ring interfaces, unisolates and/or unquarantines the nodethus placing it in the active ring, downloads operational code into it, places the code in execution, then changes the major state of the node to active. If the node is sandwiched in isolation, RST:NODE entered unconditionally leaves the node isolated, while placing it under ARR control so that it will be automatically restored when ring conditions permit. Entered conditionally, RST:NODE completes the same actions as DGN:NODE with no phases specied, then restores the node, provided that it passes diagnostics and is not sandwiched in isolation. If it is sandwiched in isolation, RST:NODE leaves it isolated while placing it under ARR control so that it will be automatically restored when ring conditions permit. If a node fails diagnostics, RST:NODE leaves it isolated, if its ring-interface state is FLTY, or quarantines it, if its ring-interface state is USBL or QUSBL and it is not sandwiched in an isolation. If the RST:NODE command is followed by a resource failure that prevents downloading or executing code, a REPT IUN RST output message with failure code 43 will appear on the ROP. When this occurs, technicians should wait a few minutes and try the restoral again. Before entering RST:NODE conditionally for an active node with an active external user interface, remove from service the communication link or links that terminate in the node.
Issue 16.0
December 2000
3-41
401-661-045
After entering RST:NODE for a node whose communication link has been manually removed from service, it may be necessary to manually return the communication link to service. OP:RING Produces an OP RING output message concerning the status or generic identity of specied nodes, groups of nodes, or of the ring.
CFR:RING 1. isolates or attempts to end the isolation of specied nodes or 2. initializes the ring if it is down. Because the DGN and RST commands provide automatic isolation and unisolation of nodes under most conditions, this command is rarely used. The command is intended primarily for use in the rst sense when growing and degrowing nodes and in the second sense when a new ring is being installed under Manual Ring Mode, which is explained below. In daily operations, the rst version of the command might be used with the exclude option to isolate a node whose ring-interface state is quarantine-usable prior to changing the ring-interface or IRN circuit pack. With the MOVFLT option the rst version command can be used to shift an isolation on a ring that is too small for the isolation to be extended. Before the Exclude version of the CFR command is entered for an active node, the node must be removed from service with the RMV:NODE command. Tables providing brief descriptions of commonly used versions of IMS output messages appear in Chapter 5, Ring Critical Events.
December 2000
Ring Maintenance
maintenance is manually initiated, and one whenthese procedures failing to clear a problemit becomes necessary to consult diagnostic listings. The information provided by these three procedures is entirely sufcient for the maintenance of nodes that are quarantined. Maintenance of isolated nodes, however, involves these issues and others as well. The section ends with procedures for dissolving isolations. One is concerned with single-node isolations; one is concerned with multiple-node isolations; and one, to be used in conjunction with the other two, is concerned with the problems associated with a fault in a BISO or EISO node.
Issue 16.0
December 2000
3-43
401-661-045
The following Table describes the various LED indications. Nodes should be isolated before having any part of their backplanes repaired. Table 3-6. Circuit Pack LED States Node Type
any VLSI any
Circuit-Pack Type
auxiliary IRN IFB
State
quarantined or isolated isolated isolate the adjacent node in the same unit as the IFB CP
Indication
RQ LED red NT LED red NT LED red
NOTE:
Before pulling any circuit pack in units not equipped with a connector assembly, isolate all nodes serviced by the power supply associated with the connector assembly. In 3-node units, the connector assembly is located at the rear of the backplane at the RI\ 1 position in the two external nodes and is associated with the nearest power supply. In two-node units, the connector assembly is located at the rear of the backplane at the RI 1 position in both nodes and is associated with the nearest power supply. In eight-node units the connector assembly is located at the back of each power supply and is associated with that power supply. 4. Replace the first circuit pack on the list, then proceed as follows:
s
If you replaced a ring-interface, a node-processor, or an IRN circuit pack in any node-type other than an RPCN, restore the node conditionally with RST:NODEa,b command. If you replaced any circuit pack in an RPCN other than the DDSBS circuit pack, restore the node conditionally with the RST:RPCNa,b command. If you replaced the DDSBS circuit pack of an RPCN, rst run all automatic diagnostic phases with the DGN:RPCN command. If the automatic phases pass, next run optional diagnostic phase 14 with the command DGN:RPCNa,b:PH 14,CU c where c is 0 or 1, indicating the off-line control unit of the 3B21D. If the DDSBS circuit pack passed both optional and automatic diagnostic phases, restore the node to service unconditionally using the RST:RPCNa,b;UCL command. If you replaced an auxiliary circuit pack of any node other than an RPCN or CDN-I, enter the command DGN:NODEa,b:PHc where c is the range of phases that test the circuit pack you replaced. If the unit passes all specied diagnostic phases, restore the node unconditionally with the RST:NODEa,b;UCL command.
December 2000
Ring Maintenance
If you replaced the DDSBS circuit pack of a DLN, rst run all automatic diagnostic phases with the DGN:NODEa,b command. If the automatic phases pass, next run optional diagnostic phase 34 with the command DGN:NODEa,b:PH 34,CU c where c is 0 or 1, indicating the off-line control unit. If the DDSBS circuit pack passed both optional and automatic diagnostic phases, restore the node to service unconditionally using the RST:NODEa,b;UCL command. Consult the section ``Ring Application Processor Critical Maintenance Procedure'' for instructions on diagnosing and changing auxiliary circuit packs on a CDN-I. If to replace an interframe buffer you isolated an RPCN, restore the node conditionally with the RST:RPCNa,b command. If to replace an interframe buffer you isolated any other node-type, run diagnostic phases 1 through 13 with the DGN:NODE,b:PH 1-13 command and, if the phases pass, restore the node unconditionally. If you permanently removed an interframe buffer or substituted a buffer with different capacity, change the ECD HV eld to reect the change before restoring the node.
5. If the list of suspect circuit packs contained more than one entry and the node failed to pass diagnostics after the first listed pack was replaced, reinstall the original pack, replace the next pack on the list, then repeat the applicable portion of 4 and 5 above. Continue in this fashion until either the node passes the specified diagnostic tests or all circuit packs on the list have been replaced and tested. (If the node you are troubleshooting is critically important or contributing to a multiple isolation, you may wish to replace simultaneously all its circuit packs and then, at another time, reinstall the original packs and test them individually to determine which pack was at fault.) 6. If you replaced all circuit packs without the node passing diagnostics, visually inspect the node and its housing. Look for unseated circuit packs, backplane damage, poor grounding connections, and unseated cable connections. Before repairing the backplane, isolate the node. 7. If the backplane is not at fault, consult the sections below on isolations and trouble-shooting.
Issue 16.0
December 2000
3-45
401-661-045
IMS circuit packs are designed to be replaced while the power supply to the node is on. 1. Before entering an RMV, DGN, conditional RST, or CFR:RING,NODExx yy;EXCLUDE command for an active node with an active external user interface, remove from service the communication link or links that terminate in the node. After entering an RST command for a node whose communication link was manually removed from service, it may be necessary to manually return the communication link to service. 2. Before manually initiating maintenance on a circuit pack or interframe buffer, remove the resident or associated node from service. See Table 3-6. Before replacing a power supply circuit pack in a 3-node unit, isolate the two nodes adjacent to the power supply. In a 2-node unit, isolate the node adjacent to the power supply. In an 8-node unit, isolate the four nodes adjacent to the power supply. In a 5-node unit, learn from the unit horizontal designation strip next to the power supply in question the nodes serviced by the power supply, and isolate either three or two nodes. Nodes should be isolated before having any part of their backplanes repaired. 3. To quarantine a node, remove it from service with the RMV:NODEa b command. This action has the effect of changing the maintenance mode of the node to manual, thus preventing ARR from attempting to restore it. 4. To isolate a node, first remove it from service with the RMV:NODEa b command, and then isolate it with the CFR:RING,NODExx yy;EXCLUDE command. This also has the effect of changing the maintenance mode to manual. 5. If a quarantined or isolated node has not had a circuit pack replaced or reset, it may be restored to service unconditionally. 6. If an isolated node has not had a circuit pack replaced but has been powered down or had a circuit pack reset, run diagnostic phases 1 and 2 on it with the DGN:NODEa,b:PH 1-2 command. If it passes it may be restored to service unconditionally. 7. If a node has had a circuit pack replaced, observe the guidelines set forth in the fifth step of the procedure ``Clearing Faults in Response to ARR Action.''
December 2000
Ring Maintenance
Issue 16.0
December 2000
3-47
401-661-045
isolated-ring segment. Messages destined for the isolated segment are read from the active ring by the active-ring RAC, then transmitted by the node processor to the isolated-ring RAC, which writes them to the isolated segment of the ring. A fault in the isolated-ring RAC of either BISO or EISO node might go undetected, since it would not affect the transportation of message on the active ring and could show up misleadingly as a diagnostic failure in the isolated node. Therefore, technicians who nd that they cannot clear a fault that appears to reside in the isolated node should extend the isolation to include the current BISO and EISO nodes and run diagnostics again.
Low-Phase Ambiguity
The other reason for extending isolations concerns the ambiguity that IMS experiences in detecting certain ring-related faults. Faults that prevent the propagation of messages on the ring usually produce phase-1 and phase-2 diagnostic failures. In the case of such failures, IMS often has the problem of being unable to decide in which of two adjacent RACs a fault resides. Because this problem is associated entirely with the parts of node hardware tested by diagnostic phases 1 and 2, this document calls it low-phase ambiguity.'' Low-phase ambiguity does not usually result in the isolation of two nodes because, while one suspect RAC is isolated, the other suspect RAC may be included in the isolated segment as the isolated RAC of the BISO or EISO node. The following gure illustrates the ring structure that permits this practice:
RAC 0
RAC 0
RAC 0
RAC 0
RAC 0
RAC 1
RAC 1
RAC 1
RAC 1
RAC 1
Figure 3-3.
Isolated RACs of BISO and EISO Nodes Notice that either RAC 1 of the BISO node or RAC 0 of the EISO could be included in the isolated segment as a suspect RAC. IMS has difculty acknowledging by customary means the fact that it has included possibly faulty RACs in BISO or EISO nodes. A BISO or EISO node, being in the active ring, cannot have its ring interface marked faulty. Therefore, if a RAC of such a node is suspect, this fact will not be indicated in the minor state of the node nor in the TLP information. It will, however, be reected in tests 5 and 10 of the ROP failure data for diagnostic phases 1 or 2, provided that the RAW option of the
December 2000
Ring Maintenance
DGN command has been specied. (ARR does not specify the RAW option, so the automatically output DGN failure data does not contain this information in full. It does, however, contain failing test 5, which is a sure indication that low-phase ambiguity exists.) The maintenance principle dictated by low-phase ambiguity is represented in the following procedure:
Issue 16.0
December 2000
3-49
401-661-045
Ignore everything except the mismatch data for test 005 and 010. If either test 005 or test 010 appears in the DGN output message, the other will appear also, provided that the RAW option to the DGN command has been specied. These tests will always identify two nodes as possibly faulty. 4. Using the physical node-address table in the reference chapter of this document, translate the hexadecimal mismatch data for test numbers 005 and 010 into the node names of two nodes. For example, in the above DGN output message, 00000E01 translates into IUN32 1 and 00000E02 translates into IUN32 2. These are the nodes suspected by IMS of being faulty. In the case of single-node isolations, one of the suspect nodes will be the isolated node and the other will be the BISO or EISO node, the suspect component of which will be the RAC 1 of the former or RAC 0 of the latter. 5. When one suspect node is an EISO or BISO node, manually remove its communication link (if it has an active one) from service, then remove the node from service with the RMV:NODEa b command, thus extending the isolation to include the suspect node in the isolated segment. 6. Perform maintenance on the newly isolated node. Low-phase ambiguity has bearing on the procedures for treating singleand multiple-node isolations. The procedures concerning isolations that follow are merely recommended. When circumstances, reason, or user practices dictate to act differently, do so. The procedures are not self-sufcient but build upon the three procedures discussed above for clearing faults in nodes. The order of battle in these procedures is this: rst perform maintenance on suspect nodes within the isolated segment. If this fails to dissolve the isolation, next check to see if the isolated RAC of an EISO or BISO node is suspected of being faulty. If so, perform maintenance on it after including it in the isolation. Finally, if no isolated RAC in the EISO or BISO node is suspected of being faulty, extend the isolation to include the BISO and EISO nodes, one at a time, and run diagnostics again on the chance that a fault in one of their isolated RACs is being misread by diagnostic code.
December 2000
Ring Maintenance
If test 5 of a phase-1 or phase-2 failure is indicated, verify your repair using the DGN command with the RAW option specied, thereby learning when the isolated node still fails diagnostics whether the isolated RAC of the BISO or EISO node is also suspected by IMS of being faulty.
BISO Node
Isolated Node
EISO Node
4. If the procedure that you employed on the isolated node in step 3 failed to end the isolation and test 5 and test 10 of a phase-1 and/or phase-2 failure is indicated, extend the isolation to include the BISO or EISO node identified by the mismatch data for test 10. Use the command RMV:NODEa, b, where NODE is the node name of the node identified by test 10 mismatch data. On small rings you may have to shift, rather than extend, the isolation by employing the MOVFLT option of the CFR:RING command. (If the BISO or EISO node has an active communication link, remove the link from service before removing the node.) 5. Follow the procedure Clearing Faults in Response to ARR Actions'' for the newly isolated node. 6. If:
Issue 16.0
December 2000
3-51
401-661-045
a. the procedure that you employed on the isolated node in 3 failed to end the isolation b. and test 5 of a phase-1 and/or phase-2 failure is not indicated, extend the isolation to include the BISO node with the command RMV:NODEa, b, where NODE is the BISO node. On small rings you may have to shift, rather than extend, the isolation by employing the MOVFLT option of the CFR:RING command. (If the BISO node has an active communication link, remove the link from service before removing the node.)
BISO Node
EISO Node
7. With the former BISO node now in the isolated segment, again diagnose the originally isolated node. 8. If the originally isolated node now passes diagnostics, a. diagnose the former BISO node and, if it fails, perform maintenance on it following the TLP instructions b. but if it passes, change its ring-interface and node-processor circuit pack(s), then conditionally restore it to service.
s
If the former BISO node now enters the active ring (thereby dissolving the isolation), unconditionally restore the originally isolated node (which should now have become quarantined) to service, and end this procedure.
9. But if the originally isolated node still fails diagnostics after the former BISO node has been included in the isolated segment, reduce the isolation by unconditionally restoring the former BISO node, thereby making it once again the BISO node. (You may have to manually return its communication link to service.) 10. Extend the isolation in the other direction to include the EISO node, and treat the former EISO node as you did the former BISO node above.
BISO Node
EISO Node
December 2000
Ring Maintenance
11. If the originally isolated node still fails diagnostics after the isolation has been extended in both directions, or if the isolation repeatedly dissolves and returns, attempt any appropriate procedures described in the section below on troubleshooting. Then, if the isolation still persists, call the CTS.
BISO Node
EISO Node
3. If you are on-site, confirm that the nodes in question are indeed isolated by checking their NT LEDs. 4. Choose to begin working on either the isolated node next to the BISO node or the isolated node next to the EISO node. Base your choice on the following considerations in the order shown: a. If diagnostic failure data is given for only one of the two nodes, begin with the node for which you have failure data.
Issue 16.0
December 2000
3-53
401-661-045
b. If failure data is given for both nodes, begin at the end of the isolation that includes the nodes most important to your operation. 5. For the node you have chosen, follow the procedure ``Clearing Faults in Response to ARR Actions.'' If test 5 of a phase-1 or phase-2 failure is indicated for this node, verify your repair of the node using the DGN command with the RAW option specified, thereby learning when the isolated node still fails diagnostics if the isolated RAC of the adjacent BISO or EISO node is also suspected by IMS of being faulty. 6. If the procedure clears the fault of the isolated node next to the BISO or EISO node, the ring should now contain only a singly-isolated node, since both the repaired node and the innocent victim nodes will have returned to the active ring. (An exception to this statement occurs when the isolated segment contains three faulty nodes. In this case, restoring one of the external faulty nodes will result in a smaller multiple isolation. If this occurs, return to the beginning of this procedure and repeat the steps up to here, then continue on.) Treat the singly-isolated node according to the procedure for ``Responding to Single-Node Isolations,'' and end this procedure. 7. If, however, the procedure that you employed failed to reduce the isolation and test 5 and test 10 of a phase-1 and/or phase-2 diagnostic failure are indicated, extend the isolation to include the BISO or EISO node identified by the mismatch data for test 10. Use the command RMV:NODEa, b, where NODE is the name of the node identified by test 10 mismatch data. On small rings you may have to shift, rather than extend, the isolation by employing the MOVFLT option of the CFR:RING command. (If the BISO or EISO node has an active communication link, remove the link from service before removing the node.) 8. Follow for the newly isolated node the procedure ``Clearing Faults in Response to ARR Actions.'' 9. If the procedure clears the fault of the newly isolated node, the ring should now contain only a singly isolated node, since the repaired node, the isolated node next to the original BISO or EISO node, and the innocent victim nodes will have returned to the active ring. (An exception to this statement occurs when the isolated segment contains three faulty nodes. In this case, restoring one of the external faulty nodes will result in a smaller multiple isolation. If this occurs, return to the beginning of this procedure and repeat the steps.) Treat the singly-isolated node according to the procedure for ``Responding to Single-Node Isolations,'' and end this procedure. 10. If the previous step of this procedure fails to reduce the isolation or test 5 and test 10 of a phase-1 and/or phase-2 diagnostic failure were not indicated after failure in Step 5 above, go to the other end of the isolated segment and repeat Steps 5 through 9 there.
December 2000
Ring Maintenance
11. If these steps fail to reduce the isolation, extend the isolation to include either the EISO or BISO node if one has already been extended, choose the other; if neither has been extended, choose either with the command RMV:NODEa, b, where NODE is the EISO or BISO node. (If the EISO or BISO node has an active communication link, remove the link from service before removing the node. 12. With the former EISO or BISO node now in the isolated segment, diagnose the isolated node next to the former EISO or BISO node; and if the isolated node next to the former EISO or BISO node now passes diagnostics, change the ring-interface and node-processor circuit pack(s) of the former EISO or BISO node, then conditionally restore the former EISO or BISO node to service.
BISO Node
EISO Node
BISO Node
EISO Node
13. If the former EISO or BISO node enters the active ring (thereby reducing the isolation), treat the remaining isolation according to the procedure for single-node isolations. 14. If, however, the isolated node next to the former EISO or BISO node still fails diagnostics, unconditionally restore the former EISO or BISO node to the active ring. (If you manually removed its communication link from service, you may have to manually return it to service.) Then extend the isolation at the other end of the isolated segment (unless you have done so previously), and treat that end in the same way you have treated this end. 15. If both originally faulty nodes still fail diagnostics after the isolation has been extended in both directions, or if the isolation returns after nodes have been restored, follow any appropriate procedures described below in the section on troubleshooting. Then if the problem still persists, call the CTS.
Issue 16.0
December 2000
3-55
401-661-045
December 2000
Ring Maintenance
If so, are they all external nodes (adjacent to the BISO or EISO nodes) within the portion of the ring likely to become the isolated segment, or is one of them an internal node within that portion? If not, are they innocent victim nodes within the candidate for the isolated segment?
Issue 16.0
December 2000
3-57
401-661-045
5. If nodes adjacent to padded interframe buffers are faulty and one of them is likely to be an external node in an isolated segment, replace (if you are in an emergency situation) the ring-interface and node-processor circuit pack(s) on both nodes adjacent to the interframe buffers and replace both interframe buffers. Then initialize the ring at level 4. 6. If nodes adjacent to padded interframe buffers are internal nodes (either faulty or innocent-victim) in the candidate for the isolated segment, approach the problem following the procedure described above for responding to multiple isolations (though of course under ring down conditions you will not be able to conduct diagnostics). Then, if a node adjacent to padded interframe buffers becomes a probable external node in a candidate for the isolated segment, treat it as in 5 above. 7. Study the MOVFLT option of the CFR:RING command. It may be useful in resolving an isolation on a very small ring. 8. If none of the above approaches succeeds in recovering the ring, force faults by unseating various ring circuit packs and initializing at level 4. This is a desperate attempt by trial and error to force an isolation in the hope of getting the ring up. Once the ring is up, diagnostics can be run on the isolated portion.
December 2000
Ring Maintenance
2. Set the ECD Manual Ring Mode flag as described in the above reference. IMS is programmed to abort if, during initialization, the ring fails to come up. The ECD manual ring mode flag inhibits this response. 3. If you are employing manual ring mode for a new installation, or if you are experiencing ring down and no RPCNs are in the standby state, restore as many RPCNs as possible. When RPCNs are restored with the ring down, they will be in the STBY, not the ACT, state. This state is expected and sufficient for moving on to Step 4. 4. Enter the command CFR:RING 5. Expect to receive a form of the REPT RING INIT message indicating that the initialization was or was not successful and a CFR RING COMP message indicating that the program has completed. Forms of the REPT RING FLT message may also appear to identify nodes that failed to participate in the initialization. 6. If the initialization was successful, reset the manual ring mode flag to null. 7. If the initialization was not successful, leave the ECD flag set for manual ring mode and use the information you gained in Step 5 to troubleshoot the ring in the manner described in ``Responding to Ring Down.''
Issue 16.0
December 2000
3-59
401-661-045
on each RAP power control interface and display (PCID) board allow them to control these functions locally. Thus RAP initialization and diagnostics may be run centrally by the host or locally by means of PCID-board switches. A RAP failure will usually be tested initially by central diagnostics at the request of ARR, and ROP output will indicate the phases that failed and the circuit pack(s) suspected of being faulty. The procedure described below for fully diagnosing a RAP fault begins by tentatively accepting the results of the automatic diagnostics and then proceeds to conrm them. (Notice in the procedure the requirement that a CDN be quarantined when its RAP circuit packs are diagnosed or replaced.)
The node processor interface (NPI) circuit pack. The central controller support (CCS) circuit pack. The central controller cache (CCC) circuit pack. All equipped main store controller (MASC) circuit packs.
When power is restored the LED of each pack should come on, go off, come back on, and nally go off; and this sequence of LED blinks should be completed for all packs within [18 + (2 the number of MASA boards) +/-2] seconds for systems with the 2-Mbyte memory and within [18 + (20 the number of MASA boards) +/-2] seconds for systems with the 16-Mbyte memory. If an LED fails to come on initially, turn off RAP power, replace the circuit pack, and repeat this step. If any LED fails to follow the full sequence of blinks, or if all LEDs fail to complete the sequence of blinks within the allotted time, go to Step 7 of this procedure. 5. This step manually diagnoses the node. The following information is helpful in understanding it: When diagnostics begin, the LED on each non-MASA circuit pack turns on and stays on until the pack has passed diagnostics. Moreover, diagnostics run on non-MASA packs early-terminate. Therefore, when a non-MASA pack fails
December 2000
Ring Maintenance
diagnostics, the diagnostic routine ends and the LEDs on the failed pack and on all non-MASA packs that have not yet been diagnosed stay on. MASA LEDs, on the other hand, may or may not come on when diagnostics begin, but they will come on if their circuit packs fail diagnostics. Moreover, MASA diagnostics do not early-terminate. Therefore, it is possible during a single diagnostic routine for a MASA pack to fail and for another pack perhaps a non-MASA pack further downstream to fail as well. Depress the DIAG switch on the PCID board. All non-MASA LEDs should come on, then go off within 6 minutes for systems with the 2-Mbyte memory and within 4 minutes for systems with the 16-Mbyte memory. (If more than one MASC memory group is present, add 2 minutes and 40 seconds for each additional group.) If any LED fails to come on initially, turn off RAP power, replace the circuit pack, and repeat this step. If any LED fails to go off in the time indicated, turn off RAP power, replace the circuit pack, and repeat this step. If more than one LED fails to go off in the time indicated, turn off RAP power, replace the rst circuit pack in the following list whose LED is on, and then repeat this step. a. CCS b. Memory group 0, that is, MASC_0 and all MASA packs associated with it. (MASC diagnostics depend upon memory from the rstthe MASA_0memory board, so a fault in one pack may under some circumstances cause the other to fail diagnostics. Therefore, if the situation here or elsewhere indicates that either of these related packs should be replaced but replacing it does not solve the problem, try reinstalling the original pack and replacing the pack of the other.) c. CCC d. Each additional equipped memory group in numerical order. e. NPI If, upon repetition, a replaced circuit pack fails to pass diagnostics, leave RAP power off, quarantine the node, and contact the CTS. 6. If Step 5 succeeded, unconditionally restore the node to service and end this procedure. 7. Systematically search for the fault that is preventing initialization by following Steps 7 through 23. Turn off RAP power. Reinstall the original circuit pack removed in Step 3. 8. Unplug the following circuit packs by opening their latches and pulling them out about one inch:
s
Issue 16.0
December 2000
3-61
401-661-045
s s
The NPI pack All MASAs packs in memory group 0 except MASA_0.
9. Restore RAP power and observe the LED on the CCS pack. If it goes on, off, on, off in 33 to 43 seconds, go to Step 24. 10. Turn off RAP power and replace the CCS pack. 11. Restore RAP power and observe the LED on the CCS pack. If it goes on, off, on, off in 33 to 43 seconds, go to Step 24. 12. Turn off RAP power. Reinstall the original CCS pack. Replace the CCC pack. 13. Restore RAP power and observe the LED on the CCS pack. If it goes on, off, on, off in 33 to 43 seconds, go to Step 24. 14. Turn off RAP power. Reinstall the original CCC pack. Replace the MASC pack. 15. Restore RAP power and observe the LED on the CCS pack. If it goes on, off, on, off in 33 to 43 seconds, go to Step 24. 16. Turn off RAP power. Reinstall the original MASC pack. Replace the MASA_0 pack. 17. Restore RAP power and observe the LED on the CCS pack. If it goes on, off, on, off in 33 to 43 seconds, go to Step 24. 18. Measure the voltage at each power converter (PWRB on the main unit and PWRC on the growth unit) from + pin 056 to gnd pin 032. If the voltage is below the +5.1 to +5.3 volt range, turn RAP power off and replace the appropriate converter. 19. Restore RAP power and observe the LED on CCS pack. If it goes on, off, on, off in 33 to 43 seconds, go to Step 24. 20. Steps 20-23 attempt to identify a problem that is not associated with the failure of a circuit pack. a. Turn off RAP power. b. Reinstall the original MASA_0 pack. c. Check backplane for shorted pins. d. Check growth unit cables and bus terminators for proper installation, adjusting as needed. e. Restore RAP power and observe the LED on the CCS pack. If it goes on, off, on, off in 33 to 43 seconds, go to Step 24.
December 2000
Ring Maintenance
21. If the RAP is not equipped with a growth unit, go to Step 23. Otherwise, turn off RAP power and remove the basic-unit ends of the six growth cables, leaving them hanging free. Remove the six terminator resistors from the growth unit and place them in the positions formerly occupied by the basic-unit ends of the six growth cables. 22. Restore RAP power and observe the LED on the CCS pack. If it goes on, off, on, off in 33 to 43 seconds, the problem is in the growth-unit backplane. Go to Step 24. 23. Leave the node quarantined, call the CTS, and end this procedure. 24. Manually diagnose the node as follows: a. Depress the PCID DIAG switch. b. Check that the CCS, CCC, and MASC_0 LEDs come on. c. Check that the CCS LED goes off in 25 to 35 seconds for systems with the 2-Mbyte memory and in 35 to 45 seconds for systems with the 16-Mbyte memory. d. Check that the following circuit packs all go off in the order listed within 2 minutes for systems with the 2-Mbyte memory and within 75 seconds for systems with the 16-Mbyte memory. 1. MASA_0 2. MASC_0 3. CCC Check that the yellow fail light on the PCID has gone out. e. If the LED on any of the four circuit packs fails to go off on time or in the indicated sequence, or if the PCID fail light fails to go off, turn off RAP power, replace the faulted pack, turn on RAP power, and repeat this step. If the repetition is unsuccessful, leave the node quarantined and call the CTS.
Issue 16.0
December 2000
3-63
401-661-045
usually locate them. But if their frequency of occurrence is long or very irregular, they may escape the IMS net. In such cases, manual records kept by technicians are the indispensable tool for identifying, nding, and correcting them. How will an intermittent fault show up? In a ring interface or IRN node processor, an intermittent fault may appear in several guises as repeated losses of token, as successful ring restarts following instances of blockage, as a node that EAR isolates but ARR returns to service because it passes diagnostics, as a node that ARR turns over to technicians because it has violated the fourth-time rule, or as a combination of these automatic responses. It could also appear as a repeated failure of EAR recovery level 3 to nd a fault that levels 1 and 2 had attempted unsuccessfully to isolate. Again, the existences and histories of faults of this kind are likely to be caught only in the manual records of technicians. On nodes suspected of having intermittent faults, enact the following checks:
s
Inspect the node and its housing (Visually). Look for poorly seated circuit packs, backplane damage or improper grounding, and poorly seated cable connections. Run diagnostics on the node in the repeat mode. Tap on the front of the circuit packs and apply pressure to the backplane with your thumb in an effort to stress cracks and in an attempt to stimulate an intermittent fault to recur. Move the circuit packs of a suspected node one-by-one to another location to see which hardware (if any) have an intermittent failure follow. (Make sure you keep careful records of each move.)
s s
IMS attempts to recover automatically from software faults. Thus no regular software maintenance is required of the Craft. Intermittent faults are more likely to be in hardware than in software. Nevertheless, when a troubled component consistently passes diagnostics, the fault could be in software.
December 2000
Ring Maintenance
Messages, including messages containing diagnostic code, are sent from the 3B21D to an isolated segment of the ring through the BISO or the EISO node. BISO and EISO nodes have one RAC participating in the active-ring segment and one RAC participating in the isolated-ring segment. Messages destined for the isolated segment are read from the active ring by the active-ring RAC, then transmitted by the node processor to the isolated-ring RAC, which writes them to the isolated segment of the ring. A fault in the isolated-ring RAC of either BISO or EISO node might go undetected, since it would not affect the transportation of message on the active ring and could show up misleadingly as a diagnostic failure in the isolated node, thereby, creating the maintenance anomaly described above. Therefore, technicians who face this problem should consider extending the isolation to include the current BISO and EISO nodes and running diagnostics on them.
Unconditional Restorals
Do not unconditionally restore a node unless you are certain it is without faults. Even when you are certain, do not unconditionally restore a node that has been powered down, that contains a ring-interface circuit pack that has been reset, or that exists in isolation with a node that has had a ring-interface circuit pack reset without rst running diagnostic phases 1 and 2 on it. When a node or a circuit pack has been powered down, the status registers of its ring-interface hardware may become improperly set, and an unconditional restoral of the node will likely result in a ring transport error and an isolation. Diagnostic phases 1 and 2 reset all ring-interface status registers to their proper positions.
Avoiding Trouble
Be careful not to leave the system unattended with ARR or CNR inhibited.
Recording Trouble
When troubleshooting a ring-related problem, frequently enter the OP:RING;DETD command as a way of providing, on the ROP output, sequential records of ring status. Such records may be useful during postmortems. If a problem is likely to be referred to developers at Bell Laboratories, save the current RPTERR0 and RPTERR1 log les in /etc/log. Keep records on all circuit pack replacements and failures.
Issue 16.0
December 2000
3-65
401-661-045
Keep records on all indications of transient and intermittent faults identifying, if possible, the locations where they occur. Remember that a transient fault may be an intermittent fault in its infancy.
December 2000
Ring Maintenance
Issue 16.0
December 2000
3-67
401-661-045
00AAAAAAAAAAAA....
01................
02................
30................
31.AAAAAAAAAAAAAAA
32AAAAAAAAAAAA....
63.AAAAAAAAAAAAAAA
CMD FUNCTION
REPT RING CFR LEVEL 0 RING CONFIGURATION INITIATED BY EAR NORMAL CONFIGURATION REQUESTED 0 1 4 3600000..........................................(4030614766)
Announces the onset of a level-0 recovery attempt, stimulated by EARs receipt of one or more error messages indicating a ring-related fault. The onset time of the attempt appears in milliseconds in parentheses on the bottom line. Other numbers on the bottom line pertain to the ring error threshold. The rst digit indicates EARs mode where 0 = ``threshold not exceeded and 1 = ``threshold exceeded. The second digit identies the number of ring errors that have occurred within the current threshold interval. The third digit is the user-specied number of errors per threshold interval that causes the threshold to be exceeded. And 3600000 is the user-specied threshold interval in milliseconds. When the second number equals the third, the threshold has been exceeded. Announces a successful restart of the ring. Thus no manual response is required. 455 ms is the duration in milliseconds of ring silence resulting from the conguration attempt, and in parentheses are the times when the ring conguration job started and was completed.
REPT RING CFR RING CONFIGURATION ESTABLISHED (455 ms) NORMAL CONFIGURATION, NODE NODES ISOLATED .................................(4030614777)(4030615120)
December 2000
Ring Maintenance
REPT RING TRANSPORT ERR RAC PARITY/FORMAT ERROR DETECTED, IUN31 11 RAC 0 ....................................................................... ....................................................(4030614653)
IMS in the 3B21D received this and the following two-ring transport error messages (at the times in parentheses) as a result of the fault that stimulated the above recovery attempt. This message (the rst to arrive) identies the error type and the node and RAC associated with the error. Notice that ring transport error messages appear on the ROP following the messages announcing the system response to the error. The fault spawned two instances of blockage, one from this, the second node upstream of the faulty node...
REPT RING TRANSPORT ERR BLOCKAGE DETECTED, IUN31 9 RAC 0 ....................................................................... .....................................................(4030614663) REPT RING TRANSPORT ERR BLOCKAGE DETECTED, IUN31 10 RAC 0 ....................................................................... .....................................................(4030614667)
and one from this, the rst node upstream of the faulty node. IUN 31 9 detected blockage before IUN 31 10 could drain the ring. IUN 31 10 must have detected blockage prior to IUN 31 9, but IUN 31 9s ring transport error report reached the 3B21D rst.
Issue 16.0
December 2000
3-69
401-661-045
00AAAAAAAAAAAA....
01................
02................
30................
31.AAAAAAAAAAAAAAA
32AAAAAAAAAAAA....
63.AAAAAAAAAAAAAAA
REPT RING CFR LEVEL 0 RING CONFIGURATION INITIATED BY EAR NORMAL CONFIGURATION REQUESTED .....................................................(4030772385) REPT RING CFR RING CONFIGURATION ATTEMPT FAILED 17 COULD NOT ESTABLISH A NORMAL RING CONFIGURATION ....................................................................... (4030772397)(4030772536) REPT RING CFR LEVEL 1 RING CONFIGURATION INITIATED BY EAR ISOLATION FROM IUN31 11 TO IUN31 11 REQUESTED 0 2 4 3600000..................................(4030772561)
Prompted by a ring transport error report, EAR level-0 requests that the ring cong module restart the ring.
The continuity test run by the ring cong module failed, an indication that the fault is probably hard.
EAR level-1 requests that the ring cong module isolate the node indicated as faulty by the ring transport error messages.
December 2000
Ring Maintenance
REPT RING CFR RING CONFIGURATION ESTABLISHED (658 MS) BISO NODE = IUN31 10, EISO NODE = IUN31 12 (4030772580)(4030772942) REPT RING TRANSPORT ERR RAC PARITY/FORMAT ERROR DETECTED, IUN31 11 RAC 0. ................................................(4030772270) REPT RING TRANSPORT ERR BLOCKAGE DETECTED, IUN31 10 RAC 0. ................................................(4030772278) REPT RING TRANSPORT ERR BLOCKAGE DETECTED, IUN31 9 RAC 0. ................................................(4030772282) REPT ARR AUTORST ARR COND RST FOR IUN31 11 STARTED
IUN31 11 is isolated with IUN31 10 acting as BISO node and IUN31 12 acting as EISO node.
ARR requests that MIRA conditionally restore the isolated node. This is ARRs check that the removal and isolation of the node was necessary. The attempt will generate diagnostic data that the technician should use if called upon to perform maintenance on the node. RTR message announcing that ARR`s restoral request is on the active queue and being processed.
Issue 16.0
December 2000
3-71
401-661-045
00AAAAAAAAAAAA....
01................
02................
30................ 63.AAAAAAAAAAAAAAA
31.AAAAAAAAAAAAAAA
32AAAAAAAAAAAA....
RTR message announcing that it could not remove IUN31 11 from service (because EAR had done so previously). Indicates that during phase 1 diagnostics, some tests (nine in all) failed and none (X00000000 X00000000) were skipped. IUN31 11 is not necessarily the node in which phase 1 failed, but the node specied in ARRs diagnostic request. Since phases 1 and 2 test all RACs in the isolated segment, the fault that produces a phase 1 or 2 failure may not reside in the specied node. The failure of test 005 indicates that, in this instance, low-phase ambiguity exists; in other words, that both a RAC of the isolated node and a RAC of either the EISO or BISO node is suspected of being faulty. See the ``LowPhase Ambiguity section in this chapter.
DGN IUN31 11 PH 1 STF (9 X00000000 X00000000) TEST 004........................................................... 005 X00000dfb................................................ 006........................................................... 008........................................................... 009...........................................................
December 2000
Ring Maintenance
DGN IUN31 11 PH 2 STF (10 X00000000 X00000000) TEST 002........................................................... 004........................................................... 005 X00000dfb................................................ 006........................................................... 007........................................................... DGN IUN31 11terminated at ph 2 stmnt 36 after test 17 ANALY:TLPFILE: IUN31 11 SUMMARY DATA MSG STARTED TLP: IUN31 11 PH=1.................................................... TLP: IUN31 11 PH=2....................................................
Phase-1 diagnostics test the isolated segment beginning at the BISO node and phase-2 tests them beginning at the EISO node. In the case of single-node isolations, the two phases should report failure data for the same node(s), but in the case of multiple-isolations they usually report failure data for different nodes.
Indicates the point in the diagnostic routine at which execution terminated. Summarizes diagnostic failure data. Phases cited are those that failed; but because phases 1 and 2 are at issue, IUN31 11 is not necessarily the location of the failure.
TLPFILE COMPLETED DGN IUN 31 11 COMPLETED STF (19........................) ANALY TLPFILE IUN31 11 TLPSRCH MSG IP TLPFILE #983090 ANALY TLPFILE IUN31 11 SUSPECT FLTY EQUIPMENT CODE GRP MEM CONT POS WT NOTE UN303 31 11 -- -- 10 -CABLE ----- 10 3 Short form of this message. The longer form is next. This data is printed only after a test fails and only if the TLP option was specied in the DGN command (as it always is by ARR). The entry lists in weighted (WT) order equipment suspected of being faulty. The WT is a number between 1 and 10. The higher the WT the greater the likelihood of the equipment being faulty. Because ARR does not specify the RAW option of the DGN command, failure data for test 010 is not given. (See the ``Low-Phase Ambiguity section of this chapter.) Because of diagnostic failure (error code 1).
RST IUN31 11 STOPPED 1 DGN IUN31 11 STF..............................................MSG COMPL REPT ARR AUTORST ARR COND RST FOR IUN 31 11 FAILED
Conrms that ARRs restoral request has failed. Many IMS processes write to the ROP, at times resulting in some redundancy.
Issue 16.0
December 2000
3-73
401-661-045
00AAAAAAAAAAAA....
01................
02................
30................
31.AAAAAAAAAAiAAAA
32AAAAAAAAAAAA.... 4
63.AAAAAAAAAAAAAAA
The subnumber 4 under the i in the above output message indicates that the ring interface of IUN31 11 is faulty. The numbers used in this way have the following meanings: 1 = manual mode 2 = RI QUSBL or NP faulty or untested 3 = combination of 1 and 2 4 = RI faulty or untested 5 = combination of 1 and 4 6 = combination of 2 and 4 7 = combination of 1, 2, and 4
OP:RING, IUN31 11 OP:RING IUN31 11 COMPL IUN32 11: MJ = OOS; NM = MAN; RI = FLTY ; NP = USBL IN ISOL SEG
Manual input message. Like the TLP and OP:RING;DETD outputs above, this data does not reect the low-phase ambiguity. Following the procedures, ``Responding to Single Node Isolations and ``Clearing Faults in Response to ARR Actions, a technician replaces circuit pack UN303 in IUN 31 11... and conditionally restores the node.
RST:IUN31 11
December 2000
Ring Maintenance
RST IUN31 11 TASK 4 MSG STARTED RMV IUN31 11 STOPPED 5 DGN IUN31 11 COMPLETED ATP MESSAGE IN PROGRESS REPT RING CFR RING CONFIGURATION ESTABLISHED (338 ms) NORMAL CONFIGURATION, NO NODES ISOLATED (4031118365)(40311118740) RST IUN31 11 COMPLETED IUN31 11 has been returned to the active ring, pumped with operational code and placed in execution. Repaired IUN31 11 now passes diagnostics. The isolation is dissolved automatically as IUN31 11 is restored.
00AAAAAAAAAAAA....
01................
02................
30................
31.AAAAAAAAAAAAAAA
32AAAAAAAAAAAA....
63.AAAAAAAAAAAAAAA
Issue 16.0
December 2000
3-75
401-661-045
CMD> -- 1105 RING STATUS SUMMARY -00AAAAAAAAAAAA.... 30................ 63.AAAAAAAAAAAAAAA CMD FUNCTION 400 OP RING DETAILED REPT RING CFR LEVEL 0 RING CONFIGURATION INITIATED BY EAR NORMAL CONFIGURATION REQUESTED. 0 3 4 3600000................(4031349825) REPT RING CFR RING CONFIGURATION ATTEMPT FAILED 17 COULD NOT ESTABLISH A NORMAL RING CONFIGURATION ..................................................... (4031349837)(4031350005) REPT RING CFR LEVEL 1 RING CONFIGURATION INITIALED BY EAR ISOLATION FROM IUN31 11 TO IUN31 11 REQUESTED. 0 3 4 3600000.................(4031350030) REPT RING CFR RING CONFIGURATION ESTABLISHED (695 ms) BISO NODE = IUN31 10, EISO NODE = IUN31 12 (4031350049)(4031350422) REPT RING TRANSPORT ERR RAC PARITY/FORMAT ERROR DETECTED. IUN31 11 RAC 0. ........................................(4031349712) REPT RING TRANSPORT ERR BLOCKAGE DETECTED, IUN31 9 RAC 0. ........................................(4031349722) REPT RING TRANSPORT ERR BLOCKAGE DETECTED, IUN31 10 RAC 0. ........................................(4031349727) RST IUN31 11 TASK 5 MSG STARTED 01................ 31.AAAAAAAAAAAAAAA 02................ 32AAAAAAAAAAAA....
December 2000
Ring Maintenance
00AAAAAAAAAAAA....
01................
02................
30................
31.AAAAAAAAAAAAAAA
32AAAAAAAAAAAA....
63.AAAAAAAAAAAAAAA
DGN IUN31 11 COMPLETED ATP MESSAGE IN PROGRESS REPT RING CFR RING CONFIGURATION ESTABLISHED (338 ms) NORMAL CONFIGURATION, NO NODES ISOLATED (4031519404)(4031519780) RST IUN31 11 COMPLETED DGN IUN31 11 ATP MESSAGE COMPLETE REPT ARR AUTORST ARR COND RST FOR IUN31 11 SUCCEEDED OP:RING;DETD
00AAAAAAAAAAAA....
01................
02................
30................
31.AAAAAAAAAAAAAAA
32AAAAAAAAAAAA....
63.AAAAAAAAAAAAAAA
Issue 16.0
December 2000
3-77
401-661-045
RPCN00 0 RAC 0
RAC 1
IUN32 1
RAC 0
RAC 1
EISO Node
BISO Node
RAC 1
RAC 0
Isolated Node
RAC 1 Isolated Ring Active Ring RAC 0 Padded Interframe Buffers RPCN32 0
Figure 3-4.
December 2000
Ring Maintenance
In response to an ambiguous ring-interface failure associated with either RAC 0 in RPCN32 0 or RAC 0 in IUN32 1, IMS would congure the ring as in the structure illustrated. If, in such a ring, performing maintenance on RAC 0 in RPCN32 0 failed to clear the fault, the next procedural stepextending the isolation to include IUN32 1 in order to perform maintenance on it (see ``Guideline to Single-Node Isolations'' in this chapter)would bring the ring down, since both pairs of padded interframe buffers would then be included in the isolated segment. A version of the CFR command is designed especially for handling this dilemma. CFR:RING,NODEa,b;MOVFLT moves the indication of a faulty ring interface from the currently isolated node to the node identied as NODEa,b in the command. It also causes the isolation to shift so that NODEa,b becomes the newly isolated node and the formerly isolated node becomes the BISO or EISO node, as in the following illustration which was created by the command CFR:RING,LN32,1;MOVFLT:
RPCN00 0 RAC 0
RAC 1
EISO Node
IUN32 1
RAC 0
RAC 1
Isolated Node
RAC 1
RAC 0
BISO Node
RAC 1 Isolated Ring Active Ring RAC 0 Padded Interframe Buffers RPCN32 0
Figure 3-5.
Issue 16.0
December 2000
3-79
401-661-045
REPT RING CFR LEVEL 0 RING CONFIGURATION INITIATED BY EAR NORMAL CONFIGURATION REQUESTED 0 1 4 3600000.............................(242674464) REPT RING CFR RING CONFIGURATION ATTEMPT FAILED 17 COULD NOT ESTABLISH A NORMAL RING CONFIGURATION ....................................................................... (242674474)(242674649) REPT RING CFR LEVEL 1 RING CONFIGURATION INITIATED BY EAR ISOLATION FROM RPCN32 0 TO RPCN32 0 REQUESTED 0 1 3 3600000.............................(242674676) REPT RING CFR RING CONFIGURATION ESTABLISHED (610 MS) BISO NODE = IUN00 1, EISO NODE = IUN32 1 (242674689)(242674963) REPT RING TRANSPORT ERR RAC PARITY/FORMAT ERROR DETECTED, IUN32 1 RAC 0. ...................................................................... ............................................(242674346) REPT ARR AUTORST ATT COND RST FOR RPCN32 0 STARTED RMV RPCN32 0 STOPPED 5 In this instance EAR did not receive or did not report blockage.
December 2000
Ring Maintenance
DGN RPCN32 0 PH 1 STF (11 X00000000 X00000000) TEST.................................................................. 002................................................................... 004................................................................... 005 (X00000e00)...................................................... 006................................................................... 007................................................................... DGN RPCN32 0 PH 2 STF (11 X00000000 X00000000) TEST.................................................................. 002................................................................... 004................................................................... 005 (X00000e00)......................................................... 006................................................................... 007................................................................... RPCN32 0 TERMINATED AT PH 27 STMNT 15 AFTER TEST 8 ANALY:TLPFILE: RPCN32 0 SUMMARY DATA TLP: RPCN32 0 PH=1.................................................... TLP: RPCN32 0 PH=2.................................................... T.PFILE COMPLETED DGN RPCN32 0 COMPLETED STF (21 X00000000 X00000000) ANALY TLPFILE RPCN32 0 TLPSRCH TLPFILE #917573 ANALY TLPFILE RPCN32 0 SUSPECT FLTY EQUIPMENT CODE GRP MEM CONT POS WT NOTE UN122C 32 0 -- -- 10 -UN123B 32 0 -- -- 10 -CABLE ----- 10 3
The failure of test 5 means that lowphase ambiguity exists in this case; in other words, the IMS regards either RAC 1 in the BISO node or RAC 0 in the EISO node, or both, as possibly faulty.
The extended TLP output message does not identify equipment in the BISO or EISO node as faulty, because the ring interfaces of these nodes are necessarily classied as usable.
Issue 16.0
December 2000
3-81
401-661-045
REPT ARR AUTORST ARR COND RST FOR RPCN32 0 FAILED OP:RING;DETD
Failure of the ARR restoral attempt results in the maintenance mode of the node being changed to manual.
00AA..............
01................
02................
30................
31................
32iA.............. 5
The isolation in this small ring during a time of heavy trafc creates an emergency condition. Following the procedures for ``Clearing Faults in Response to ARR Actions and ``Responding to Single-Node Isolations, the technician elects to change both UN122C and UN123B in RPCN32 0 but does not troubleshoot the cable. It is possible, of course, that the fault is in the cable, but this being a situation involving low-phase ambiguity, it is far more likely that the fault, if it is not in the circuit packs of RPCN32 0, is in the isolated RAC of either the EISO or BISO node. Then, this being a phase 1 and 2 failure, the technician diagnoses the node using the RAW option so that if phase 1 or 2 still fails, an indication will be given as to whether the isolated RAC of the BISO or EISO node is suspected of being faulty. Of course, the problem could be in the cable of RPCN32 0.
December 2000
Ring Maintenance
RMV RPCN32 0 STOPPED 5 DGN RPCN32 0 PH 1 STF (11X00000000 X00000000) TEST MISMATCH........................ 002................................................................... 004................................................................... 005 X00000e00...................................................... 006................................................................... 007................................................................... 008................................................................... 009................................................................... 010 X00000e01...................................................... 011................................................................... 016................................................................... 017................................................................... DGN RPCN32 0 PH 2 STF (10X00000000 X00000000) TEST MISMATCH 002................................................................... 004................................................................... 005 X00000e00............................ 006................................................................... 007................................................................... 008................................................................... 009................................................................... 010 X00000c01............................ 011................................................................... 016................................................................... 017................................................................... DGN RPCN32 0 PH 10 ATP.................... DGN RPCN32 0 PH 11 ATP..................... DGN RPCN32 0 PH 12 ATP..................... DGN RPCN32 0 PH 13 ATP..................... DGN RPCN32 0 PH 20 ATP..................... The mismatch data for failing test 10 identies both IUN32 1 and IUN00 1 as suspect nodes. (Hexadecimal e01 is translated by the ``Physical Node Address Hexadecimal Representation table in the reference chapter of this document as node 32 1 and hexadecimal c01 is translated as node 00 1.) In this situation, the standard procedure calls for technicians to extend the isolation to include IUN32 1 or IUN00 1 to perform maintenance on it. Extending the isolation to include IUN32 1 would in this instance, however, bring the ring down, because it would result in the isolation of both pairs of padded interframe buffers.
(See the illustration of the ring that appears at the beginning of this section.) Therefore, the rst action (which to conserve space is not shown here) was to extend the isolation to include IUN00 1 and to perform maintenance on it. This action, however, did not nd a fault in IUN00 1, and so the isolation was reduced to include once again only RPCN32 0, and the MOVFLT option of the CFR command was employed to shift the isolation from RPCN32 0 to IUN32 1 as played out below.
Issue 16.0
December 2000
3-83
401-661-045
DGN RPCN32 0 PH 23 ATP..................... DGN RPCN32 0 PH 24 ATP..................... DGN RPCN32 0 PH 26 ATP..................... DGN RPCN32 0 PH 27 ATP..................... Unuseful output generated by the DGN RAW option could have been stopped by terminating DGN with the STOP:DMQ command.
DGN RPCN32 0 TERMINATED AT PH 27 STMNT 15 AFTER TEST 3 DGN RPCN32 0 STF (21 X00000000 X0000000)......... RMV:LN32 1 In preparation for entering the CFR command, the node specied in the command must be removed from service.
00AA..............
01................
02................
30................
31................
32iO.............. 51
REPT RING CFR WARNING: BISO AND/OR EISO NODE OOS BISO NODE - IUN00 1, EISO NODE =IUN32 1 ACTIVE RING SEGMENT NOT LONG ENOUGH
Removing a BISO or EISO node from service would ordinarily cause the isolation to extend to include the out-of-service node. In this case it does not, however, because IMS calculates that doing so would shorten the ring below its minimum data length.
December 2000
Ring Maintenance
CFR:RING,IUN32 1;MOVFLT!
With the suspect IUN32 1 quarantined out-of-service, the technician enters the MOVFLT version of the CFR command to shift the isolation to include IUN32 1.
REPT RING CFR RING CONFIGURATION ESTABLISHED (290 ms) BISO NODE = RPCN32 0, EISO NODE = RPCN00 0 (243506608) (243506934) REPT ARR AUTORST CNR UCL REST FOR RPCN32 0 STARTED CFR RING IUN32 1 COMPL ARR undertakes its highest-priority task, the restoral of a node designated as a BISO or EISO node. The isolation shifted, the ring now has the structure of the second illustration at the beginning of this section, and the probable fault in IUN32 1 may now be corrected.
00AA..............
01................
02................
30................
31................
32Ai.............. 5
Issue 16.0
December 2000
3-85
401-661-045
request of ARR as RI faulty, and its maintenance mode changed to manual. Then, before the technician can repair and return it to service, another ring-related fault occurs on a distant part of the ring, with the result that the many nodes lying between the two faulty nodes must be removed from service as victims of the expanded isolation. The rst stage of this example is identical to the example recorded above in ``Manual Recovery from a Hard Fault,'' except that the massive isolation intervenes before the rst fault can be repaired. This example occurs on the following ring:
CMD>
00AAAAAAAAAAAA....
01................
02................
30................
31.AAAAAAAAAAAAAAA
32AAAAAAAAAAAA....
63.AAAAAAAAAAAAAAA
REPT RING CFR LEVEL 0 RING CONFIGURATION INITIATED BY EAR NORMAL CONFIGURATION REQUESTED .....................................................(4030772385) REPT RING CFR RING CONFIGURATION ATTEMPT FAILED 17 COULD NOT ESTABLISH A NORMAL RING CONFIGURATION ....................................................................... (4030772397)(4030772536)
Prompted by a ring transport error report, EAR level-0 requests that the ring cong module restart the ring.
The continuity test run by the ring cong module failed, an indication that the fault is probably hard.
December 2000
Ring Maintenance
REPT RING CFR LEVEL 1 RING CONFIGURATION INITIATED BY EAR ISOLATION FROM IUN31 11 TO IUN31 11 REQUESTED 0 2 4 3600000..................................(4030772561) REPT RING CFR RING CONFIGURATION ESTABLISHED (658 MS) BISO NODE = IUN31 10, EISO NODE = IUN31 12 (4030772580)(4030772942) REPT RING TRANSPORT ERR RAC PARITY/FORMAT ERROR DETECTED, IUN31 11 RAC 0. ....................................................................... ................................................(4030772270) REPT RING TRANSPORT ERR BLOCKAGE DETECTED, IUN31 10 RAC 0. ....................................................................... ................................................(4030772278) REPT RING TRANSPORT ERR BLOCKAGE DETECTED, IUN31 9 RAC 0. ....................................................................... ................................................(4030772282) REPT ARR AUTORST ARR COND RST FOR IUN31 11 STARTED
EAR level-1 requests that the ring cong module isolate the node indicated by the ring transport error messages below as faulty.
IUN31 11 is isolated with IUN31 10 acting as BISO node and IUN31 12 acting as EISO node.
ARR requests that MIRA conditionally restore the isolated node. This is ARRs check that the removal and isolation of the node was necessary. The attempt will generate diagnostic data that the technician should use if called upon to perform maintenance on the node. RTR message announcing that ARRs restoral request is on the active queue and being processed.
Issue 16.0
December 2000
3-87
401-661-045
00AAAAAAAAAAAA....
01................
02................
30................
31.AAAAAAAAAAAAAAA
32AAAAAAAAAAAA....
63.AAAAAAAAAAAAAAA
RTR message announcing that it could not remove IUN31 11 from service (because EAR had done so previously). Indicates that during phase 1 diagnostics, some tests (nine in all) failed and none (X00000000 X00000000) were skipped. IUN31 11 is not necessarily the node in which phase 1 failed, but the node specied in ARRs diagnostic request. Since phases 1 and 2 test all RACs in the isolated segment, the fault that produces a phase 1 or 2 failure may not reside in the specied node. The failure of test 005 indicates that, in this instance, low-phase ambiguity exists; in other words, that both a RAC of the isolated node and a RAC of either the EISO or BISO node is suspected of being faulty. See the ``Low-Phase Ambiguity section in this chapter.
DGN IUN31 11 PH 1 STF (9 X00000000 X00000000) TEST 004........................................................... 005 X00000dfb................................................ 006........................................................... 008........................................................... 009...........................................................
December 2000
Ring Maintenance
DGN IUN31 11 PH 2 STF (10 X00000000 X00000000) TEST 002........................................................... 004........................................................... 005 X00000dfb................................................ 006........................................................... 007...........................................................
Phase-1 diagnostics test the isolated segment beginning at the BISO node and phase-2 tests them beginning at the EISO node. In the case of single-node isolations, the two phases should report failure data for the same node(s), but in the case of multiple-isolations they usually report failure data for different nodes. Indicates the point in the diagnostic routine at which execution terminated.
DGN IUN31 11 terminated at ph 2 stmnt 36 after test 17 ANALY:TLPFILE: IUN31 11 SUMMARY DATA MSG STARTED TLP: IUN31 11 PH=1.................................................... TLP: IUN31 11 PH=2.................................................... TLPFILE COMPLETED Summarizes diagnostic failure data. Phases cited are those that failed; but because phases 1 and 2 are at issue, IUN31 11 is not necessarily the location of the failure.
DGN IUN 31 11 COMPLETED STF (19...................................) ANALY TLPFILE IUN31 11 TLPSRCH MSG IP TLPFILE #983090 ANALY TLPFILE IUN31 11 SUSPECT FLTY EQUIPMENT CODE GRP MEM CONT POS WT NOTE UN303 31 11 -----10 10 -3 Short form of this message. The longer form is next. This data is printed only after a test fails and only if the TLP option was specied in the DGN command (as it always is by ARR). The entry lists in weighted (WT) order equipment suspected of being faulty. The WT is a number between 1 and 10. The higher the WT the greater the likelihood of the equipment being faulty. Because ARR does not specify the RAW option of the DGN command, failure data for test 010 is not given. (See the ``Low-Phase Ambiguity section of this chapter.) Because of diagnostic failure (error code 1).
CABLE --
Issue 16.0
December 2000
3-89
401-661-045
Conrms that ARRs restoral request has failed. Many IMS processes write to the ROP, at times resulting in some redundancy. Manual input message.
00AAAAAAAAAAAA....
01................
02................
30................
31.AAAAAAAAAAAAAAA
32AAAAAAAAAAAA....
63.AAAAAAAAAAAAAAA
OP:RING, IUN31 11 OP:RING IUN31 11 COMPL IUN31 11: MJ = OOS; NM = MAN; RI = FLTY ; NP = USBL IN ISOL SEG REPT RING CFR LEVEL 0 RING CONFIGURATION INITIATED BY EAR ISOLATION FROM IUN31 11 TO IUN31 11 REQUESTED. 0 1 4 3600000................(403082426)
Manual input message. Like the TLP output above, this data does not reect the low-phase ambiguity.
Before the technician can respond to the single isolation, another fault occurs. EAR level-0 attempts to restart the ring in conformity with its isolated structure prior to the occurrence of the second fault.
December 2000
Ring Maintenance
REPT RING CFR RING CONFIGURATION ATTEMPT FAILED 17 COULD NOT ESTABLISH BISO NODE = IUN31 10, EISO NODE = IUN31 12 ...................................................................... (403082441)(403082625) REPT RING CFR LEVEL 1 RING CONFIGURATION INITIATED BY EAR ISOLATION FROM IUN31 11 TO IUN32 6 REQUESTED. 0 2 4 3600000.................(403082654) REPT RING TRANSPORT ERR RMV RPCN 32 0 RQSTD; RPC ISOLATION RPTD ...................................(403082796) REPT RING CFR RING CONFIGURATION ESTABLISHED (703 ms) BISO NODE = IUN31 10, EISO NODE = IUN32 7 (403082671)(403082031) REPT RING TRANSPORT ERR RAC PARITY/FORMAT ERROR DETECTED, IUN32 6 RAC 0. ........................................(403082306) REPT RING TRANSPORT ERR BLOCKAGE DETECTED, IUN32 5 RAC 0. ...................................................................... ........................................(403082316) REPT RING TRANSPORT ERR BLOCKAGE DETECTED, IUN32 4 RAC 0. ...................................................................... ........................................(403082322) REPT ARR AUTORST ARR COND RST FOR IUN32 6 STARTED
so the isolation must be extended to include both nodes suspected of having faulty ring interfaces.
This message noties the technician that an innocent-victim RPCN is being included in the extended isolation. The multiple-node isolation is now established.
Having failed previously (during the single isolation stage) to restore IUN31 11, ARR now selects IUN32 6 for a conditional restoral attempt.
Issue 16.0
December 2000
3-91
401-661-045
RMV IUN32 6 STOPPED 5 DGN IUN32 6 PH 1 STF (9 X00000000 X`00000000) TEST.................................................................... 004..................................................................... 005 X00000dfb......................................................... 006..................................................................... 008..................................................................... 009..................................................................... DGN IUN32 6 PH 2 STF (11 X00000000 X`00000000) TEST.................................................................... 002..................................................................... 004..................................................................... 005 X00000e06......................................................... 006..................................................................... 007..................................................................... DGN IUN32 6 TERMINATED AT PH 2 STMNT 36 AFTER TEST 17 ANALY:TLPFILE: IUN32 6 SUMMARY DATA TLP: IUN32 6 PH=1........................................................ TLP: IUN32 6 PH=2........................................................ TLPFILE COMPLETED DGN IUN32 6 COMPLETED STF (20..................) ANALY TLPFILE IUN 32 6 TLPSRCH TLPFILE # 1179716 ANALY TLPFILE IUN32 6 SUSPECT FLTY EQUIPMENT CODE GRP MEM CONT POS WT NOTE UN303 UN303 31 31 12 11 -------10 10 10 --3 Contrast this output with the TLP output when IUN32 11 was singly isolated. Both then and now the ring interface of IUN31 12 was suspect. The difference is that when the suspect RAC of IUN31 12 was part of an EISO node, its ring interface could not be set to FLTY. IUN32 6 is not included because the TLP output reects only the rst failing phase. Phase-2 diagnostic tests begin running from the EISO node. Therefore, they identify IUN32 6 (e06) as faulty. The failure of test 005 of phase 2 indicates that low-phase ambiguity exists surrounding IUN32 6. Probably, though not certainly, IUN32 5, whose ring interface is suspected to be faulty, is the node involved in this instance of low-phase ambiguity. Phase-1 diagnostic tests begin running from the BISO node. Therefore, they identify IUN31 11 as faulty.
CABLE --
December 2000
Ring Maintenance
REPT ARR AUTORST ARR COND RST FOR IUN32 6 FAILED OP:RING;DETD
00AAAAAAAAAAAA....
01................
02................
30................
31.AAAAAAAAAAiiiii 54
32iiiiiiiAAAAA.... 45
63.AAAAAAAAAAAAAAA
Notice that the subnumbers produced by the OP:RING;DETD command indicate that, as a result of low-phase ambiguity, four nodes are suspected of having faults in their ring interfaces. Because none of the four is now in the active ring as an EISO or BISO node, each can have its ring interface minor state marked FLTY. DGN:IUN31 11;RAW! In accordance with the procedures, ``Responding to Multiple-Node Isolations and ``Clearing Faults in Response to ARR Actions, a technician replaces circuit pack UN303 in IUN 31 11 and submits the node to automatic diagnostics with the RAW option.
Issue 16.0
December 2000
3-93
401-661-045
RMV IUN31 11 STOPPED 5 DGN IUN31 11 PH 1 (STF (10X00000000 X00000000) TEST.................................................................... 004..................................................................... 005 X00000e05........................................... 006..................................................................... 007..................................................................... 008..................................................................... 009..................................................................... 010 X00000e06........................................................ 011..................................................................... 016..................................................................... 017..................................................................... REPT ARR AUTORSTR ARR COND RST FOR IUN31 12 STARTED Having failed to restore IUN31 11 and IUN32 6, ARR now attempts to restore IUN31 12. This automatic action occurs at nearly the same time as the manual diagnostic procedure. This output from the manual diagnostic request with the RAW option shows IUN32 5 and IUN32 6 as suspected of having faulty ring interfaces, implying that IUN31 11 and IUN31 12 have passed phase 1, a condition that should cause their ring interface states to change to QUSBL.
RST IUN31 12 QUEUED TASK 0 DGN IUN31 11 PH 2 STF (11 X00000000 X00000000) TEST.................................................................... 002..................................................................... 004..................................................................... 005 X00000e06.......................................................... 006..................................................................... 007..................................................................... 008..................................................................... 009..................................................................... 010 X00000e05........................................................ 011..................................................................... 016..................................................................... 017..................................................................... DGN IUN31 11 TERMINATED AT PH 2 STMNT 36 AFTER TEST 17
December 2000
Ring Maintenance
DGN IUN31 11 COMPLETED STF (21...........) RST LN31 12 TASK 9 RMV IUN31 12 STOPPED 5 DGN IUN31 12 PH 1 (STF (10X00000000 X00000000) TEST.................................................................... 004..................................................................... 005 X00000e05......................................................... 006..................................................................... 007..................................................................... 008..................................................................... DGN IUN31 12 PH 2 (STF (11X00000000 X00000000) TEST..................................................................request. 004..................................................................... 005 X00000e06......................................................... 006..................................................................... 007..................................................................... 008..................................................................... DGN IUN31 12 TERMINATED AT PH 2 STMNT 36 AFTER TEST 17 ANALY:TLPFILE: IUN31 12 SUMMARY DATA TLP: IUN31 12 PH=1...................................................... TLP: IUN31 12 PH=2...................................................... ANALY TLPFILE IUN31 12 SUSPECT FLTY EQUIPMENT CODE GRP MEM CONT POS WT NOTE UN303 UN303 32 32 6 5 -------10 10 10 --3 Only the extended TLP message explicitly identies the node(s) within the isolation that may have failed diagnostic phases 1 and 2. This is output from ARRs restoral request. ARR restoral request on IUN31 12 started.
CABLE --
Issue 16.0
December 2000
3-95
401-661-045
REPT RING CFR RING CONFIGURATION ESTABLISHED (358 ms) BISO NODE = IUN32 4, EISO NODE = IUN32 7 (403041870)(403042272)
This action was triggered by the automatic RST command, which concludes with a request that as much as possible of an isolated segment be included in the active ring. The isolated segment is now reduced to the two nodes whose ring interfaces are still suspected of being faulty.
DGN IUN 31 12 STF................................................... REPT ARR AUTORST ARR COND RST FOR IUN31 12 FAILED REPT ARR AUTORST CNR UCL RST FOR IUN32 4 STARTED The new BISO node, having been an innocent victim of the isolation, was outof-service. Restoring a BISO or EISO node is the highest priority of ARR.
REPT ARR AUTORST CNR UCL RST FOR IUN32 4 SUCCEEDED RST IUN32 4 COMPLETED REPT ARR AUTORST ARR COND RST FOR IUN32 5 STARTED Having previously attempted and failed to restore IUN32 6, ARR now attempts to restore IUN32 5. Consult the section ``Restoral Priorities Rule in this chapter for an explanation of ARRs behavior in the remainder of this example.
RST IUN32 5 TASK 0 MSG STARTED RMV IUN32 5 STOPPED 5 DGN IUN32 5 PH 1 (STF (10X00000000 X00000000) TEST.................................................................... 004..................................................................... 005 X00000e05......................................................... 006..................................................................... 007..................................................................... 008..................................................................... This is output from ARRs restoral request for IUN32 5.
December 2000
Ring Maintenance
DGN IUN32 5 PH 2 (STF (11X00000000 X00000000) TEST..................................................................request. 004..................................................................... 005 X00000e06......................................................... 006..................................................................... 007..................................................................... 008..................................................................... DGN IUN32 5 TERMINATED AT PH 2 STMNT 36 AFTER TEST 17 ANALY:TLPFILE: IUN32 5 SUMMARY DATA TLP: IUN32 5 PH=1........................................................ TLP: IUN32 5 PH=2........................................................ ANALY TLPFILE IUN31 12 / SUSPECT FLTY EQUIPMENT CODE GRP MEM CONT POS WT NOTE UN303 UN303 32 32 6 5 -------10 10 10 --3
CABLE --
RST IUN32 5 STOPPED 10 DGN IUN32 5 STOPPED COMPLETED REPT ARR AUTORST ARR COND RST FOR IUN32 5 FAILED REPT ARR AUTORST ARR UCL RST FOR RPCN32 0 STARTED Having attempted to restore all nodes whose ring interfaces are possibly faulty, ARR now unconditionally restores the innocent victim RPCN...
RST RPC32 0 COMPLETED REPT ARR AUTORST ARR UCL RST FOR IUN31 13 STARTED and then the innocent victim IUNs. (The ROP output concerning restoral of the innocent victim IUNs is omitted from this example.)
REPT ARR AUTORST ARR UCL RST FOR IUN31 13 SUCCEEDED RST IUN31 13 COMPLETED OP:RING;DETD
Issue 16.0
December 2000
3-97
401-661-045
00AAAAAAAAAAAA....
01................
02................
30................
31.AAAAAAAAAAOOAAA 33
32AAAAAiiAAAAA.... 55
63.AAAAAAAAAAAAAAA
OP:RING, IUN31 11 OP:RING IUN31 11 COMPL IUN31 11: MJ = OOS; NM = MAN; RI = QUSBL; NP = USBL IN ACT RING OP:RING, IUN31 12 OP:RING IUN31 12 COMPL IUN31 12: MJ = OOS; NM = MAN; RI = QUSBL; NP = USBL IN ACT RING Notice that IUN31 11 and IUN31 12 are now quarantined and in the manual mode. They are in the manual mode because ARR previously failed to restore them. They are quarantinedclassied as QUSBLbecause no diagnostic phases higher than 2 have been run on them and, therefore, IMS cannot know that their ring-interface hardware (except for the hardware tested by phases 1 and 2that is, the hardware that propagates messages on the ring) is usable.
December 2000
Ring Maintenance
RST:IUN32 6:TLP
Following standard procedures, the technician now assigns priority to performing maintenance on the remaining isolated segment. Choosing IUN32 6 because it was an external isolated node in the massive isolation, the technician changes the circuit pack indicated in the original TLP message and then conditionally restores the node to service. (Although manual restoral requests take priority over automatically requested conditional restorals, the former can occur in parallel with automatically requested unconditional restorals, such as are occurring. Therefore, the technician felt free to conditionally restore IUN32 6. If a conict had existed, allowing the rapid recovery of the many innocent victim nodes to proceed without interruption would usually make sense. The decision to conditionally restore IUN32 6 rather than to follow the somewhat slower procedure of running diagnostics on it with the RAW option was dictated by the high probability that IUN32 5 is the other node involved in this instance of low-phase ambiguity.)
REPT ARR AUTORST ARR UCL RST FOR IUN31 14 STARTED RST:IUN31 11 TASK 1 REPT ARR AUTORST ARR UCL RST FOR IUN31 14 SUCCEEDED RST IUN31 14 COMPLETED REPT ARR AUTORST ARR UCL RST FOR IUN31 15 STARTED RMV IUN31 11 STOPPED 5 REPT ARR AUTORST ARR UCL RST FOR IUN31 15 SUCCEEDED DGN IUN31 11 COMPL CATP (X00000000 X40000000) See the OM under DGN IUN, Bit 30, which indicates that all phases did not run because the node under test was not the only isolated node.
Issue 16.0
December 2000
3-99
401-661-045
ROP output concerning ARRs unconditional restorals of the remaining innocent victims is omitted from this example.
RST IUN32 6 TASK 2 MSG STARTED RMV IUN32 6 STOPPED 5 DGN IUN32 6 COMPL CATP (X00000000 X40000000) REPT RING CFR RING CONFIGURATION ESTABLISHED (338 ms) NORMAL CONFIGURATION, NO NODES ISOLATED (403431319)(403431699) RST IUN32 6 COMPLETED OP:RING;DETD! OP RING COMP RING STAT: ACTIVE That IMS is dissolving the remaining isolation, returning the ring subsystem to a two-ring structure, indicates the fault was located in IUN32 6.
00AAAAAAAAAAAA....
01................
02................
30................
31.AAAAAAAAAAOOAAA 33
32AAAAAOAAAAAA.... 3
63.AAAAAAAAAAAAAAA
December 2000
Ring Maintenance
RST:IUN31 12!
Now the only task remaining for the technician is to conditionally restore the remaining out-of-service nodes, none of which will be handled by ARR, since they are all in the manual mode. Probably none of the out-of-service nodes will contain faults, since one has had its ringinterface circuit pack replaced and the other two were designated as possibly faulty as a result of low-phase ambiguity. Nevertheless, the technician restores them conditionally to be certain that a fault undetected in one of them does not lead to another massive isolation. If while diagnostics are run on these nodes, a fault were to appear elsewhere in the ring, IMS would avoid a massive isolation by immediately returning the node being diagnosed to the active ring.
RST IUN31 12 TASK 2 MSG STARTED RMV IUN31 12 STOPPED 5 RST IUN31 11 COMPLETED REPT RING CFR RING CONFIGURATION ESTABLISHED (308 ms) BISO NODE = IUN31 10, EISO NODE = IUN31 13 (403490173)(403490559) The predictable action that concludes this example is not reproduced.
Issue 16.0
December 2000
3-101
401-661-045
CMD>
00AAAAAAAAAAAA....
01................
02................
30................
31.AAAAAAAAAAAAAAA
32AAAAAAAAAAAA....
63.AAAAAAAAAAAAAAA
CMD 400
REPT RING CFR LEVEL 0 RING CONFIGURATION INITIATED BY EAR NORMAL CONFIGURATION REQUESTED 0 1 4 3600000.......................(4034364845) REPT RING CFR RING CONFIGURATION ESTABLISHED (468 ms) NORMAL CONFIGURATION, NO NODES ISOLATED (4034364857)(4034365210) REPT RING TRANSPORT ERR RAC PARITY/FORMAT ERROR DETECTED, IUN31 11 RAC 0 ....................................................................... ............................................(4034364730) REPT RING TRANSPORT ERR BLOCKAGE DETECTED, IUN31 09 RAC 0 ....................................................................... ............................................(4034364740)
A ring-related fault stimulates EAR to a level-0 attempt (restart) to recover the ring.
December 2000
Ring Maintenance
REPT RING TRANSPORT ERR BLOCKAGE DETECTED, IUN31 10 RAC 0 ....................................................................... ............................................(4034364745) REPT RING CFR LEVEL 1 RING CONFIGURATION INITIATED BY EAR ISOLATION FROM IUN31 11 TO IUN31 12 REQUESTED 0 1 4 3600000.......................... (4034368158) REPT RING CFR RING CONFIGURATION ESTABLISHED (437 MS) BISO NODE = IUN31 10, EISO NODE = IUN31 12 (4034368175)(4034368492) REPT RING TRANSPORT ERR RAC PARITY/FORMAT ERROR DETECTED, IUN31 11 RAC 0 ....................................................................... ............................................(4034368041) REPT RING TRANSPORT ERR BLOCKAGE DETECTED, IUN31 09 RAC 0 ....................................................................... ............................................(4034368051) REPT RING TRANSPORT ERR BLOCKAGE DETECTED, IUN31 10 RAC 0 ....................................................................... ............................................(4034368056) REPT RING TRANSPORT ERR UNEXPLAINED LOSS OF TOKEN REPORTED ON BOTH RINGS. REPT TOKEN TRACK TOKEN WAS LOST BETWEEN IUN32 5 AND IUN32 6 ON RING: 0 REPT RING CFR LEVEL 3 RING CONFIGURATION INITIATED BY EAR 0 1 4 3600000.............................(4034373503) ...within the condence interval the 3B21D receives notice that the token is lost without receiving other error reports. The token-track module reports the probable location where the token left the ring. When unexplained loss of token occurs during the condence interval of levels 0 or 1, EAR jumps to level 3. The isolation succeeds momentarily, but... ...another fault occurs less than 3 seconds into the recovery, thereby, driving EAR to escalate to a level-1 attempt to isolate the faulty node.
Issue 16.0
December 2000
3-103
401-661-045
REPT RING CFR RING CONFIGURATION ESTABLISHED (1302 MS) NORMAL CONFIGURATION, NO NODES ISOLATED (4034374032)(4034374330)
EAR level-3 tests for continuity in the rings. Because the tests succeed, EAR directs ring conguration to establish the normal, two-ring structure. The success of the ring continuity tests are the rst clear indication that the recent faults are transient in nature. But again the condence interval fails, so EAR escalates to level 4.
REPT RING CFR LEVEL 4 RING CONFIGURATION INITIATED BY EAR 0 1 4 3600000..............................(4034376599) REPT RING CFR RING CONFIGURATION ESTABLISHED (8169 MS) NORMAL CONFIGURATION, NO NODES ISOLATED (4034384478)(4034384790)
Level 4 also nds continuity in the rings and directs ring conguration to establish the normal, two-ring structure. In this instance the recovery out lasts the condence interval, thereby, ending this episode of EAR escalation. Evidently the episode was triggered by two transient faults. The location of one fault is suggested by the short-lived, level-1 isolation of IUN31 11. The location of the other was identied by token track as between IUN32 5 and IUN32 6. The technician who witnesses these events should record the occurrences and locations of the two intermittent faults and perhaps should retain the ROP output of this unusual episode.
December 2000
4
4-1 4-3 4-3 4-3 4-6 4-7 4-11 4-12 4-18 4-18 4-20 4-20 4-20 4-20 4-20 4-21 4-21 4-21 4-22 4-22 4-22 4-22 4-22 4-22 4-23 4-23 4-23
Contents
Introduction Ring Fault Conditions and Maintenance Approach
s
Ring Node Out-of-Service Ring Node OOS Maintenance Approach Single-Ring Node Isolation Single Node Isolation Maintenance Approach Multiple-Ring Node Isolation Multiple Node Isolation Maintenance Approach Ring Down Ring Down Maintenance Approach Feature Definition Purpose Incompatibilities Interactions Changes Feature Description Release Availability Provisioning Special Planning Considerations Hardware Software Impact Software Description User Profile Description of Feature Operation Initial Setup Setting a Breakpoint
Issue 16.0
December 2000
4-i
401-661-045
Contents
Loading Memory Reading Memory Loading and Dumping RGRASP Utility Variables (UVARs) Feature Activation Feature Deactivation Equipment Configuration Data (ECD) Recent Change Procedures Measurement Network Management Impact Maintenance/Troubleshooting Impact Recording Output Messages Audits Critical Events Support Tools Related Documentation Cross-References 4-24 4-24 4-25 4-25 4-25 4-25 4-25 4-25 4-26 4-26 4-27 4-30 4-31 4-31 4-31 4-31
s s s s s s s s s s s
4-ii
Issue 16.0
December 2000
Introduction
This guide serves as an aid in performing ring and ring hardware maintenance functions. It contains procedures used in detecting, troubleshooting, and clearing faults associated with the ring and ring hardware. The procedures detailed in this guide are only guidelines for resolving ring-associated maintenance problems, and are not the only methods that may be used in performing ring maintenance. A system called trace provides a formal mechanism for embedding tracepoints within application code for use in testing and debugging. The system collects and forwards the trace messages produced by individual tracepoints to one or more destinations, including log les, ROPs and MCRTs. The tracepoints are controlled, so a related group scattered throughout the software can be turned on/ off at will. The parameters can also be set and changed using craft commands. The trace system is created automatically by during its initialization. Also, the user may create it manually. The tracepoints are designed to generate little overhead when disabled, but when used improperly, the trace system can consume large amounts of system resources while yielding little useful information. Craft commands allow one to totally inhibit all tracepoints, so that no trace messages are generated and the trace system uses little overhead, or to enable subsets of the tracepoints, thus restricting trace output to only that dealing with selected portions of application code. ALW:TRACE and INH:TRACE provide the basic on/off switch for trace. Until ALW:TRACE is invoked, no trace messages can be generated and logged under any circumstances. Similarly, once INH:TRACE is invoked, trace becomes totally dormant except for a certain amount of xed overhead. If trace is inhibited, the SET:TRACE command allows one to specify
Issue 16.0
December 2000
4-1
401-661-045
which tracepoints are active once trace is again enabled or, if trace is active, the command allows one to control the tracepoints during operation. The command, OP:TRACE, presents a summary of the current status of trace. The output message, REPT TRACE, reports a tracepoint from a 3B21D computer process or a node processor. The output message REPT TDTP indicates that the trace process has encountered a hardware or software fault. It should also be noted that the trace process is terminated when the system enters disk independent operation; see the 401-610-055 FLEXENT/AUTOPLEX Wireless Networks INPUT MESSAGES Message Manual or the 401-610-057 FLEXENT/ AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual. Ring maintenance functions for a ofce serve to detect, troubleshoot, and clear all fault conditions associated with the ring and ring hardware. The most common fault conditions associated with the ring are the following:
s s s s
Ring node out-of-service (OOS) Single ring node isolation Multiple ring node (RN) isolation Ring down.
Another less common fault condition on the ring is unexplained loss of token. These fault conditions are discussed in the remainder of this section. For additional information on ring maintenance. Direct link nodes (DLNs) follow the same guidelines as link nodes (s) in this section. CDN-I nodes also follow these guidelines except for removing ring application processor (RAP) circuit packs which require the power be turned off before circuit pack (CP) extraction.
December 2000
Issue 16.0
December 2000
4-3
401-661-045
xx
OOS-NORM
yy
Figure 4-1.
December 2000
NOTE: Perform an unconditional restore on the OOS-NORMAL node using the command RST:nodexx y;UCL where: For LN node = LN x = node member number y = node member number UCL = restores the node without performing diagnostics. For RPCN xx = group number y=0 UCL = restores the node without performing diagnostics.
Do not perform an unconditional restore unless one of the following has occurred:
s
CAUTION:
A complete diagnostics has produced an all-tests-passed (ATP) response. A complete diagnostics has produced a conditional all-tests-passed (CATP) response and the RI and the NP minor states are both usable (USBL).
Does the faulty node remain OOS-NORMAL? NoDONE. YesProceed to next step. 5. Diagnose node (yy) adjacent to the faulty node. If problems are located, correct and restore node (yy) to service. NOTE: Perform an unconditional restore on the OOS-NORMAL node using the command RST:nodexx y;UCL where:
Issue 16.0
December 2000
4-5
401-661-045
For LN node = LN xx = group number y = node member number UCL = restores the node without performing diagnostics. For RPCN node = RPCN xx = group number y=0 UCL = restores the node without performing diagnostics.
Do not perform an unconditional restore unless one of the following has occurred:
s s
CAUTION:
A complete diagnostics has produced an ATP response. A complete diagnostics has produced a CATP response, and the RI and the NP minor states are both USBL.
Does the faulty node remain OOS-NORMAL? NoDONE. Yes Proceed to next step. 6. If all attempts fail to clear the OOS node, then detailed testing is required. Call the CTS.
December 2000
See the 401-610-055 FLEXENT/AUTOPLEX Wireless Networks INPUT MESSAGES Message Manual or the 401-610-057 FLEXENT/AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual for further information and explanation to response of message. Assumption: An equipment malfunction has been detected, the fault recovery software has removed a single node from service, recongured the ring, and has formed an isolation around the faulty node. The ARR attempts to restore the node to service and has failed (manual action is required). The following diagram depicts a single node isolation.
Issue 16.0
December 2000
4-7
401-661-045
BISO
isolated
EISO
Figure 4-2.
December 2000
The following message should appear on the MCRT: CHG SLK a b [c d] NEW REQUESTED MINOR STATE = MOOS where: a = group number (00 - 63) b = member number (01 - 15) c = LI4 circuit pack (0 - 1) d = LI4 port (0 - 3) When the BISO node is congured into the isolated segment of the ring, a new BISO node is established. Once this occurs, the old BISO node can be diagnosed as any other node in an isolated segment.
BISO
isolated
EISO
Figure 4-3.
NOTE: After diagnosing and clearing problems associated with the old BISO node, restore it to service using guidelines for restoring all other nodes. If the problem with the isolation was associated with the BISO node and corrected, then it is included back into the active ring, restoring and including the isolated segment into the active ring also. If the node is OOS-NORMAL, and the isolation has cleared, then unconditionally restore the OOS-NORMAL node to service. Refer to ``Ring Node OOS Maintenance Approach'' in this chapter.
Do not perform an unconditional restore unless one of the following has occurred:
CAUTION:
Issue 16.0
December 2000
4-9
401-661-045
s s
A complete diagnostics has produced an ATP response. A complete diagnostics has produced a CATP response, and the RI and the NP minor states are both USBL.
If the SLK was manually removed from service, put it back in the AVAILABLE-In Service (IS) or AVAILABLE-Standby (STBY) state by typing the following message into the MCRT: CHG:SLK (a, b, [c, d]);{IS | ARST} where: a = group number (00 - 63) b = member number (01 - 15) The following message should appear on the MCRT: CHG SLK a b [c d] NEW REQUESTED MINOR STATE = e where: a = group number (00 - 63) b = member number (01 - 15) c = LI4 circuit pack (0 - 1) d = LI4 port (0 - 3) If the isolation still exists, proceed to the next step. 3. After diagnosing and troubleshooting the BISO node, and the isolation on the ring still exists, restore the old BISO node, and then diagnose and troubleshoot the EISO node using guidelines used in diagnosing the BISO node (Step 2 above).
BISO
isolated
NEW EISO
Figure 4-4.
NOTE: After diagnosing and clearing any problems associated with the old EISO node, restore it to service if an ATP response is received for all phases. If the fault was
December 2000
found in the EISO node, then the isolation should clear, leaving the original faulty node in the OOS-NORMAL state. See Figure 4-4. If the node is OOS-NORMAL, and the isolation has cleared, then refer to ``Ring Node OOS Maintenance Approach,'' and unconditionally restore the node to service. If the isolation still exists, proceed to the next step. 4. If the ring isolation is not cleared, then starting with the single isolated node, replace all RN CPs in the order of ring interface 0 (RI0), RI1, the NP, and the link interface, and perform a conditional restore. For VLSI RNs, replace the IRN circuit pack and then the link interface. If the trouble clears after replacing the CPs in the order listed, then when office traffic is minimal, the original CPs should be reinserted one at a time in the node and diagnostics run to determine the faulty CP(s). If the diagnostics fail to detect the faulty CP, but the previous CP replacements cleared the trouble, then the CP(s) should be saved, noting the failure conditions. Inform the CTS of the condition. If the trouble is located and corrected, leaving the original isolated node in the OOS-NORMAL maintenance state, then refer to ``Ring Node OOS Maintenance Approach'' in this chapter to complete this approach. If the isolation still exists, proceed to the next step. 5. Visibly inspect affected equipment for shorts, bent or broken backplane/pins, etc. Correct any problems that are uncovered. Diagnose, and unconditionally restore equipment to service if an ATP response is received for all phases run. If the isolation clears, and the node is OOS-NORMAL, refer to ``Ring Node OOS Maintenance Approach'' listed in this chapter. If the isolation still exists, proceed to the next step. 6. Contact the CTS.
Issue 16.0
December 2000
4-11
401-661-045
to replace CPs, and to restore the ring to an operational state. The second approach (B) details guidelines that should be used when the load on the CNI is minimal. The rst approach is not intended to be used as the total maintenance approach, and should only be used when time does not allow for diagnostic testing. Otherwise, approach ``B'' should be used whenever possible.
December 2000
Issue 16.0
December 2000
4-13
401-661-045
BISO
iso 0
xx
yy
zz
iso 1
Figure 4-5.
The xx, yy, and zz represent nodes that are in the isolated segment and may or may not be faulty.
December 2000
5. Place all other CPs in the original static wrapping, and store them (the ``tested good'' CPs) for possible, future faults.
For s, enter RST:xx y;UCL! For RPCN, enter RST:RPCNxx yy;UCL xx = group number y = node member number UCL = restores the node without performing diagnostics.
where:
Do not perform an unconditional restore unless one of the following has occurred:
s s
CAUTION:
A complete diagnostics has produced an ATP response. A complete diagnostics has produced a CATP response, and the RI and the NP minor states are both USBL.
If iso 0 remains OOS-NORMAL, refer to ``Ring Node OOS Maintenance Approach'' in this chapter. If the original isolation still exists, proceed to next step. 2. Diagnose node xx using guidelines detailed in Chapter 6, Diagnostic User's Guide.
Issue 16.0
December 2000
4-15
401-661-045
If node iso 0 is in the OOS-NORMAL state, and the original BISO node no longer exists after diagnosing and repairing node xx, then refer to ``Ring Node OOS Maintenance Approach.'' If the above statement is true, and all problems are corrected concerning these nodes, then a single node isolation may be formed, including a new BISO node, iso 1, and the EISO node. If this occurs, then refer to ``Single Node Isolation Maintenance Approach'' for the remainder of these guidelines. If the original isolation still exists after diagnosing node xx and correcting any problems, then repeat Steps 1 and 2 using nodes iso 1 and yy. If the original isolation still exists, then proceed to the next step. 3. Diagnose the BISO node. NOTE: The BISO node is an active node on the ring. To diagnose the BISO node, the node must be excluded from the active ring. See Figure 4-6. To accomplish this, use the RMV command. See the 401-610-057 FLEXENT/AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual. When the BISO node is removed from service (OOS-NORM), it is automatically included in the isolated segment (OOS-ISOLATED).
NEW BISO
iso 0
xx
yy
zz
iso 1
EISO
Figure 4-6.
The RMV request may or may not be accepted. If the request is accepted, proceed with diagnostics as usual, using guidelines listed in Chapter 6, Diagnostic User's Guide. If the request is denied, it may be necessary to remove the node and SLK from service, and then diagnose the node. To put the SLK in the AVAILABLE-MOOS state, type the following message into the MCRT, and proceed with diagnostics as usual: CHG:SLK (a, b, [c, d]); MOOS where: a = group number (00 - 63)
December 2000
b = member number (01 - 15) The following message should appear on the MCRT: CHG SLK a b [c d] NEW REQUESTED MINOR STATE = MOOS where: a = group number (00 - 63) b = member number (01 - 15) c = LI4 circuit pack (0 - 1) d = LI4 port (0 - 3) NOTE: After diagnosing and clearing problems associated with the BISO node, if any are located, restore the node to service using guidelines for restoring all other nodes. After diagnosing the BISO node, if problems are found and corrected, and if an ATP response is received, the BISO node may be deleted, leaving the iso 0 node in the OOS-NORMAL state. If this occurs, restore iso 0 to service. Refer to ``Ring Node OOS Maintenance Approach'' in this chapter.
Do not perform an unconditional restore unless one of the following has occurred:
s s
CAUTION:
A complete diagnostics has produced an ATP response. A complete diagnostics has produced a CATP response, and the RI and the NP minor states are both USBL.
If problems are corrected with the BISO, iso 0, and xx node, then the isolated segment of the ring should shorten, leaving only a single isolated node. If this occurs, refer to ``Single Node Isolation Maintenance Approach'' in this chapter for the remainder of this test. If the SLK was manually removed from service, put it back in the AVAILABLE-IS or AVAILABLE-STBY state by entering the following message at the MCRT: CHG:SLK (a, b, [c, d]);{ IS | ARST} where: a = group number (00 - 63) b = member number (01 - 15) The following message should appear on the MCRT:
Issue 16.0
December 2000
4-17
401-661-045
CHG SLK a b [c d] NEW REQUESTED MINOR STATE = e where: a = group number (00 - 63) b = member number (01 - 15) c = LI4 circuit pack (0 - 1) d = LI4 port (0 - 3) 4. If the original ring isolation still exists, starting with node iso 0, then xx, and finally the BISO node, replace all RN CPs in this order: ring interface 0 (RI0), RI1, the NP, and the link interface. Perform a conditional restore. For VLSI RNs, replace the IRN circuit pack and then the link interface. If the trouble clears after replacing the CPs in the order listed, the original CPs should be reinserted one at a time in the node and diagnostics run to determine the faulty CP(s). If the diagnostics fail to detect the faulty CP(s), but the previous CP replacement cleared the trouble, then the CP(s) should be saved, noting the failure conditions. Inform the CTS of the condition. 5. If the original ring isolation still exists, visibly inspect affected equipment for shorts, bent or broken pins, backplane faults, etc. Also ensure that proper equipment has been used with the long message option. If problems are located, correct the problems and perform a conditional restore on the affected equipment. 6. If the isolation still exists, or if all problems with the original BISO node, the iso 0 node, and node xx have been cleared, diagnose and attempt to correct problems associated with nodes iso 1, yy, and the EISO node, using Steps 3 through 5 of these guidelines. See Figure 4-7.
BISO
iso 0
xx
yy
zz
iso 1
NEW EISO
Figure 4-7.
NOTE: After correcting and restoring this portion of the isolated segment of the ring, attempt to restore iso 0, xx, and the BISO nodes if problems were not corrected in previous steps.
December 2000
7. If all attempts fail to clear the isolated segment, then detailed testing is required. Contact the CTS.
Ring Down
The ring down maintenance state is a state where the ring is unable to handle trafc. In this state, communication with the 3B21D computer (except for maintenance purposes) and other nodes on the ring is lost. All s are in the OOS state and all ring peripheral controller nodes (RPCNs) are in the standby state. The RPCNs are left in this standby state to eliminate any need to restore them if service can be restored. This state totally affects system operation; therefore, the problem must be corrected as soon as possible. If the ring is down for more than one second, the CCS network is affected and it results in a critical alarm. Expect the REPT CSLM output message.
Issue 16.0
December 2000
4-19
401-661-045
NOTE: For additional information on the initialization levels, refer to ``Initialization,'' Part 4 of this manual. Does the ring initialize? YesProceed to next step. NoProceed to Step 5. 3. Are all nodes that were not previously OOS (except quarantined nodes) before the ring down state restored to service? YesProceed to Step 8. NoProceed to next step. 4. For all nodes that were not previously OOS before the ring failure, perform an unconditional RST. See Chapter 6, Diagnostic User's Guide, or the 401-610-055 FLEXENT/AUTOPLEX Wireless Networks INPUT MESSAGES Message Manual or the 401-610-057 FLEXENT/AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual. Did all nodes previously not OOS prior to the ring failure restore? YesProceed to Step 8. NoProceed to next step. 5. Attempt to reinitialize the ring. Perform a level-4 initialization (see the proper application in the 401-610-055 FLEXENT/AUTOPLEX Wireless Networks INPUT MESSAGES Message Manual or the 401-610-057 FLEXENT/ AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual.). NOTE: For additional information on the initialization levels, refer to ``Initialization,'' Part 4 of Chapter 6, Diagnostic User's Guide. Does the ring initialize? YesProceed to next step. NoProceed to Step 9. 6. Are all nodes that were not previously OOS prior to the ring failure restored to service? YesProceed to Step 8. NoProceed to next step.
December 2000
7. For all nodes that were not previously OOS before the ring failure, perform an unconditional RST. See Chapter 6, Diagnostic User's Guide. 8. Are there any other nodes OOS left on the ring? NoDONE. YesDetermine the ring condition (single node isolation, multiple node isolation, etc.) and proceed to that condition's maintenance approach presented in this chapter. 9. If the system still doesn't initialize after the level-3 and level-4 initialization attempts, call the CTS.
Care must be exercised when using the RGRASP tool. Improper use of RGRASP can result in program mutilation or excessive utilization of system resources. Both of these consequences of improper use of the tool can lead to call processing downtime and therefore interrupt the operation of a node on the ring or the whole ring.
CAUTION:
Feature Description
The RGRASP tool can:
s
Set (allow) breakpoints (a breakpoint corresponds to the address of the rst byte of a target process instruction). Clear breakpoints. Report on current status for specied breakpoints. Inhibit breakpoints. Load a specied RGRASP utility variable (UVAR). Dump a specied RGRASP UVAR. Load a specied node with data.
s s s s s s
Issue 16.0
December 2000
4-21
401-661-045
s s s
Dump the contents of a specied address in a given node. Direct the loading of an address. Dump the contents of a specied Application Processor or Node Processor register.
Software Impact
This feature does not impact customer engineerable software resources on APs. This feature could impact customer engineerable software resources on NPs, dependent on memory size.
Software Description
The software consists of the following processes: RGP_KER This is a UNIX process kernel for the feature. It acts as the interface between the AM (RG_CFT and RG_PRT) and the ring node (monitor) processes. This UNIX process handles input commands from the craft shell. It parses and performs some preliminary checking on the input command. Then it relays the command to the RG_KER process for further processing. This UNIX process handles printing of output. This system process performs the actual operations required to handle breakpoints, memory dumping, and memory loading. It communicates with the RGP_KER.
RGP_CFT
RGP_PRT monitor
User Prole
This feature and its associated input commands are intended for use by technicians in conjunction with the CTS.
Initial Setup
First, determine the address in memory that requires investigation. This can be done by using the latest PR/PK listings provided. This address may be provided by the CTS.
December 2000
Determine which processor should be looked at. In the case of the DLN, there is an active and a standby processor. Use the OP:SLK or poke the 118 page to determine this. As a precaution, it is a good idea to set breakpoints in only one processor at a time.
Setting a Breakpoint
You can set a breakpoint in a program using the WHEN:RUTIL input command. Before this can be done, the opcode (OPC) must be known. To verify the OPC, use the DUMP:RUTIL command to dump the memory at the breakpoint address. If the expected OPC does not match the dump output, then the listings do not match the memory. This discrepancy should be cleared up before continuing the procedure. One possible explanation is that the node software is out of date. To eliminate this possibility, you can remove and restore the target node (node in which breakpoint is to be set). Doing this will ensure that the newest version of code has been pumped from disk. You can use the RMV:LN and RST:LN commands or 118 poke to achieve this. After the node has been pumped, try dumping the breakpoint address again. If it does not match up now, you know the listings are out of date. In this case, you should stop and get a current listing before proceeding. The WHEN:RUTIL command allows you to specify actions (commands) to be executed when the breakpoint you set res. The input message manual page for WHEN:RUTIL denes the actions. Up to 24 actions may be specied in the action list for a single breakpoint. The action list must be terminated by a END:WHEN command. The action list can contain only the END:WHEN command, in which case you will simply know whether a piece of code is being executed. Only ve breakpoints can be set in any one ring node processor.
Loading Memory
You can load memory with the LOAD:ADDR, LOAD:WORD, LOAD:SHORT or LOAD:BYTE commands within the WHEN:RUTIL command or with the LOAD:RUTIL command. Details on the use of these command are provided under " Input Messages.''
Loading memory may drastically change program execution. If not done properly, this can interrupt or degrade service; for example, calls may be lost.
CAUTION:
Issue 16.0
December 2000
4-23
401-661-045
The RGRASP tool has WRITE permissions to all parts of available memory. This makes the tool powerful but dangerous. No OPC checking is performed; it is possible to specify the wrong address and overwrite the wrong data. If you should overwrite the wrong data and the original contents cannot be loaded, the ring node should be removed and restored (pumped) to get an original disk copy back. To perform the remove and restore, the RMV:LN and RST:LN commands should be used. After a load, you should use the DUMP:RUTIL command to verify the new contents in memory. Registers can be loaded only during breakpoint action lists (WHEN:RUTIL command).
Reading Memory
Dumping memory is a fairly straightforward and safe operation. You need only the address to dump. You can dump memory with the DUMP:ADDR or DUMP:REG commands within the WHEN:RUTIL command or with the DUMP:RUTIL command. RGRASP allows 468 bytes to be dumped in one operation. The output is hexadecimal. You can dump memory either higher or lower than the starting address with the DUMP:RUTIL command. A range of addresses may also be specied with DUMP:RUTIL. Registers can be read only during breakpoint action lists (WHEN:RUTIL command).
Feature Activation
You can activate the feature; that is, execute one or more of its functions by using any of the following input commands:
s s
December 2000
s s s
Feature Deactivation
You can deactivate the feature; that is, clear all breakpoints in a specied node with the CLR:RUTIL command. You can clear a specic breakpoint in a specied node with the CLR:RUTILFLAG command. You can temporarily disable or inhibit all breakpoints in a specied node with the INH:RUTIL command. You can temporarily disable or inhibit a specic breakpoint in a specied node with the INH:RUTILFLAG command.
Measurement
No measurements are provided as part of the RGRASP tool.
Maintenance/Troubleshooting Impact
The RGRASP tool is a debugging tool for CNI ring nodes. It is usable only at nodes that are active from an IMS viewpoint, such as the IMS ACT state. Nodes that are quarantined or isolated cannot be accessed with RGRASP. There are no new diagnostics related to this tool.
Issue 16.0
December 2000
4-25
401-661-045
RGRASP breakpoints are affected by CNI initialization levels as follows: Level O,1,FPI,2,3 4 Effect None Clears all breakpoints
Recording
This tool has no impact on recording.
Incorrect use of these commands may interrupt operation of a node on the ring or the whole ring. READ EACH PURPOSE CAREFULLY. 1. ALW:RUTIL or ALW:RUTILFLAG The rst command allows all breakpoints in the specied node; the second allows a specic breakpoint in the specied node. 2. CLR:RUTIL or CLR:RUTILFLAG The rst command clears all breakpoints in the specied node; the second clears specic breakpoints in the specied node. 3. DUMP:ADDR Dumps the contents of the specied address in the given node. This command is allowed only within a WHEN:RUTIL command <action-list>. 4. DUMP:REG
CAUTION:
December 2000
Dumps the contents of the specied Application or Node Processor register in the given node. This command is allowed only within a WHEN:RUTIL command <action-list>. 5. DUMP:RUTIL Dumps the contents of memory at the address range given at the specied node. It can also dump the contents of memory starting at the given address for the specied number of bytes. Currently a maximum length of 468 bytes is allowed for a single dump operation. A formatted output of the node's memory contents will follow this input command. 6. DUMP:UVAR Dumps the contents of the specied RGRASP UVAR. This command is allowed only within a WHEN:RUTIL command <action-list>. 7. INH:RUTIL or INH:RUTILFLAG The rst command inhibits all breakpoints in the specied node; the second inhibits specic breakpoint(s) in the specied node. 8. LOAD:ADDR Loads the specied address with the specied data. This command is allowed only within a WHEN:RUTIL command <action-list>. 9. LOAD:BYTE Loads the address in the given node with the specied data. This command is allowed only within a WHEN:RUTIL command <action-list>. 10. LOAD:REG Loads an Application or Node Processor register with the specied data in the given node. This command is allowed only within a WHEN:RUTIL command <action-list>. 11. LOAD:RUTIL Loads the address at the given node with the specied data. The maximum number of data items allowed for loading is 128 bytes or 32 4-byte words.
Issue 16.0
December 2000
4-27
401-661-045
There must be a one-to-one correspondence between the length of the data to be written and the data provided. If there are 3 bytes of data to be written, three data entries must be specied. Similarly, if there are ve words to be written, ve data entries must be specied.
December 2000
12. LOAD:SHORT Loads the address in the given node with the specied data. This command is allowed only within a WHEN:RUTIL command <action-list>. The address provided is expected to be on a 2-byte boundary. The data provided is expected to be a 2-byte value. 13. LOAD:UVAR Loads the specied RGRASP UVAR with the specied data. This command is allowed only within a WHEN:RUTIL command <action-list>. 14. LOAD:WORD Loads the address in the given node with the specied data. This command is allowed only within a WHEN:RUTIL command <action-list>. The address provided is expected to be on a 4-byte boundary for an AP or a 2byte boundary for an NP. The data provided is expected to be a 4-byte value. 15. OP:RUTIL or OP:RUTILFLAG The rst command outputs the status of all breakpoints in the specied node; the second outputs the status of a specic breakpoint in the specied node. 16. WHEN:RUTIL <action list> END:WHEN! Sets a RGRASP breakpoint in the specied node along with a specied action-list to be performed by the node when the breakpoint res. Current <action-list> items available are: ALW:RUTIL ALW:RUTILFLAG DUMP:ADDR DUMP:REG DUMP:UVAR INH:RUTIL INH:RUTILFLAG LOAD:ADDR LOAD:BYTE LOAD:REG
Issue 16.0
December 2000
4-29
401-661-045
LOAD:SHORT LOAD:UVAR LOAD:WORD For more specic instructions on these items, see preceding listings for specic commands, or refer to the 401-610-055 FLEXENT/AUTOPLEX Wireless Networks INPUT MESSAGES Message Manual or the 401-610-057 FLEXENT/AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual.
Output Messages
The following output messages are associated with the RGRASP tool. For more information about each of these messages, refer to the the 401-610-057 Output Message Manual.. 1. ALW RUTIL or ALW RUTILFLAG Prints in response to a ALW:RUTIL or ALW:RUTILFLAG command. Indicates the action that has occurred as a result of the command. 2. CLR RUTIL or CLR RUTILFLAG Prints in response to a CLR:RUTIL or CLR:RUTILFLAG command. Indicates the action that has occurred as a result of the command. 3. DUMP RUTIL Prints in response to a DUMP:RUTIL command. Indicates the action that has occurred as a result of the command. 4. INH RUTIL or INH RUTILFLAG Prints in response to a INH:RUTIL or INH:RUTILFLAG command. Indicates the action that has occurred as a result of the command. 5. LOAD RUTIL Prints in response to a LOAD:RUTIL command. Indicates the action that has occurred as a result of the command. 6. OP RUTIL or OP RUTILFLAG Prints in response to a OP:RUTIL or OP:RUTILFLAG command. Indicates the action that has occurred as a result of the command.
December 2000
7. REPT RGP PRT Prints when anomalies occur within the print process of the RGRASP tool. Indicates the kind of anomaly that has occurred. 8. REPT RUTIL This message has 40 formats. Formats [1] through [15] report an error condition encountered by the RGRASP RGP_KER process. Formats [16] through [40] print in response to the ring of a breakpoint. 9. WHEN RUTIL Prints in response to a WHEN:RUTIL command.
Audits
The RGRASP tool does not affect any audits.
Critical Events
The RGRASP tool does not affect any critical events.
Support Tools
The RGRASP tool is a new support tool.
Issue 16.0
December 2000
4-31
401-661-045
December 2000
5
5-1 5-2 5-2 5-3 5-3 5-4 5-4
Contents
Introduction Critical Event Message Output
s s s s
Logging Critical Events Short Form CNCE Message Long Form CNCE Message Using the CHG:CEPARM Command
CNCE Descriptions
Issue 16.0
December 2000
5-i
401-661-045
Contents
5-ii
Issue 16.0
December 2000
Introduction
CCS Network Critical Events (CNCE) are predened events that are considered indicators of abnormal network operation. They are of importance to network operation and to the proper functioning of the ofce. Both on-site and support system personnel must be immediately aware of events affecting the CCS network. CNCE messages are output as these critical events occur and are referred to as on-occurrence autonomous messages. CNCE messages are output as critical events occur in the ofce or as network events are recognized and acted upon. There are approximately 70 critical events in a system. Some critical events pertain to the CCS network in general, while others have signicance to the. A CNCE could represent an occurrence, the beginning of some state, or the ending of some state. Events indicating the beginning or ending of a state should occur in pairs. A critical event never represents a length of time. The naming convention used for critical events is similar to the naming convention used for measurements. It is as follows:
s
The mnemonic represents as closely as possible the actual event. The mnemonic is derived from a set of abbreviations representing typical signaling events. These abbreviations are combined to describe the event. The sufx E means the state indicated by the mnemonic has ended. Names may include letters, digits, or special characters. Names are unique and contain no more than 12 characters.
s s s
Issue 16.0
December 2000
5-1
401-661-045
The names given to critical events are used by the Measurement Output Control Table (MOCT), which is described in the ``Measurement Output Control Table'' section in the. At the end of this section are tables providing explanations of each critical event by name.
Identication of the event that occurred (the CNCE name) When the event occurred (may be set to network or local time) Identication of the peripheral units involved, if required.
The critical event handler immediately generates a CNCE message. The CNCE message is generated in two forms: short form and long form (see the REPT CNCE message in the ). The CNCE message is automatically recorded in the CNCE log le, rst, using the long form. Then, it is output to the appropriate users in the forms specied in the CET. The CNCEs are output at the MROP locally and are sent to various support system centers over BX.25 links. For more information on the CNCE message forms, see the REPT CNCE message in the 401-610-057 FLEXENT/AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual. The CNCE log le is a circular le stored on disk (/etc/log/CNCELOG). The le contains a minimum of 90 minutes of the most recent CNCE messages. The messages in the log le can be retrieved. The le can be output using the OP:LOG:CNCELOG UNIX system Real Time Reliable (RTR) command (see the 401-610-055 FLEXENT/AUTOPLEX Wireless Networks INPUT MESSAGES Message Manual or the 401-610-057 FLEXENT/AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual). Support system users cannot use this command over BX.25 sessions.
December 2000
14:00:36:59
32-00
14:00:36:59
7 02-0
ATLN_GA_TL_MS2_06 56. A
Figure 5-1.
CNCE Messages
For CNCE messages related to PBX links, both long and short forms may contain circuit pack and port identication and diagnostic code.
Issue 16.0
December 2000
5-3
401-661-045
CNCE Descriptions
The event names appearing in CNCE output messages are derived from the MOCT and are dened as shown in Table 5-1. The descriptions are presented alphabetically by event name. The table shows the information provided by the CNCE message. The eld, shown in parentheses after the event name, is the group-member number, the point code, or the link set. Often, an occurrence not only causes a CNCE message but is also counted as a measurement. Some of the critical events in the table can be better understood by referring to corresponding measurements in the rst part of this chapter. That part contains a table with more detailed descriptions of certain events. Some measurement names should be similar to the critical event names. NOTE: The C6'' or `C7'' at the beginning of a CNCE name identies the event as either CCIS6 or CCS7 link related. The ``CP'' or ``CT'' at the beginning of a CNCE name identies it as PBX node/link related. Others are per ofce events.
December 2000
Table 5-1.
CNCE Descriptions (Page 1 of 14) DESCRIPTION Change back from a failure that is not a declared failure. This is an automatic change back to a link that previously did an automatic changeover and then restored. The change back must normally occur within 3 minutes of the changeover. If the LI reports a long key exchange is taking place, this time period is extended to 6 minutes. This event occurs for all automatic change backs exclusive of the C6ACBFLD event. Refer to the L6ACO_ measurements for a description of the changeover/change back sequence. This event is usually preceded by a C6ACO_ event. Automatic change back from declared failure. This event indicates that the link is declared failed, has recovered, and trafc has been routed back to the link. This event is preceded by one of the C6FLD_ events (see those descriptions for more information on declared failure). Note that if a link is in the MOOS state and an emergency condition automatically forces the link back into service (called preemption), the C6MCB event occurs rather than this event. Automatic changeover initiated by the far end. A changeover involves transferring signaling messages from the unavailable link to some other link. For example, in the case of a B-link, the changeover results in messages being routed to the mate link, and in the case of an A-link, the changeover results in messages being routed to a C-link. When the changeover message is received from the far end, the following occurs: 1. The link is removed from service. 2. No new messages are given to the link. New messages are diverted to the mate link or C-link. 3. Messages remaining in the transmit buffers are retrieved and an attempt is made to transmit these messages on some other link.
C6ACBFLD (gg-mm)
C6ACOCOV (gg-mm)
Issue 16.0
December 2000
5-5
401-661-045
Table 5-1.
CNCE Descriptions (Page 2 of 14) DESCRIPTION 4. Only synchronization messages are sent to the far end. 5. The link switches VFLs and attempts to synchronize. 6. If acceptable, the link is proven in (from 3 to 15 seconds required) and restored. Messages are routed back (referred to as change back). Both VFLs are tested alternately until one syncs. If the link cannot change back within 3 minutes (or 6 minutes if a long key exchange is involved), it is declared failed. Refer to the L6ACO_ measurements for more information.
C6ACOER (gg-mm)
Automatic changeover error threshold has been exceeded. The error rate monitor in the LI maintains a "leaky bucket" count of the number of SUs received in error during normal operation and also a linear count of SUs received in error during prove-in. If either count exceeds some threshold, the error is reported to the node. The node then reports this event, and alternate synchronization and changeover messages are sent to the far end (the far end recognizes this as a changeover request). Similar actions to those described for the C6ACOCOV event are taken. Transmit buffer overow begins (this occurs only for the telephone message transmit buffer). This event indicates that message(s) have been discarded because the buffer is full. The message is discarded and this event is reported on the rst attempt to transmit a message with the buffer full. As long as the buffer is full, messages may be discarded. This event is not reported again at least until buffer overload ends (indicated by the C6BOLXE event). This event should be preceded by the C6BOLX event.
C6BOFX (gg-mm)
December 2000
Table 5-1.
CNCE Descriptions (Page 3 of 14) DESCRIPTION Transmit buffer overload begins (only the telephone message transmit buffer). The number of signal units in the buffer has reached the threshold for congestion controls to be activated. This event is reported only once when the threshold has been reached and not again at least until the overload ends. When the overload occurs, the node returns selected outgoing messages to their originations. The originators of these messages in turn control their trafc towards the node experiencing buffer overload. This mechanism is called selected return, and consists of the following: s Return some direct signaling messages. s Discard all IAMs and COTs and return message refusal to the sending ofce.
s
Send a group signaling congestion message to all ofces that send messages on this link.
Every second, the node checks to see if buffer occupancy has dropped to an abatement threshold (see the C6BOLXE event description). When that occurs, the overload has ended. Should the link remain overloaded for one minute, it is declared failed. C6BOLXE (gg-mm) Transmit buffer overload ends. This event indicates that the number of signal units in the transmit buffer has dropped to the abatement threshold after an overload. The node checks the buffer occupancy once each second. When occupancy has reached the abatement threshold, selective message return is ended and this event is reported. Both overload and overow are considered ended when this event occurs. Broadcast the remove dynamic overload controls message. These messages are in response to messages from end ofces requesting the application or removal of a particular DOC state. The corresponding C6DOC_ event occurs when the message is received. The request results in a DOCx message being transmitted backwards for all bands that can send messages to the congested ofce. The messages are sent on each "trigger" band to the far end ofces. The request may be received on a CCS7 link if virtual links are assigned. Those far end ofces then apply the controls to all bands associated with the trigger bands. All DOCx messages are one signal unit in length. Two minutes after receiving the last message, an end ofce automatically removes the controls. The DOC0 broadcast is an explicit request for the end ofce to remove the controls.
C6DOC0 (gg-mm)
Issue 16.0
December 2000
5-7
401-661-045
Table 5-1.
CNCE Descriptions (Page 4 of 14) DESCRIPTION Broadcast the dynamic overload control 1 message. The least severe control. DOC1 and DOC2 are progressive controls used when the congested ofce is only slightly overloaded or is recovering from a failure. They allow CCS messages to be slowly restored to (or removed from) the affected ofce. For a description of the broadcast mechanism, refer to the C6DOC0 event. Broadcast the dynamic overload control 2 message. Refer to the C6DOC1 description. Broadcast dynamic overload control message to a far end ofce. The most severe control. Caused by an emergency restart due to a received processor outage. This DOC message is broadcast every minute until congestion is relieved. It stops all CCS messages to the congested ofce. See the C6DOC0 event for a description of the broadcast mechanism. Emergency restart (EMR) begins. The specied link failed at the near end causing a complete failure of banded signaling between this ofce and the other ofce. This affects banded signaling, but if a particular ofce contains only one link, other types of signaling may be affected. If another path is available, the signaling load is transferred to the other link and an EMR condition is not triggered. When the last link in the C-link pool or set fails, emergency restarts are triggered on many A, B, and D-links. Refer to the EMR_ measurement descriptions. Since signaling messages cannot be routed over the affected link, alternate link messages may be lost (such as banded messages). Selective return is used so some direct signaling messages are returned to their originators. The end of the EMR condition is indicated by the C6EMRE event. Emergency restart ends. The link restoral causes an automatic status update for the affected link, bands, and routes. This event indicates that the end of the EMR condition on the specied link (regardless of what triggered the EMR). Emergency restart due to processor outage begins. The specied link receives a processor outage message from the far end while its mate is unavailable. This results in DOC3 messages being broadcast to all ofces that could send messages to this link. See the C6EMR event for further description. Declared link failure due to a 1-minute continuous receive buffer overload. If there is not an EMR, a changeover is initiated. The link is removed from service and is diagnosed.
C6EMR (gg-mm)
C6EMRE (gg-mm)
C6EMRPO (gg-mm)
C6FLDCOL (gg-mm)
December 2000
Table 5-1.
CNCE Descriptions (Page 5 of 14) DESCRIPTION Declared link failure due to an automatic changeover initiated by the far end. The changeover lasted more than 3 minutes (or 6 minutes if a long key exchange is involved). Actions are taken as described under the C6FLDCOL event except no diagnostics are attempted and the changeover (the C6ACOCOV event) precedes this event. Declared link failure due to error threshold exceeded. This is caused by an excessive number of received SUs in error. Actions are taken as described under the C6FLDCOV event except the changeover (the C6ACOER event) precedes this event. Declared link failure due to continuous (lasting 30 seconds) far end processor congestion. This event occurs only on A-links. Actions are taken as described under the C6FLDCOL event. The C6PCR description (that event precedes this event) shows how a processor congestion is detected. Declared link failure due to a sanity check failure. This failure is due to either software or hardware problems causing abnormal node operation. Automatic diagnostics then attempt to determine the problem. Actions are taken as described under the C6FLDCOL event. Manual change back from manual changeover. This event occurs either due to manually restoring the link or due to preemption of the MOOS state by an emergency condition. In the latter case, this event may be preceded by a C6EMR_ event on the mate link. Refer to the L6MCO_ measurements for a description of the changeover/change back sequence.
C6FLDER (gg-mm)
C6FLDPCR (gg-mm)
C6FLDSNT (gg-mm)
C6MCB (gg-mm)
Issue 16.0
December 2000
5-9
401-661-045
Table 5-1.
CNCE Descriptions (Page 6 of 14) DESCRIPTION Far end manual changeover request has been received. A changeover involves transferring signaling messages from the unavailable link to some other link, usually due to a need for link changes or maintenance. For example, in the case of a B-link, the changeover results in messages being routed to the mate link, in the case of an A-link, the changeover results in messages being routed to a C-link, and in the case of a C-link, it results in messages being load balanced over the other available C-links. The changeover request may be denied if the mate link is out-of-service or the C-link pool is unable to handle the additional load. When the request is received, the following occurs (if the request is accepted): 1. A manual changeover acknowledgment is sent to the far end, and the link is removed from service. 2. No new messages are given to the link. New messages are diverted to the mate link or C-link. 3. Messages remaining in the transmit buffers are retrieved, and an attempt is made to transmit these messages on some other link. Refer to the L6MCO_ measurements for more information.
C6MCON (gg-mm)
Near end manual changeover due to local maintenance action. The maintenance and routing actions taken when this event occurs are similar to those taken for the C6MCOF event, except, before diverting messages to the other link, a manual changeover request is sent to the far end (not an acknowledgment). Upon receipt of an acknowledgment from the far end, the link is removed from service and the diversion is done. Refer to the L6MCO_ measurements for more information. Far end 1STP processor congestion event begins. This event occurs only on A-links. It indicates that the base call-processing cycle of the congested ofce exceeded a specied value for three consecutive cycles. The node uses selective message return to limit trafc to the congested ofce (described under the C6BOLX event). If a congestion message is received at least every 8 to 10 seconds for 30 seconds, declare the link failed. The event occurs once when the message is rst received and not again at least until congestion ends (indicated by the C6PCRE event). End of received processor congestion. If more than 10 seconds elapse between congestion messages, consider the event ended.
C6PCR (gg-mm)
C6PCRE (gg-mm)
December 2000
Table 5-1.
CNCE Descriptions (Page 7 of 14) DESCRIPTION Adjacent processor outage begins (a PRO has been received). This indicates that the far end ofce is undergoing initialization or is overloaded. The far end LI goes into the processor outage send mode. In this mode, processor outage (PRO) signal units are transmitted in a continuous stream. This end treats the problem as a link failure (causes a changeover). DOC3 is broadcast every 60 seconds on links to connected ofces that go into EMR due to the PROs being received on this link. The DOC message continues until synchronism is restored on this link. This is indicated by no more PROs. This event occurs once when the PRO is rst received, and not again until the outage ends. This is indicated by the C6PORE event. The C6DOC3 event occurs every 60 seconds as shown above. Adjacent processor outage ends. This event occurs when the far end stops sending PRO, synchronism is regained, and the link is restored. Automatic link check (ALC) failure. When a link is declared failed (a C7FLD_ event), the ALC is initiated. If the ALC is not successful within 15 seconds from the link failure, this event occurs. Change back from a failure that is not a declared failure. This is an automatic change back to a link that previously did an automatic changeover and then was restored. The change back must normally occur within 3 minutes of the changeover. If the LI reports a long key exchange is taking place, this time period is extended to 10 minutes. This event occurs for all automatic change backs exclusive of the C7ACBFLD event. Refer to the L7ACO_ measurements for a description of the changeover/ change back sequence. This event is usually preceded by a C7ACO_ event. Automatic change back from declared failure. This event indicates that the link is declared failed, has recovered, and trafc has been routed back to the link. This event is preceded by one of the C7FLD_ events (see those descriptions for more information on declared failure). Note that if a link is in the MOOS state and an emergency condition automatically forces the link back into service (called preemption), the C7MCB event occurs rather than this event.
C6PORE (gg-mm)
C7ALCIF (gg-mm)
C7ACB00 (gg-mm)
C7ACBFLD (gg-mm)
Issue 16.0
December 2000
5-11
401-661-045
Table 5-1.
CNCE Descriptions (Page 8 of 14) DESCRIPTION Automatic changeover initiated by the far end. A changeover involves transferring signaling messages from the unavailable link to other links. These could be any links in the combined link set or C-links. In the case of a C-link failing, the changeover results in messages being load balanced over the other available C-links. The changeover message and the acknowledgment are both sent on some other link in the specied links set. When the changeover order is received from the far end, this event occurs and either a changeover or emergency changeover is initiated. An emergency changeover is done when the far end indicates that messages were received out of sequence or when the link node is out-of-service. The following is the changeover sequence: 1. The link is removed from service and no new messages are given to the link node (message handling pauses). 2. A changeover acknowledgment is sent to the far end on some other link in the set. Messages remaining in the transmit and retransmit buffers are retrieved and are transmitted in sequence on other links. An emergency changeover does not attempt the retrieval from the retransmit buffer (if the link node is out-of-service or the link failed due to a near end PRO, no retrieval is done). 3. Message handling resumes with new messages to the other links. 4. Only synchronization messages are sent on this link. In the case of an automatic changeover, the link changes back when sync is regained. Then it is proven in (from 3 to 15 seconds required) and restored. CCS messages are routed back to the restored link. If the link cannot sync and change back within 3 minutes (or 10 minutes if a long key exchange is involved), it is declared failed.
C7ACOER (gg-mm)
Automatic changeover error threshold has been exceeded. The error rate monitor in the LI has reported excessive signal unit errors. The monitor is described in more detail under the C6ACOER event. Similar actions to those described for the C7ACOCOV event are taken.
December 2000
Table 5-1.
CNCE Descriptions (Page 9 of 14) DESCRIPTION Declared link failure due to a 1-minute continuous receive buffer overload. This event is followed by a changeover (assuming it is not denied due to a blocked path). The link is removed from service and is diagnosed. Declared link failure due to an automatic changeover initiated by the far end. The changeover lasted more than 3 minutes (or 10 minutes if a long key exchange is involved). Actions are taken as described under the C7FLDCOL event except no diagnostics are attempted and the changeover (the C7ACOCOV event) precedes this event. Declared link failure due to error threshold exceeded. This is caused by an excessive number of received SUs in error. Actions are taken as described under the C7FLDCOV event except the changeover (the C7ACOER event) precedes this event. Declared link failure due to a sanity check failure. This failure is due to either software or hardware problems causing abnormal node operation. Automatic diagnostics attempt to determine the problem. Actions are taken as described under the C7FLDCOL event. Transmit buffer level 1 congestion ends. Buffer occupancy has dropped below the threshold for level 1 abatement after transmit buffer congestion. Messages are not being discarded. Transmit buffer level 2 congestion ends. Buffer occupancy has dropped below the threshold for level 2 abatement after transmit buffer congestion. The node reverts to level 1 discard. Transmit buffer level 3 congestion ends. Buffer occupancy has dropped below the threshold for level 3 abatement after transmit buffer congestion. The node reverts to level 2 discard.
C7FLDCOV (gg-mm)
C7FLDER (gg-mm)
C7FLDSNT (GGmm)
Issue 16.0
December 2000
5-13
401-661-045
Table 5-1.
CNCE Descriptions (Page 10 of 14) DESCRIPTION Transmit buffer level 1 congestion discard begins. Buffer occupancy has reached the threshold for level 1 discard to be initiated. The SS7 discard strategy (for levels 1, 2, or 3) is as described below: The node rst checks the priority of a message before transmitting it. The priority is contained in the service information octet eld and is compared with the congestion state of the transmit buffer. If the priority is less than the congestion level, the message is removed and a return message may be sent. The return message is sent only if the return indicator in the received message is set. If the message to be transmitted is a unit data type SCCP message, a UDS message is created and returned to the originator. If the priority of the message is equal to or greater than the congestion level, it is transmitted. This event does not occur again at least until buffer occupancy drops below the level 1 abatement threshold (signaled by the C7LCABM1X event). Transmit buffer level 2 congestion discard begins. Buffer occupancy has reached the threshold for level 2 discard to be initiated. The C7LCDIS1X event describes the discard strategy. Transmit buffer level 3 congestion discard begins. Buffer occupancy has reached the threshold for level 3 discard to be initiated. At this point, all messages are being discarded. The C7LCDIS1X event describes the discard strategy. Transmit buffer level 1 congestion onset begins. The congestion onset thresholds (levels 1, 2, or 3), are higher than the corresponding abatement levels but lower than the corresponding discard levels. At each onset level, the node reports the congestion state to the central processor. Network management messages (transfer controlled) are then broadcast to adjacent signaling points to limit messages to the affected node. To avoid further congestion of the transmit buffer, the far end initiates the discard strategy used by nodes at the discard level (described under the C7LCDIS1X event). If the node remains in the same congestion level (1, 2, or 3) for 60 seconds, it is taken OOS and diagnosed.
C7LCON1X (gg-mm)
C7LCON2X (gg-mm)
Transmit buffer level 2 congestion onset begins. Messages are being discarded according to the level 1 strategy. The node reports the level 2 congestion state to the central processor. Actions are taken as described under the C7LCON1X event.
December 2000
Table 5-1.
CNCE Descriptions (Page 11 of 14) DESCRIPTION Transmit buffer level 3 congestion onset begins. Messages are being discarded according to the level 2 strategy. The node reports the level 3 congestion state to the central processor. Actions are taken as described under the C7LCON1X event. Link set failure begins. When the last available link in the set fails, this event occurs. If the failure of the link set results in failure of the associated combined link set, another C7LSF CNCE message is output with the combined link set identication. The end of this event is signaled by the C7LSFE event. The CLF_ measurements describe the various link set failure scenarios. If this failure causes some destination to become isolated from this ofce (for example, all signaling paths to a signaling point have failed), this event is accompanied by a C7SPI event. Link set failure ends. When any link in the set restores, this event occurs. Manual change back from manual changeover. This event occurs either due to manually restoring the link (at the near end or far end) or due to preemption of the MOOS state by an emergency condition. When the link regains sync, a change back declaration is sent to the far end. The link state is changed to OOS and new messages are diverted back to the link. Until all acknowledgments are received, these messages are not transmitted; messages are diverted to other links if the link fails to return to service. Note that this event occurs before the link is made available. Far end manual changeover request has been received, usually due to a need for link changes or maintenance. The far end has requested and permission has been granted to initiate a changeover. Either a changeover or emergency changeover is initiated. The sequence is described under the C7ACOCOV event. Near end manual changeover due to local maintenance action. The changeover could be denied if removing the link from service would cause the far end to become inaccessible. This end requests permission from the far end to initiate a changeover (the far end recognizes a C7MCOF event). If the far end grants permission, either a changeover or emergency changeover is initiated. The sequence is described under the C7ACOCOV event. Adjacent processor outage event begins (the end of this event is signaled by the C7PORE event). Refer to the C6POR description.
C7LSF (linkset)
C7MCOF (gg-mm)
C7MCON (gg-mm)
C7POR (gg-mm)
Issue 16.0
December 2000
5-15
401-661-045
Table 5-1.
CNCE Descriptions (Page 12 of 14) DESCRIPTION Adjacent processor outage event ends. Refer to the C6PORE description. An adjacent signaling point isolation begins due to local failure. A link failed causing a complete failure of all signaling paths to the indicated destination from this ofce. This condition is usually accompanied by a C7LSF event. The end is indicated by the C7SPIE event. See the SPI_ measurements for more detail. Adjacent signaling point isolation ends. Some failed path to the indicated destination has restored due to a local link set recovery. This event indicates that the destination is no longer isolated from this ofce. An adjacent signaling point isolation begins due to a far end processor outage. A link failed due to receiving PROs from the far end causing a complete failure of all signaling paths to the indicated destination from this ofce. See also the C7SPI description. The end of this condition is indicated by the C7SPIE event. Received a subsystem allowed message. Receiving an SSA message indicates that the subsystem (either local or nonlocal), has become allowed. SSA messages sent by the far end are in response to subsystem status test messages. This event (and the C7SSPF event described below) occurs only if both of the following two conditions are met:
s s
C7SPIE (pointcode)
C7SPIPO (pointcode)
C7SSAF (subsystem)
Indicated subsystem is in the same region, and It is simplex, or duplex with the mate subsystem prohibited.
C7SSPF (subsystem)
Received a subsystem prohibited message. SSP messages sent by the far end are in response to signaling messages destined for the indicated prohibited subsystem. Receiving an SSP message indicates that the subsystem (either local or nonlocal), has become prohibited causing it to be blocked. The C7SSAF description details certain conditions for the generation of this event. Automatic return to service from a declared failure. Automatic link check (ALC) failure on the specied link. When a link is declared failed (the CPFLD or CPFLDNS event), the ALC is initiated. If the ALC is not successful within 15 seconds from the link failure, this event occurs.
December 2000
Table 5-1.
CNCE Descriptions (Page 13 of 14) DESCRIPTION A SERV message exchange has failed on the specied D-channel link. The SERV message is sent several times and, if no acknowledgment is received (the T321 timer expires), this event occurs. This indicates that either a layer 3 protocol problem, a provisioning problem, or a hardware failure other than facility failure. This event occurs when a link attempts to transition to the IS state. Note that since the SERV message exchange is not done for standby links, a standby link could have latent layer 3 problems. A duplex D-channel link has transitioned to the standby state. If the link was in declared failure, this event indicates that it has recovered. The mate D-channel link fails while the indicated link is in the manual out-of-service (MOOS) state. No switchover occurs until manual action removes the MOOS state. If the link remains in MOOS, the system attempts to recover the mate link normally. This event is a warning of possible service outage. Declared link failure (this only applies to PBX links with diagnostic). The link state is changed to OOS and the central processor is informed. For a D-channel link failure, this event indicates that a signaling path failure; therefore, any associated B-channels are removed from service. There are various reasons for the failure, including:
s
s s s
Layer 1 protocol down (probably failure of DS0 or DS1, no explicit indication of L1 failure) Layer 2 protocol down (protocol exceptions and inability to establish link within 90 sec.) DDS code received Disconnect message received from far end Level 2 error threshold exceeded (usually facility problems).
Nonsignaling declared link failure of a mated link. The signaling path is still available on the backup link. The link state is changed to OOS. For the reasons for this event, see the CPFLD description. Manual out-of-service (MOOS) begins. Manual out-of-service ends.
Issue 16.0
December 2000
5-17
401-661-045
Table 5-1.
CNCE Descriptions (Page 14 of 14) DESCRIPTION Red alarm declared (near end DS1 facility failure). This is the second most severe trouble condition for a PBX node. This event obstructs sensing of the yellow alarm condition. Note that this means that there may be no explicit clearing of any yellow alarm in progress (normally indicated by the CTYELALC event). Red alarm cleared. Any yellow alarm in progress is also cleared. Yellow alarm declared. Yellow alarm cleared.
December 2000
6
6-1 6-1 6-1 6-2 6-5 6-6 6-6 6-7 6-8 6-9 6-9 6-24 6-39 6-40 6-41 6-41 6-53 6-54 6-57 6-59 6-63 6-66 6-66 6-67 6-68 6-70 6-71
Contents
Introduction Overview
s s s
Diagnostics Hardware and Interfaces System Maintenance Interfaces Diagnostic Message Structure System Diagnostics Use of DGN Commands Obtaining the Status of Diagnostics Node Diagnostic Phase Descriptions Circuit Pack Trouble Location Guide Diagnostic Listings Clearing Troubles Using the Diagnostic Listings LNs with Unequipped LI Boards - MV Updates Ring Node Addressing Automatic Diagnostics and Restorals Manual (Unit) Diagnostics Manual Diagnostics Using the 1106 Display Page Manual Diagnostics Using the DGN Command Manual Diagnostics Procedure Using the RST Command CDN-I Fault Isolation Panic Messages RAP Diagnostic Firmware Interactive Diagnostics Denied Diagnostic Requests Inhibiting Diagnostic Requests
Performing Diagnostics
s s
s s
Issue 16.0
December 2000
6-i
401-661-045
Contents
s
6-ii
Issue 16.0
December 2000
Introduction
This chapter serves as an aid for performing diagnostics on ring nodes (RNs) in a Common Network Interface ring-based ofce. When diagnostics are performed, see the 401-610-055 FLEXENT/AUTOPLEX Wireless Networks INPUT MESSAGES Message Manual or the 401-610-057 FLEXENT/AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual should also be used. Diagnostics are performed both automatically and manually. Automatic diagnostics are performed by automatic ring recovery (ARR). For more information concerning ARR, refer to the "Maintenance Description section in this manual. Manual diagnostics are performed with the aid of input messages at the Maintenance CRT (MCRT).
Overview
Diagnostics
Diagnostics serve two major purposes. First, diagnostics are run for fault detection and resolution, and are invoked by manual requests. Diagnostics are also invoked by error analysis programs as part of the automatic ring recovery (ARR) of a node that has been removed due to a fault condition. Secondly, diagnostics are invoked for the purpose of repair verication.
Issue 16.0
December 2000
6-1
401-661-045
The CNI diagnostics provide diagnostic testing for the system. These diagnostics are performed in a manner similar to those of the 3B21D computer system, but diagnose totally different equipment. For a complete list and details on 3B21D computer diagnostics and UNIX system RTR, refer to the UNIX System RTR 3B20/3B21 Operators System Maintenance Manual, 254-303-106.
Ring Peripheral Controller Nodes (RPCNs) Link Nodes (both LIN-E/SS7 and LI4S/SS7 nodes) Direct Link Nodes (DLNs) DLN30 nodes DLN60 nodes CDN-I, CDN-II, CDN-IIx, and CDN-III nodes MDL nodes Ethernet Interface Node(s) (EINs)
Very large scale integration (VLSI) is used for RNs. The VLSI ring node combines the two RIs and the NP of the ring node into one circuit pack called the IRN. The CNI utilizes a link interface to provide an interface between the ring and any ofce in the network, thus the name Common Network Interface. The CNI diagnostics primarily test this link interface. The following is a description of the ring nodes and their contents. NOTE: Parentheses () have been used throughout these circuit pack listings to designate that more than one type of circuit pack may exist for a particular ring node, depending upon which generic is being used (although it is preferred that the most
December 2000
current circuit packs be in operation). For more information, refer to SD 3F019-02 (Application Schematic for (CNI) and for features provided by each circuit pack. Table 6-1. Discontinued Availability CP Listings UNIT NAME RIO RI1 NP IRN LI-E LI-E UPDATE CIRCUIT BACK UN122C UN123B TN922 UN303B TN917B TN1803
IRN RPC node Integrated ring node (IRN) UN303() (VLSI) Dual duplex serial bus selector (DDSBS) TN69B 3B computer interface (3BI) TN914.
IRN2 RPC node Integrated ring node (IRN2) UN304() Dual duplex serial bus selector (DDSBS) TN69B 3B computer interface (3BI) TN914.
IRN link (LIN-E/SS7) node Node processor (NP) TN922 Integrated ring node (IRN) UN303() (VLSI). not encrypted TN916 or encrypted TN917() or memory data link (MDL) TN1317.
IRN link (LI4S/SS7) node Integrated ring node (IRN) UN303() (VLSI). 4-Port Link Interface 0 (LI4 0) TN1316 (LI4S) (the TN1316 has an APA 12A CP, rear mount). IRN DLNE node Integrated ring node (IRNB) UN303B (VLSI). Dual duplex serial bus selector (DDSBS) TN69B 3B computer interface (3BI) TN914
Issue 16.0
December 2000
6-3
401-661-045
IRN2 DLN30 node Integrated ring node (IRN2B) UN304B Dual duplex serial bus selector (DDSBS) TN69B 3B computer interface (3BI) TN914 Attached processor (AP) TN1630
IRN2 DLN60 node Integrated ring node (IRN2B) UN304B TN918 TN1803 TN1508 Attached processor (AP) TN2522
IRN CDN-I node Integrated ring node (IRN) UN303 () Node processor interface (NPI) TN1349 3B15 computer line of boards:
s
Central controller cache (CCC) UN237(1) or UN626 for the 16-Mbyte memory board option Central controller support (CCS) UN236(1) or UN625 for the 16-Mbyte memory board option Main store controller (MASC) UN95(1-6) or UN507(1) for 16-Mbyte memory board option Main store array (MASA) TN56(1-48) or TN1398(1-8) for 16-Mbyte memory board option Power control interface and display (PCID) TN1128.
s s
IRN2 CDN-II node Integrated ring node (IRN2B) UN304B Attached processor (AP) TN1630B
IRN2 CDN-IIx node Integrated ring node (IRN2B) UN304B Attached processor (AP) TN1720x
December 2000
NOTE: The x represents boards lettered TN1720A through TN1720H depending upon the amount of memory installed. Each board has 32 Mbytes of memory.
s
IRN2 CDN-III node Integrated ring node (IRN2B) UN304 TN918 TN1803 TN1508 Attached processor (AP) TN2523
IRN MDL node (includes CSN, DSN, and ICN) Integrated ring node (IRN) UN303()/UN304 MDL TN1640
IRN2 EIN node Integrated Ring Node (IRN) 2 UN304B TN4016 Paddleboard, 9822EB ED3F064-37 G80 cable.
An RPCN is a node where packetized information is removed from the ring and transferred to the 3B21D computer for processing, or reenters the ring after processing. It is the node on the ring where packetized information enters or exits a transmission facility. Both the RPCN and the DLNs are located in the RNF/C. DLNs function like s but have DMA capability. They contain the same circuit packs as an RPCN plus an attached processor (AP). CDN-I nodes are located in the RNF/C too. They are basically a VLSI with a modied 3B15 computer as the user apparatus circuit. The Underwriters Laboratories (UL) listed RNF/C provides ring bus connections between the RNs, access to analog and digital facilities and access to the 3B21D computer via the RPCNs.
Issue 16.0
December 2000
6-5
401-661-045
communications for system control and display (C&D), input and output messages, and the 3B21D computer emergency action interface (EAI) control and display. Inputs entered at the MCRT are monitored via the CTS. The ROP provides hard copies of the MCRT input and output messages, report status information, fault conditions, audits, and diagnostic results. If remote maintenance is provided, it has the same terminal access and terminal capabilities as the on-site user. Because both the remote and local users have simultaneous access to the 3B21D computer, it is advised that diagnostic input requests be coordinated through the on-site MCRT user.
Performing Diagnostics
When performing manual RN diagnostics, input and output messages are entered and interpreted from the maintenance terminal. For this reason, basic terminal familiarization and operating knowledge is required. An understanding of input messages and knowledge of the message data elds and formats are also important. UNIX system Real Time Reliable (RTR) or UNIX system RTR Very Large Main Memory (VLMM) provides assistance to users for entering input messages. It can be used to complete or correct errors caused by the user. Invalid values are rejected and accompanied by an appropriate error acknowledgment. Further help can be obtained by entering a question mark (?). A prompting mode can be used to lead the user through the input message. When a complete input message has been constructed, the user may either execute it or cancel it. The help session is then completed; that is, help is provided for only one input message at a time.
Action Field: An action verb (keywords) identies the action the system should perform. This is a verb such as diagnose (DGN), inhibit (INH), remove (RMV), or restore (RST).
December 2000
Identication Field: Consists of one, two, or three elds called subelds. These subelds are separated by semicolons (;) with each containing one or more keywords. The identication eld aids in structuring the message to permit a complete specication, or provides other information further identifying the object of the action. Data Field: This eld is either null or composed of additional variable information pertaining to the message. This information is in keyword format with keywords separated by commas.
A general format for input messages and some output messages can be seen in the following format in Figure 6-1 on page 6-7.
ACTION
IDENTIFICATION
DATA
subfield (object) ;
Figure 6-1.
General Format for Input/Output Messages A typical diagnostic input message and format varies in length and eld identiers. The sample message below provides eld separation and identication. Each eld is separated by a colon (:) and square brackets [ ] indicate optional information. DGN:NODExx y[;[RPT n][,RAW][,UCL]][:PH n [,TLP] | :TLP] where: DGN: = the action eld NODE = LN or RPCN xx y[;[RPT n][,RAW][,UCL]][: = the identication eld PH n [,TLP] | :TLP] = the data eld.
Issue 16.0
December 2000
6-7
401-661-045
System Diagnostics
Diagnostics may be performed manually. However, when the system detects a fault(s), diagnostics are performed automatically (ARR). The diagnostics in this section cover only the manual portions of system diagnostics, and present information to familiarize the user with the various diagnostic (DGN) input commands, phase descriptions, message interpretation, and other diagnostic information. For more information concerning ARR, refer to the "Maintenance Description section in this manual. DLNs and CDN-I use the same commands as LNs for diagnostics.
COMMANDS DGN:nodexx y DGN:nodexx y:PH a DGN:nodexx y:PH a-b DGN:nodexx y;RPT n DGN:nodexx y;RAW DGN:nodexx y;UCL DGN:nodexx y:TLP
December 2000
Another means of obtaining a status report of the system is by calling up the 1105 or 1106 display page from the MCRT. See the Trouble Indicators, Error Analysis, and Display Pages in this manual.
Issue 16.0
December 2000
6-9
401-661-045
Table 6-4.
IRN and IRN2 RPCN Node Diagnostic Phases PHASE DESCRIPTION Tests that a message can be relayed from the BISO node to the EISO node via the isolated segment over ring 0. Phase 1 also tests that any interframe buffers and all IRN boards in the isolated segment are equipped in accordance with ECD data, and that any interframe buffers in the isolated segment exhibit the proper data storage capacity. Tests that a message can be relayed from the EISO node to the BISO node via the isolated segment over ring 1. Phase 2 also tests that any interframe buffers and all IRN boards in the isolated segment are equipped in accordance with ECD data, and that any interframe buffers in the isolated segment exhibit the proper data storage capacity. Tests the interface between the Dual Serial Channel (DSCH) and the DDSBS. Tests interface between the DDSBS and the 3BI. Veries that RAC0 can detect bad parity in a ring message. Veries that RAC1 can detect bad parity in a ring message. Runs off-line CU to DDSBS tests (Demand phase only). Tests the NP RAM memory, NP parity checker, and generator circuitry. Tests everything but the memory in the node-processor function.THIS PHASE IS NOT VALID FOR IRN2. Tests part of both RAC circuits, and the RAC to the NP interface. Partially tests interface between both RACs and the ring bus. Veries that RAC0 can detect bad parity in a ring message. Veries that RAC1 can detect bad parity in a ring message.
PHASE 01
02
10 11 12 13 14 20 21 (IRN only) 30 32 33
December 2000
Table 6-5.
IRN LN (LIN - E/SS7) Node Diagnostic Phases PHASE DESCRIPTION Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC0. Phase 1 also tests that a message can be relayed from the BISO node to the EISO node via the isolated segment overring 0, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC1. Phase 2 also tests that a message can be relayed from the EISO node to the BISO node via the isolated segment overring 1, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests part of both RACs, the RAC to the NP interface, and the interface between both RACs and the ring bus. Checks the capacity of the interframe buffers associated with the node under test. Verifies that RAC0 can detect bad parity in a ring message. Verifies that RAC1 can detect bad parity in a ring message. Tests the NP RAM memory, NP parity checker and generator circuitry. Tests the NP programmable master and slave interrupt controllers and associated circuitry.It also tests the NP programmable interval timer circuitry. Verifies the ability of the node to read, write and propagate a maximum-length long message (demand only phases for transition load). Tests hardware in the LI board or the LI-NP interface. Tests the sanity of the microprocessor and the ROM. Tests the 2.4 and 4.8 data service units, along with their respective VFLA or DSA units. CCS7 will ATP by default. Ensures that the firmware and the hardware on the LI board will function as a whole.
PHASE 01
02
10
12 13 20 21
39
40 41 47 48
Issue 16.0
December 2000
6-11
401-661-045
IRN LN (LI4S/SS7) Node Diagnostic Phases (Page 1 of 2) PHASE DESCRIPTION Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC0. Phase 1 also tests that a message can be relayed from the BISO node to the EISO node via the isolated segment overring 0, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC1. Phase 2 also tests that a message can be relayed from the EISO node to the BISO node via the isolated segment overring 1, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests part of both RACs, the RAC to the NP interface, and the interface between both RACs and the ring bus. Checks the capacity of the interframe buffers associated with the node under test. Verifies that RAC0 can detect bad parity in a ring message. Verifies that RAC1 can detect bad parity in a ring message. Tests the NP RAM memory, NP parity checker and generator circuitry. THIS PHASE IS NOT VALID FOR IRN2 Tests the NP programmable master and slave interrupt controllers and associated circuitry .It also tests the NP programmable interval timer circuitry. Verifies the ability of the node to read, write and propagate a maximum length long message (demand only phases for transition load). Tests the LI4 0 local RAM and the Dual Port RAM from the Node Processor. The LI4 is held reset. Tests the NP-LI4 0 interface and DPRAM from the NP view while the microprocessor on the Link Interface board is running. This phase is downloaded to the LI4 0 via the NP. Tests the 8086 microprocessor on theLI4 0 board. A subset of the instruction set of the 8086 is exercised to verify that the microprocessor operates properly. This phase is downloaded to the LI4 0 via the NP. Tests the DPRAM and the parity check circuit. This phase is downloaded to the LI40 RAM via the NP.
02
10
12 13 20 21 (IRN Only) 39 50 51
52
53
December 2000
IRN LN (LI4S/SS7) Node Diagnostic Phases (Page 2 of 2) PHASE DESCRIPTION Tests the Programmable Interrupt Controllers and the Programmable Interval Timers.This phase is downloaded to the LI4 0 RAM via the NP. Tests the DMA, Serial Communications Chip (SCC), part of the Programmable Interrupt Controller, timers, and the formatting chips ofLI4 0 when the LI4D is tested (TN1315). No tests are run; ATPs are by default. If TLP is run, the APA13 and the DSA (Z2556L1A/2) are noted but no tests are run. Thus, when link maintenance is performed, this equipment must be taken into consideration.
56
Issue 16.0
December 2000
6-13
401-661-045
Table 6-7.
IRN DLNE Node Diagnostic Phases PHASE DESCRIPTION Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC0. Phase 1 also tests that a message can be relayed from the BISO node to the EISO node via the isolated segment overring 0, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC1. Phase 2 also tests that a message can be relayed from the EISO node to the BISO node via the isolated segment overring 1, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests part of both RACs, the RAC to the NP interface, and the interface between both RACs and the ring bus. Checks the capacity of the interframe buffers associated with node under test. Veries that RAC0 can detect bad parity in a ring message. Veries that RAC1 can detect bad parity in a ring message. Tests the NP RAM memory, NP parity checker and generator circuitry. Tests the NP programmable master and slave interrupt controllers and associated circuitry.It also tests the NP programmable interval timer circuitry. Tests the interface between the DSCH and the DDSBS. Tests the interface between the DDSBS and the 3BI. Tests the ability of NP to go insane and set the Interrupt Request Flag when the 3BI has an error. Tests the interface between the 3BI and the NP. Runs off-line CU to DDSBS tests. (Demand phase only.) Cooperates with the 3B21D driver to test the DMA capability via the 3BI. Tests the hardware in the LI board or the LI-NP interface. Tests the sanity of the microprocessor and the ROM. Tests the interface between DMA and 3BI.
PHASE 01
02
10
12 13 20 21
30 31 32 33 34 35 40 41 42
December 2000
IRN2 DLN30 Node Diagnostic Phases (Page 1 of 2) PHASE DESCRIPTION Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC0. Phase 1 also tests that a message can be relayed from the BISO node to the EISO node via the isolated segment overring 0, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC1. Phase 2 also tests that a message can be relayed from the EISO node to the BISO node via the isolated segment overring 1, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests part of both RACs, the RAC to the IRN2 interface, and the interface between both RACs and the ring bus. Veries that RAC0 can detect bad parity in a ring message. Veries that RAC1 can detect bad parity in a ring message. Tests the IRN2 RAM memory, IRN2 parity checker and generator circuitry. Tests the interface between the DSCH and the DDSBS. Tests the interface between the DDSBS and the 3BI. Tests the ability of NP to go insane and set the Interrupt Request Flag when the 3BI has an error. Tests the interface between the 3BI and the NP. Runs off-line CU to DDSBS tests. (Demand phase only) Cooperates with the 3B21D driver to test the DMA capability via the 3BI. Tests the shared static memory in the AP30 from theIRN2 side.
02*
Issue 16.0
December 2000
6-15
401-661-045
IRN2 DLN30 Node Diagnostic Phases (Page 2 of 2) PHASE DESCRIPTION Tests the shared static memory from the AP30 side, the local parity error snapshot register, and the main 16 Megabytes of DRAM on the AP30. Tests the DMA capability via the 3BI.The DMA is from the 3B21D to/ from the AP Dual Port Memory (DPM). Tests the 4 D-channel data links on the AP30.
42* 43
* Automatic Demand-Only
December 2000
Table 6-9.
IRN2 DLN60 Node Diagnostic Phases PHASE DESCRIPTION Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC0. Phase 1 also tests that a message can be relayed from the BISO node to the EISO node via the isolated segment overring 0, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC1. Phase 2 also tests that a message can be relayed from the EISO node to the BISO node via the isolated segment overring 1, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests part of both RACs, the RAC to the IRN2 interface, and the interface between both RACs and the ring bus. Veries that RAC0 can detect bad parity in a ring message. Veries that RAC1 can detect bad parity in a ring message. Tests the IRN2 RAM memory, IRN2 parity checker and generator circuitry. Tests the shared static memory in the AP60 from the IRN2 side. Tests the shared static memory from the AP60 side, the local parity error snapshot register, and the main 32 Megabytes of DRAM on the AP60.
PHASE 01*
02
10 12 13 20 40 41
Demand-only
Issue 16.0
December 2000
6-17
401-661-045
IRN CDN-I Diagnostic Phases (Page 1 of 2) PHASE DESCRIPTION Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC0. Phase 1 also tests that a message can be relayed from the BISO node to the EISO node via the isolated segment overring 0, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC1. Phase 2 also tests that a message can be relayed from the EISO node to the BISO node via the isolated segment overring 1, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests part of both RACs, the RAC to the NP interface, and the interface between both RACs and the ring bus. Checks the capacity of the interframe buffers associated with node under test. Veries that RAC0 can detect bad parity in a ring message. Veries that RAC1 can detect bad parity in a ring message. Tests the NP RAM memory, NP parity checker and generator circuitry. Tests the NP programmable master and slave interrupt controllers and associated circuitry .It also tests the NP programmable interval timer circuitry. Tests the NPI from the IRN side. Tests the CCS board. Tests the MASC 0 memory group. Tests the MASC 16 memory group. Tests the CCC board. Tests the NPI from the RAP side. Tests the MASC 1 memory group.
02
10
12 13 20 21
40 42 43 43 (16 meg) 44 45 46
December 2000
IRN CDN-I Diagnostic Phases (Page 2 of 2) PHASE DESCRIPTION Tests the MASC 2 memory group. Tests the MASC 3 memory group. Tests the MASC 4 memory group. Tests the MASC 5 memory group. Tests the MASC 6 memory group. Tests the MASC 7 memory group. Tests a comprehensive end-to-end test. Tests the MASA 0. Tests the MASA 1. Tests the MASA 2. Tests the MASA 3. Tests the MASA 4. Tests the MASA 5. Tests the MASA 6. Tests the MASA 7.
54 (16 meg) 55 (16 meg) 56* (16 meg) 57* (16 meg) 58* (16 meg) 59* (16 meg) 60* (16 meg) 61* (16 meg) * Demand-only
Issue 16.0
December 2000
6-19
401-661-045
IRN2 CDN-II/CDN-IIx Diagnostic Phases (Page 1 of 2) PHASE DESCRIPTION Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC0. Phase 1 also tests that a message can be relayed from the BISO node to the EISO node via the isolated segment overring 0, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC1. Phase 2 also tests that a message can be relayed from the EISO node to the BISO node via the isolated segment overring 1, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests part of both RACs, the RAC to the IRN2interface, and the interface between both RACs and the ring bus. Veries that RAC0 can detect bad parity in a ring message. Veries that RAC1 can detect bad parity in a ring message. Tests the IRN2 RAM memory, IRN2 parity checker and generator circuitry. Tests the shared static memory in the AP30 from the IRN2 side. Tests the shared static memory from the AP30 side, the local parity error snapshot register, and the main 16 Megabytes of DRAM on the AP30. Tests the 4 D-channel data links on the AP30. Tests the overall functionality of the mezzanine memory. For CDN-II, tests the 1st 32 Mbytes of the mezzanine memory.For CDN-IIx, tests the 1st 32-Mbyte block of the mezzanine. For CDN-II, tests the 2nd 32 Mbytes of the mezzaninememory.For CDN-IIx, tests the 2nd 32-Mbyte block of the mezzanine. For CDN-IIx only, tests the 3rd 32-Mbyte block of the mezzanine.
02*
43 44 45
46
47
December 2000
IRN2 CDN-II/CDN-IIx Diagnostic Phases (Page 2 of 2) PHASE DESCRIPTION For CDN-IIx only, tests the 4th 32-Mbyte block of the mezzanine. For CDN-IIx only, tests the 5th 32-Mbyte block of the mezzanine. For CDN-IIx only, tests the 6th 32-Mbyte block of the mezzanine. For CDN-IIx only, tests the 7th 32-Mbyte block of the mezzanine. For CDN-IIx only, tests the 8th 32-Mbyte block of the mezzanine.
Automatic. NOTE: For APX6.1 prior to Software Update that includes diagnostics for CDN-IIx, Phases 43 and 45 through 52 are demand-only phases; Phase 44 is an automatic phase. For APX6.1 with the Software Update that includes diagnostics for CDN-IIx and for APX7.0, Phase 43 does not apply; and Phases 44 through 52 are automatic phases.
Issue 16.0
December 2000
6-21
401-661-045
IRN2 CDN-III Diagnostic Phases PHASE DESCRIPTION Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC0. Phase 1 also tests that a message can be relayed from the BISO node to the EISO node via the isolated segment overring 0, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC1. Phase 2 also tests that a message can be relayed from the EISO node to the BISO node via the isolated segment overring 1, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests part of both RACs, the RAC to the IRN2interface, and the interface between both RACs and the ring bus. Veries that RAC0 can detect bad parity in a ring message. Veries that RAC1 can detect bad parity in a ring message. Tests the IRN2 RAM memory, IRN2 parity checker and generator circuitry. Tests the shared static memory in the AP60 from theIRN2 side. Tests the shared static memory from the AP60 side, the local parity error snapshot register, and the main 32 Megabytes of DRAM on the AP60. Tests the database memory control circuits. Tests the 1st 128 Mbytes of the AP60 0.5 Gbyte database memory array. Tests the 2nd 128 Mbytes of the AP60 0.5 Gbyte database memory array. Tests the 3rd 128 Mbytes of the AP60 0.5 Gbyte database memory array. Tests the 4th 128 Mbytes of the AP60 0.5 Gbyte database memory array.
02
10 12 13 20 40 41
Demand-only.
December 2000
IRN2 EIN Node Diagnostic Phases PHASE DESCRIPTION Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC0. Phase 1 also tests that a message can be relayed from the BISO node to the EISO node via the isolated segment overring 0, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC1. Phase 2 also tests that a message can be relayed from the EISO node to the BISO node via the isolated segment overring 1, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests part of both RACs, the RAC to the IRN2 interface, and the interface between both RACs and the ring bus. Veries that RAC0 can detect bad parity in a ring message. Veries that RAC1 can detect bad parity in a ring message. Tests the IRN2 RAM memory, IRN2 parity checker and generator circuitry. Tests the shared static memory in the AP30 from the IRN2 side. Tests the shared static memory from the AP30 side, the local parity error snapshot register, and the main 16 Megabytes of DRAM on the AP30. Tests the 4 D-channel data links on the AP30. Tests the overall functionality of the mezzanine memory. For CDN-II, tests the 1st 32 Mbytes of the mezzaninememory. For CDN-IIx, tests the 1st 32-Mbyte block of the mezzanine. For CDN-II, tests the 2nd 32 Mbytes of the mezzaninememory.For CDN-IIx, tests the 2nd 32-Mbyte block of the mezzanine. For CDN-IIx only, tests the 3rd 32-Mbyte block of the mezzanine.
02*
43 44 45 46
47
Automatic.
Issue 16.0
December 2000
6-23
401-661-045
IRN MDL (SCN, DSN, ICN) Diagnostic Phases PHASE DESCRIPTION Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC0. Phase 1 also tests that a message can be relayed from the BISO nodeto the EISO node via the isolated segment overring 0, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests that each node in the isolated segment is able to set and clear its data selector via hardware commands at RAC1. Phase 2 also tests that a message can be relayed from the EISO nodeto the BISO node via the isolated segment overring 1, and that any interframe buffers in the isolated segment are equipped in accordance with ECD data and exhibit the proper data storage capacity. Tests part of both RACs, the RAC to the NP interface, and the interface between both RACs and the ring bus. Checks the capacity of the interframe buffers associated with node under test. Veries that RAC0 can detect bad parity in a ring message. Veries that RAC1 can detect bad parity in a ring message. Tests the IRN2 RAM memory, IRN2 parity checker and generator circuitry. Requests download of diagnostic driver code to the IRN2 and initiates its execution to diagnose the Ethernet interface hardware. Testing ends at the loopback relay on the ELI circuit pack, CP TN4016. * Automatic Circuit Pack Trouble Location Guide On the following pages are check lists for probable or suspected faulty circuit packs to be used when a diagnostic phase has failed for a particular ring node. These listings are ordered from the most to the least probable cause of failure. When diagnosing ring nodes, if the diagnostic result returned is some-tests-failed (STF), refer to the Trouble Location CP List tables for the location of the faulty or suspected faulty CP(s). The TLP option delivers the same information as these tables and can also be used in identifying faulty or suspected faulty CPs. The TLP output is valid only for the rst failing phase and only when all phases are run.
02*
10*
December 2000
The TLP capability has been enhanced to provide more extensive on-line interpretation of the isolated segment diagnostic failure (phases 1 and 2). This assists in the direct localization of ring faults to nodes (or circuit packs) within a multinode isolated segment other than the node being diagnosed. Visual indicators in the form of LEDs located on the CPs can also be used to locate faulty CPs too. For more information on visual indicators in this manual. NOTE: Parentheses () have been used throughout these circuit pack listings to designate that more than one type of circuit pack may exist for a particular ring node, depending upon which generic is being used (although it is preferred that the most current circuit packs be in operation). (For more information, refer to "SD 3F019-02, the Application Schematic for CNI" for features provided by each circuit pack.) Table 6-15. Discontinued Availability CP Listings UNIT NAME RI0 RI1 NP IRN LI-E LI-E UPDATED CIRCUIT PACK UN122C UN123B TN922 UN303B TN917B TN1803
Table 6-16.
IRN and IRN2 RPC Trouble Location CP List (Page 1 of 2) PROBABLE/SUSPECTED FAULTY PACK UN303()/UN304B TN915/TN918 TN1508/TN1803 Ring Bus Cable UNIT NAME IRN/IRN2 IFB IFB RNF/C Same as Phase 01
02
rpc02.I
Same as Phase 01
Issue 16.0
December 2000
6-25
401-661-045
Table 6-16.
IRN and IRN2 RPC Trouble Location CP List (Page 2 of 2) PROBABLE/SUSPECTED FAULTY PACK TN69B KBN15 (3B21D) UNIT NAME DDSBS DSCH 3BI DDSBS 3BI IRN/IRN2 3BI IRN/IRN2 DDSBS (Demand only Phase) Off-Line DSCH IRN/IRN2 IRN IRN/IRN2 IRN/IRN2 IFB IFB Same as Phase 32
11
rpc11.I
TN914 TN69B
12
rpc12.I
TN914 UN303()/UN304B
13
rpc13.I
TN914 UN303()/UN304B
14
rpc14.I
TN69B
KBN15 (3B21D) 20 21 30 32 rpci20.I rpci21.I rpci30.I rpc32.I UN303()/UN304B UN303()/UN304B UN303()/UN304B UN303()/UN304B TN915/TN918 TN1508/TN1803 33 rpc33.I Same as Phase 32
December 2000
Table 6-17.
IRN LN (LIN-E/SS7) Trouble Location CP List (Page 1 of 2) PROBABLE/SUSPECTED FAULTY PACK UN303() TN915/TN918 TN1506/TN1508/TN1509 Ring Bus Cable UNIT NAME IRN IFB IFB RNF/C Same as Phase 01 IRN IRN IFB IFB Same as Phase 12 IRN IRN IRN LI-NE LI-E IRN Same as Phase 40
02 10 12
13 20 21 39 40
41
cBph1.41.I
Same as Phase 40
Issue 16.0
December 2000
6-27
401-661-045
Table 6-17.
IRN LN (LIN-E/SS7) Trouble Location CP List (Page 2 of 2) PROBABLE/SUSPECTED FAULTY PACK TN916 TN917() TN919 2024-A, 2048-A TN922 LINK Cabling UNIT NAME LI-NE LI-E VFLA Data Sets NP
48
cBph8.48.I
VFLA
Z2466L1A/2 (CCS7) TN916 TN917() TN922 Link Cabling * Phase 47 - CCS7 will ATP by default. Phase 48 - test 47 will fail if Z24556L1A/2 is in Local Loop (LL).
Table 6-18.
IRN LN (LI4S/SS7) Trouble Location CP List (Page 1 of 2) PROBABLE/SUSPECTED FAULTY PACK UN303()/UN304B TN915/TN918 TN1506/TN1508/TN1509 Ring Bus Cable IRN/IRN2 IFB IFB RNF/C UNIT NAME
December 2000
Table 6-18.
IRN LN (LI4S/SS7) Trouble Location CP List (Page 2 of 2) PROBABLE/SUSPECTED FAULTY PACK Same as Phase 01 UN303()/UN304B UN303()/UN304B TN915/TN918 TN1508/TN1803 Same as Phase 01 IRN/IRN2 IRN/IRN2 IFB IFB Same as Phase 12 IRN/IRN2 IRN IRN LI4S 0 Same as Phase 50 LI4S 0 LI4S 0 LI4S 0 LI4S 0 UNIT NAME
13 20 21 50
51 52 53 54 55 56
Same as Phase 50 TN1316 TN1316 TN1316 TN1316 ATPs are by default (APA13 and the DSA (Z2556L1A/2) are noted but no tests are run.
Issue 16.0
December 2000
6-29
401-661-045
Table 6-19.
IRN DLNE Trouble Location CP List (Page 1 of 2) PROBABLE/SUSPECTED FAULTY PACK UN303()/UN304B TN915/TN918 TN1508/TN1803 Ring Bus Cable IRN/IRN2 IFB IFB RNF/C Same as Phase 01 IRN/IRN2 IRN/IRN2 IFB IFB Same as Phase 12 IRN/IRN2 IRN/IRN2 DDSBS DSCH 3BI DDSBS 3BI IRN/IRN2 3BI IRN/IRN2 UNIT NAME
02 10 12
13 20 21 30
31
iun31.l
TN914 TN69B
32
iun32.l
TN914 UN303()/UN304B
33
iun33.l
TN914 UN303()/UN304B
December 2000
Table 6-19.
IRN DLNE Trouble Location CP List (Page 2 of 2) PROBABLE/SUSPECTED FAULTY PACK TN69B DDSBS (Demand only phase) Off-line DSCH Same as Phase 33 AP AP LI4E AP AP LI4E IRN AP AP LI4E UNIT NAME
KNB15 (3B21D) 35 40 iun35.I ap68.40.I Same as Phase 33 TN1340 (2 Meg) TN1641 (8 Meg) TN1630 (4ESS Only) 41 ap68.41.I TN1340 (2 Meg) TN1641 (8 Meg) TN1630 (4ESS Only) 42 ap68.42.I UN1340 (2 Meg) TN1340 (2 Meg) TN1641 (8 Meg) TN1630 (4ESS Only)
Issue 16.0
December 2000
6-31
401-661-045
Table 6-20.
IRN2 DLN30 Trouble Location CP List (Page 1 of 2) PROBABLE/SUSPECTED FAULTY PACK UN304 TN918 TN1803 TN1508 Ring Bus Cable IRN2 IFB-U IFB-4K/8 IFB-16/8 RNF/C Same as Phase 01 IRN2 IRN2 IFB-U IFB-4K/8 IFB-16/8 Same as Phase 12 IRN/IRN2 DDSBS DSCH 3BI DDSBS 3BI IRN2 3BI IRN2 UNIT NAME
13* 20* 30
31
iun31.l
TN914 TN69B
32
iun32.l
TN914 UN304
33
iun33.l
TN914 UN304
December 2000
Table 6-20.
IRN2 DLN30 Trouble Location CP List (Page 2 of 2) PROBABLE/SUSPECTED FAULTY PACK TN69B DDSBS (Demand only phase) Off-line DSCH Same as Phase 33 AP30 AP30 AP30 AP30 UNIT NAME
KNB15 (3B21D) 35 40* 41* 42* 43 iun35.I ap68.40.I ap60.41.I ap68.42.I Ii4e.43.I Same as Phase 33 TN1630B TN1630B TN1630B TN1630B
* Automatic Demand-Only
Table 6-21.
IRN2 DLN60 Trouble Location CP List (Page 1 of 2) PROBABLE/SUSPECTED FAULTY PACK UN304 TN918 TN1803 TN1508 Ring Bus Cable IRN2 IFB-U IFB-4K/8 IFB-16/8 RNF/C Same as Phase 01 IRN2 UNIT NAME
02 10
iun02.l iuni10.
Issue 16.0
December 2000
6-33
401-661-045
Table 6-21.
IRN2 DLN60 Trouble Location CP List (Page 2 of 2) PROBABLE/SUSPECTED FAULTY PACK UN304B TN918 TN1803 TN1508 IRN2 IFB-U IFB-4K/8 IFB-16/8 Same as Phase 12 IRN2 AP60 AP60 UNIT NAME
13 20 40 41
Table 6-22.
IRN CDN-I Manual Trouble Location CP List (Page 1 of 3) PROBABLE/SUSPECTED FAULTY PACK UN303 UN303B TN918 TN1803 TN1508 Ring Bus Cable IRN IRNB IFB-U IFB-4K/8 IFB-16/8 RNF/C Same as Phase 01 IRN IRNB UNIT NAME
02 10
iun02.I iuni10.I
December 2000
Table 6-22.
IRN CDN-I Manual Trouble Location CP List (Page 2 of 3) PROBABLE/SUSPECTED FAULTY PACK UN303 UN303B TN918 TN1803 TN1508 IRN IRNB IFB-U IFB-4K/8 IFB-16/8 Same as Phase 12 IRN IRNB Same as Phase 20 NPI CCS CCS16 MASA (0-7) MASC 0 MASA16 (0-7) MASC16 CCC CCC16 NPI MASA (0-7) MASC1 Same as Phase 46 Same as Phase 46 UNIT NAME
13 20
iun13.I iuni20.I
21 40 42
43
irap43.I
TN56 UN95
43 (16meg)
irap43_16.I
TN1398 UN507
44
irap44.I
UN237 UN626
45 46
irap45.I irap46.I
47 48
irap47.I irap48.I
Issue 16.0
December 2000
6-35
401-661-045
Table 6-22.
IRN CDN-I Manual Trouble Location CP List (Page 3 of 3) PROBABLE/SUSPECTED FAULTY PACK Same as Phase 46 Same as Phase 46 Same as Phase 46 Same as Phase 46 all TN1398 TN1398 TN1398 TN1398 TN1398 TN1398 TN1398 TN1398 Same as Phase 46 Same as Phase 46 Same as Phase 46 Same as Phase 46 all MASA16 (0) MASA16 (1) MASA16 (2) MASA16 (3) MASA16 (4) MASA16 (05 MASA16 (6) MASA16 (7) UNIT NAME
DIAGNOSTIC PHASE PHASE 49 50 51 52 53 54 55 56* 57* 58* 59* 60* 61* * Demand-only TABLE irap49.I irap50.I irap51.I irap52.I irap53.I irap54.I irap55.I irap56.I irap57.I irap58.I irap59.I irap60.I irap61.I
December 2000
Table 6-23.
IRN2 CDN-II/CDN-IIx Manual Trouble Location CP List (Page 1 of 2) PROBABLE/SUSPECTED FAULTY PACK UN304 TN918 TN1803 TN1508 Ring Bus Cable IRN2 IFB-U IFB-4K/8 IFB-16/8 RNF/C Same as Phase 01 IRN2 IRN2 IFB-U IFB-4K/8 IFB-16/8 Same as Phase 12 IRN2 AP30 AP30 AP30 AP30 AP30 AP30 AP30 UNIT NAME
Same as Phase 12 UN304 TN1630B(CDN-II) TN1720()(CDN-IIx) TN1630B(CDN-II) TN1720()(CDN-IIx) TN1630B TN1630B(CDN-II) TN1720()(CDN-IIx) TN1630B(CDN-II) TN1720()(CDN-IIx) TN1630B(CDN-II) TN1720()(CDN-IIx) TN1720() CDN-IIx
Issue 16.0
December 2000
6-37
401-661-045
Table 6-23.
IRN2 CDN-II/CDN-IIx Manual Trouble Location CP List (Page 2 of 2) PROBABLE/SUSPECTED FAULTY PACK TN1720() CDN-IIx TN1720() CDN-IIx TN1720() CDN-IIx TN1720() CDN-IIx TN1720() CDN-IIx AP30 AP30 AP30 AP30 AP30 UNIT NAME
DIAGNOSTIC PHASE PHASE 48 49 50 51 52 * Automatic TABLE ap30.48.I ap30.49.I ap30.50.I ap30.51.I ap30.52.I
NOTE: For APX6.1 prior to Software Update that includes diagnostics for CDN-IIx, Phases 43 and 45 through 52 are demand-only phases; Phase 44 is an automatic phase. For APX6.1 with the Software Update that includes diagnostics for CDN-IIx and for APX7.0, Phase 43 does not apply; and Phases 44 through 52 are automatic phases.
Table 6-24.
IRN2 CDN-III Trouble Location CP List (Page 1 of 2) PROBABLE/SUSPECTED FAULTY PACK UN304 TN918 TN1803 TN1508 Ring Bus Cable IRN2 IFB-U IFB-4K/8 IFB-16/8 RNF/C Same as Phase 01 IRN2 UNIT NAME
02 10
iun02.l iuni10.
December 2000
Table 6-24.
IRN2 CDN-III Trouble Location CP List (Page 2 of 2) PROBABLE/SUSPECTED FAULTY PACK UN304B TN918 TN1803 TN1508 IRN2 IFB-U IFB-4K/8 IFB-16/8 Same as Phase 12 IRN2 AP60 AP60 AP60 AP60 AP60 AP60 AP60 UNIT NAME
13 20 40 41 44 45 46 47 48
iun13.l iuni20.l ap60.40I ap60.41I ap60.44I ap60.45I ap60.46I ap60.47I ap60.48I * Automatic
Same as Phase 12 UN304 TN2523 TN2523 TN2523 TN2523 TN2523 TN2523 TN2523
Table 6-25.
IRN2 EIN Node Trouble Location CP List (Page 1 of 2) PROBABLE/SUSPECTED FAULTY PACK UN304 TN918 TN1803 TN1508 Ring Bus Cable IRN2 IFB-U IFB-4K/8 IFB-16/8 RNF/C UNIT NAME
Issue 16.0
December 2000
6-39
401-661-045
Table 6-25.
IRN2 EIN Node Trouble Location CP List (Page 2 of 2) PROBABLE/SUSPECTED FAULTY PACK Same as Phase 01 UN304B UN304B TN918 TN1803 TN1508 Same as Phase 01 IRN2 IRN2 IFB-U IFB-4K/8 IFB-16/8 Same as Phase 12 IRN2 ELI UNIT NAME
DIAGNOSTIC PHASE PHASE 02* 10* 12* TABLE iun02.l iuni10. iun12.l
Table 6-26.
IRN MDL (CSN, DSN, ICN) Trouble Location CP List PROBABLE/SUSPECTED FAULTY PACK UNIT NAME
02 10 12
December 2000
Table 6-26.
IRN MDL (CSN, DSN, ICN) Trouble Location CP List PROBABLE/SUSPECTED FAULTY PACK Same as Phase 12 UN303()/UN304() UN303() TN1640 TN1640 Same as Phase 40 Same as Phase 40 TN1640 TN1640 Same as Phase 50 Same as Phase 50 Same as Phase 12 IRN/IRN2 IRN MDL_0 MDL_0 Same as Phase 40 Same as Phase 40 MDL_1 MDL_1 Same as Phase 50 Same as Phase 50 UNIT NAME
DIAGNOSTIC PHASE PHASE 13 20 21 (IRN only) 40 (IRN only) 40 (IRN2 only) 41 (IRN only) 41 (IRN2 only) 50 (IRN only) 50 (IRN2 only) 51 (IRN only) Demand Phase 51 (IRN2 only) Demand Phase TABLE iun13.l iuni20.l iuni21.I iun40.I i2mdI40.I iun41.I i2un41.I iun50.I i2mdI50.1 iun51.I i2mdI51.I
Diagnostic Listings
When diagnostic failures still exist after replacing hardware as recommended in the Manual Trouble Location Circuit Pack List tables, analysis of diagnostic test results is important. This is accomplished using the diagnostic output message and diagnostic listings (.l les), if available. The diagnostic listings are les that end with a .l sufx (such as iun01.l, or rpc01.l). See the manual trouble location circuit pack list tables. Generally the rst failing phase and the rst few failing tests within that phase are useful for analysis. If this data is not on hand, run diagnostics using the RAW option to print all test failures at the ROP. A diagnostic listing consists of a prologue, followed by one or more program units. Each program unit has a prologue, which gives information about what is tested, how the testing is done, and the hardware involved. The remainder of the program unit consists of the diagnostic command lines, comment lines, and lines that are ASCII equivalent of the data found in the corresponding object le. The command lines direct the sequence of diagnostic test execution.
Issue 16.0
December 2000
6-41
401-661-045
Each diagnostic command begins with a statement number. This is the statement number that is referred to in the interactive diagnostics (EX) input and output message (see Performing Diagnostics in this chapter) in early termination output messages, or in the DGN AUDIT RING output message. Some diagnostic command lines are preceded by one or more comment lines. These are lines that begin with the character C. They are intended to give the purpose of the command line that follows it. Each diagnostic command line is followed by a line that shows, in ASCII format, the data corresponding to the command that is contained in the associated executable object le. This line begins with the string * adr unless the command generates a test, and in this case, the command line begins with the string * test. The test numbers in the diagnostic listings correspond to the test numbers in the diagnostic output messages. The only data on this line of importance to on-site users are the test numbers. NOTE: For the rdgnrsl diagnostic command, a separate line is shown to illustrate that all failed test numbers that are returned from the NP are reported by adding 20 to the failed test number that is actually returned.
December 2000
form for the LN must be changed. Therefore, if the LN is not equipped with an LI4 circuit pack, enter 0x3 in the MV eld. If the LN is equipped with an LI4 circuit pack, enter 0x3d in the MV eld.
Issue 16.0
December 2000
6-43
401-661-045
Table 6-27.
GRP # 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 0 0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272 288 304 320 336 352 368 384 400 416 432 448
December 2000
Table 6-27.
GRP # 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 0 464 480 496 512 528 544 560 576 592 608 624 640 656 672 688 704 720 736 752 768 784 800 816 832 848 864 880 896 912 928
Issue 16.0
December 2000
6-45
401-661-045
Table 6-27.
GRP # 59 60 61 62 63 0 944 960 976 992 1008
December 2000
Table 6-28.
GRP # 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 0 000 010 020 030 040 050 060 070 080 090 0A0 0B0 0C0 0D0 0E0 0F0 100 110 120 130 140 150 160 170 180 190 1A0 1B0 1C0
Issue 16.0
December 2000
6-47
401-661-045
Table 6-28.
GRP # 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 0 1D0 1E0 1F0 200 210 220 230 240 250 260 270 280 290 2A0 2B0 2C0 2D0 2E0 2F0 300 310 320 330 340 350 360 370 380 390 3A0
December 2000
Table 6-28.
GRP # 59 60 61 62 63 0 3B0 3C0 3D0 3E0 3F0
Issue 16.0
December 2000
6-49
401-661-045
Table 6-29.
GRP # 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 0 3072 3088 3104 3120 3136 3152 3168 3184 3200 3216 3232 3248 3264 3280 3296 3312 3328 3344 3360 3376 3392 3408 3424 3440 3456 3472 3488 3504 3520
December 2000
Table 6-29.
GRP # 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 0 3536 3552 3568 3584 3600 3616 3632 3648 3664 3680 3696 3712 3728 3744 3760 3776 3792 3808 3824 3840 3856 3872 3888 3904 3920 3936 3952 3968 3984 4000
Issue 16.0
December 2000
6-51
401-661-045
Table 6-29.
GRP # 59 60 61 62 63 0 4016 4032 4048 4064 4080
December 2000
Table 6-30.
GRP # 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 0 C00 C10 C20 C30 C40 C50 C60 C70 C80 C90 CA0 CB0 CC0 CD0 CE0 CF0 D00 D10 D20 D30 D40 D50 D60 D70 D80 D90 DA0 DB0 DC0
Issue 16.0
December 2000
6-53
401-661-045
Table 6-30.
GRP # 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 0 DD0 DE0 DF0 E00 E10 E20 E30 E40 E50 E60 E70 E80 E90 EA0 EB0 EC0 ED0 EE0 EF0 F00 F10 F20 F30 F40 F50 F60 F70 F80 F90 FA0
December 2000
Table 6-30.
GRP # 59 60 60 62 63 0 FB0 FC0 FD0 FE0 FF0
During a system-wide initialization. When the ring maintenance state indicates that the ring is undergoing reconguration or is down.
5. Submit all conditional restorals under software known as ARR. When a requested restoral is not successful, or the internal timer awaiting job completion expires, the following message is generated: REPT ARR AUTORST FAILURE FOR aaaa b where: aaaa b = identifying name of the node.
Issue 16.0
December 2000
6-55
401-661-045
If the ECD restoral threshold is exceeded, the following output message is generated: REPT ARR AUTORST THRESHOLD EXCEEDED FOR aaaa b where: aaaa b = identifying name of the node. If a time-out occurs while waiting for a reply message, this output message is generated: REPT ARR AUTORST TIMEOUT AWAITING MIRA FOR aaaa b where: aaaa b = identifying name of the node. For additional information regarding the BREPT ARR AUTORST messages, refer to the the 401-610-055 FLEXENT/AUTOPLEX Wireless Networks INPUT MESSAGES Message Manual. The following priorities determine the order in which nodes eligible for automatic restoral are served: 1. 2. 3. 4. 5. 6. A nominated critical node (typically the BISO or EISO node) Nodes with faulty ring interfaces RPCNs eligible for unconditional restorals RPCNs eligible for conditional restorals Is eligible for unconditional restorals Is eligible for conditional restorals.
For a more detailed description of automatic node restorals and ARR, refer to the"Maintenance Description section in ththe 401-610-055 Input Message Manual.
December 2000
should be inhibited to prevent automatic diagnostics (ARR) from attempting to diagnose and restore nodes scheduled for manual diagnostics. See fINH:DMQ in the the 401-610-055 FLEXENT/AUTOPLEX Wireless Networks INPUT MESSAGES Message Manual. Before any node associated with an active link can be removed from service for diagnostic purposes, the appropriate link must be removed from service. To put the signaling link (SLK) in the AVAILABLE-Manual Out-of-Service (MOOS) state, enter the following message at the MCRT, and proceed with diagnostics as usual. CHG:SLK (a, b, [c, d]); MOOS where: a = group number (00 - 63) b = member number (01 - 15) The following message should appear on the MCRT: CHG SLK a b [ c d ] NEW REQUESTED MINOR STATE = MOOS where: a = group number (00 - 63) b = member number (01 - 15) c = LI4 circuit pack (0 - 1) d = LI4 port (0 - 3) If the SLK was manually removed from service, after diagnostics put it back in the AVAILABLE-In Service (IS) or Standby (STBY) state by entering the following message at the MCRT: CHG:SLK (a, b, [c, d]); {IS | ARST} where: a = group number (00 - 63) b = member number (01 - 15) c = LI4 circuit pack (0 - 1) d = LI4 port (0 - 3) The following message should appear on the MCRT:
Issue 16.0
December 2000
6-57
401-661-045
CHG SLK a b [ c d ] NEW REQUESTED MINOR STATE = IS where: a = group number (00 - 63) b = member number (01 - 15) c = LI4 circuit pack (0 - 1) d = LI4 port (0 - 3) Refer back to these procedures as required when performing manual diagnostics. There are basic events that must be accomplished when performing RN diagnostics. Input messages and formats can vary. As indicated in earlier paragraphs of this guide, some input messages cause the system to perform all diagnostic activities, such as removing the node from service, isolating the node, diagnosing the node, unisolating the node, and restoring the node to service. Yet, there are other input messages, where each individual event is acted upon according to the diagnostic message used. When performing RN diagnostics with the use of a conditional restore (RST) or with the DGN command, a basic sequence of events (excluding obtaining a status report) autonomously occur in the manner listed below: 1. The node under test (NUT) must rst be removed from service. This is done by changing its state to out-of-service normal (OOS-NORMAL), if it was in the ACT state prior to performing the diagnostics. For additional information on node state changes, see the Maintenance Description section in this Manual. 2. The NUT is changed to the OOS-ISOLATED state to route incoming and outgoing trafc around the NUT. The request to isolate the NUT may be denied for reasons not listed here. 3. The node under test is diagnosed. 4. If the NUT was in the active ring prior to Step 2, after all diagnostic phases ran, the NUT is congured back into the active ring (OOS-NORMAL). The conguration can be denied if the diagnostics determined that the ring interface (RI) minor state is faulty (FLTY). 5. Finally, after successfully conguring the node back into the active ring, the NUT is restored to service. It is automatically pumped with operational code, placed into execution, and changed to the active (ACT) state. NOTE: If the request was a DGN rather than an RST, the node is not restored to service.
December 2000
When a diagnostic failure cannot be corrected by CP replacement using the manual trouble locating process (see the trouble location circuit pack list tables in this chapter), check:
s s s
Before replacing any cables or changing any connections or pins, refer to the appropriate maintenance manuals. The following pages provide procedures used in performing RN diagnostics. Any of the following procedures can perform a diagnostic task. The following procedures are used for diagnosing either RPCNs or s. Each procedure is totally independent and should not be combined.
Issue 16.0
December 2000
6-59
401-661-045
where:
xx = group number.
2. If a node is to be removed from service (OOS-NORMAL) for any reason, the following input command is used: 2xx where: xx = display line number of the node to be removed from service. The node state changes to OOS-NORMAL. 3. From the MCRT To diagnose a node from this frame/cabinet group, enter the following command: 5xx where: xx = display line number of the node to be diagnosed.
See the DGN command in the 401-610-057 Output Message Manual, for the response to the completion of the diagnostics. If the diagnostic result is: STFDetermine which phase(s) failed, and record the CP number(s) for that phase. See the trouble location circuit pack list tables in this chapter for additional information. Conditional all-tests-passed (CATP) Determine the reason for the CATP response. If the reason is the node was not singly isolated, go to Step 4. Conditionally restore (RST) the adjacent nodes. When these nodes have been restored, conditionally restore this node, the rst failing node. If the reason is the node was not isolated, correct all problems so that a duplex ring exists and conditionally restore this node. If the reason is the ring is down, correct all problems so that an active ring exists and conditionally restore this node. For additional information on ring conguration and maintenance, see Maintenance Description section in this manual. No-tests-run (NTR)If an NTR response is received, go to Step 3. If the problem persists, seek technical assistance. ABTIf an ABORT is received, determine the reason(s) for the ABORT. After determining the reason(s) for the ABORT, go to Step 3, and/or seek technical assistance. 4. From the MCRT Unconditionally restore the node to service by entering the following input command: 3xx
December 2000
where:
Do not perform an unconditional restore unless one of the following has occurred:
s
CAUTION:
A complete diagnostics has produced an all-tests-passed (ATP) response. A complete diagnostics has produced a CATP response, and the RI and the NP minor states are both USBL.
The node which was being diagnosed should return to the system ACT state, and this should complete the diagnostic tests.
Issue 16.0
December 2000
6-61
401-661-045
NOTE: Before any manual diagnostics begin, ARR should be inhibited to prevent automatic diagnostics (ARR) from attempting to diagnose and restore nodes queued, or actively performing manual diagnostics. See the INH:DMQ message in the 401-610-055 FLEXENT/AUTOPLEX Wireless Networks INPUT MESSAGES Message Manual. 1. At the MCRT Obtain a report on the status of a node in a particular group, or the status of the ring by entering the following input message, or a variation thereof, as shown in OP: Ring Input Message Variations table, or refer to the 401-610-055 FLEXENT/ AUTOPLEX Wireless Networks INPUT MESSAGES Message Manual. OP:RING,nodexx y For LN node = LN xx = group number y = node member number. For RPCN node = RPCN xx = group number y = node member number. NOTE: The input message provided above provides the status information for a specied RN. For the message completion response, observe the MCRT or the ROP. To determine what response message to expect and for an explanation of such, see the 401-610-057 FLEXENT/AUTOPLEX Wireless Networks OUTPUT MESSAGES Manuall. 2. At the MCRT If there is an active link supported by this node, remove it from service using the procedures listed previously in this section. Request diagnostics of the node by entering the following input message, or a variation thereof, as listed in DGN Message Input Variation table. For a complete listing of all DGN input command variations, see the 401-610-055 FLEXENT/ AUTOPLEX Wireless Networks INPUT MESSAGES Message Manuall.
December 2000
DGN:nodexx y For LN node = LN xx = group number y = node member number. For RPCN node = RPCN xx = group number y = node member number. NOTE: The input message listed above runs all automatic phases on the specied RN. To determine what response message to expect and for an explanation of this message, see the 401-610-055 FLEXENT/AUTOPLEX Wireless Networks INPUT MESSAGES Message Manual or the 401-610-057 FLEXENT/ AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual 3. At the ROP Examine the copy of the DGN printout to determine the status of the diagnostics tests (determine which phases failed or passed). If an ATP response is received at the ROP, proceed to Step 4. If an STF, NTR, or CATP response is received at the ROP, go to Step 5. 4. At the MCRT If a link associated with this node was removed from service prior to diagnostics, put the link back in service using the procedures listed previously in this section. Unconditionally restore the node to service by entering the following input message: RST:nodexx y;UCL For LN node = LN xx = group number y = node member number UCL = restores the node without diagnostics.
Issue 16.0
December 2000
6-63
401-661-045
For RPCN node = RPCN xx = group number y = node member number UCL= restores the node without diagnostics.
Do not perform an unconditional restore unless one of the following has occurred:
s s
CAUTION:
A complete diagnostics has produced an ATP response. A complete diagnostics has produced a CATP response, and the RI and the NP minor states are both USBL.
NOTE: If the major state of the node is OOS-ISOLATED, this input message requests that the node be included back into the active ring. If conguring the node back into the active ring is successful, the node major state is changed to ACT and the node is pumped with the required operational code. If the node is unable to be congured back into the active ring, the restore is stopped and the node is left in the OOS-NORMAL state. If the node was not originally OOS, the restore is stopped and the node is left in the state it was in prior to the restoral request. The nodes major state must be changed to OOS via a recent change and verify (RCV) command before it can be restored. For additional information concerning a node state change, refer to Maintenance Description section in this manual. NOTE: If the major state is changed to ACT, the DGN diagnostics are complete. Omit the remainder of this test procedure. NOTE: Perform Steps 5 through 8 only if an ATP response is not received in Step 3. 5. From the ROP If the diagnostic result is: STFDetermine which phase(s) failed, and record the CP number(s) for that phase. See the trouble location circuit pack list tables in this chapter for additional information on RNs. Proceed to Step 6. CATPDetermine the reason for the CATP response.
December 2000
If the reason is the node was not singly isolated, go to Step 4. Conditionally restore (RST) the adjacent nodes. When these nodes have been restored, conditionally restore this node, the rst failing node. If the reason is the node was not isolated, correct all problems so that a duplex ring exists and conditionally restore this node. If the reason is the ring is down, correct all problems so that an active ring exists and conditionally restore this node. For additional information on ring conguration and maintenance, see the "Maintenance Description section in this manual. NTRIf an NTR response is received, go to Step 1 or Step 2. If the problem persists, seek technical assistance. ABTIf an ABORT is received, determine the reason(s) for the ABORT. See the 401-610-057 FLEXENT/AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual. After determining the reason(s) for the ABORT, go to Step 1 or Step 2, and/or seek technical assistance. 6. At the ring node frame/cabinet (RNF/C) Use the trouble location circuit pack list tables in this chapter to determine the equipment location for each suspected or faulty CP. 7. At the RNF/C Replace the faulty CP(s) using the procedures described in using the procedure described in Chapter 7, Equipment Handling Procedures. 8. If time permits and there is uncertainty about node operation, repeat diagnostics to confirm proper system operations. Go to Step 2.
Issue 16.0
December 2000
6-65
401-661-045
exceptions, see the RST:/RST:RPCN input command in the 401-610-055 FLEXENT/AUTOPLEX Wireless Networks INPUT MESSAGES Message Manual. 1. Conditionally removes the node from service (OOS-NORMAL). 2. Isolates (OOS-ISOLATED) the node. 3. Runs all automatic phases on the node. 4. Unisolates the node (OOS-NORMAL). 5. Restores the node to service (ACT). For additional information on the normal sequence of events when using the RST command, see the 401-610-055 Input Message Manual.
December 2000
2. At the MCRT If there is an active link supported by this node, remove it from service using the procedures listed previously in this section. Request node test by entering the following input message: RST:nodexx y For LN node = LN xx = group number y = node member number. For RPCN node = RPCN xx = group number y = node member number. NOTE: Upon inserting the RST command at the MCRT, the following events normally occur: 1. The node is conditionally removed from service (OOS-NORMAL). The ring quarantine (RQ) LED on the node processor or IRN lights if the remove above was successful. 2. The node is isolated from the active ring (OOS-ISOLATED). The no token (NT) LED lights at the node under test if the node is successfully congured out of the active ring. 3. All diagnostic phases are run on the specied node under test. 4. If the diagnostic result is an ATP response, the node is congured back into the active ring. When the node is successfully congured back into the active ring, it is restored to service. If the node is unable to congure back into the active ring, it is left in the OOS state. To determine what completion response message to expect and for an explanation of such, see the 401-610-057 FLEXENT/AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual. If a link associated with this node was removed from service prior to diagnostics, put the link back in service using the procedures listed previously in this section. NOTE: If the node is left in the OOS state, and the response STF, CATP, or NTR is received at the ROP, further diagnostics are required. Depending upon the severity of the failure(s), that is, if a particular phase or range of phases failed,
Issue 16.0
December 2000
6-67
401-661-045
choose a DGN input message as listed in the DGN Message Input Variations Table or from the CNI Input Message Manual, 256-090-204 which matches the circumstances of the failed phase(s), and perform Steps 3 through 9. At the ROP From the printout received at the ROP (this step), determine which phase(s) failed. If an ATP response is received at the ROP, all diagnostics are complete and the rest of this test procedure should be omitted. If only a particular phase failed, proceed to Step 4, and enter message as listed in instructions. If a range of phases failed, enter the appropriate input message from DGN Message Input Variations table in Step 4, and proceed with the test. NOTE: Perform Steps 4 through 9, only if a CATP, NTR, or STF response is received in Steps 2 and 3. At the MCRT Request diagnostics for the failing phase by entering the following input message, or a variation thereof, as listed in DGN Message Input Variations table: DGN:nodexx y:PH a For LN node = LN xx = group number y = node member number PH = phase a = number of a particular phase to run For RPCN node = RPCN xx = group number y = node member number PH = phase a = number of the particular phase to run. NOTE: To determine what completion response message to expect and for an explanation of the message, see the 401-610-055 Input Message Manual or the 401-610-057 Output Message Manual.
December 2000
1. At the ROP Examine the printout and ascertain the failed phase(s), record the CP(s) number(s) and use the trouble location circuit pack list tables in this chapter to determine the equipment location of the failed or faulty CP(s). The TLP option can also be used to determine the location of suspected faulty equipment. 2. At the RNF/C Replace the faulty CP using the procedure described in Chapter 7 Equipment Handling Procedures. 3. If time permits and there is uncertainty about node operation, repeat diagnostics to conrm proper system operations. Go to Step 2.
A normal response containing failure data. A response without failure data because the RAP is hung in a diagnostic phase (the board being diagnosed is at fault).
Issue 16.0
December 2000
6-69
401-661-045
A response without failure data because the RAP rmware is not executing.
The rst two faults can be isolated using standard diagnostic procedures. More than likely, however, the RAP rmware is not executing (a category 3 failure). In the automatic recovery procedure, diagnostics are run on a particular sequence of boards. The rst board (on the RAP local bus) of this sequence always fails regardless of which board is bad.
Interactive Diagnostics
Interactive diagnostics (EX) are used to exercise a node in the interactive mode. Interactive diagnostics are used to enter a mode of operation whereby diagnostic execution is controlled to exercise any particular phase or portion of diagnostic execution. Interactive diagnostics can be used to replace regular diagnostic execution when the following is to be performed: 1. To run diagnostics up to a particular point of execution and stop 2. To perform a specic group of tasks repeatedly 3. To start and to stop a loop of diagnostic executions 4. To step through a set of diagnostic commands 5. To suspend diagnostic execution for a specic time period. NOTE: This capability is limited to data table statements; that is, downloaded diagnostic code when executed cannot be controlled interactively. When EX is begun, the following sequence of events occurs: 1. The or RPCN is rst removed from service following the rules of the RMV: or RMV:RPCN input messages.
December 2000
2. The node is isolated if the nodes major state is OOS, GROW, OFFLINE, or UNAV. Otherwise, the diagnostic request is aborted. 3. The EX demand executions are performed. 4. Upon successful completion of the EX routine, an attempt is made to include the node back into the active ring if it was in the active ring prior to entering the EX command. Otherwise, the node is left in the isolated segment. In all cases, the node is left in the OOS state.
Issue 16.0
December 2000
6-71
401-661-045
3. From the MCRT Execute the diagnostics by entering the EX commands as listed or in the order that the diagnostics are to be performed: To pause or suspend diagnostic execution at a specified statement number within a diagnostic phase for an RN, enter the following command: EX:PAUSE;nodexx y :ST e where: node = or RPCN xx = group number y = node member number b = phase(s) to be executed e = statement number See tthe 401-610-057 FLEXENT/AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual. for system response to message. To put the diagnostics in a loop between the specied statement numbers for any RN, enter the following command: EX:LOOP;nodexx y :ST f - g See the 401-610-057 FLEXENT/AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual for the system response to the message. To step through the diagnostics and to suspend at a specied statement number for any RN, enter the following command: EX:STEP;nodexx y :ST e See the 401-610-057 Output Message Manual for the system response to the message. To stop the looping started by the EX:LOOP command for any RN, enter the following command: EX:STOP;nodexx y See the 401-610-055 FLEXENT/AUTOPLEX Wireless Networks INPUT MESSAGES Message Manual or the 401-610-057 FLEXENT/AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual for the system response to the message. To exit from the interactive mode for any RN, enter the following input command:
December 2000
STOP:DMQ;nodexx y If a link associated with this node was removed from service prior to diagnostics, put the link back in service using the procedures listed previously in this section.
Issue 16.0
December 2000
6-73
401-661-045
For more details and an explanation of the INH:DMQ command, refer to the 401610-057 FLEXENT/AUTOPLEX Wireless Networks OUTPUT MESSAGES Manual.
The request was entered by mistake. A request of higher importance is in the waiting queue, and an active queue must be cleared to allow room for another. An interactive diagnostic is to be exited. The active and waiting queues of all requests must be cleared for the eld update of diagnostic les.
s s
When it is necessary to abort or cancel a diagnostic request, the following procedure should be used: 1. At the MCRT Enter the following input command: OP:DMQ The output from this command tells the user the slot number and queue assigned to a particular job. The source in the output message may be (but is not limited to) one of the following:
s s s s s
ARR - Automatic ring recovery ADP - Automatic diagnostic process MAN - Manual requests input by the user PSM - Power switch monitor REX - Routine exercise.
2. At the maintenance terminal Enter the following command to abort a diagnostic request in the active queue or cancel it from the waiting queue. STOP:DMQ;nodexx y
December 2000
Audits
At various points in the diagnostic execution process, checks are performed to verify that the diagnostic system is functioning properly. These verications are:
s s s s s
Called functions gives correct return codes Needed system resources are available Necessary les can be opened or read, and executed Hardware errors have not occurred Illegal operations are not attempted
Audit Failures
If an audit fails, a report is printed at the MCRT. The user should respond to the audit report in the following manner: 1. If a diagnostic test or phase fails prior to an audit failure, clear the problem indicated by the test failure. This may also clear the audit failure. 2. Save the printout pertaining to the 401-610-057 FLEXENT/AUTOPLEX Wireless Networks OUTPUT MESSAGES Manualthe 401-610-057 Output Message Manual:
s s s
to determine the reason for the audit failure, to determine whether or not the CTS should be contacted, and to see if any additional data should be collected.
When a diagnostic is aborted, one of two messages is printed at the MTTY and the ROP. Listed here is only one format and explanation. For details and explanation of the second format, refer to the 401-610-057 Output Message Manual. DGN AUDIT RING R = b SYSTEM DATA D = n T = i A = j S = k I = l PH = p where: b = reason for the audit, (in hexadecimal notation) n = error code returned on a failing system call or a failing function call (in decimal notation) i = last test executed (in decimal notation) j = data table address (in hexadecimal notation)
Issue 16.0
December 2000
6-75
401-661-045
k = data table statement number (in decimal notation) l = task routine index (in hexadecimal notation). PH = phase number being executed when the DGN was aborted (in decimal notation). For additional information concerning audits, refer to the Audits section of this manual.
December 2000
7
7-1 7-1 7-2 7-2 7-7 7-13 7-13 7-13 7-13 7-15 7-16 7-17 7-17 7-23 7-28
Contents
Introduction Equipment Description and Handling Precautions
s
Power Packs and Fusing Descriptions Power Pack Description and Replacement Procedures Fuse Description and Replacement Procedures Fan and Filter Maintenance Ring Node Frame Fan Unit Description Ring Node Cabinet Fan Unit Description Analog Facility Access Frame Fan Unit Description Filter Maintenance Ring Node Equipment Visual Indicators Removing Affected Equipment From Service UN122C and UN123B Combination Circuit Pack Installation Voice Frequency Link Hardware Equipment Replacement Procedures
Issue 16.0
December 2000
7-i
401-661-045
Contents
7-ii
Issue 16.0
December 2000
Introduction
This chapter the contains guidelines and precautions to be followed when working with equipment in a Common Network Interface (CNI) ofce. These guidelines and precautions must be followed closely before and during the handling of all circuit packs (CPs). Since improper handling may cause isolation of the ring or total system failure, they are of extreme importance. Use them in conjunction with Chapter 4, Ring and Ring Node Maintenance Procedures and Chapter 6, Diagnostic Users Guide.
Integrated ring circuit packs (described for each ring node type in the Overview of Chapter 6, Diagnostic Users Guide) Power converter packs Ring node frame/cabinet (RNF/C) fan units.
s s
Issue 16.0
December 2000
7-1
401-661-045
NOTE: When handling ring and ring node (RN) equipment, the appropriate light emitting diodes (LEDs) must be illuminated to prevent severe system interruption or failure.
December 2000
Table 7-1.
Power Unit Index REPLACE POWER UNIT 1 2 3 4 5 6 7 8 9 10 REMOVE NODES: 1, 2 2, 3 4, 5 5, 6 7, 8 8, 9 10, 11 11, 12 13, 14 14, 15
Issue 16.0
December 2000
7-3
401-661-045
CFR:RING a, b;EXCLUDE where: a = Ring node (if b is present, a is the rst of a range of RNs (in the direction of ow of Ring 0). In the form of {RPCNx y | x y} b = Last node in the range begun by a in the same form. EXCLUDE = Request to isolate specied node(s) from the active ring. 5. At the affected RNF/C, locate the correct faulty converter. 6. Obtain the proper replacement power pack using precautions for handling RN equipment CPs.
Before removing the affected power pack, ensure that the associated RPCN or (s) has been removed from service and isolated. Refer to Table 7-1 to determine the proper nodes to remove from service. 7. At the faulty equipment location, replace the faulty power pack (observe all equipment handling precautions). 8. At the RN control panel, press the PWR ALM RESET button to restore the frame/ cabinet to normal operation. 9. At the 410AA or 495FA power converter, verify that the power alarm lamp and the LEDs are illuminated. 10. Place the faulty power pack in protective static wrapping, and return it to storage for later repair. 11. Before returning the node(s) to service, diagnose the node by entering the following at the MCRT: DGN:nodexx y where: DGN = Requests the run of all diagnostics phases node = LN or RPCN xx = Ring node group number y = Node position in the ring node group (member number).
CAUTION:
December 2000
NOTE: Before unconditionally restoring the node to the ring, it is strongly recommended that at least Phase 1 and Phase 2 diagnostics are run on the node. The above procedure will execute full diagnostics. 12. After diagnostics returns an ATP message, restore node(s) removed from service by entering the following at the MCRT: RST:nodexx y; UCL where: node = An LN or RPCN xx = Ring node group number y = Node position in the ring node group (member number). For further reference see Chapter 6, Diagnostic Users Guide. If after replacing the power converter the power failure is not corrected, then there may be a short in the . If a short on an circuit pack is the cause of a power failure, then the following procedure should be used to correct the malfunction:
Procedure 7-2. Fixing Power Failures Caused by a Shorted Link Node Circuit Pack
1. At the MCRT, determine the affected equipment location. 2. Press the ALM-RLS key to silence the audible alarm. NOTE: The audible alarm may also be silenced by pressing the ACO key on the control panel of the affected RNF/C. 3. At the affected equipment location, locate the nodes affected by the power loss. 4. At the MCRT, remove either the two associated s or the affected RPCN from service. Enter the following command: RMV:nodexx y where: node = An LN or RPCN xx = Ring node group number y = Node position in ring node group (member number)
Issue 16.0
December 2000
7-5
401-661-045
UCL = Restore node unconditionally. 5. Isolate the associated RPCNs or s from the active ring. Enter: CFR:RING a, b;EXCLUDE where: a = Ring node (if b is present, a is the rst of a range of RNs (in the direction of ow on Ring 0). In the form of {RPCNx y | x y} b = Last node in the range begun by a in the same form. EXCLUDE = Request to isolate specied node(s) from the active ring. 6. At the faulty equipment location, unplug all circuit packs affected by the power loss. This includes either the affected RPCN or two associated s.
Before removing the affected power pack, ensure that the associated RPCN or (s) has been removed from service and isolated. Refer to Table 7-1 to determine the proper nodes to remove from service. 7. At the faulty power pack, recycle power to the affected power converter. 8. If the converter does not turn on with no load on it, then replace the CP. Place the faulty power pack in protective static wrapping and return it to storage for later repair. 9. If the converter powers up, try replacing each suspect CP one-at-a-time. At the faulty equipment location, plug in each circuit pack removed in Step 6. The CP with the short will power down the power converter. 10. Replace the faulty circuit pack with a new one. 11. If the problem is corrected after replacing the faulty CP, place the faulty CP in protective static wrapping and return it to storage for later repair. 12. At the RN control panel, press the PWR ALM RESET key to restore the frame/ cabinet to normal operation. 13. Before returning the node(s) to service, diagnose the node by entering the following at the MCRT: DGN:nodexx y
CAUTION:
December 2000
where: DGN = requests the run of all diagnostics phases node = An LN or RPCN xx = Ring node group number y = Node position in the ring node group (member number). NOTE: Before unconditionally restoring the node to the ring, it is strongly recommended that at least Phase 1 and Phase 2 diagnostics are run on the node. The above procedure will execute full diagnostics. 14. After diagnostics returns an ATP message, restore the node(s) removed from service by entering the following at the MCRT: RST:nodexx y; UCL where: node = An LN or RPCN xx = Ring node group number y = Node position in the ring node group (member number). For further reference see Chapter 6, Diagnostic Users Guide.
Issue 16.0
December 2000
7-7
401-661-045
NOTE: The audible alarm may also be silenced by pressing the ACO key on the control panel of the affected RN frame/cabinet. 3. To avoid ring interruption, the affected ring nodes should be taken out of service and isolated from the active ring before the power converter is removed. If the RNs are not already OOS and isolated, enter the following commands: RMV:nodexx y CFR:RING a, b;EXCLUDE where: node = LN or RPCN xx = The ring node group number y = Position in the ring node group (member number). a = Ring node (if b is present, a is the rst of a range of RNs (in the direction of ow on Ring 0). In the form of {RPCNx y | x y} b = Last node in the range begun by a in the same form. EXCLUDE = Request to isolate specied node(s) from the active ring. 4. At the faulty equipment location, unseat the affected power converter (that which is associated with the blown fuse and OOS nodes). 5. Replace the faulty fuse. 6. Reseat the power converter. If the fuse does not blow again, proceed to Step 8. 7. Otherwise, the power converter must be replaced:
s s s s
unseat the affected power converter, insert a new fuse, replace the power converter, place the faulty power converter in protective static wrapping, and return it to storage for later repair.
8. At the RN control panel, press the PWR ALM RESET key to restore the frame/ cabinet to normal operation. 9. The lamp test key can be used to test the power alarm (PA) and fuse alarm (FA) lamps. 10. Before returning the node(s) to service, diagnose the node by entering the following at the MCRT: DGN:nodexx y
December 2000
where: DGN = Requests the run of all diagnostics phases node = LN or RPCN xx = Ring node group number y = Node position in the ring node group (member number). NOTE: Before unconditionally restoring the node to the ring, it is strongly recommended that at least Phase 1 and Phase 2 diagnostics are run on the node. The above procedure will execute full diagnostics. 11. After diagnostics returns an ATP message, restore node(s) removed from service by entering the following at the MCRT: RST:nodexx y; UCL where: node = LNor RPCN xx = Ring node group number y = Node position in the ring node group (member number). For further reference see Chapter 6, Diagnostic Users Guide. Disruption of either one unit or one RPCNU may be caused by a blown 20-amp fuse on the PDF or DCPD. Loss of the fuse also affects the two power converters on the or RPCN unit.
Issue 16.0
December 2000
7-9
401-661-045
3. To avoid ring interruption, the affected ring nodes should be taken out of service and isolated from the active ring before the power converter is removed. If the RNs are not already OOS and isolated, enter the following commands: CFR:RING,a, b;EXCLUDE RST:nodexx y; UCL where: a = Ring node (if b is present, a is the rst of a range of RNs (in the direction of ow of Ring 0). b = Last node in the range begun by a. EXCLUDE = Request to exclude specied node(s) from the active ring. node = LN or RPCN xx = Ring node group number y = Node position in the ring node group (member number). 4. At the faulty equipment location, unseat the affected power converters and circuit packs. Remove the fan fuse(s). 5. At the PD frame/cabinet, remove the blown fuses (both the main and indicator fuses). The GPDF does not have indicator fuses. 6. Insert the charging tool into the indicator fuse slot, and press the charge key on the PD control panel. The GPDF does not have a charging probe. When this key is pressed, the charge indicator LED illuminates and slowly decays to off as the fuse location becomes fully charged. 7. Insert a new 20A main fuse and remove the charging tool. The GPDF uses a 25-amp fuse. 8. Reinsert the indicator fuse. 9. At the affected RNF/C, reseat the power converters and replace the fan fuse. 10. Reseat all circuit packs. If all fuses hold (on both the RNF/C and the PD frame/cabinet), proceed to the next step. Otherwise, correct the problem using guidelines for the appropriate condition. 11. At the RN control panel, press the PWR ALM RESET key to restore the frame/ cabinet to normal operation. 12. Before returning the node(s) to service, diagnose the node by entering the following at the MCRT:
December 2000
DGN:nodexx y where: DGN = Requests the run of all diagnostics phases node = LN or RPCN xx = Ring node group number y = Node position in the ring node group (member number). NOTE: Before unconditionally restoring the node to the ring, it is strongly recommended that at least Phase 1 and Phase 2 diagnostics are run on the node. The above procedure will execute full diagnostics. 13. After diagnostics returns an ATP message, restore node(s) removed from service by entering the following at the MCRT: RST:nodexx y; UCL where: node = LNor RPCN xx = Ring node group number y = Node position in the ring node group (member number). For further reference see, Chapter 6, Diagnostic Users Guide
Procedure 7-5. Fixing Blown Fuse or Power Failures of the Digital Facility Access Frame/Cabinet
There are also cases where fuses and power failures may occur on the digital facility access (DFA) frame/cabinet or the analog facility access frame (AFAF). 1. At the affected equipment control panel, press the ACO key to silence the alarm. 2. At the affected equipment location, locate the blown fuse(s). 3. Unseat the appropriate 495H1 and the 393A power converters (those associated with the blown fuse or fuses). 4. At the fuse location, replace the blown fuse(s). 5. Reseat both the 495H1 and the 393A power converters.
Issue 16.0
December 2000
7-11
401-661-045
6. When powering up the DFA frame/cabinet, a major alarm may be activated before the power converters stabilize. If a major alarm sounds, continue; otherwise, the problem is corrected. 7. At the DFA control panel, press the POWER ALARM RESET key to restore the frame/cabinet to normal operation. 8. Press the ACO key to silence the alarm.
Procedure 7-6. Fixing Blown Fuse or Power Failures of the Analog Facility Access Frame
1. At the affected equipment control panel, press the ACO key to silence the alarm. 2. At the affected equipment location, locate blown fuse(s). 3. Unseat the associated 133K and the 130D power converters. NOTE: Ensure the correct power converters are removed (those associated with the blown fuse or fuses). 4. At the fuse location, replace the blown fuse(s). 5. Reseat both the 133K and the 130D power converters. 6. At the AFAF control panel, press the POWER ALARM RESET key to restore the frame to normal operation. 7. If the alarm is due to a power failure in the fan system, do the following: a. At the affected AFAF, replace the blown fuse. If the fuse blows again, proceed to Step b; otherwise, the problem is corrected. b. Replace the fan or restore it to an operational state. c. On the 64C2 data mounting unit, press the alarm reset key. d. Press the alarm reset (ARS) key.
December 2000
Issue 16.0
December 2000
7-13
401-661-045
mounting to maintain the proper operating temperature. Although the data sets can function properly with a fan unit failure, corrective action should be taken as soon as possible. Fans in standard and K-cabinets have six fans in the middle of the cabinet; three fans in front and three fans in back. The three fans in front cool the upper half of the cabinet, and the three fans in back cool the lower half of the cabinet. These fans vary in speeds from 1700 RPM to 3400 RPM. The LEDs and toggle switch for the fans are located on the back of the cabinet. When a fan failure is detected (as indicated by the ALM and PWR ALM lamps illuminating), one of the following procedures should be used to correct the fault.
December 2000
Filter Maintenance
The air lters are intended to eliminate dust from the cooling air. Dust buildup on frame circuitry could lead to improper system operation. Although no alarms are associated with the fan lters, they must be properly maintained by periodic replacement. The RNF/C lters are positioned horizontally just above the fan unit. To replace the RNF/C fan lter, simply slide it out the front of the frame/cabinet. On frame installations, remove the handle from the old lter and attach it to the new lter. On cabinet installations, simply replace the old lter. The AFAF lters are positioned horizontally just below the fan unit(s). To replace the AFAF data unit fan lter, the data unit cover must rst be opened. The lter then simply slides out the front of the frame.
Issue 16.0
December 2000
7-15
401-661-045
In the newer cabinets, the lters are above and below the front fan unit. To replace the lter, slide the lter out of the cabinet and replace it with a new lter.
Backplane pins do not come out with the pack No pins are bent when the replacement CP is inserted. Extreme care must be used when handling the ring interface CPs. These CPs require considerable force to insert and remove. Therefore, whenever replacing or inspecting these CPs, check them carefully and use care in applying pressure to them.
December 2000
Fuse failure Unplugged power converter Fan unit failure. If more than one fan fails, a major alarm sounds. If the problem is not corrected, a total RNF/RNC failure may occur.
The NT lamps are also adjacent to nodes equipped with IFBs. Before any IFB circuit pack can be replaced, the NT lamps of both adjacent nodes must be illuminated. There are only two IFBs per frame/cabinet. These are located at the RPCN node if equipped, or the rst and last of the RNF/C. Since the IFB is adjacent to one node within its own RNF/C and another in the next RNF/C in line, the NT lamp adjacent to the suspected IFB on the associated frame/cabinet, and the NT lamp on the frame/cabinet next in line must be illuminated before the IFB circuit pack can be extracted.
Issue 16.0
December 2000
7-17
401-661-045
original isolation should be corrected before attempting to correct the new problem. This eliminates the possibility of expanding the isolated segment over the additional 50 nodes. System software puts faulty equipment OOS in one of two manners: normally and isolated. By taking it OOS normally, the system leaves it in the OOS-NORMAL maintenance state. In this state, the equipment is still part of the active ring. However, when the system removes the equipment from service and isolates it from the active ring, it is in the OOS-ISOLATED maintenance state. In this state, the node is a functional part of the ring for maintenance purposes only. Equipment Replacement Procedures Before any ring node equipment involving CPs is replaced or handled, all precautions and illuminated LEDs must be observed. When performing diagnostics, faulty CPs are listed in the manual trouble locating process. Therefore, all precautions must be followed before replacing these CPs. Following is a summary of the sequence of events that must take place when replacing equipment. When a malfunction or faulty equipment is detected: 1. Press the alarm cutoff (ACO) button at the affected equipment, or the ALM-RLS key at the MCRT, to silence the audible alarm. 2. Before attempting to change, inspect, or handle any CP, ground yourself using the static control wrist strap (3M-2066). 3. At the faulty equipment location, determine which CP is faulty. On the RNFs or RNCs, nodes are grouped closely together. Individual CPs are distinguishable by a color-coded bar above and across each ring node unit. To ensure that the proper pack is removed, examine each color-coded bar before any pack is extracted. Using the identification numbers on the faulty CP (be sure to check microcode, version, and issue), obtain the proper replacement CP. 4. Make sure the wrist strap is grounded and remove the suspect CP. 5. Insert the replacement CP from the storage cabinet. 6. Wrap up the old CP and place it in a carton for return. 7. Perform diagnostics on any affected equipment, and if all goes well, restore it to service. 8. If diagnostics fail, the faulty CP may have not been removed. At the replacement CP, ensure that the proper LEDs are illuminated for the type of CP replaced:
December 2000
The NT lamp on this CP is illuminated and both RQ lamps are illuminated on the NP and circuit packs. The RQ lamp on the adjacent pack and the NT lamp on the adjacent RI1 CP is illuminated. The RQ lamp on this CP is illuminated, the adjacent RI1 NT lamp is illuminated, and the adjacent RQ lamp is illuminated. The RQ lamp on this CP is illuminated, the RQ lamp on the adjacent NP pack is illuminated, and the RI1 NT lamp is illuminated. The MDL boards are not equipped with LEDs. The NT lamps adjacent to the IFB are illuminated. Both RQ lamps on the adjacent NP and the CPs are illuminated, along with the NT lamp on RI1. The RQ lamp on the adjacent IRN is illuminated, and the PCID and power converter for the RAP are turned off. The RQ and NT lamps on this CP are illuminated, and the RQ lamp on the adjacent circuit pack is illuminated.
IRN/IRNB UN303 or UN303B (VLSI only) IRN2/IRN2B UN304 or UN304B IFB-U TN918 IFB-P TN915 IFB-4K TN1506 IFB-F TN1508 IFB-F TN1509 IFB-F TN1803 IFB-F TN4016 3BI TN914 DDSBS TN69B LI TN916 or TN1317 LI4S TN1316 LI4D TN1315 T1FA UN291 LI4S TN1316 12A Applique APA12
Issue 16.0
December 2000
7-19
401-661-045
AP: AP68 TN1340 (2 meg) or TN1641 (8 meg) for DLN AP30 TN1630 for DLNE or DLN30 AP30 TN1630B with 64-Mbyte mezzanine memory for DLNE-AP30 or CDN-II AP30 TN1630B with 64- to 256-Mbyte mezzanine memory for CDN-IIx
s s
NPI TN1349 RAP 3B15 computer boards CCC UN237 (1) for 2-mbyte, UN626 for 16-mbyte CCS UN236 (1) for 2-mbyte, UN625 for 16-mbyte MASC UN95 (1-6) or UN507 (1) for 16-mbyte memory board option MASA TN56 (1-48) or TN1398 (1-8) for 16-mbyte memory board option PCID TN1128.
As stated earlier, all faulty equipment must be OOS before maintenance is performed. If the equipment has not been automatically made OOS, then it must be manually removed from service before any CPs are handled. Ring node CPs must be isolated before they can be removed. Also, caution is again stressed when isolating nodes in a ring that already contains isolated nodes. To avoid increasing the size of the original ring isolation, problems associated with the previous ring isolation should be corrected before isolating any other nodes. This can be dangerous, in that the isolation may isolate too large of a segment on the ring, thereby not leaving enough active nodes to have a sufciently operational ring.
December 2000
When replacing circuit packs in ring nodes, it is important that the proper node and associated nodes are removed and isolated. There are two power supplies for each shelf, each power supply feeding 1 ring nodes. Table 7-2 displays additional nodes that must be isolated and removed when replacing a circuit pack in node. Table 7-2. Ring Node Power Supply Index REPLACE CIRCUIT PACK INRING NODE: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 REMOVE AND ISOLATE NODES: 1,2 1, 2, 3 2, 3 4, 5 4, 5, 6 5, 6 7, 8 7, 8, 9 8, 9 10, 11 10, 11, 12 11, 12 13, 14 13, 14, 15 14, 15
Assumption: Diagnostics have determined that there are faulty CPs in a node(s) on the ring. 1. At the MCRT, press the ALM-RLS key if necessary to silence alarms. NOTE: An audible alarm may also be silenced by pressing the ACO key at the affected RNF/C. 2. If the node with the faulty CP and associated nodes have not been removed from service, remove them. Refer to Table 7-1 to determine which nodes to remove and isolate. At the MCRT, enter:
Issue 16.0
December 2000
7-21
401-661-045
RMV:nodexx y where: node = RPCN or LN xx = Ring node group number (00-63) y = Node position in the ring node group. (0 for RPCN, 1-15 for ) 3. At the MCRT, isolate the associated node from the active ring. Enter: CFR:RING a, b;EXCLUDE where: a = Ring node (if b is present, a is the rst of a range of RNs (in the direction of ow on Ring 0). In the form of {RPCNx y | x y} b = Last node in the range begun by a in the same form. EXCLUDE = Request to isolate specied node(s) from the active ring. 4. At the faulty equipment location, obtain CP identification for the faulty pack. Get the proper replacement CP (use caution handling the new pack). 5. Ensure that the appropriate node is OOS, proper LEDs are illuminated, and that you are properly grounded to avoid static discharge. 6. Replace the faulty/suspected CP. NOTE: Ensure that the adjacent NP and (LI4 and APA12) CP RQ lamps are illuminated before removing either of these affected CPs. NOTE: Ensure that the adjacent RI1 NT and the adjacent RQ lamps are both illuminated before removing either of these CPs. NOTE: Since most CPs require considerable force to insert or remove, extreme caution must be exercised. Carefully inspect the CP edge connector and the backplane connector for bent or missing pins. 7. Place the old (or faulty) CP in the protective static wrapping, and return it to the storage cabinet for later repair. 8. At the affected RN control panel, press the PWR ALM RESET button to restore the frame/cabinet to normal operation.
December 2000
9. Diagnose the node by entering the following at the MCRT: DGN:nodexx y where: DGN = Requests the run of all diagnostics phases node = LN or RPCN xx = Ring node group number y = Node position in the ring node group (member number). NOTE: Before unconditionally restoring the node to the ring, it is strongly recommended that at least Phase 1 and Phase 2 diagnostics are run on the node. The above procedure will execute full diagnostics. 10. After diagnostics returns an ATP message, restore node(s) removed from service by entering the following at the MCRT: RST:nodexx y; UCL where: node = LNor RPCN xx = Ring node group number y = Node position in the ring node group (member number). For further reference see Chapter 6, Diagnostic Users Guide.
Issue 16.0
December 2000
7-23
401-661-045
4. To request full diagnostics on the token tracking node, enter: DGN:LNa=b 5. Resolve all troubles if the diagnostics fail. 6. To be sure the minor state of the neighbor node(s) is in the MOOS state, enter: CHG:SLK=a-b:MOOS 7. To remove appropriate neighbor nodes from ring service, enter: RMV:LNa=b 8. Isolate the token tracking node and the neighbor nodes from the active ring. Enter this command for each of the nodes: CFR:RING,LNa=b:EXCLUDE 9. Replace the existing CPs with the new UN122C and UN123B CPs. Be sure to use a wrist strap to protect from electrostatic discharge. 10. Update the in-core ECD for the token tracking node. First, change the UCB major state from OOS to GROW. Update the hv values. Now change the major state from GROW to OOS. See Table 7-3 for the appropriate hv values. 11. To request a full diagnostics on the token tracking node, enter: DGN:LNa=b 12. Wait for the diagnostics on the token tracking node to run all test pass (ATP). From the maintenance terminal, go to the 199 page and execute the activate RC/V form to copy the in-core copy of the ECD to disk. 13. To restore the neighbor nodes, enter: RST:LNa=b 14. If the token tracking node is an IUN node, run diagnostic phases 12 and 13 on the token tracking node. If the token tracking node is an RPC node, run diagnostic phases 32 and 33. After these diagnostics run ATP, enter the following to restore the token node: RST:LNa=b
December 2000
Table 7-3.
Hardware Version Values (with IFB) (Page 1 of 2) POSITION IN RNF/C TN918 HV VALUE FOR IFB TYPE TN915 0x0002 0x0020 0x0802 0x0820 0x1002 0x1020 0x1802 0x1820 0x2002 0x2020 0x0102 0x0120 0x0902 0x0920 0x1102 0x1120 0x1902 0x1920 0x2102 0x2120 TN1506 0x0004 0x0040 0x0804 0x0840 0x1004 0x1040 0x1804 0x1840 0x2004 0x2040 0x0104 0x0140 0x0904 0x0940 0x1104 0x1140 0x1904 0x1940 0x2104 0x2140 TN1508 0x0005 0x0050 0x0805 0x0850 0x1005 0x1050 0x1805 0x1850 0x2005 0x2050 0x0105 0x0150 0x0905 0x0950 0x1105 0x1150 0x1905 0x1950 0x2105 0x2150 TN1509 TN1803 0x0006 0x0060 0x0806 0x0860 0x1006 0x1060 0x1806 0x1860 0x2006 0x2060 0x0106 0x0160 0x0906 0x0960 0x1106 0x1160 0x1906 0x1960 0x2106 0x2160
TN913 UN122 UN123 TN913 UN122B UN123B TN913 UN122B UN123B TN913 UN122C UN123B TN913 UN122C UN123B TN922 UN122 UN123 TN922 UN122B UN123B TN922 UN122B UN123Bq TN922 UN122C UN123B TN922 UN122C UN123B
Lowest Highest Lowest Highest Lowest Highest Lowest Highest Lowest Highest Lowest Highest Lowest Highest Lowest Highest Lowest Highest Lowest Highest
0x0001 0x0010 0x0801 0x0810 0x1001 0x1010 0x1801 0x1810 0x2001 0x2010 0x0101 0x0110 0x0901 0x0910 0x1101 0x1110 0x1901 0x1910 0x2101 0x2110
Issue 16.0
December 2000
7-25
401-661-045
Table 7-3.
Hardware Version Values (with IFB) (Page 2 of 2) POSITION IN RNF/C TN918 HV VALUE FOR IFB TYPE TN915 0x8002 0x8020 0x8802 0x8820 0x9002 0x9020 0x9802 0x9820 0xc002 0xc020 0xc802 0xc820 TN1506 0x8004 0x8040 0x8804 0x8840 0x9004 0x9040 0x9804 0x9840 0xc004 0xc040 0xc804 0xc840 TN1508 0x8005 0x8050 0x8805 0x8850 0x9005 0x9050 0x9805 0x9850 0xc005 0xc050 0xc805 0xc850 TN1509 TN1803 0x8006 0x8060 0x8806 0x8860 0x9006 0x9060 0x9806 0x9860 0xc006 0xc060 0xc806 0xc860
UN303 (IRN)
Lowest Highest
0x8001 0x8010 0x8801 0x8810 0x9001 0x9010 0x9801 0x9810 0xc001 0xc010 0xc801 0xc810
UN303 (IRN)
Lowest Highest
UN303B (IRNB
Lowest Highest
UN303B (IRNB)
Lowest Highest
UN304B (IRNB)
Lowest Highest
UN304B (IRNB)
Lowest Highest
* The RI CPs may be equipped with the Long Message Strap (LMS). This option is indicated in these tables within the symbol next to the CP number. Otherwise, the RI is not equipped with the LMS option.
December 2000
Table 7-4.
Hardware Version Values (No IBF) NP, RI0, & RI1 CPs* TN913, UN122, UN123 TN913, UN122B, UN123B TN913, UN122B, UN123B TN913, UN122C, UN123B TN913, UN122C, UN123B TN922, UN122, UN123 TN922, UN122B, UN123B TN922, UN122B, UN123B TN922, UN122C, UN123B TN922, UN122C, UN123B UN303 (IRN) UN303 (IRN) UN303B (IRNB) UN303B (IRNB) UN304B (IRNB) UN304B (IRNB) HV VALUE 0x0000 0x0800 0x1000 0x1800 0x2000 0x0100 0x0900 0x1100 0x1900 0x2100 0x8000 0x8800 0x9000 0x9800 0xc000 0xc800
* The RI CPs may be equipped with the Long Message Strap (LMS). This option is indicated in these tables with the symbol next to the CP number. Otherwise, the RI is not equipped with the LMS option.
Example: RI types UN122/UN123B Remove the letter sufx (B) from the UN122/UN123B RI board code. Then look for the UN122/UN123 RI type, your node processor (NP) type, and the interframe buffer (IFB) type required, to locate the hardware version value. The IFB unit name indicates the buffer capacity and the ring speed. In cases where it is necessary to identify a specic IFB, the following terminology and convention should be used:
Issue 16.0
December 2000
7-27
401-661-045
Example: IFB-4K/6 This is an IFB with 4K bytes of buffer running at the ring speed of 6 Mhz. The following information is a summary of current IFBs: EXISTING CONVENTION IFB PIFB padded IFB (IFB-P) CODE TN918 TN915 TN1506 TN1508 TN1509 TN1803 NEW CONVENTION IFB (IFB-16) IFB-P (IFB-512) IFB-4k/6 IFB-16/8 IFB-4k/8 IFB-4k/8
The plain term IFB should be used whenever it is not necessary to refer to a particular vintage of this circuit.
Procedure 7-10. Voice Frequency Link Access Circuit Pack Replacement Procedures
1. At the affected equipment location or the MCRT, silence any audible alarm by pressing the ACO key or the ALM-RLS key. 2. Before attempting to change, inspect, or handle any CP, ground yourself using the static control wrist strap (3M-2066).
December 2000
Keep the CP in the protective wrapping until it is ready to be inserted in the frame/cabinet. 4. At the MCRT, put the SLK in the UNAV-TEST state. Use the Change Analog SLK VFL Access Circuit Board Procedures in the section referred to above. NOTE: If the SLK is already in the AVL-OOS state, it can be moved directly to the UNAV-TEST state without rst being moved to the AVL-MOOS state. 5. At the affected equipment location, remove the suspect VFL access CP and insert the new CP. 6. Wrap the suspect CP, and place it in a carton to be returned for repair. 7. Restore the SLK to service. Use the Change Analog SLK VFL Access Circuit Board Procedures in the section referred to above.
CAUTION:
Keep the data set in the protective wrapping until it is ready to be inserted in the frame/cabinet. 4. At the MCRT, put the SLK in the UNAV-TEST state. Use the Change Analog SLK Data Speed Procedures in the section referred to above.
CAUTION:
Issue 16.0
December 2000
7-29
401-661-045
NOTE: If the SLK is already in the AVL-OOS state, it can be moved directly to the UNAV-TEST state without rst being moved to the AVL-MOOS state. 5. At the back of the data set unit, remove the appropriate data set cables and the suspect data set. 6. On the data set unit, verify that the rise time option switches are set correctly:
s s
In the open position, the rise time is set for fast. In the closed position (toward numbers), the rise time is set for slow.
7. Insert the new data set and connect the data set cables. 8. Wrap the suspect data set, and place it in a carton to be returned for repair. 9. Set the data set options and restore the SLK to service. Use the Change Analog SLK Data Speed Procedures in the section referred to above.
December 2000
Introduction
This appendix provides information about the ring node portion of the ring error analysis and recovery mechanisms. The error handling for ring errors is split between the node and the 3B21D. When an error is detected by a node, that node will perform some recovery action and then report the error by sending a message to the 3B21D. The 3B21D will then take some corrective action and notify the craft via message printed on the ROP. This document describes all errors reported to the 3B21D by the node. Included is a description of the error, the recovery action taken by the node, and the state of the node after the recovery is complete.
Data Structures
The following structures dene the error message the node sends to the 3B21D. Throughout this document, this message will be referred to as the error message when discussing data that will be sent from the node to the 3B21D. Normally when an error occurs, the node will send error messages on both rings to the 3B21D. This ensures that a message will reach the 3B21D. In some cases, this is not possible and this will be noted as otherwise. This is the 3B21D view of the error message layout. See header le ims/com/ head/ims_emsgs.h for the NP view.
Issue 16.0
December 2000
A-1
401-661-045
struct immemsg { struct immsg_hd immh; NODE_PADD node; unsigned char imm_etype; unsigned char erring; union vardata { struct { union{ struct header dhead; struct{ short tokblk; short sint;/ short spare2; short spare3; } misc; } un; struct _riracstat ports;/ struct _riracstat opports;/ } specic; unsigned char dchar[24]; unsigned short dshrt[12]; long dlong[6]; } data; };
/* IMS mtce. message header */ /* phys. addr. of ring node */ /* IMS error message type */ /* faulty ring */
/* header from failing msg */ /*Blockage occurred on the token */ * False interrupt indicator */
General Information
In the following descriptions, the terms upstream node and downstream node will be used. These terms describe relative positions of nodes and are based on the direction of data ow on the rings. Basically, any particular node will RECEIVE data from its upstream neighbor and will SEND data to its downstream neighbor. Since the data ows in opposite directions on the two rings, a nodes upstream neighbor on ring 1 is the downstream neighbor on ring 0 and its upstream neighbor on ring 0 is the downstream neighbor on ring 1.
A-2
Issue 16.0
December 2000
The following pages contain several headings. The error code is the dened symbol for the particular error and is placed in the immemsg.imm_etype eld in the error message. The faulty ring is indicated in the erring eld in the error message. The description is a detailed description of the error and the node recovery action is a description of the node recovery process. The variable data is a description of the variable data in the error message. This data is intended to be used by the 3B21D when analyzing the error and will differ depending on the error type. There may be other data in the error message that is provided to be printed at the ROP. The ROP data is a description of the data that is printed on the ROP. This data is taken from the error message. The error message will be in the following general form: REPT RING TRANSPORT ERR See the output manual page for the complete description of the ROP output message. When this message is printed, various data elds will be included in the printout, and it is assumed that data taken from the error message from the node will be printed in the following order: 0xAAAAAAAA 0xBBBBBBBB 0xCCCCCCCC 0xDDDDDDDD 0xEEEEEEEE 0xFFFFFFFF(TTTTTTTTTT) AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD EEEEEEEE FFFFFFFF TTTTTTTTTT immemsg.data.dlong[0] immemsg.data.dlong[1] immemsg.data.dlong[2] immemsg.data.dlong[3] immemsg.data.dlong[4] immemsg.data.dlong[5] The value of the real time clock.
Blockage Error
Error Code
_RG_BLKG, _RG_RDBLK
Issue 16.0
December 2000
A-3
401-661-045
Description
The blockage timer has timed out waiting for transfer of data. The following table contains the error ags that are used to determine this error. At the present time, only _RG_BLKG is reported to the 3B21D, regardless of the type of node. The _RD_RDBLK is provided for future use with the IRN.
The IRN nodes report blockage in two situations: the downstream node does not take the data or the read FIFO does not take the data. The rst is called propagate blockage and the latter called read blockage. Propagate blockage means the downstream node is the cause of the fault, whereas read blockage indicates that the reporting node is at fault.
A-4
Issue 16.0
December 2000
If the blockage is a read blockage, the hardware will destroy the message and switch the RAC to the force propagate mode, the token will remain on the ring and the ring will continue to operate normally. The read blockage is reported to the 3B21D with the _RG_RINH error code. This code is used to indicate that the blockage was the fault of the reporting node and not the downstream node. The error is reported by sending error messages on each RPC on the opposite ring. If a blockage occurs on a broadcast message, the error ags will indicate both propagate and read blockages. This case will be handled as a propagate blockage.
Variable Data
immemsg.data.specic.ports Rac status ports from the faulty ring. immemsg.data.specic.opports Rac status ports from the opposite ring. See Notes. immemsg.data.specic.un.misc.tokblk Block on token code, which indicates whether the token was being held by the node when the blockage timeout occurred. Nonzero values indicate that the node found evidence that it was holding the token. See le ims_emsgs.h for details.
ROP Data
BLOCKAGE DETECTED (LN/RPCN)XX YY RAC (0/1) 0xaabbccdd 0xeeffgghh 0xjjkkllmm 0xnnppqqrr 0xssttuuvv 0xwwxxyyzz (TTTTTTTTTT) aabb ccdd ee ff gg hh jj kk Block on token code (see description above). not used. The nodes home RPC overow state (IRN only). The nodes overload state (IRN only). The nodes overow state (IRN only). The nodes silence state (IRN only). node type, 3 = IRN. port C, faulty ring.
Issue 16.0
December 2000
A-5
401-661-045
ll mm nnpp qq rr ss tt uu vv wwxx yy zz -
port B, faulty ring. port A, faulty ring. not used. port E, faulty ring (IRN only). port D, faulty ring (IRN only). not used. port C, opposite ring. See Notes. port B, opposite ring. See Notes. port A, opposite ring. See Notes. not used. port E, opposite ring (IRN only). See Notes. port D, opposite ring (IRN only). See Notes.
NOTE: This status port information from the RAC is used to transmit the error report. For this particular error type, the status is always from the RAC opposite to that on which the error occurred. If the error report was sent by an RPC node, this status information is meaningless.
Description
This error indicates a byte with bad parity has been presented to the input of the RAC. A hard parity error is a parity error that cannot be cleared by the node
The faulty byte will not be accepted by the node and the upstream node will eventually detect blockage.
A-6
Issue 16.0
December 2000
Variable Data
msg->specic.ports Rac status ports from the faulty ring. msg->specic.opports Rac status ports from the ring that was used to write the error message to the 3B21D. This information was taken just before the error message was written.
ROP Data
RAC PARITY/FORMAT ERROR DETECTED (LN/RPCN)XX YY RAC (0/1) 0xaabbccdd 0xeeffgghh 0xjjkkllmm 0xnnppqqrr 0xssttuuvv 0xwwxxyyzz (TTTTTTTTTT) aa bb cc dd eeffgghh jj kk ll mm nnpp qq rr ss tt The nodes home RPC overow state (IRN only). The nodes overload state (IRN only). The nodes overow state (IRN only). The nodes silence state (IRN only). not used. node type, 3 = IRN. port C, faulty ring. port B, faulty ring. port A, faulty ring. not used. port E, faulty ring (IRN only). port D, faulty ring (IRN only). not used. port C, opposite ring. See Notes.
Issue 16.0
December 2000
A-7
401-661-045
uu vv wwxx yy zz -
port B, opposite ring. See Notes. port A, opposite ring. See Notes. not used. port E, opposite ring (IRN only). See Notes. port D, opposite ring (IRN only). See Notes.
NOTE: This status port information from the RAC is used to transmit the error report. In most cases, this is the RAC opposite to that on which the error occurred. If the error report was sent by an RPC node, this status information is meaningless.
Description
An orphan byte has been presented to the input of the RAC. An orphan byte condition occurs when the RAC is expecting a C byte but the byte received is not a C byte. At the present time, the orphan byte is reported to the 3B21D using the _RG_HPTY error code. The _RG_ORBYTE code is provided for future IRN application.
In the case of the orphan byte, 2 bytes are accepted into the input FIFO of the IRN. The bytes are not read into memory and will be held until the error condition is cleared.
A-8
Issue 16.0
December 2000
Two bytes may have been accepted by the input FIFO. A processor RAC reset must be issued to clear the orphan byte(s) from the input FIFO. The input is inhibited to prevent the input FIFO from accepting more bytes. Because the reporting node will not accept data from the upstream node, that node will report a blockage condition. The orphan byte error is reported by sending error messages to each RPC only on the opposite ring.
Variable Data
msg->specic.ports Rac status ports from the faulty ring. msg->specic.opports Rac status ports from the ring that was used to write the error message to the 3B21D. This information was taken just before the error message was written.
ROP Data
RAC PARITY/FORMAT ERROR DETECTED (LN/RPCN)XX YY RAC (0/1) 0xaabbccdd 0xeeffgghh 0xjjkkllmm 0xnnppqqrr 0xssttuuvv 0xwwxxyyzz (TTTTTTTTTT) aa bb cc dd eeffgghh jj kk ll The nodes home RPC overow state (IRN only). The nodes overload state (IRN only). The nodes overow state (IRN only). The nodes silence state (IRN only). not used. node type, 3 = IRN. port C, faulty ring. port B, faulty ring.
Issue 16.0
December 2000
A-9
401-661-045
mm nnpp qq rr ss tt uu vv wwxx yy zz -
port A, faulty ring. not used. port E, faulty ring (IRN only). port D, faulty ring (IRN only). not used. port C, opposite ring. See Notes. port B, opposite ring. See Notes. port A, opposite ring. See Notes. not used. port E, opposite ring (IRN only). See Notes. port D, opposite ring (IRN only). See Notes.
NOTE: This status port information from the RAC is used to transmit the error report. For this particular error type, the status is always from the RAC opposite to that on which the error occurred. If the error report was sent by an RPC node, this status information is meaningless.
Description
This error indicates a ring parity error occurred but was subsequently cleared by the recovery routine..
Because of the difference in the recovery action, orphan byte errors will not be included in this error class. All orphan byte errors will be hard errors.
A-10
Issue 16.0
December 2000
Variable Data
msg->specic.ports Rac status ports from the faulty ring. msg->specic.opports Rac status ports from the ring that was used to write the error message to the 3B21D. This information was taken just before the error message was written.
ROP Data
TRANSIENT RAC ERROR DETECTED (LN/RPCN)XX YY RAC (0/1) 0xaabbccdd 0xeeffgghh 0xjjkkllmm 0xnnppqqrr 0xssttuuvv 0xwwxxyyzz (TTTTTTTTTT) aa bb cc dd eeffgghh jj kk ll mm nnpp qq rr ss tt uu The nodes home RPC overow state (IRN only). The nodes overload state (IRN only). The nodes overow state (IRN only). The nodes silence state (IRN only). not used. node type, 3 = IRN. port C, faulty ring. port B, faulty ring. port A, faulty ring. not used. port E, faulty ring (IRN only). port D, faulty ring (IRN only). not used. port C, opposite ring. See Notes. port B, opposite ring. See Notes.
Issue 16.0
December 2000
A-11
401-661-045
vv wwxx yy zz -
port A, opposite ring. See Notes. not used. port E, opposite ring (IRN only). See Notes. port D, opposite ring (IRN only). See Notes.
NOTE: This status port information from the RAC is used to transmit the error report. In most cases, this is the RAC opposite to that on which the error occurred. If the error report was sent by an RPC node, this status information is meaningless.
Description
The upstream interframe buffer has detected a parity error..
Variable Data
msg->specic.ports Rac status ports from the faulty ring. msg->specic.opports -
A-12
Issue 16.0
December 2000
Rac status ports from the ring that was used to write the error message to the 3B21D. This information was taken just before the error message was written.
ROP Data
INTERFRAME BUFFER PARITY ERROR DETECTED (LN/RPCN)XX YY RAC (0/1) 0xaabbccdd 0xeeffgghh 0xjjkkllmm 0xnnppqqrr 0xssttuuvv 0xwwxxyyzz (TTTTTTTTTT) aa bb cc dd eeffgghh jj kk ll mm nnpp qq rr ss tt uu vv wwxx yy zz The nodes home RPC overow state (IRN only). The nodes overload state (IRN only). The nodes overow state (IRN only). The nodes silence state (IRN only). not used. node type, 3 = IRN. port C, faulty ring. port B, faulty ring. port A, faulty ring. not used. port E, faulty ring (IRN only). port D, faulty ring (IRN only). not used. port C, opposite ring. See Notes. port B, opposite ring. See Notes. port A, opposite ring. See Notes. not used. port E, opposite ring (IRN only). See Notes. port D, opposite ring (IRN only). See Notes.
NOTE: This status port information from the RAC is used to transmit the error report. In most cases, this is the RAC opposite to that on which the error occurred. If the error report was sent by an RPC node, this status information is meaningless.
Issue 16.0
December 2000
A-13
401-661-045
Description
A explanation on the RAC hardware is needed to understand this error code which is another form of blockage. When a node detects blockage while propagating a message, the hardware will be set to force read the remainder of the message that was being propagated and will then stop the ring. If the blockage occurred while the node was writing data to the ring, the write is stopped and the contents of the RAC FIFO are read into memory. As part of the recovery procedure, the data that was read into memory is checked for valid parity. Bad parity would explain the blockage because the downstream node will not accept data with bad parity. To get this error, the RAC must have received good data either from the upstream node or the node processor, but it tried to transmit bad parity to the downstream node. This implies the RAC hardware is faulty. If a node reports this error, the downstream node should have reported a hard parity error. If this error occurs during a write, a partial message may have been written to the ring and this will cause one or more downstream nodes to report a read format error.
A-14
Issue 16.0
December 2000
Variable Data
immemsg.data.specic.ports Rac status ports from the faulty ring. immemsg.data.specic.opports Rac status ports from the opposite ring. See Notes.
ROP Data
RAC OUTPUT PARITY ERROR DETECTED (LN/RPCN)XX YY RAC (0/1) 0xaabbccdd 0xeeffgghh 0xjjkkllmm 0xnnppqqrr 0xssttuuvv 0xwwxxyyzz (TTTTTTTTTT) aabb ccdd ee ff gg hh jj kk ll mm nnpp qq rr ss tt uu vv wwxx yy zz not used. not used. The nodes home RPC overow state (IRN only). The nodes overload state (IRN only). The nodes overow state (IRN only). The nodes silence state (IRN only). node type, 3 = IRN. port C, faulty ring. port B, faulty ring. port A, faulty ring. not used. port E, faulty ring (IRN only). port D, faulty ring (IRN only). not used. port C, opposite ring. See Notes. port B, opposite ring. See Notes. port A, opposite ring. See Notes. not used. port E, opposite ring (IRN only). See Notes. port D, opposite ring (IRN only). See Notes.
Issue 16.0
December 2000
A-15
401-661-045
NOTE: This status port information from the RAC is used to transmit the error report. For this particular error type, the status is always from the RAC opposite to that on which the error occurred. If the error report was sent by an RPC node, this status information is meaningless.
Description
These error codes indicate some error occurred while a node was attempting to write a message to the ring. At the present time, all write errors are reported with the _RG_WFMT error code, regardless of the type of node reporting the error. The other error codes are provided for future use with the IRN.
This error code may indicate one of the following: a. Write source match error. The node tried to write a message to the ring, but the source address did not match the nodes address or the source ring in the message did not match the ring being used. b. Write too short. A C byte was presented to the header FIFO before the FIFO had received enough of the header to determine the disposition of the message. c. Write length error. When a write is performed, a counter is loaded with the length value from the message. If the write FIFO becomes empty and the write DMA channel asserts the end of DMA signal (EOD) before the counter reaches zero, a write length error is indicated. This error means the RAC saw at least the rst 6 bytes of the message and was able to
A-16
Issue 16.0
December 2000
determine the disposition of the message. If this error occurs, partial message was sent on the ring and downstream node(s) may report read format errors (_RG_RFMT).
Variable Data
msg->specic.ports Rac status ports from the faulty ring. msg->specic.opports Rac status ports from the ring that was used to write the error message to the 3B21D. This information was taken just before the error message was written. msg->specic.dhead The header of the message that was being written to the ring.
ROP Data
WRITE FORMAT ERROR DETECTED (LN/RPCN)XX YY RAC (0/1) 0xaabbccdd 0xeeffgghh 0xjjkkllmm 0xnnppqqrr 0xssttuuvv 0xwwxxyyzz (TTTTTTTTTT) aabbccdd eeffgghh - Header of the message that was being written to the ring. jj kk ll mm nnpp qq rr ss tt uu node type, 3 = IRN. port C, faulty ring. port B, faulty ring. port A, faulty ring. not used. port E, faulty ring (IRN only). port D, faulty ring (IRN only). not used. port C, opposite ring. See Notes. port B, opposite ring. See Notes.
Issue 16.0
December 2000
A-17
401-661-045
vv wwxx yy zz -
port A, opposite ring. See Notes. not used. port E, opposite ring (IRN only). See Notes. port D, opposite ring (IRN only). See Notes.
NOTE: This status port information from the RAC is used to transmit the error report. In most cases, this is the RAC opposite to that on which the error occurred. If the error report was sent by an RPC node, this status information is meaningless.
Description
.
Read length error. A C byte was received before the end of the message is reached
Variable Data
msg->specic.ports Rac status ports from the faulty ring.
A-18
Issue 16.0
December 2000
msg->specic.opports Rac status ports from the ring that was used to write the error message to the 3B21D. This information was taken just before the error message was written. msg->specic.dhead The header of the message that was being read from the ring.
ROP Data
READ FORMAT ERROR DETECTED (LN/RPCN)XX YY RAC (0/1) MSG SRC: (LN/RPCN)GG MM, MSG TYPE: (NORMAL/BROADCAST/ SEL BROADCAST/TAKE) 0xaabbccdd 0xeeffgghh 0xjjkkllmm 0xnnppqqrr 0xssttuuvv 0xwwxxyyzz (TTTTTTTTTT) MSG SRC, MSG TYPE - MSG SRC and MSG TYPE are the source node and message type respectively, extracted from the rst word of the message header: 0xaabbccdd. When the node is unsuccessful in recovering the message involved in the READ FORMAT ERROR, 0xaabbccdd is set to 0xffffffff. . aabbccdd eeffgghh - Header of the message that was being read from to the ring. If the node could not recover the message that was read from the ring, these elds will be set to 0xffffffff. jj kk ll mm nnpp qq rr ss tt uu vv wwxx yy zz node type, 3 = IRN. port C, faulty ring. port B, faulty ring. port A, faulty ring. not used. port E, faulty ring (IRN only). port D, faulty ring (IRN only). not used. port C, opposite ring. See Notes. port B, opposite ring. See Notes. port A, opposite ring. See Notes. not used. port E, opposite ring (IRN only). See Notes. port D, opposite ring (IRN only). See Notes.
Issue 16.0
December 2000
A-19
401-661-045
NOTE: This status port information from the RAC is used to transmit the error report. In most cases, this is the RAC opposite to that on which the error occurred. If the error report was sent by an RPC node, this status information is meaningless.
Description
.
Read too short. A second C byte was received before a complete ims header had been received.
Variable Data
msg->specic.ports Rac status ports from the faulty ring. msg->specic.opports Rac status ports from the ring that was used to write the error message to the 3B21D. This information was taken just before the error message was written.
A-20
Issue 16.0
December 2000
ROP Data
READ TOO SHORT DETECTED (LN/RPCN)XX YY RAC (0/1) 0xaabbccdd 0xeeffgghh 0xjjkkllmm 0xnnppqqrr 0xssttuuvv 0xwwxxyyzz (TTTTTTTTTT) aabbccdd eeffgghh - The partial header that was read into memory. If the node could not recover the message that was read from the ring, these elds will be set to 0xffffffff. jj kk ll mm nnpp qq rr ss tt uu vv wwxx yy zz node type, 3 = IRN. port C, faulty ring. port B, faulty ring. port A, faulty ring. not used. port E, faulty ring (IRN only). port D, faulty ring (IRN only). not used. port C, opposite ring. See Notes. port B, opposite ring. See Notes. port A, opposite ring. See Notes. not used. port E, opposite ring (IRN only). See Notes. port D, opposite ring (IRN only). See Notes.
NOTE: This status port information from the RAC is used to transmit the error report. In most cases, this is the RAC opposite to that on which the error occurred. If the error report was sent by an RPC node, this status information is meaningless.
Issue 16.0
December 2000
A-21
401-661-045
Description
When a blockage occurs during a write, the data in the FIFO should be transferred to the NP memory. If a blockage occurs during a read or while propagating a message, the data up to the next C byte should be read into memory. This error code is set if it appears that no data was put into memory. Either problem indicates that the RAC hardware is faulty or there is a problem with the DMAC. This error code indicates that the reporting node caused the blockage, not the downstream node..
At the present time, this error code is used in the IRN to report a read blockage.
Variable Data
immemsg.data.specic.ports ac status ports from the faulty ring. immemsg.data.specic.opports Rac status ports from the opposite ring. See Notes.
ROP Data
READ INHBIT ERROR DETECTED (LN/RPCN)XX YY RAC (0/1) 0xaabbccdd 0xeeffgghh 0xjjkkllmm 0xnnppqqrr 0xssttuuvv 0xwwxxyyzz (TTTTTTTTTT) aabb ccdd ee ff not used. not used. The nodes home RPC overow state (IRN only). The nodes overload state (IRN only).
A-22
Issue 16.0
December 2000
gg hh jj kk ll mm nnpp qq rr ss tt uu vv wwxx yy zz -
The nodes overow state (IRN only). The nodes silence state (IRN only). node type, 3 = IRN. port C, faulty ring. port B, faulty ring. port A, faulty ring. not used. port E, faulty ring (IRN only). port D, faulty ring (IRN only). not used. port C, opposite ring. See Notes. port B, opposite ring. See Notes. port A, opposite ring. See Notes. not used. port E, opposite ring (IRN only). See Notes. port D, opposite ring (IRN only). See Notes.
NOTE: This status port information from the RAC is used to transmit the error report. For this particular error type, the status is always from the RAC opposite to that on which the error occurred. If the error report was sent by an RPC node, this status information is meaningless.
Description
In order to detect and recover from problems caused by certain types of circulating hardware control messages, ring error interrupts generated by hardware control message execution are counted and thresholded at IRN RPCs. This report indicates that the number of these ring command interrupts generated at the reporting IRN RPC has exceeded a threshold. A leaky bucket thresholding technique is used to determine when the number of interrupts is excessive; a
Issue 16.0
December 2000
A-23
401-661-045
count of ring command events is incremented during processing of ring error interrupts, and decremented on each 10 ms clock interrupt. After incrementing the leaky bucket count, the ring error interrupt handler compares the count against a pre-dened threshold; if the count has exceeded the threshold, a circulating hardware control message is assumed to be the cause. The leaky bucket count increment, decrement, and threshold are parameters dened in header le rg.ear.h. Two separate thresholds are dened: one for use during the normal RPC operational state (RPCS4), and one for use during the RPC initialization and ring maintenance states (RPCS2 and RPCS3). This error condition is most likely an indication of a circulating broadcast type hardware control message - one of the nonlethal control types that do not quarantine or NP reset the affected nodes. A circulating nonbroadcast RAC reset message will also generate ring command interrupts in this way. A less likely cause is faulty ring interface hardware that generates an unclearable ring command interrupt. Refer to the contents of RAC status port D for an indication of the type of hardware control command that generated the excessive interrupt activity.
Variable Data
immemsg.data.specic.ports RAC status ports from the interrupting ring, prior to the node recovery actions. immemsg.data.specic.opports RAC status ports from the interrupting ring, after the node recovery actions have been completed. ROP Data EXCESSIVE RING CMD INTERRUPTS DETECTED, RPCNXX YY RAC (0/1) 0xaabbccdd 0xeeffgghh 0xjjkkllmm 0xnnppqqrr 0xssttuuvv 0xwwxxyyzz (TTTTTTTTTT)
A-24
Issue 16.0
December 2000
aa bb -
RPC node state. a ag to indicate whether the leaky bucket counters value incremented past the threshold over a span of multiple ring error interrupts (ag = 01) or entirely during the processing of one ring error interrupt (ag = 00). value of ring cmd interrupt leaky bucket counter, after it was incremented and found to exceed the counter threshold. leaky bucket counter increment, on each ring command event. leaky bucket counter decrement, on each 10 ms clock tick. leaky bucket counter threshold. node type, 3 = IRN (should always indicate IRN). port C, interrupting ring (prior to recovery actions). port B, interrupting ring (prior to recovery actions). port A, interrupting ring (prior to recovery actions). not used. port E, interrupting ring (prior to recovery actions). port D, interrupting ring (prior to recovery actions). not used. port C, interrupting ring (after recovery actions). port B, interrupting ring (after recovery actions). port A, interrupting ring (after recovery actions). not used. port E, interrupting ring (after recovery actions). port D, interrupting ring (after recovery actions).
Issue 16.0
December 2000
A-25
401-661-045
Description
The opns module has determined this node removed the token from the ring. The token was taken from the ring as if there was a legitimate destination address match. The node message switch was delivering messages from the ring buffers and a message destined for the _TOKEN channel was encountered. There are no ring status ports to check; this is purely a software decision. However, if the token was actually removed from the ring, the INACT bit in the RAC status information may be set.
Variable Data
msg->specic.dhead The header of the suspected token.
ROP Data
DEQUEUED TOKEN DETECTED (LN/RPCN)XX YY RAC (0/1) 0xaabbccdd 0xeeffgghh 0xjjkkllmm 0xnnppqqrr 0xssttuuvv 0xwwxxyyzz (TTTTTTTTTT) aabbccdd eeffgghh jj kkllmm nnppqqrr ssttuuvv wwxxyyzz The header of the suspected token.
node type, 3 = IRN. not used. not used. not used. not used.
A-26
Issue 16.0
December 2000
Description
The node placed a message on the ring, but the destination node was not able to remove the message from the ring. The message traveled completely around the ring and returned to the source node.
Variable Data
msg->specic.dhead The header from the message that caused the source match.
ROP Data
RMV (LN/RPCN)XX YY; SRC MATCH RPTD BY (LN/RPCN)AA BB 0xaabbccdd 0xeeffgghh (TTTTTTTTTT)
Issue 16.0
December 2000
A-27
401-661-045
The source address of the source match message. The control word of the source match message. The destination function of the source match message. The source function of the source match message. The destination address of the source match message.
Description
An error interrupt is generated and the error cannot be cleared. This error is a catch-all to handle some unexpected hardware or software condition. When any error interrupt is generated, the status ports are saved, the errors are cleared, and some recovery action is taken. After the recovery has completed, the status ports are checked again. If errors still exist, the cycle of clearing the error and performing the recovery is repeated. If the number of times the cycle is repeated exceeds a predened threshold, it is assumed that the error is permanent and the RAC problem is reported to the 3B21D. It is possible for this error to be caused by a circulating message on the ring.
Variable Data
immemsg.data.specic.ports Rac status ports from the faulty ring. immemsg.data.specic.opports Rac status ports from the opposite ring. See Notes. immemsg.data.specic.un.misc.sint -
A-28
Issue 16.0
December 2000
ROP Data
GENERAL RAC ERROR DETECTED (LN/RPCN)XX YY RAC (0/1) 0xaabbccdd 0xeeffgghh 0xjjkkllmm 0xnnppqqrr 0xssttuuvv 0xwwxxyyzz (TTTTTTTTTT) aabb ccdd ee ff gg hh jj kk ll mm nnpp qq rr ss tt uu vv wwxx yy zz False interrupt indicator. If this eld is 0x1, the problem was a false interrupt generated by a RAC. not used. The nodes home RPC overow state (IRN only). The nodes overload state (IRN only). The nodes overow state (IRN only). The nodes silence state (IRN only). node type, 3 = IRN. port C, faulty ring. port B, faulty ring. port A, faulty ring. not used. port E, faulty ring (IRN only). port D, faulty ring (IRN only). not used. port C, opposite ring. See Notes. port B, opposite ring. See Notes. port A, opposite ring. See Notes. not used. port E, opposite ring (IRN only). See Notes. port D, opposite ring (IRN only). See Notes.
NOTE: This status port information from the RAC is used to transmit the error report. For this particular error type, the status is always from the RAC opposite to that on which the error occurred. If the error report was sent by an RPC node, this status information is meaningless.
Issue 16.0
December 2000
A-29
401-661-045
Description
The node is trying to write to the ring, and a timer expired while waiting for the token to arrive at this node. This timing interval is 60 msec. This error is reported only by an RPC and then only when it has attempted to write to the ring.
Variable Data
None.
ROP Data
UNEXPLAINED LOSS OF TOKEN ON aa aa RING 0, RING1 or BOTH RINGS.
Description
The node checksum audit on a text or data section has failed.
A-30
Issue 16.0
December 2000
ROP Data
RMV (LN/RPCN)XX YY; NODE CKSUM ERROR 0xaabbccdd 0xeeffgghh (TTTTTTTTTT) aa bb cc dd eeff gghh Current audit number. Accumulated sum. Not used. Reference sum. Segment that the audit was running in. Offset to the beginning of the section that was being audited.
Description
This error code should never be reported to the 3B21D, but it is included here for reference. If a node processor parity failure occurs, the node will panic but it will not send an error message. If there is bad parity and an attempt is made to send a message, it may create parity errors at the downstream node and cause that node to be removed from service. If an NP parity error occurs while writing a message to the ring, the write will be terminated. This will chop off the end of the message and cause the downstream node(s) to report a read format error. The 3B21D will be unaware of the problem until a message destined for the node is returned as a source match.
Design Issues
Some new error codes were created, but they were mapped to existing error codes. These new codes were provided for future use in the 3B21D. 1. Presently, the indication of which ring is at fault is the upper bit of the error code in the error message. Would it be any simpler to dedicate a separate eld in the error message for this purpose?
Issue 16.0
December 2000
A-31
401-661-045
Yes. A spare eld in the error message will be assigned to use as the faulty ring indicator. The bit will still be set until the 3B21D code is changed to the new eld in the message. 2. The error messages will contain an indication of the type of node that sent the message. 3. In the current conguration, the error message may contain the RAC ports of both rings. Is this necessary? The info from the opposite port is not usually used by the 3B21D, but in some cases the additional information in the ROP printout is helpful in the analysis of the problem. For that reason, the opposite port information will be retained whenever possible. This status information is really data that is obtained from the RAC on which the error message was transmitted. When an error message is sent on both rings, it is not possible to tell which RAC this status belongs to. Should something be added to the error message to indicate which RAC the message was transmitted from? Should this status information be provided when the reporting node is an RPC? 4. Should the orphan byte error be handled separately from the parity error? Yes. Previously, these errors were grouped together because the recovery action was the same in either case. That is no longer true, so a new error code will be assigned for orphan byte errors. Also, the orphan byte error requires that error messages be sent to all RPCs on the opposite ring. 5. There are three error codes that indicate blockage, _RG_BLKG, _RG_ROPF, and _RG_RINH. Is it necessary to have all of the error codes? The ring error analysis in the 3B21D relies on the different error codes to determine how to recover from the error. 6. What if there is a source match and the destination address of the source match message is a virtual address. How does the 3B21D know which node to remove? Is it going to have to wait until the neighbor audit runs to discover which node is in error? This is a known hazard associated with using virtual addresses. The source match will not be reported if the destination is a virtual address. The faulty node will not be removed until the neighbor audit runs. 7. At the present time, the recovery strategy for a write format error (_RG_WFMT), sets inhibit input, which will cause the upstream node to see a blockage. Is this overkill? There seems to be a couple of things to consider. Why should we block the ring because one node cannot write? The problem with causing blockage is that all trafc on the ring is lost and this seems a harsh penalty to pay for a write format error.
A-32
Issue 16.0
December 2000
It seems logical that we could clean up after the error and try to continue normal operation. The 3B21D could then make the decision whether to remove the node from service. This is also the case with the input format error (_RG_IFMT). This is really a read too short error. Inhibit input is also set when this error occurs. The nal decision was to set inhibit input in the IRN to make it look like an older node. 8. In some error cases in older nodes, inhibit input is set to prevent recurring error interrupts. Will that work in the IRN? Or would it be better to disable the ring interrupt? The IRN will continue to use inhibit input to prevent recurring interrupts. If it disabled interrupts, it would be difcult for the node to determine when to reenable the interrupt. 9. The error report printed at the ROP presently contains data taken from the error message. This is provided to help analyze the problem. The amount of data printed may change and is subject to the time required to print the message. The time used to print the report affects the total error recovery time. 10. The loss of token error message is sent to the 3B21D if the timer times out during a token write or if it times out during a priority write. Should there be a separate codes for the different write failures? The nal decision was not to create a new error code. 11. The input format error is really a read too short error, so the error code is changed from _RG_IFMT to _RG_RDTOSHRT.
Issue 16.0
December 2000
A-33
401-661-045
A-34
Issue 16.0
December 2000
Ring-Related Errors
The following ring transport errors indicate faults that obstruct the transportation of messages on the ring. Such faults usually lead to ring restarts and/or node isolations. BLOCKAGE A nodes blockage timer timed out waiting for the downstream node or interframe buffer board (IFB) to accept an offered data byte. The blocked node will clear the ring by reading all data from the ring, including the token message. It then reports the condition to the 3B20D/3B21D by sending on the opposite ring a BLOCKAGE Ring Transport Error Message to each RPCN.
Issue 16.0
December 2000
B-1
401-661-045
RAC OUTPUT PARITY ERROR A node attempted to transmit bad parity to the downstream node or IFB. Since bad parity is not accepted by the downstream node or IFB, the transmitting node eventually detects blockage and reads the data with bad parity into memory as part of the blockage recovery process. Upon recognizing the bad parity, the transmitting node will take the same recovery action as with BLOCKAGE, except that this error is reported instead of BLOCKAGE. READ INHIBIT ERROR Blockage occurred during a read or while propagating a message and no data was read into NP memory as part of the blockage recovery process. The node will take the same recovery action as with BLOCKAGE. RAC PARITY/FORMAT ERROR A node reporting this error will not accept data from its upstream neighbor, thereby forcing the upstream node to detect ring blockage. The following two conditions cause this error. (1) A ring data byte with bad parity has been offered to the node; and the node recovery action of resampling the data could not clear the error. (If bad parity were due to a transient error, resampling should clear it.) (2) An orphan byte has been offered to the node. An orphan byte condition occurs when a node expects to receive a control byte but is offered another byte instead. The control byte is the rst byte of data in an IMS message. A special signal lead on the ring bus is asserted only during the control byte, thereby allowing the receiving node to identify the control byte from all other message bytes. INTERFRAME BUFFER PARITY ERROR The upstream interframe buffer has detected a ring parity error. The IFB will not accept any more data, thereby forcing blockage in the node upstream from the IFB. WRITE FORMAT ERROR Some error occurred while a node was attempting to write a message to the ring. For example, the message may have had a source address that does not match that of the writing node, or the message specied an improper message length. A node reporting this error will not accept ring data from its upstream node, thereby forcing the upstream node to detect blockage. GENERAL RAC ERROR A catch-all error type used to report an unexpected node hardware or software condition. A node reporting this error will not accept ring data from its upstream node, thereby forcing the upstream node to detect blockage.
B-2
Issue 16.0
December 2000
DEQUEUED TOKEN A ring node reports this error when it nds that it has read the token message from the ring. This error is intended to detect failures that cause a node to inadvertently read data from the ring. RING INTERFACE FAILURE During a boot, ring maintenance activity found an RPCs ring interface to be faulty. PIO FAILURE A Programmed IO operation at an RPCN from the 3B20D/3B21D failed. RPCN ISOLATION An RPCN was removed from service due to isolation. The RPCN may or may not be an innocent victim. This condition is reported as a ring transport error but is actually a status message, since it is a condition imposed upon an RPCN by the 3B20D/3B21D as a result of ring transport error messages it has previously received.
Node-Related Errors
The following ring transport errors indicate faults that prevent the processing and transmission of messages in nodes. They usually lead to node quarantine. SOURCE MATCH A ring message returned to the sending node because the destination node did not remove the message from the ring. SRC MATCH This is the same as the SOURCE MATCH error, except the detection was made by the node audit (NAUD) operation. NAUD FAILURE The node audit operation failed in a communication test with a node. RPCN PANIC This is a failure condition in RPCN software. RPCN STATE CHANGE FAILURE The RPCN failed to conrm that it has followed a 3B20D/3B21D directive to change into a particular software state during ring maintenance activity.
Issue 16.0
December 2000
B-3
401-661-045
UNXPCTD STATE CHNG MSG This is similar to the RPCN STATE CHANGE FAILURE. Without having been sent a 3B20D/3B21D directive, an RPCN reported that it has changed into a particular software state. RING WRITE FAILURE An RPCN reported that it failed to write a message to the active ring. MSG RELAY FAILURE This is similar to the RING WRITE FAILURE. An RPCN failed in relaying a message from the 3B20D/3B21D onto one of the rings during ring maintenance activity. RING READ FAILURE An RPCN reported that it failed to read a message from the active ring. UNXPCTD SET QUA The 3B20D/3B21D received an unprovoked conrmation from an RPCN that it has been directed to quarantine itself. RAC CONTROL FAILURE During ring maintenance activity, the ability of the 3B20D/3B21D to control an RPCs ring access circuit (RAC) failed.
B-4
Issue 16.0
December 2000
READ TOO SHORT ERROR A node read a message that was shorter than an IMS header (8 bytes). The partial message header is discarded.
Issue 16.0
December 2000
B-5
401-661-045
Some Versions of the RST Input Message Result Returns the specied range of nodes, if they are eligible, to the active ring. Isolates the specied node, if it is eligible. Isolates the specied range of nodes, if they are eligible. Moves the indication of a faulty ring interface from the currently isolated node to the node identied as NODEa,b and causes the isolation to shift so that NODEa,b becomes the newly isolated node and the currently isolated node becomes the EISO or BISO node. See Manual Recovery from a Hard Fault on a Small Ring in Chapter 3, Ring Maintenance.
CFR:RING ,NODEa b ,NODEa b;INCLUDE CFR:RING ,NODEa b;EXCLUDE CFR:RING ,NODEa b NODEa b;EXCLUDE CFR:RING,NODEa,b;MOVFLT
B-6
Issue 16.0
December 2000
8. Enter c in response to indicate that a eld is to be changed. ODIN will then prompt for the eld number. 9. Enter 22 in response to specify eld 22 (the equippage eld). ODIN will position the cursor at eld 22. 10. Change the value of eld 22 as follows:
s
0x8 at the beginning of the manual ring initialization is used to set the ag. and at the completion of the manual ring initialization, after the ring is stable, to reset the ag.
ODIN will prompt for the next eld to be changed. 11. Depress the <CR> key to indicate that no other changes are desired on the page. ODIN will again display the operations prompt at the lower portion of the screen. 12. Enter u in response to update the form and inform ODIN that no other changes are required for this session. 13. The message FORM UPDATED will ash once at the upper right of the screen when the form is updated. ODIN will then return to page 1 of the form. 14. Return to the forms selection page by depressing the < key, and execute the TREND Form.
Issue 16.0
December 2000
B-7
401-661-045
For interframe buffers that are upstream of RAC 1, set bits 4-7 of the ECD UCB HV eld to the following values: VALUE 0 1 2 3 4 5 6 no IFB TN918 (unpadded) TN915 (padded 512 byte capacity) TN1507 (fiber 256 byte capacity) TN1506 (padded 4104 byte capacity) TN1508 (fast unpadded 16 byte capacity) TN1509 (fast 4104 byte capacity) BUFFER TYPE
B-8
Issue 16.0
December 2000
Abbreviations
For denitions of terms used in this acronym list, see the Glossary or consult the Index for text references.
Numerics
3B20D AT&T 3B20 Duplex Real Time Reliable computer 3B21D A new version of the existing 3B20D processor 5ESS Registered trademark of Lucent Technologies for its premier electronic switching system
A
ACCH Associated control channel ACDN Administrative Call Processing/Database Node ACT Active state ACTS Automated Cellular Test System ACU Analog conversion unit AIF Antenna Interface Frame (Series II Cell) AMA Automatic Message Accounting AMASE Automatic Message Accounting Standard Entries AMPS Advanced Mobile Phone Service AP Attached Processor - Another name for the Ring Application/Attached Processor.
Issue 16.0
December 2000
AC-1
401-661-045
ATP All Tests Passed AUTOPLEX AT&T Registered Trademark for its Cellular Switching Systems AutoPACE Performance Analysis and Cellular Engineering
B
BBA Bus Interface Unit + Baseband Combiner & Radio + Analog Conversion Unit (BIU+BCR+ACU) BCR Baseband Combiner & Radio BER Bit Error Rate BIU Bus Interface Unit BWM Broadcast Warning Message
C
CCC CDMA Cluster Controller CCCEQ CDMA Cluster equipage form CCFDB Custom Calling Features Database CCU CDMA Channel Unit CDMA Code Division Multiple Access CDN Call Processing/Database Node CDN-II Call Processing/Database Node - II
AC-2
Issue 16.0
December 2000
CDN-IIX Call processing/database node - IIX CE Channel Element CELLDB Cell Site Database CEQCOM1 Series I Cell Equipage Common form CEQCOM2 The Series II Cell Equipage RC/V Form CEQFACE Cell Equipage Face CGSADB Cellular Geographic Service Area Database CNI Common Network Interface CNI/IMS Common Network Interface/Interprocess Message Switch CPI Communication processor interface CPU Core processor unit CSC Cell Site Controller CU Control unit
D
DAT Digital Audio Tape DCCH Digital Control Channel DCI Dual-Serial Channel (DSCH) Computer Interconnect DCS Digital Cellular Switch
Issue 16.0
December 2000
AC-3
401-661-045
DCSDB Digital Cellular Switch Database DFI Digital Facility Interface DRTU Digital Radio Test Unit DRU Digital Radio Unit DS-1 Digital Signal level 1 DS0 Digital Signal-0 DSN Digital Switch Node
E
EA Emergency Action Page EA/NORM Emergency Action/Normal Display Key on MCRT ECD Equipment Conguration Database ECP Executive Cellular Processor ECPC ECP Complex ECPDB Executive Cellular Processor Database
F
FAF Feature Activation File FDMA Frequency Division Multiple Access
AC-4
Issue 16.0
December 2000
G
GPS Global Positioning System
H
HO Handoff Hz Hertz
I
IMS Interprocessor Message Switch IIRN Integrated Ring Node IRN2 Integrated ring node version 2
Issue 16.0
December 2000
AC-5
401-661-045
L
LAF Linear Amplier Frame LAN Local Area Network
M
MAHO Mobile Assisted Handoff MB Mega Byte MCRT Maintenance Cathode Ray Tube/Terminal MHD Moving Head Disk MHz Megahertz MSC Mobile Switching Center (formerly MTSO) MSO Multiple Size Option for Subscriber Database MUFDB Mobile Unit Features Data Base
AC-6
Issue 16.0
December 2000
N
N/A Not Applicable NVM Non-Volatile Memory
O
OA&M Operations, Administration & Maintenance ODA Ofce Data Assembler ODD Ofce Dependent Data OMP Operations Mgmt Platform, previously Operations and Maintenance Processor OOS Out-Of-Service
P
PC Personal Computer PM Plant Measurements PSTN Public switched telephone network PSU Packet Switching Unit
Issue 16.0
December 2000
AC-7
401-661-045
R
RAM Random Access Memory RCC Radio Control Complex RCU Radio Channel Unit RCV Recent Change & Verify RF Radio Frequency RFTG Reference frequency and timing generator RN Ring Node ROP Read/Receive-Only Printer RPC Ring Peripheral Controller (node) RPCN Ring Peripheral Controller Node RTR Real Time Reliable RTU Radio Test Unit
S
SC Stable Clear
AC-8
Issue 16.0
December 2000
SCSI Small Computer System Interface SCT Synchronous Clock and Tone SH Speech Handler SII Series II Cell Site SM Service Measurements SMS Short Message Service SS7 Signaling System 7 STBY Standby SU Software Update
T
TDMA Time Division Multiple Access TEA Translations Entry Assistant TRKGRP Trunk group TRTU TDMA Radio Test Unit
Issue 16.0
December 2000
AC-9
401-661-045
V
VCSA Voice Channel Selection Activity
W
WTSC Wireless Technical Support Center (formerly CTSO)
AC-10
Issue 16.0
December 2000
Glossary
A
Attached Processor (AP) A circuit pack used with the direct link node (DLN) that provides expanded storage for added processing capacity on the ring.
B
Basic Error Correction (BEC) BEC or Basic is an algorithm for Level 2 error correction on signaling links with short one-way propagation delay. In normal operation, BEC ensures correct transfer of message signal units over CCS7 and CCITT7 signaling links, in sequence and with no double delivery. Positive acknowledgments indicate correct transfer of message signal units. Negative acknowledgments request a retransmission of those signal units because they were received in a corrupt form.
C
Call Processor/Database Node (CDN) A CNI node that handles the call processing functions of the FLEXENT/AUTOPLEX Wireless Network Systems. A CDN is a two-part unit consisting of a node and ring application processor (RAP). There are several versions of CDNs: CDN, CDN-I, CDN-II, and CDN-IIx. CCITT Consultive Committee International Telegraph and Telephone (Comite Consultatif International Telegraphique et Telephonique). An international body that controls the standards of communications protocols. CDN Call Processor/Database Node
Issue 16.0
December 2000
GL-1
401-661-045
CDN-I A CDN that is comprised of an IRN and a 3B15-based computer. This is sometimes referred to as a SMART Node (SN). CDN-II A CDN that is comprised of an IRN2 and an AP30. This is sometimes referred to as a Turbo CDN. CDN-IIx A CDN that is comprised of an IRN2B and a modied AP30. CNCE CCS Network Critical Events Common Network Interface (CNI) A common subsystem software component supplied to various network components whose primary function is providing CCS network access and CCS message routing. Computer Congestion Control The 3B20D/3B21D computer congestion control feature enables a craft to reduce real-time congestion by reducing CNIs activity on the 3B20D/3B21D computer. If not used by a craft, it remains inactive. Critical Node Restore/Monitor CNIs critical node monitor looks for congurations of out-of-service link nodes and direct link nodes (DLNs) that have cut its ring off from the outside world. To restore these nodes quickly, it tells Interprocess Message Switch (IMS) to give them a user critical priority on its automatic ring recovery (ARR) priority list. The monitor also permits its rings application to nominate nodes to this priority. CSN Cell Site Node
D
DCS Digital Cellular Switch Destination Point Code (DPC) A unique value associated with every network component that is used for routing. Direct Link Node (DLN) A DLN is basically an RPCN equipped with an AP. A DLN routes the data link message trafc between cellular systems for both X.25 and SS7 messaging.
GL-2
Issue 16.0
December 2000
Glossary
Direct Link Node 30 (DLN30) The DLN30 has IRN2B, AP30, 3BI, and DDSBS boards. The IRN2B board provides increased performance and higher reliability. Direct Link Node Enhanced (DLNE) The DLNE has IRNB, AP30, 3BI and DDSBS boards. DSN Digital Switch Node
E
EAI Emergency Action Interface EAR Error Analysis and Recovery Extended Access Links (E-Links) and Full Point Code Routing (FPCR) The ELINKS/FPCR features allow LECs to achieve the following benets in their networks: provides additional routes to destinations which further minimizes signaling end point (SEP) isolation; forces trafc to be directly routed (thus using fewer intermediate STPs) to more efcient and less problematic routes which improves network performance; and allows switching trafc between Access Links (A-Links) and E-Links which makes network reconguration easier.
F
Full Process Initialization (FPI) FPI will reduce failed and abandoned initializations. It is a faster and more reliable initialization response than the abort and boot initialization.
Issue 16.0
December 2000
GL-3
401-661-045
I
ICN Inter-Cellular Node IFB Interframe Buffer Board IMS User Node (IUN) An IMS provided node on the ring where with the addition of CNI hardware provides an interface between the ring and the transmission facility. This includes all non-RPCNs. Integrated Ring Node (IRN) A ring node that uses very large scale integration to combine the node processor and both ring interfaces into one circuit pack. There are several versions of the IRN referenced in this document: the IRN (UN303), the IRNB (UN303B), the IRN2 (UN304), and IRN2B (UN304B). Functionally, they all serve the same purpose, but different IRN versions are used in different node types. Interprocess Message Switch (IMS) A common subsystem software component that provides a ring based interfunction, interprocessor transport mechanism. IUN Init with Optional Pump This restores the node without repumping the node. It increases the systems availability through reduced down time.
GL-4
Issue 16.0
December 2000
Glossary
L
LI Link Interface LIN-E Link Interface Node - Encrypted LIN-NE Link Interface Node - Nonencrypted Link Node (LN) A node on the ring where digital information enters from or exits to the transmission facility.
M
MCRT Maintenance Cathode Ray Tube MDL Memory Data Link Message Switch The portion of the IMS software that handles the sending and receiving of internal messages. There are portions of the message switch in all ring nodes and in the central processor. Message Transfer Part (MTP) The functional part of CCS7 that transfers signaling messages as required by all the users and also performs the necessary subsidiary functions (for example, error control and signaling security).
Issue 16.0
December 2000
GL-5
401-661-045
N
Network Interconnect (NI) NI is used to interconnect signaling points in different North American networks which adhere to the ANSI standard specications for the CCS7 protocol. It provides: MTP and SCCP routing to PCs in nonlocal networks, SNM and SCMG for nonlocal network PCs, administration of the associated nonlocal network routing data, new routing types to support routing to small networks and cluster-level-only routing to populated clusters, and NID only routing. Node Processor (NP) The NP is the central processing unit (CPU) portion of a ring node. It controls and schedules the processes in the ring node. Nonlocal Point Code Any signaling point code which has a network identier value that is different from the network identier value of the local point code. NRM Node Recovery Monitor
O
Ofine Boot (OFLBOOT) The OFLBOOT feature allows the 3B20D/3B21D duplex processor of a 5ESS-2000 switch to be logically separated into two simplex machines: the ONLINE side and the OFFLINE side. This allows personnel at a 5ESS-2000 switch to cut over to a new software release with a minimum of downtime.
P
Peripheral Routing Provides the capability to do CCS7 MTP and SCCP routing in a node on the ring. Preventive Cyclic Retransmission (PCR) PCR is an algorithm for Level 2 error correction on CCS7 or CCITT7 signaling links with a long one-way propagation delay. Each message signal unit must be retained at the transmitting signaling link terminal until a positive acknowledgement arrives
GL-6
Issue 16.0
December 2000
Glossary
from the receiving signaling link terminal. During the period when there are no new signal units to be transmitted, all the signal units which have not yet been positively acknowledged are retransmitted cyclically. Protected Applications Segment (PAS) CNI data that rarely changes is referred to as static data, and is preserved in the protected applications segment area of 3B memory. CNI can reuse this data from PAS during CNI init level 2, saving time that would have been wasted downloading the data from disk. To insure PAS data is safe, it must be protected from accidental writes. For this purpose, CNI has improved protection of the PAS area.
R
Ring Refers collectively to the RPCNs and IUNs which are serially connected to one of two circular busses. The ring provides 4 megabyte data paths in both directions between adjacent nodes and can uniquely address up to 1,024 nodes. Ring Application Processor A modied 3B15 computer used in the standard multiapplication real time node that performs processing on the ring. Ring Conguration For various reasons, the ring is recongured under control of the 3B20D/3B21D computer to isolate the faulty segment. Ring Generic Access Package (RGRASP) RGRASP is a debugging tool for CNI ring nodes. Ring Interface (RI) A RI is one of two circuits in a ring node that interfaces the node processor to the ring. Each RI can access either ring 0 or ring 1 to insert messages onto, or remove messages from, the active ring. The heart of the circuit is a rst-in rst-out (FIFO) buffer that provides access to the ring yet allows messages to circulate in the ring independent of the node. Ring Isolation A ring conguration where ring nodes are isolated from the active ring.
Issue 16.0
December 2000
GL-7
401-661-045
Ring Peripheral Controller Node (RPCN) A node on the ring where digital information is removed from the ring for transferral to the 3B20D/3B21D computer for processing or, after processing, reenters the ring.
S
Signaling Connection Control Part (SCCP) An adjunct to the MTP layer of CCS7 which performs interpoint code subsystem status. Signaling End Point (SEP) Dual Point Code (DUALPC) The DUALPC feature allows Signaling End Points (SEPs) to support a two point code assignment to facilitate the change of the point code for resectoring of the SEP with minimal Signaling System Number 7 (SS7) service disruption. SMART Node (SN) Standard Multi-Application Real Time node. See CDN-I. SS7 Signaling System 7
T
Turbo CDN See CDN-II.
GL-8
Issue 16.0
December 2000
Glossary
W
WTSC Wireless Technical Support Center
Issue 16.0
December 2000
GL-9
401-661-045
GL-10
Issue 16.0
December 2000
Index
Index
I
Interactive Diagnostics, 6-70 IRN CDN-I Diagnostic Phases, 6-18 IRN DLNE Node Diagnostic Phases, 6-14 IRN LN (LI4S/SS7) Node Diagnostic Phases, 6-12 IRN LN (LIN-E/SS7) Node Diagnostic Phases, 6-11 IRN2 CDN-II/CDN-IIx Diagnostic Phases, 6-20 IRN2 CDN-III Diagnostic Phases, 6-22 IRN2 DLN30 Node Diagnostic Phases, 6-15 IRN2 DLN60 Node Diagnostic Phases, 6-17
A
About this document, xv comments, xix Automatic Diagnostics and Restorals, 6-55
C
Circuit Pack Trouble Location, 6-24
L
LNs with Unequipped LI Boards - MV Updates, 6-42
D
Diagnostic Listings, 6-41 Diagnostic Message Structure, 6-6
M
Manual (Unit) Diagnostics, 6-56 Manual Diagnostics Using the 1106 Display Page, 659 Manual Diagnostics Using the DGN Command, 6-61
E
Equipment Description, 7-1
N
Node Diagnostic Phases IRN CDN-I, 6-18 IRN DLNE, 6-14 IRN LN (LI4S/SS7), 6-12 IRN LN (LIN-E/SS7), 6-11 IRN2 CDN-II/CDN-IIx, 6-20 IRN2 CDN-III, 6-22 IRN2 DLN30, 6-15 IRN2 DLN60, 6-17 Node Phase Descriptions, 6-9
G
Global Positioning System, AC-5
H
Handling Precautions, 7-1 Hardware and Interfaces, 6-2
Issue 16.0
December 2000
IN-1
401-661-045
O
Operating System Diagnostics, 6-75
P
Performing Diagnostics, 6-6 Power Packs and Fusing, 7-2
R
RAP Diagnostic Firmware, 6-69 Ring Node Addressing, 6-43
S
System Diagnostics, 6-8 System Maintenance Interfaces, 6-5
U
Unexplained Loss of Token, B-5
IN-2
Issue 16.0
December 2000
Lucent Technologies welcomes your comments on this information product. Your opinion is of great value and helps us to improve.
1. Was the information product:
Yes No Not applicable
In the language of your choice? In the desired media (paper, CD-ROM, etc.)? Available when you needed it? Please provide any additional comments: ________________________________________________________________________________________________ ________________________________________________________________________________________________
2. Please rate the effectiveness of this information product:
Excellent More than satisfactory Satisfactory Less than satisfactory Unsatisfactory Not applicable
Ease of use Level of detail Readability and clarity Organization Completeness Technical accuracy Quality of translation Appearance If your response to any of the above questions is Less than satisfactory or Unsatisfactory, please explain your rating. ________________________________________________________________________________________________ ________________________________________________________________________________________________
3. If you could change one thing about this information product, what would it be?
________________________________________________________________________________________________ ________________________________________________________________________________________________
4. Please write any other comments about this information product:
________________________________________________________________________________________________ ________________________________________________________________________________________________
Please complete the following if we may contact you for clarification or to address your concerns:
Date: ________________________________
If you choose to complete this form online, go to http://www.lucent-info.com/comments Otherwise fax to 407 767 2760 (U.S.) or +1 407 767 2760 (outside the U.S.) or email comments to ctiphotline@lucent.com