Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 26

HSS9860 V900R008

Troubleshooting

www.huawei.com

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved.


This document describes the procedures for
troubleshooting common faults or emergency
faults during routine maintenance.

This document describes the operations to be


performed periodically on the equipment to detect
and solve problems in advance, and thus ensure
normal running of the equipment.

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved.


Page 2
References
 Prevent Failures in the electronic documentation of the GU
HSS9860 V900R008
 Troubleshooting in the electronic documentation of the GU
HSS9860 V900R008
 Fault Management Description in the electronic documentation
of the HSS9860 V900R008

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page3
Objective
Upon completion of this course, you will be able to know:

 How to handle a fault

 How to check equipment status in various periods to detect


exceptions with the equipment in advance

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page4
Contents

1. Troubleshooting

2. Prevent Failures

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page5
Troubleshooting
Faults of a system are classified into common faults and emergency faults.

Common Fault
Common faults are device faults that occur unexpectedly and affect a small range of
services or devices. They do not severely affect the running and quality of service (QoS)
of a network.

Category Description
Service failures Service failures complained by subscribers

Operation failures on the Failures in performing operations on the client of the


provisioning system provisioning system. An error code is displayed when an
operation failure occurs.
Maintenance faults Failures in performing operations on the Huawei Operation
& Maintenance System. An error code is displayed when an
operation failure occurs.
Faults identified during routine maintenance
Faults identified during upgrade, migration, or capacity
expansion

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page6
Troubleshooting
Emergency Fault

Common faults are device faults that occur unexpectedly and affect a small range of
services or devices. They do not severely affect the running and quality of service (QoS)
of a network.

Troubleshooting emergency fault aims to recover the system and service provisioning
as soon as possible. To improve the efficiency of troubleshooting emergency faults
and minimize the loss, you must adhere to the following principles.

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page7
Overview of Alarm Handling
Alarm Console
The alarm box provides only visible and audible alarm severity information. The alarm
console on the LMT provides the details about alarms.

 Alarm Severity
The alarm severity indicates the severity level of an alarm.

In descending order of alarm severity, alarms are classified into four types:
Critical alarm: Critical alarms should be cleared immediately. Otherwise, system breakdown
may occur.
Major alarm: Urgent action is required to rectify the fault as this type of alarms affects the QoS of
the system.
Minor alarm: This type of alarms does not affect the QoS of the system, but you need to locate and
remove these faults in time.
Warning alarm:This type of alarms should be handled based on the actual conditions.

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page8
Fault Detection Mechanism
The fault detection subsystem monitors the operating status of the equipment through
hardware detection and software detection. It reports the detected faults to you so that you
can rectify fault in time.

 Hardware detection
The hardware detection implemented by boards is as follows:
Board state (normal/abnormal or active/standby)
Clock
Temperature
Online/Offline state

 Software detection
Logical errors can be detected through software detection. The logical errors that can be detected
are as follows:
Cyclic Redundancy Check (CRC) error
Memory error
Data consistency error

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page9
Analysis Methods Used in
Troubleshooting
 Analyzing Indicator Status
Indicators reflect different status of boards and links through different colors. The status of indicators
can be used for fault identification when a board experiences a fault.

 Analyzing Alarm Information


Alarms are used to report faults or exceptions on the system in a clear and simple manner. The alarms
are generated by the alarm box in visual and audio modes, and are presented on the OMU client or
the Element Management System (EMS). The alarm information includes the following parts:
Symptoms of faults or exceptions
Possible causes
Troubleshooting measures
The alarm information is important for fault identification.

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page10
Analysis Methods Used in
Troubleshooting
 Analyzing Performance Measurement Information
Performance measurement collects the running information of the system in real time. The
performance measurement information reflects the running status of the system. It can be used for
fault identification when the system experiences a fault.
 Analyzing Traced Messages
Message tracing provides dynamic and real-time monitoring on the call connection process, resource
usage, and service flow over ports and signaling links. The traced messages allow you to locate a call
connection failure quickly and help you to troubleshoot the fault. In addition, the traced messages
help you to learn about the signaling exchange between NEs.
 Analyzing Error Codes
Error codes are returned by the system when the operations performed on the client fail. The error
codes are used to query specific error information, which is helpful for fault identification.

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page11
Analysis Methods Used in
Troubleshooting
 Analyzing Logs
Logs record specific running information of each module of the system or data configuration
operations performed on the client. You can use logs to identify faults, which, however, is more time
consuming compared with other analysis methods. Therefore, use logs to identify faults only when
the other analysis methods do not work.
 Analyzing the Device Panel
The device panel provides a visual emulation pane, through which you can perform operations to
manage the hardware, software, and modules of the system. It displays the boards in different colors
according to their hardware status; it also displays the status indicators of modules in different colors
according to the process status. The colors of boards and status indicators of modules can be used for
fault identification when a board or module experiences a fault.

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page12
Analysis Methods Used in
Troubleshooting
 Analyzing the Service Panel
The service panel provides a typology view about the logical modules of the system. It depicts the
service processing of modules on different logical layers in graphs or continuous curves. It displays
the modules in different colors according to their status. You can quickly locate a faulty module
based on the colors. Particularly, in case of emergency faults, the service panel effectively shortens
the duration for fault identification.

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page13
Analysis Methods Used in
Troubleshooting
 Methods of Analyzing Common Faults

Fault Method
Hardware faults Analyzing Alarm Information, Analyzing the Device Panel, Analyzing
Indicator Status, Analyzing Logs
Link faults Analyzing Traced Messages, Analyzing Alarm Information, Analyzing
Performance Measurement Information, Analyzing Indicator Status,
Analyzing Logs
Operation failures on the Analyzing Alarm Information, Analyzing the Device Panel, Analyzing
OMU Error Codes, Analyzing Logs
Operation failures on the Analyzing Alarm Information, Analyzing Error Codes, Analyzing Logs
provisioning system
Subscriber service failures Analyzing Alarm Information, Analyzing Performance Measurement
Information, Analyzing Traced Messages, Analyzing Logs
Device performance faults Analyzing Performance Measurement Information, Analyzing Alarm
Information

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page14
Analysis Methods Used in
Troubleshooting
 Methods of Analyzing Emergency Faults

The following methods are available for analyzing emergency faults:


Analyzing the Service Panel
Analyzing Alarm Information
Analyzing Performance Measurement Information
Analyzing the Device Panel

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page15
Contents

1. Troubleshooting

2. Prevent Failures

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page16
Overview of Fault Prevention
Fault prevention is a set of preventive measures taken regularly while the system is running. Fault
prevention helps to locate and eliminate defects or helps to troubleshoot the system in time to ensure
long-term security and stability of the system.

Based on the implementation period, the fault prevention can be classified into daily maintenance and
periodic maintenance.
 Daily Maintenance
Daily maintenance consists of simple operations performed daily by common maintenance
personnel.
 Periodic Maintenance
Periodic maintenance consists of complex operations performed regularly by qualified
maintenance personnel.

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page17
Overview of Fault Prevention
 Daily Maintenance
Identify alarms generated by the equipment or identify existing defects on the equipment in time, and take
preventive measures. This ensures the stability of the equipment and reduces the number of faults or failures.
Examine the operating status of the equipment and the network in real time, and determine the running
status of the equipment and the network in a future period. This helps to improve the efficiency of
maintenance engineers in handling emergencies.

 Periodic Maintenance
Ensure that the equipment is in good condition, and it is safe, stable, and reliable to operate.
Identify the defects in the equipment, such as natural aging, malfunction, and deterioration of performance
through periodic checks, backup measures, tests, and cleaning processes. Take proper measures to eliminate
these defects.

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page18
Category of Fault Prevention
Operations
Based on the maintenance period, the fault prevention operations are categorized as follows:
 Check and troubleshoot the faults reported by the system every day.
 Check the system performance and subscriber data backed up by the third-party device every week, and identify
and rectify the potential faults. This helps to ensure the normal running of the system and provides data
consistency.
 Check the system running status and data consistency between the active and redundancy systems every month.
This helps to eliminate the potential faults from the systems.
 Check the system time and running status of the internal components of the system every quarter of the year.
This helps to ensure the normal running environment for the equipment.
Half-yearly Maintenance measures: Check the system ports and system passwords every half year. This helps to
ensure that the normal running environment for the equipment.
 Check the switchover between the active and redundancy systems, cable connections, grounding, and power
supply in the equipment room every year. This helps to ensure that the standby modules or the redundancy system
can take over the services of the active system in case of a fault. It also helps to eliminate the potential risks caused
by the aging of the equipment.

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page19
Prevent Failures
 Daily Maintenance
Office name:________________________ Maintenance date (year-month-day):___________________

Maintenance Item Maintenance Status Remarks Maintenance


Person Name

Checking Alarm Information Normal □ Abnormal □


Checking the Power Supply Status of
Normal □ Abnormal □
the Rack
Checking the Fan Status Normal □ Abnormal □
Checking the Operating Status of
Normal □ Abnormal □
Boards
Checking the Port Status Normal □ Abnormal □
Checking the CPU Usage Measurement
Normal □ Abnormal □
Data
Checking the IP Traffic Measurement
Normal □ Abnormal □
Data
Checking the Measurement Data of the
Normal □ Abnormal □
Board Memory Usage
Checking the Network Port
Normal □ Abnormal □
Measurement Data

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page20
Prevent Failures
 Weekly Maintenance
Office name:________________________ Maintenance date (year-month-day):__________________

Maintenance Item Maintenance Status Remarks Maintenance


Person Name
Checking the USCDB Performance Measurement Normal □ Abnormal □
Checking HSS9860 Performance Measurement Normal □ Abnormal □
Checking Link Status Normal □ Abnormal □
Checking the Resource Usage on Boards Normal □ Abnormal □
Checking the Subscriber Data Backed Up to a
Normal □ Abnormal □
Third-Party Device
Backup on Local Disk Normal □ Abnormal □
Backup on Network Disk Normal □ Abnormal □
Checking the Resource Status Normal □ Abnormal □
Checking the System Time Normal □ Abnormal □
Checking the Synchronization Status Between the
Normal □ Abnormal □
Active and Standby OMU Servers
Checking the OMU Service Status Normal □ Abnormal □
Checking the Disk Usage of the OMU Server Normal □ Abnormal □

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page21
Prevent Failures
 Monthly Maintenance
Office name:________________________ Maintenance date (year-month-day):___________________

Maintenance Item Maintenance Status Remarks Maintenance


Person Name
Checking the ME Status by Using the NHC
Normal □ Abnormal □
Tool
Checking the Communication Status of the
Normal □ Abnormal □
Remote Maintenance Network
Checking Data Consistency Between the
Normal □ Abnormal □
Active and Redundancy HSS9860s
Checking Pre-Warning Based Rectification
Normal □ Abnormal □
Results

Checking the QoS of the Bearer Network Normal □ Abnormal □

Checking the Operating Status of Hot Patches Normal □ Abnormal □

Checking User Rights Normal □ Abnormal □


Checking Security Logs Normal □ Abnormal □
Changing OS User Passwords Normal □ Abnormal □

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page22
Prevent Failures
 Quarterly Maintenance
Office name:________________________ Maintenance date (year-month-day):___________________

Maintenance Item Maintenance Status Remarks Maintenance


Person Name

Checking the Operating Status of LAN


Normal □ Abnormal □
Switches

Checking the Operating Status of the


Normal □ Abnormal □
KVMS

Cleaning the Air Filters of the Cabinet Normal □ Abnormal □

Cleaning the Dust-Preventive Air Filter Normal □ Abnormal □

Checking the Time Zone and Daylight


Normal □ Abnormal □
Saving Time

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page23
Prevent Failures
 Semi-annual Maintenance
Office name:________________________ Maintenance date (year-month-day):___________________

Maintenance Item Maintenance Status Remarks Maintenance


Person Name

Checking System Port Status Normal □ Abnormal □


Performing Security Audit Normal □ Abnormal □
Changing the Password for an OMU
Normal □ Abnormal □
Database User
Changing the Password for a USCDB
Normal □ Abnormal □
Physical Database User
Changing the Password for a Database
Normal □ Abnormal □
Board OS User
Changing the Password for the OS Account
Normal □ Abnormal □
on the USRSU Board
Changing the Passwords for the OS
Normal □ Abnormal □
Accounts on the USPGW Board
Changing the Passwords for the OS Normal □ Abnormal □
Accounts on the FEU Board

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page24
Prevent Failures
 Yearly Maintenance
Office name:________________________ Maintenance date (year-month-day):___________________

Maintenance Item Maintenance Status Remarks Maintenance


Person Name

Checking the Power Supply System Normal □ Abnormal □

Checking the Switchover Between the


Normal □ Abnormal □
Active and Redundancy HSS9860s

Checking the Cable Connections Normal □ Abnormal □

Checking the Grounding System Normal □ Abnormal □

Checking Spare Boards and Parts Normal □ Abnormal □

Copyright © 2013 Huawei Technologies Co., Ltd. All rights reserved. Page25
Thank you
www.huawei.com

You might also like